<a href="https://colab.research.google.com/github/Shaunnero/EDA_UMUZI/blob/main/Project_2_Part_1_(Core).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

#First choice: Spanish Wine Quality Dataset

# Spanish Wine Quality Dataset

## Source of Data

- [Kaggle - Spanish Wine Quality Dataset](https://www.kaggle.com/datasets/fedesoriano/spanish-wine-quality-dataset)

## Brief Description of Data

This dataset pertains to red variants of Spanish wines. It provides insights into various popularity and description metrics and their impact on wine quality. The dataset is suitable for both classification and regression tasks. The quality classes are ordered and unbalanced, spanning a range from almost 5 to 4 points. The objective is to predict either the quality of wine or its prices using the available data.

## Dataset Details

- **Target:** `rating`
- **One Row Represents:** A product (specifically, a red variant of Spanish wine)
- **Problem Type:** Regression
- **Number of Features:** 5
- **Number of Rows:** 7500

## Challenges and Considerations

Several challenges might arise during data cleaning, exploration, and modeling:

- **Missing Data:** The dataset contains missing values in the 'year', 'type', 'body', and 'acidity' columns. Handling missing data is crucial to maintain the quality of analysis and modeling. Decisions regarding imputation or removal of missing values need to be made.

- **Categorical Data:** Categorical columns like 'winery', 'wine', 'country', and 'region' need numerical encoding. Dealing with high cardinality categorical columns (e.g., 'winery', 'wine') presents challenges.

- **Data Exploration:** Gaining insights into the distribution of numerical features such as 'rating', 'num_reviews', 'price', 'body', and 'acidity' is essential to identify potential outliers or skewed distributions that can impact model performance.

- **Feature Engineering:** Depending on modeling objectives, feature engineering may be required to generate meaningful new features from existing ones. For instance, 'age' based on the 'year' column or features related to wine characteristics.

- **Modeling Challenges:** Selecting an appropriate modeling technique based on data nature and objectives is critical. Different algorithms may perform differently based on the data. Handling categorical variables and addressing multicollinearity (if present) are key considerations.

- **Imbalanced Data:** If the distribution of classes in the target variable (e.g., wine subscription) is imbalanced, it could affect model performance. Techniques like oversampling, undersampling, or specialized evaluation metrics may be necessary.

- **Hyperparameter Tuning:** If machine learning algorithms are employed, hyperparameter tuning is essential to achieve optimal model performance. Tuning requires careful selection of hyperparameter values and can be time-consuming.

- **Interpreting Results:** After modeling, interpreting results and understanding the significance of features within the context of wine quality prediction can be complex.

In summary, successfully cleaning, exploring, and modeling this dataset requires thoughtful consideration of these challenges to ensure accurate and dependable analysis and predictions.


In [3]:
df = pd.read_csv('/content/drive/MyDrive/CodingDojo/02-MachineLearning/Week08/Data/wines_SPA.csv')
df.head()

Unnamed: 0,winery,wine,year,rating,num_reviews,country,region,price,type,body,acidity
0,Teso La Monja,Tinto,2013,4.9,58,Espana,Toro,995.0,Toro Red,5.0,3.0
1,Artadi,Vina El Pison,2018,4.9,31,Espana,Vino de Espana,313.5,Tempranillo,4.0,2.0
2,Vega Sicilia,Unico,2009,4.8,1793,Espana,Ribera del Duero,324.95,Ribera Del Duero Red,5.0,3.0
3,Vega Sicilia,Unico,1999,4.8,1705,Espana,Ribera del Duero,692.96,Ribera Del Duero Red,5.0,3.0
4,Vega Sicilia,Unico,1996,4.8,1309,Espana,Ribera del Duero,778.06,Ribera Del Duero Red,5.0,3.0


In [5]:
df['rating'].value_counts()

4.2    5679
4.3     707
4.4     484
4.5     281
4.6     191
4.7     112
4.8      44
4.9       2
Name: rating, dtype: int64

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   winery       7500 non-null   object 
 1   wine         7500 non-null   object 
 2   year         7498 non-null   object 
 3   rating       7500 non-null   float64
 4   num_reviews  7500 non-null   int64  
 5   country      7500 non-null   object 
 6   region       7500 non-null   object 
 7   price        7500 non-null   float64
 8   type         6955 non-null   object 
 9   body         6331 non-null   float64
 10  acidity      6331 non-null   float64
dtypes: float64(4), int64(1), object(6)
memory usage: 644.7+ KB


#Second choice: Cirrhosis Prediction Dataset

# Cirrhosis Prediction Dataset

## Source of Data

- [Kaggle - Cirrhosis Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/cirrhosis-prediction-dataset)

## Brief Description of Data

Cirrhosis is a late stage of scarring (fibrosis) of the liver caused by many forms of liver diseases and conditions, such as hepatitis and chronic alcoholism. The following data contains the information collected from the Mayo Clinic trial in primary biliary cirrhosis (PBC) of the liver conducted between 1974 and 1984. A description of the clinical background for the trial and the covariates recorded here is in Chapter 0, especially Section 0.2 of Fleming and Harrington, Counting
Processes and Survival Analysis, Wiley, 1991. A more extended discussion can be found in Dickson, et al., Hepatology 10:1-7 (1989) and in Markus, et al., N Eng J of Med 320:1709-13 (1989).

A total of 424 PBC patients, referred to Mayo Clinic during that ten-year interval, met eligibility criteria for the randomized placebo-controlled trial of the drug D-penicillamine. The first 312 cases in the dataset participated in the randomized trial and contain largely complete data. The additional 112 cases did not participate in the clinical trial but consented to have basic measurements recorded and to be followed for survival. Six of those cases were lost to follow-up shortly after diagnosis, so the data here are on an additional 106 cases as well as the 312 randomized participants.

## Dataset Details

- **Target:** `Stage`
- **One Row Represents:** A Person
- **Number of Features:** 5
- **Number of Rows:** 480

## Challenges and Considerations

Based on the provided information about the dataset, there are several potential challenges that could arise during the processes of cleaning, exploring, and modeling:

- **Missing Data:** The dataset contains missing values in multiple columns, such as 'Drug', 'Cholesterol', 'Copper', 'Tryglicerides', 'Platelets', 'Prothrombin', and 'Stage'. Addressing missing data is crucial, as improper handling can lead to biased analysis and modeling results.

- **Categorical Data:** The presence of categorical variables, including 'Status', 'Drug', 'Sex', 'Ascites', 'Hepatomegaly', 'Spiders', and 'Edema', requires appropriate encoding or transformation for analysis and modeling. Handling categorical data effectively is essential for accurate predictions.

- **Feature Engineering:** Depending on the modeling goals, creating new features from existing attributes might be necessary to improve model performance. However, the challenge lies in identifying which new features are relevant and informative.

- **Imbalanced Data:** If the distribution of the 'Status' column (indicating disease presence or absence) is imbalanced, where one class significantly outweighs the other, it can impact the predictive accuracy of the model. Balancing techniques such as oversampling or undersampling may be needed.

- **Model Selection:** Selecting the appropriate algorithm for modeling liver disease prediction is essential. Different algorithms have different strengths and weaknesses, and selecting the wrong one could lead to suboptimal results.

- **Hyperparameter Tuning:** Optimizing model performance through hyperparameter tuning can be challenging and time-consuming. Proper tuning is required to achieve the best predictive outcomes.

- **Interpreting Results:** Interpreting the results of the models, especially in the medical context of liver disease, can be complex. Understanding the significance of different attributes in contributing to disease prediction is important for informed decision-making.

- **Limited Sample Size:** The dataset consists of only 418 entries, which might limit the generalizability of the models. Ensuring that the models do not overfit due to the limited sample size is crucial.

- **Data Exploration:** Exploring and visualizing the distribution of attributes, relationships, and potential outliers can be time-consuming. However, it is essential for gaining insights and making informed decisions.

- **Ethical Considerations:** Given that the dataset involves medical information, ensuring ethical handling of sensitive patient data and adhering to data privacy regulations is critical.

Addressing these challenges will require a comprehensive and well-structured approach to data cleaning, exploration, and modeling. Each challenge necessitates careful consideration and domain expertise to ensure accurate and reliable results.


In [8]:
df = pd.read_csv('/content/drive/MyDrive/CodingDojo/02-MachineLearning/Week08/Data/cirrhosis.csv')
df.head()

Unnamed: 0,ID,N_Days,Status,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage
0,1,400,D,D-penicillamine,21464,F,Y,Y,Y,Y,14.5,261.0,2.6,156.0,1718.0,137.95,172.0,190.0,12.2,4.0
1,2,4500,C,D-penicillamine,20617,F,N,Y,Y,N,1.1,302.0,4.14,54.0,7394.8,113.52,88.0,221.0,10.6,3.0
2,3,1012,D,D-penicillamine,25594,M,N,N,N,S,1.4,176.0,3.48,210.0,516.0,96.1,55.0,151.0,12.0,4.0
3,4,1925,D,D-penicillamine,19994,F,N,Y,Y,S,1.8,244.0,2.54,64.0,6121.8,60.63,92.0,183.0,10.3,4.0
4,5,1504,CL,Placebo,13918,F,N,Y,Y,N,3.4,279.0,3.53,143.0,671.0,113.15,72.0,136.0,10.9,3.0


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ID             418 non-null    int64  
 1   N_Days         418 non-null    int64  
 2   Status         418 non-null    object 
 3   Drug           312 non-null    object 
 4   Age            418 non-null    int64  
 5   Sex            418 non-null    object 
 6   Ascites        312 non-null    object 
 7   Hepatomegaly   312 non-null    object 
 8   Spiders        312 non-null    object 
 9   Edema          418 non-null    object 
 10  Bilirubin      418 non-null    float64
 11  Cholesterol    284 non-null    float64
 12  Albumin        418 non-null    float64
 13  Copper         310 non-null    float64
 14  Alk_Phos       312 non-null    float64
 15  SGOT           312 non-null    float64
 16  Tryglicerides  282 non-null    float64
 17  Platelets      407 non-null    float64
 18  Prothrombi