## EDA

#### Description of dataset variables
- Pregnancies. Number of pregnancies of the patient (numeric)
- Glucose. Plasma glucose concentration 2 hours after an oral glucose tolerance test (numeric)
- BloodPressure. Diastolic blood pressure (measured in mm Hg) (numeric)
- SkinThickness. Triceps skinfold thickness (measured in mm) (numeric)
- Insulin. 2-hour serum insulin (measured in mu U/ml) (numeric)
- BMI. Body mass index (numeric)
- DiabetesPedigreeFunction. Diabetes Pedigree Function (numeric)
- Age. Age of patient (numeric)
- Outcome. Class variable (0 or 1), being 0 negative in diabetes and 1, positive (numeric)

The target is to predict whether or not the patient has diabetes.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### 1) Scanning and surface data cleaning

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/decision-tree-project-tutorial/main/diabetes.csv")
df.head()

In [None]:
df.shape

Consists of 9 variables and 768 entries

In [None]:
df.info()

All data are numerical, integer and decimal

In [None]:
df.isnull().sum()

There is no null data in the dataset.

In [None]:
df.duplicated().sum()

No duplicate entries

### 2) Univariate analysis

In [None]:
# Obtain the numerical columns
column_num = df.select_dtypes(include=['int64', 'float64']).columns

# Calculate the number of rows and columns required for the subcharts
num_rows = (len(column_num) + 2) // 3
num_columns = 3

# Create subcharts
fig, axis = plt.subplots(num_rows, num_columns, figsize=(15, 3 * num_rows))

# Generate histograms for each numerical variable
for i, columna in enumerate(column_num):
    sns.histplot(ax=axis[i // num_columns, i % num_columns], data=df, x=columna, bins=15, kde=True).set(ylabel=None)

plt.tight_layout()
plt.show()

plt.tight_layout()
plt.show()



In [None]:

# Selects only columns of type int or float
column_num = df.select_dtypes(include=['int64', 'float64']).columns

# Dictionary for storing the number of zero values per column
number_zeros = {}

# Iterates over all columns of the numeric DataFrame
for column in column_num:
   # Counts the number of zero values in the column
    sum_zeros = (df[column] == 0).sum()
    
    # If there is at least one zero value, add the column and quantity to the dictionary.
    if sum_zeros > 0:
        number_zeros[column] = sum_zeros

for column, sum_zeros in number_zeros.items():
    print(f"{column}: {sum_zeros}")


We have zero values in some variables, data that were probably obtained erroneously and that in this case we are not going to eliminate as it would negatively alter our final model. Taking into account that it is a small dataset and that the values we could eliminate are too many, compared to the amount of entries in it.
We highlight the amount of zero values in SkinThickness: 227 and Insulin: 374, taking into account that a person cannot have a zero insulin value or skin thickness.
We also have the body mass index with 11 entries with zero value and the blood pressure with 35.


### 3) Multivariate analysis

In [None]:
# Analysis N-N
fig, axis = plt.subplots(4, 3, figsize = (15, 10))


sns.regplot(ax = axis[0,0], data = df, x = "Glucose", y = "Outcome")
sns.heatmap(df[["Outcome", "Glucose"]].corr(), annot = True, fmt = ".2f", ax = axis[1,0], cbar = False)

sns.regplot(ax = axis[0,1], data = df, x = "BMI", y = "Outcome")
sns.heatmap(df[["Outcome", "BMI"]].corr(), annot = True, fmt = ".2f", ax = axis[1, 1], cbar = False)

sns.regplot(ax = axis[0,2], data = df, x = "BloodPressure", y = "Outcome")
sns.heatmap(df[["Outcome", "BloodPressure"]].corr(), annot = True, fmt = ".2f", ax = axis[1, 2], cbar = False)

sns.regplot(ax = axis[2, 0], data = df, x = "Pregnancies", y = "Age")
sns.heatmap(df[["Age", "Pregnancies"]].corr(), annot = True, fmt = ".2f", ax = axis[3, 0], cbar = False)

sns.regplot(ax = axis[2, 1], data = df, x = "DiabetesPedigreeFunction", y = "Outcome")
sns.heatmap(df[["Outcome", "DiabetesPedigreeFunction"]].corr(), annot = True, fmt = ".2f", ax = axis[3, 1], cbar = False)

sns.regplot(ax = axis[2, 2], data = df, x = "DiabetesPedigreeFunction", y = "Outcome")
sns.heatmap(df[["DiabetesPedigreeFunction", "Outcome"]].corr(), annot = True, fmt = ".2f", ax = axis[3, 2], cbar = False)

plt.tight_layout()
plt.show()

In [None]:
# Correlation map

column_num = df.select_dtypes(include=['int64', 'float64']).columns

plt.figure(figsize=(10, 8))
sns.heatmap((df[column_num].corr()), annot=True, cmap='PuBu', fmt=".2f")
plt.show()



We found moderate positive relationships:
* SkinThickness and Insulin
* SkinThickness and BMI
* Glucose and insulin
* Age and Pregnancies

No moderate or strong negative relationships, only weak and rare.

### 3) Outlier analysis

In [None]:
df.describe()

In [None]:
# Display outliers
fig, axis = plt.subplots(2, 3, figsize = (14, 8))

sns.boxplot(ax = axis[0, 0], data = df, x = "Glucose")
sns.boxplot(ax = axis[0, 1], data = df, x = "BloodPressure")
sns.boxplot(ax = axis[0, 2], data = df, x = "BMI")
sns.boxplot(ax = axis[1, 0], data = df, x = "Insulin")
sns.boxplot(ax = axis[1, 1], data = df, x = "SkinThickness")


plt.tight_layout()
plt.show()


We are going to run the decision tree model, we do not modify or remove outliers, as this could have an impact on the performance of the model.

### 4) Split into Train and Test

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y = True, as_frame = True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 71)

X_train.head()

In the decision tree, no scaling of variables is done. Predictor variables do not need to be normalised, as decision trees are not affected by the scale of the data.

In [None]:
# Save 

X_train.to_csv("/workspaces/ML-Decision-Tree-PilarZarco/data/processed/X_train.csv", index=False) # PREDICTORS OF TRAIN
with open ("/workspaces/ML-Decision-Tree-PilarZarco/data/processed/y_train.txt", "w") as f: # TRAIN TARGET
    f.write(y_train.to_string(index=False))

X_test.to_csv("/workspaces/ML-Decision-Tree-PilarZarco/data/processed/X_test.csv", index=False)# # PREDICTORS OF TEST
with open ("/workspaces/ML-Decision-Tree-PilarZarco/data/processed/y_test.txt", "w") as f:# TEST target
    f.write(y_test.to_string(index=False))