You are a data scientist working for a healthcare company, and you have been tasked with creating a
decision tree to help identify patients with diabetes based on a set of clinical variables. You have been
given a dataset (diabetes.csv) with the following variables:
1. Pregnancies: Number of times pregnant (integer)
2. Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test (integer)
3. BloodPressure: Diastolic blood pressure (mm Hg) (integer)
4. SkinThickness: Triceps skin fold thickness (mm) (integer)
5. Insulin: 2-Hour serum insulin (mu U/ml) (integer)
6. BMI: Body mass index (weight in kg/(height in m)^2) (float)
7. DiabetesPedigreeFunction: Diabetes pedigree function (a function which scores likelihood of diabetes
based on family history) (float)
8. Age: Age in years (integer)
9. Outcome: Class variable (0 if non-diabetic, 1 if diabetic) (integer)

Your goal is to create a decision tree to predict whether a patient has diabetes based on the other
variables. Here are the steps you can follow:

Q1. Import the dataset and examine the variables. Use descriptive statistics and visualizations to
understand the distribution and relationships between the variables.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
df = pd.read_csv("diabetes.csv")

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
sns.heatmap(df.corr(),annot=True,fmt=".2f",cmap="crest")
plt.show()

In [None]:
df.describe().T

In [None]:
sns.pairplot(df)

In [None]:
df["Outcome"].value_counts()

In [None]:
x = 1
plt.figure(figsize=(16,10))
plt.subplots_adjust(top = 0.99, bottom=0.01, hspace=0.5, wspace=0.5)
for i in df.columns:
    plt.subplot(3,3,x)
    x = x+1
    sns.histplot(data=df,x=i,kde='true')
    plt.title(f'Histogram for {i}')
plt.show()

In [None]:
# Univariate Analysis with respect to Outcome column
for i in df.columns:
    if i != 'Outcome':
        fig, ax = plt.subplots(1,2,figsize=(15,7))
        plt.subplot(121)
        sns.histplot(data=df,x=i,kde=True,bins='fd',color='g')
        plt.title(f'Histogram for {i}')
        plt.subplot(122)
        sns.histplot(data=df,x=i,kde=True,bins='fd',hue='Outcome')
        plt.title(f'Histogram for {i} wrt Outcome')
        plt.show()

Observation

There is imbalance in the outcome column for class 0 --> 500 values and for class 1 --> 268
There are 0 as values in various columns as in the above histogram we will replace them with median
Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction these columns 0 values will be replaced with their respective column medians

Q2. Preprocess the data by cleaning missing values, removing outliers, and transforming categorical
variables into dummy variables if necessary.

In [None]:
cols_zero_val = ['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI']

In [None]:
for i in cols_zero_val:    
    print(f'{i} : {len(df[df[i]==0])}')

In [None]:
for i in cols_zero_val:
    df[i] = df[i].replace(0,df[i].median())

In [None]:
df.describe().T

In [None]:
df.isnull().sum()

In [None]:
df.dtypes

In [None]:
plt.figure(figsize=(10,4))
sns.boxplot(data = df)
plt.xticks(rotation = 45)
plt.show()

There are outliers but its not necessary to remove them as the Descision Tree Algorithm is not sensitive to outliers

Q3. Split the dataset into a training set and a test set. Use a random seed to ensure reproducibility.

In [None]:
#Dependent and independent features
X = df.iloc[:,:-1]
y = df["Outcome"]
X.head()

In [None]:
y.head()

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=40,test_size=0.33)

In [None]:
X_train.shape,X_test.shape

In [None]:
y_train.shape,y_test.shape

Q4. Use a decision tree algorithm, such as ID3 or C4.5, to train a decision tree model on the training set. Use
cross-validation to optimize the hyperparameters and avoid overfitting.

In [None]:
parameters = {
    'criterion':['gini','entropy','log_loss'],
    'splitter':['best','random'],
    'max_depth':list(range(1,15)),
    'max_features':['sqrt','log2']
}

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier()
clf = GridSearchCV(classifier,param_grid=parameters,cv=5,scoring = 'accuracy')
clf.fit(X_train,y_train)

In [None]:
clf.best_params_

In [None]:
clf.best_score_

In [None]:
model = DecisionTreeClassifier(**clf.best_params_)
model.fit(X_train,y_train)

In [None]:
y_pred=model.predict(X_test)

Q5. Evaluate the performance of the decision tree model on the test set using metrics such as accuracy,
precision, recall, and F1 score. Use confusion matrices and ROC curves to visualize the results.

In [None]:
y_pred

In [None]:
from sklearn.metrics import classification_report,accuracy_score,confusion_matrix

In [None]:
acc_test = accuracy_score(y_pred,y_test)
print(f'Accuracy Score for test data is {acc_test}')

In [None]:
print(classification_report(y_pred,y_test))

In [None]:
cf = confusion_matrix(y_pred, y_test)
sns.heatmap(cf, annot=True,fmt='d')
plt.show()

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score, auc

# Assuming you have true labels (y_test) and predicted probabilities (y_pred) for the positive class

# Compute the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

# Calculate the area under the ROC curve (AUC)
roc_auc = auc(fpr, tpr)

# Create the ROC curve plot
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = {:.2f})'.format(roc_auc))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()

Q6. Interpret the decision tree by examining the splits, branches, and leaves. Identify the most important
variables and their thresholds. Use domain knowledge and common sense to explain the patterns and
trends.

In [None]:
from sklearn import tree
plt.figure(figsize=(12,10))
tree.plot_tree(model,filled=True)
plt.show()

In [None]:
model.feature_importances_

In [None]:
imp = model.feature_importances_
imp = pd.Series(imp)
imp.index = X_train.columns
imp = imp.sort_values(ascending=False)
imp

In [None]:
imp.plot(kind='bar',ylabel='Importance',title='Feature Importances')
plt.show()

Q7. Validate the decision tree model by applying it to new data or testing its robustness to changes in the
dataset or the environment. Use sensitivity analysis and scenario testing to explore the uncertainty and
risks.

In [None]:
import warnings
warnings.filterwarnings("ignore")
new_data = [6,120,22,35,120,18.4,0.90,45]

In [None]:
y_pred = model.predict([new_data])
y_pred[0]

In [None]:
if y_pred[0] == 1:
    print("Diabetic")
else :
    print("Not Diabetic")