

Q1. Import the dataset and examine the variables

	1.	Load the dataset:
If you’ve downloaded the diabetes.csv file, place it in your working directory and import it:

import pandas as pd
data = pd.read_csv('path/to/diabetes.csv')  # Replace with the correct path
print(data.head())


	2.	Descriptive statistics:
Use .describe() to summarize the dataset.

print(data.describe())


	3.	Visualizations:
	•	Distribution plots:

import matplotlib.pyplot as plt
data.hist(bins=20, figsize=(12, 10))
plt.show()


	•	Pairplot for feature relationships:

import seaborn as sns
sns.pairplot(data, hue='Outcome')
plt.show()


	•	Correlation heatmap:

sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.show()



Q2. Preprocess the data

	1.	Check for missing values:

print(data.isnull().sum())

If there are missing values, you can fill them with the median:

data.fillna(data.median(), inplace=True)


	2.	Remove outliers:
Outliers can be handled using the IQR method or Z-score. Here’s an example with IQR:

Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
data = data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).any(axis=1)]


	3.	Categorical Variables: Since the dataset contains only numerical variables, no need for dummy variables.

Q3. Split the dataset into a training set and a test set

Use train_test_split to divide the data into training and testing sets.

from sklearn.model_selection import train_test_split
X = data.drop('Outcome', axis=1)
y = data['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Q4. Train a decision tree model and perform cross-validation

	1.	Train the model:

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

dt_model = DecisionTreeClassifier(random_state=42)

# Perform cross-validation to avoid overfitting
cv_scores = cross_val_score(dt_model, X_train, y_train, cv=5)
print(f"Cross-Validation Scores: {cv_scores}")


	2.	Hyperparameter tuning using GridSearchCV:

from sklearn.model_selection import GridSearchCV
param_grid = {'max_depth': [3, 5, 7, 10], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 5]}

grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print(f"Best Parameters: {grid_search.best_params_}")
best_model = grid_search.best_estimator_



Q5. Evaluate the performance of the decision tree model

	1.	Make predictions and evaluate metrics:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, auc

y_pred = best_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")


	2.	Confusion matrix:

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()


	3.	ROC Curve:

y_prob = best_model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, label=f'ROC curve (area = {roc_auc:.2f})')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()



Q6. Interpret the decision tree

	1.	Visualizing the tree:

from sklearn import tree
plt.figure(figsize=(12, 8))
tree.plot_tree(best_model, feature_names=X.columns, class_names=['Non-Diabetic', 'Diabetic'], filled=True)
plt.show()


	2.	Important features:

feature_importances = best_model.feature_importances_
important_features = pd.Series(feature_importances, index=X.columns).sort_values(ascending=False)
print(important_features)

	•	Example: Glucose, BMI, and Age might emerge as the most important features in predicting diabetes.

Q7. Validate the decision tree model

	1.	Test on new data:
You can generate new test cases or load unseen data to test the model’s robustness:

new_patient = [[3, 145, 85, 25, 0, 33.6, 0.627, 50]]  # Hypothetical patient
print(best_model.predict(new_patient))


	2.	Sensitivity analysis:
You can introduce noise to the input variables and observe the model’s behavior. This will help understand how sensitive the model is to changes in the input data.

