In [None]:


Q1. Import the dataset and examine the variables. Use descriptive statistics and visualizations to understand the distribution and relationships between the variables.

	1.	Importing the dataset:

import pandas as pd
data = pd.read_csv('diabetes.csv')
print(data.head())


	2.	Descriptive statistics:
Use .describe() to get an overview of the distribution of numerical features:

print(data.describe())


	3.	Visualizations:
	•	Histograms to check the distribution of variables like Glucose, BMI, etc.:

import matplotlib.pyplot as plt
data.hist(bins=20, figsize=(12, 10))
plt.show()


	•	Pairplot to explore relationships between variables:

import seaborn as sns
sns.pairplot(data, hue='Outcome')
plt.show()


	•	Correlation matrix to see the correlation between the variables:

sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.show()



Q2. Preprocess the data by cleaning missing values, removing outliers, and transforming categorical variables into dummy variables if necessary.

	1.	Check for missing values:

print(data.isnull().sum())

If there are missing values, you can fill them with the mean or median of the column:

data.fillna(data.median(), inplace=True)


	2.	Removing outliers:
Outliers can be detected using methods like the IQR method or z-score.
For example, to remove outliers using IQR:

Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
data = data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).any(axis=1)]


	3.	Categorical variables: Since all the variables are numerical, no need to create dummy variables here.

Q3. Split the dataset into a training set and a test set. Use a random seed to ensure reproducibility.

	1.	Splitting the data:

from sklearn.model_selection import train_test_split
X = data.drop('Outcome', axis=1)
y = data['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)



Q4. Use a decision tree algorithm, such as ID3 or C4.5, to train a decision tree model on the training set. Use cross-validation to optimize the hyperparameters and avoid overfitting.

	1.	Training the Decision Tree Classifier:

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

# Initialize the Decision Tree
dt_model = DecisionTreeClassifier(random_state=42)

# Cross-validation to optimize hyperparameters
cross_val_scores = cross_val_score(dt_model, X_train, y_train, cv=5)
print(f"Cross-Validation Accuracy: {cross_val_scores.mean()}")


	2.	Hyperparameter tuning using GridSearchCV:

from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth': [3, 5, 7, 10], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 5]}
grid_search = GridSearchCV(dt_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print(f"Best Parameters: {grid_search.best_params_}")
best_model = grid_search.best_estimator_



Q5. Evaluate the performance of the decision tree model on the test set using metrics such as accuracy, precision, recall, and F1 score. Use confusion matrices and ROC curves to visualize the results.

	1.	Model evaluation:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, auc

# Predictions
y_pred = best_model.predict(X_test)

# Metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")


	2.	Confusion matrix:

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()


	3.	ROC curve:

y_prob = best_model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, label=f'ROC curve (area = {roc_auc:.2f})')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()



Q6. Interpret the decision tree by examining the splits, branches, and leaves. Identify the most important variables and their thresholds. Use domain knowledge and common sense to explain the patterns and trends.

	1.	Tree visualization (using plot_tree or exporting to a readable format):

from sklearn import tree
plt.figure(figsize=(12, 8))
tree.plot_tree(best_model, feature_names=X.columns, class_names=['Non-Diabetic', 'Diabetic'], filled=True)
plt.show()


	2.	Variable importance:

feature_importances = best_model.feature_importances_
important_features = pd.Series(feature_importances, index=X.columns).sort_values(ascending=False)
print(important_features)

	•	Example interpretation: You might find that Glucose and BMI are the most important features, meaning patients with higher glucose levels and BMI are more likely to be diabetic.

Q7. Validate the decision tree model by applying it to new data or testing its robustness to changes in the dataset or the environment. Use sensitivity analysis and scenario testing to explore the uncertainty and risks.

	1.	Testing robustness: You can add slight perturbations to the test data or introduce noise, and check how the model’s predictions change. This can help assess if the model is overfitting or sensitive to changes.
	2.	Scenario testing: Simulate different scenarios (e.g., older vs. younger patients, higher glucose vs. lower glucose levels) and observe how the model behaves.
Example:

test_patient = [[6, 148, 72, 35, 0, 33.6, 0.627, 50]]  # Hypothetical patient
print(best_model.predict(test_patient))



