# **Developing and Evaluating Machine Learning Model**

## **Task-1: Dataset Selection and Preprocessing**

### **1. Data Collection**
The dataset for this project was collected from **Kaggle**. It is titled as **'Diabetes Dataset'** and can be accessed throught the following link:<br>
__[Diabetes Dataset](https://www.kaggle.com/datasets/mathchi/diabetes-data-set)__<br>
The Pima Indians Diabetes dataset is a standard dataset used in machine learning to predict whether a person has diabetes based on diagnostic measurements. The dataset has 768 records and 8 features representing various physiological factors such as glucose concentration, BMI, age, and insulin levels. The outcome variable is binary: 1 indicates the presence of diabetes, and 0 indicates its absence.r.

In [26]:
# importing required libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

#### **Loading dataset**

In [27]:
data = pd.read_csv('diabetes.csv')

In [28]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [29]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [30]:
data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


### **2. Data Cleaning and Pre-Processing**

#### **a. Handling missing values**
The dataset has missing values, represented by 0 in features like Glucose, BloodPressure, SkinThickness, Insulin, and BMI. These were handled using the following approaches:<br>

* **Glucose and BloodPressure:** Replaced 0 values with the median of non-zero values.
* **SkinThickness and Insulin:** Imputed with median values for continuity, as these were not measured for all patients.

In [48]:
# Replace zeros with median for specific columns
cols_with_zeros = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
for col in cols_with_zeros:
    X[col] = X[col].replace(0, X[col].median())

**Justification:** Median Imputation was chosen as it does not significantly alter the distribution of the data and is robust to outliers.

#### **b. Checking for duplicate values**

In [49]:
duplicates = data.duplicated().sum()
print("Duplicates:", duplicates)
if duplicates == 0:
    print("No duplicate data found.")
else:
    print("Duplicate data found.")

Duplicates: 0
No duplicate data found.


#### **c. Checking for datatypes**

In [50]:
data_types = data.dtypes
print("Data types:\n", data_types)

Data types:
 Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object


#### **d. Data Normalization**
For features like Glucose and BMI, normalization was performed to scale the values between 0 and 1 to ensure that the model treats all features equally during training.

In [51]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

**Justification:** Normalization was necessary to bring the features to a similar scale, preventing any single feature from dominating the learning process due to its range of values.

#### **e.  Converting Categorical Data into Numerical Format**
In the case of the taken dataset which is a Diabeted dataset, all the features are numerical, so no categorical data transformation was required.<br>
**Justification:** If there were categorical data, one-hot encoding would allow us to represent categories as numerical values, ensuring that machine learning algorithms can process them. This technique avoids creating misleading ordinal relationships between categories.

#### **f. Splitting the Dataset into Training and Testing Sets**
We split the dataset into training and testing sets, with 80% of the data used for training the model and 20% reserved for evaluating its performance on unseen data. This ensures that the model is evaluated on data it hasn't encountered before, giving a realistic measure of its generalization ability.

In [52]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

**Justification:** An 80-20 split is a common approach in machine learning that provides sufficient data for training while reserving enough data to reliably test model performance. This helps avoid overfitting and ensures robust evaluation.

## **Task-2: Model Design and Implementation**

### **1. Model Selection**

We will use the **Logistic Regression Algorithm**. This model is highly suitable for binary classification problems like predicting whether a patient has diabetes (Outcome = 1) or not (Outcome = 0). Logistic Regression is easy to interpret, requires less computational power, and is a solid baseline model for classification.<br>

**Justification:** <br>

* **Interpretable:** The model coefficients provide insights into the relationship between features and the outcome.
* **Simple to Implement:** Logistic Regression works well for datasets with linearly separable classes, and the dataset isn't overly large.
* **Performance:** Though simple, Logistic Regression can achieve high performance with well-preprocessed data.

### **2. Model Implementation**

In [53]:
from sklearn.linear_model import LogisticRegression

# Initialize the model
model = LogisticRegression(max_iter=200)

# Train the model
model.fit(X_train, y_train)

### **3. Key Parameters and Hyperparameters of the Model**
* **penalty:** The type of regularization applied to prevent overfitting. Default is l2 (Ridge Regularization), which helps in reducing the complexity of the model.
* **class_weight:** Automatically adjusts weights for class imbalance (if present).
* **solver:** Algorithm to be used for optimization. By default, lbfgs solver is used, which is efficient for small datasets. 
* **max_iter:** Maximum number of iterations for the solver to converge. Increased to 200 to ensure convergence.

## **Task-3: Model Training and Evaluation**

### **1. Training the Model**
The model was trained using the training data (X_train and y_train).

In [58]:
model.fit(X_train, y_train)

#### **Training Process:**

**Challenges:** The primary challenge was handling the imbalanced dataset, with more non-diabetic patients than diabetic patients. This was addressed by using the class_weight='balanced' parameter.

In [60]:
data['Outcome'].value_counts()

Outcome
0    500
1    268
Name: count, dtype: int64

In [61]:
# Handle class imbalance
model = LogisticRegression(class_weight='balanced', max_iter=200)
model.fit(X_train, y_train)

### **2. Model Evaluation**

In [64]:
# Predictions on training data
y_train_pred = model.predict(X_train)

# Predictions on test data
y_test_pred = model.predict(X_test)

# Evaluate training performance
train_accuracy = accuracy_score(y_train, y_train_pred)
train_class_report = classification_report(y_train, y_train_pred)
train_roc_auc = roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])

# Evaluate testing performance
test_accuracy = accuracy_score(y_test, y_test_pred)
test_class_report = classification_report(y_test, y_test_pred)
test_roc_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])

# Output the results
print(f"Training Accuracy: {train_accuracy * 100:.2f}%")
print(f"Training AUC-ROC: {train_roc_auc:.2f}")
print("Training Classification Report:\n", train_class_report)

print(f"Testing Accuracy: {test_accuracy * 100:.2f}%")
print(f"Testing AUC-ROC: {test_roc_auc:.2f}")
print("Testing Classification Report:\n", test_class_report)


Training Accuracy: 76.06%
Training AUC-ROC: 0.85
Training Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.78      0.81       401
           1       0.64      0.73      0.68       213

    accuracy                           0.76       614
   macro avg       0.74      0.75      0.74       614
weighted avg       0.77      0.76      0.76       614

Testing Accuracy: 69.48%
Testing AUC-ROC: 0.82
Testing Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.70      0.75        99
           1       0.56      0.69      0.62        55

    accuracy                           0.69       154
   macro avg       0.68      0.69      0.68       154
weighted avg       0.72      0.69      0.70       154



**Comparing model's performance on the training set vs. the testing set**<br>
**Training Set Performance:**
* Accuracy: 76.06%
* AUC-ROC: 0.85
* Precision, Recall, F1-Score:
* Diabetic (Class 1): Precision 0.64, Recall 0.73, F1-Score 0.68
* Non-Diabetic (Class 0): Precision 0.84, Recall 0.78, F1-Score 0.81
  
**Testing Set Performance:**
* Accuracy: 69.48%
* AUC-ROC: 0.82
* Precision, Recall, F1-Score:
* Diabetic (Class 1): Precision 0.56, Recall 0.69, F1-Score 0.62
* Non-Diabetic (Class 0): Precision 0.80, Recall 0.70, F1-Score 0.75


**Interpretation:**
* **Training vs. Testing Accuracy:** The model shows 76.06% accuracy on the training set and 69.48% accuracy on the testing set. The performance drop between the two sets is noticeable, indicating that the model could be slightly overfitting to the training data. Overfitting happens when the model performs well on the training set but struggles to generalize on unseen test data.
* **AUC-ROC Comparison:** The AUC-ROC is 0.85 on the training set and 0.82 on the test set. These values suggest that the model is performing well in distinguishing between diabetic and non-diabetic patients across both datasets. The small drop in AUC-ROC on the test set implies the model maintains good discriminatory power, though it performs slightly better on the training data.
* **Precision, Recall, F1-Score (Class 1 - Diabetic):** In the training set, diabetic precision (0.64) and recall (0.73) indicate that the model can reasonably identify diabetic patients, but there's room for improvement in precision, as it is misclassifying some non-diabetic patients as diabetic.
In the testing set, precision drops to 0.56, and recall is at 0.69. This lower precision means the model makes more false positive predictions for diabetic patients in the test data compared to the training data.
* **Non-Diabetic (Class 0) Performance:** The model shows relatively higher precision and recall for non-diabetic patients (Class 0) in both the training and testing sets. In the training set, precision is 0.84 and recall is 0.78, while in the testing set, precision is 0.80 and recall is 0.70. This indicates that the model is better at correctly predicting non-diabetic individuals than diabetic ones.
Overfitting/Underfitting:

While there is a slight performance drop between the training and testing sets, the difference is not drastic, suggesting that the model is not severely overfitting. However, the lower test accuracy and diabetic class precision suggest the model could be improved to generalize better, especially for predicting diabetic patients.

**Confusion Matrix and Classification Report:**<br>
* Accuracy: Measures the proportion of correctly predicted outcomes.
* Precision, Recall, and F1-Score: Used to evaluate the performance of the model on both the diabetic and non-diabetic classes.
* AUC-ROC: Measures the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one.

In [None]:
Task 4: Critical Analysis and Report
1. Model Performance Analysis:
o Analyze the results and discuss the strengths and weaknesses of your model.
o Suggest potential improvements or alternative approaches.
2. Report Writing:
o Write a clear and concise report (1500-2000 words) summarizing the entire
process, from data selection to model evaluation.
o Include visualizations (e.g., graphs, charts) to support your analysis.
o Properly cite any references used.

## **Task 4: Critical Analysis and Report**

### **1. Model Performance Analysis:**

**Strengths:** Logistic Regression performs well on the dataset, achieving a good balance between sensitivity (recall) and specificity (precision). It is interpretable, simple to implement, and computationally efficient.<br>
**Weaknesses:** Logistic Regression assumes a linear relationship between the features and the log odds of the target, which may not hold true for more complex datasets. More sophisticated models like Random Forests or Neural Networks could capture non-linear patterns in the data.<br>

**Potential Improvements:**
* Feature Engineering: Adding new features or transforming existing features could improve the model's ability to capture underlying patterns in the data.
* Advanced Models: Implementing more complex models such as Random Forests or SVM could improve performance.