# **Worksheet-07**

### **Download the Pima Indian Diabetes**

 ### Dataset: Available from sources like Kaggle. Contains columns such as: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI DiabetesPedigreeFunction, Age, Outcome (whether the patient has diabetes or not).

In [12]:
# Importing Necessary Libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix


In [13]:
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
columns = [
    "Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin",
    "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"
]
df = pd.read_csv(url, header=None, names=columns)
print(df.head(10))

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   
5            5      116             74              0        0  25.6   
6            3       78             50             32       88  31.0   
7           10      115              0              0        0  35.3   
8            2      197             70             45      543  30.5   
9            8      125             96              0        0   0.0   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   2

In [26]:
#  Data Preprocessing

 # Checking for null values
print(df.isnull().sum())

# Replacing zeros in certain columns with NaN
columns_to_replace = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
df[columns_to_replace] = df[columns_to_replace].replace(0, np.nan)

# Filling missing values with mean
df.fillna(df.mean(), inplace=True)

# Check the dataset after preprocessing
print(df.describe())


Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64
       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin  \
count   768.000000  768.000000     768.000000     768.000000  768.000000   
mean      3.845052  121.686763      72.405184      29.153420  155.548223   
std       3.369578   30.435949      12.096346       8.790942   85.021108   
min       0.000000   44.000000      24.000000       7.000000   14.000000   
25%       1.000000   99.750000      64.000000      25.000000  121.500000   
50%       3.000000  117.000000      72.202592      29.153420  155.548223   
75%       6.000000  140.250000      80.000000      32.000000  155.548223   
max      17.000000  199.000000     122.000000      99.000000  846.000000   

              BMI  DiabetesPedig

### **Regression Task:**

### 1.   Predict the Blood Pressure of the patients based on other features.
### 2. Use Linear Regression model from Scikit-learn.



In [29]:
#Regression Task: Predict BloodPressure

# Features and target for regression
X_reg = df.drop(columns=["BloodPressure"])
y_reg = df["BloodPressure"]

# Train-test split
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.25, random_state=42)

# Training Linear Regression model
reg_model = LinearRegression()
reg_model.fit(X_train_reg, y_train_reg)

# Predicting and evaluating
reg_predictions = reg_model.predict(X_test_reg)
reg_mse = mean_squared_error(y_test_reg, reg_predictions)
reg_r2 = r2_score(y_test_reg, reg_predictions)

print("\nRegression Task Results:-")
print(f"Mean Squared Error (MSE): {reg_mse}")
print(f"R2 Score: {reg_r2}")
print(f"Original Mean BloodPressure: {y_test_reg.mean()}")
print(f"Predicted Mean BloodPressure: {reg_predictions.mean()}")



Regression Task Results:-
Mean Squared Error (MSE): 116.02540344946478
R2 Score: 0.2049813437400695
Original Mean BloodPressure: 72.80868434515689
Predicted Mean BloodPressure: 72.9187443936565


### **Classification Task:**
### 1. Predict whether the patient has diabetes (target column: Outcome).
### 2. Use Logistic Regression or K-Nearest Neighbors (KNN) model.

In [30]:

#Classification Task: Predict Outcome

# Features and target for classification
X_clf = df.drop(columns=["Outcome"])
y_clf = df["Outcome"]

# Train-test split
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(X_clf, y_clf, test_size=0.25, random_state=42)

# Training Logistic Regression model
clf_model = LogisticRegression(max_iter=1000)
clf_model.fit(X_train_clf, y_train_clf)

# Predicting and evaluating
clf_predictions = clf_model.predict(X_test_clf)
clf_accuracy = accuracy_score(y_test_clf, clf_predictions)
clf_precision = precision_score(y_test_clf, clf_predictions)
clf_recall = recall_score(y_test_clf, clf_predictions)
clf_f1 = f1_score(y_test_clf, clf_predictions)
clf_conf_matrix = confusion_matrix(y_test_clf, clf_predictions)

print("\nClassification Task Results:-")
print(f"Accuracy: {clf_accuracy}")
print(f"Precision: {clf_precision}")
print(f"Recall: {clf_recall}")
print(f"F1 Score: {clf_f1}")
print(f"Confusion Matrix:\n{clf_conf_matrix}")



Classification Task Results:-
Accuracy: 0.7291666666666666
Precision: 0.6349206349206349
Recall: 0.5797101449275363
F1 Score: 0.6060606060606061
Confusion Matrix:
[[100  23]
 [ 29  40]]
