<a href="https://colab.research.google.com/github/Swatisdpt/Machine_Learning_Coursework/blob/main/Copy_of_LogisticRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic Regression

In this example, we will use logistic regression to predict diabetes.

We will use Pima Indian Diabetes dataset for classification.

See [link](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database) for more details.

# Set the root directory for processing.

In [None]:
import os

root_dir = '/content/'
os.chdir(root_dir)

!ls -al

total 16
drwxr-xr-x 1 root root 4096 Feb  6 14:23 .
drwxr-xr-x 1 root root 4096 Feb  8 11:08 ..
drwxr-xr-x 4 root root 4096 Feb  6 14:23 .config
drwxr-xr-x 1 root root 4096 Feb  6 14:23 sample_data


# Set kaggle API token.

### Create Kaggle API token. Upload the token 'kaggle.json' at '/content/' directory.

In [None]:
os.environ['KAGGLE_CONFIG_DIR'] = root_dir

In [None]:
!chmod 600 /content/kaggle.json

# Download kaggle dataset for further processing.

In [None]:
!kaggle datasets download -d uciml/pima-indians-diabetes-database

Downloading pima-indians-diabetes-database.zip to /content
  0% 0.00/8.91k [00:00<?, ?B/s]
100% 8.91k/8.91k [00:00<00:00, 22.9MB/s]


In [None]:
!ls -al

total 36
drwxr-xr-x 1 root root 4096 Feb  8 11:31 .
drwxr-xr-x 1 root root 4096 Feb  8 11:31 ..
drwxr-xr-x 4 root root 4096 Feb  6 14:23 .config
drwxr-xr-x 2 root root 4096 Feb  8 11:15 .ipynb_checkpoints
-rw-r--r-- 1 root root   73 Feb  8 11:30 kaggle.json
-rw-r--r-- 1 root root 9128 Sep 19  2019 pima-indians-diabetes-database.zip
drwxr-xr-x 1 root root 4096 Feb  6 14:23 sample_data


In [None]:
!unzip pima-indians-diabetes-database.zip

Archive:  pima-indians-diabetes-database.zip
  inflating: diabetes.csv            


In [None]:
!ls -al

total 60
drwxr-xr-x 1 root root  4096 Feb  8 11:32 .
drwxr-xr-x 1 root root  4096 Feb  8 11:31 ..
drwxr-xr-x 4 root root  4096 Feb  6 14:23 .config
-rw-r--r-- 1 root root 23873 Sep 19  2019 diabetes.csv
drwxr-xr-x 2 root root  4096 Feb  8 11:15 .ipynb_checkpoints
-rw-r--r-- 1 root root    73 Feb  8 11:30 kaggle.json
-rw-r--r-- 1 root root  9128 Sep 19  2019 pima-indians-diabetes-database.zip
drwxr-xr-x 1 root root  4096 Feb  6 14:23 sample_data


# Load Pima Indian Diabetes dataset.

### Import required python modules.

In [None]:
import pandas as pd

In [None]:
column_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']

In [None]:
pima_dataset = pd.read_csv('diabetes.csv', header=None, names=column_names)
pima_dataset = pima_dataset[1:]

In [None]:
pima_dataset.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,label
1,6,148,72,35,0,33.6,0.627,50,1
2,1,85,66,29,0,26.6,0.351,31,0
3,8,183,64,0,0,23.3,0.672,32,1
4,1,89,66,23,94,28.1,0.167,21,0
5,0,137,40,35,168,43.1,2.288,33,1


### Show dataset schema.

In [None]:
pima_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 1 to 768
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   pregnant  768 non-null    object
 1   glucose   768 non-null    object
 2   bp        768 non-null    object
 3   skin      768 non-null    object
 4   insulin   768 non-null    object
 5   bmi       768 non-null    object
 6   pedigree  768 non-null    object
 7   age       768 non-null    object
 8   label     768 non-null    object
dtypes: object(9)
memory usage: 54.1+ KB


# Preprocess dataset.

### Check for missing values in the dataset.

In [None]:
pima_dataset.isnull().sum()

pregnant    0
glucose     0
bp          0
skin        0
insulin     0
bmi         0
pedigree    0
age         0
label       0
dtype: int64

### Define input variables (X) and an output variable (y).

In [None]:
feature_columns = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']

In [None]:
X = pima_dataset[feature_columns]
y = pima_dataset.label

In [None]:
print(X.shape)
print(y.shape)

(768, 7)
(768,)


### Import required python modules.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
seed = 9
test_size = 0.25

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size, random_state = seed)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(576, 7)
(192, 7)
(576,)
(192,)


# Create logistic regression model.

### Import required python modules.

See [link](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) for more details.

In [None]:
from sklearn.linear_model import LogisticRegression

### Create logistic regression model.

In [None]:
model = LogisticRegression(max_iter=500, random_state=16)
model.fit(X_train, y_train)

# Evaluate the model.

### Import required python modules.

In [None]:
from sklearn import metrics
from sklearn.metrics import classification_report

### Evaluate the model on the training dataset.

In [None]:
y_train_predict = model.predict(X_train)

confusion_matrix = metrics.confusion_matrix(y_train, y_train_predict)
confusion_matrix

array([[342,  35],
       [ 83, 116]])

### Evaluate the model on the test dataset.

In [None]:
y_test_predict = model.predict(X_test)

confusion_matrix = metrics.confusion_matrix(y_test, y_test_predict)
confusion_matrix

array([[102,  21],
       [ 27,  42]])

### Show classification report.

In [None]:
target_names = ['without diabetes', 'with diabetes']
print(classification_report(y_test, y_test_predict, target_names=target_names))

                  precision    recall  f1-score   support

without diabetes       0.79      0.83      0.81       123
   with diabetes       0.67      0.61      0.64        69

        accuracy                           0.75       192
       macro avg       0.73      0.72      0.72       192
    weighted avg       0.75      0.75      0.75       192

