# Logistic Regression

In this tutorial, we will build a logistic regression model to predict diabetes.

### Load Data

We will use the Pima Indians Diabetes Database, which can be found at https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database.

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

In [7]:
import pandas as pd

# Load the data from my github link
df = pd.read_csv("https://raw.githubusercontent.com/yangliuiuk/data/main/diabetes.csv")

# Explore the data
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


### Dependent and Independent Variable
We aim to predict the 'Outcome' column based on other columns. The 'Outcome' feature indicates whether a person has diabete.

So, the 'Outcome' is the dependent variable and other variables are independent variables.

In [13]:
X = df.drop('Outcome', axis=1)
y = df['Outcome']

### Data Exploration
For any classification problem, we first need to explore the class distribution. If the class distribution is imbalanced, then the trained model will be biased. 

For instance, if 80% of data points are positive and 20% are negative, a model that simply predicts every data point as positive can yeild an 80% accuracy. In this case, an high accuracy doesn't mean the model performs good.

If the class distribution is imbalanced, oversampling or undersampling is needed to reconstruct the class distribution.

First, we check the distribution of the dependend variable 'Outcome'.

In [10]:
df['Outcome'].value_counts()

Outcome
0    500
1    268
Name: count, dtype: int64

### Oversamping
The above result shows that there are 500 negative instances vs 268 positive instances. The class distribution is inbalanced. We will use oversampling to address this problem.

Oversampling and undersampling are techniques used in the field of imbalanced classification to address the problem of class imbalance, where one class has significantly fewer samples than the other(s). These techniques are used to balance the class distribution in the dataset, which can improve the performance of machine learning models, particularly for minority classes.

Oversampling involves increasing the number of instances in the minority class(es) by generating synthetic samples or replicating existing ones. This is done to balance the class distribution and provide the model with more examples of the minority class. 

Undersampling involves reducing the number of instances in the majority class(es) to balance the class distribution. This is typically done by randomly selecting a subset of samples from the majority class. Undersampling aims to mitigate the impact of class imbalance by reducing the dominance of the majority class. 

Both oversampling and undersampling have their advantages and drawbacks. Oversampling can increase the risk of overfitting, especially when using naive approaches like random oversampling. Undersampling can discard potentially useful information from the majority class, leading to loss of data. 

Because the number of data instances is small in this dataset, so we prefer oversamping.

In [11]:
!pip install imblearn




[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [17]:
from imblearn.over_sampling import RandomOverSampler

oversampler = RandomOverSampler(random_state=42)
X_resampled, y_resampled = oversampler.fit_resample(X, y)

X_resampled.info()
y_resampled.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               1000 non-null   int64  
 1   Glucose                   1000 non-null   int64  
 2   BloodPressure             1000 non-null   int64  
 3   SkinThickness             1000 non-null   int64  
 4   Insulin                   1000 non-null   int64  
 5   BMI                       1000 non-null   float64
 6   DiabetesPedigreeFunction  1000 non-null   float64
 7   Age                       1000 non-null   int64  
dtypes: float64(2), int64(6)
memory usage: 62.6 KB
<class 'pandas.core.series.Series'>
RangeIndex: 1000 entries, 0 to 999
Series name: Outcome
Non-Null Count  Dtype
--------------  -----
1000 non-null   int64
dtypes: int64(1)
memory usage: 7.9 KB


### Split Data in Train and Test Set

In [18]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

### Build and train a logistic regression model

In [19]:
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression(random_state=16)

# fit the model with data
logreg.fit(X_train, y_train)

### Evaluate the model

In [20]:
# Make predictions on testing set
y_pred = logreg.predict(X_test)

# Calculate confusion matrix
from sklearn import metrics

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

array([[72, 27],
       [26, 75]], dtype=int64)

In [23]:
# Calculate accuracy, precision, recall, and F1 score
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Step 5: Evaluate the model
y_pred = logreg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision, recall, fscore, _ = precision_recall_fscore_support(y_test, y_pred, average='binary')

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", fscore)

Accuracy: 0.735
Precision: 0.7352941176470589
Recall: 0.7425742574257426
F1 Score: 0.7389162561576355
