# Logistic Regression in Machine Learning

<img src="https://media5.datahacker.rs/2021/01/83.jpg" height=500 width=500>

## Problem Statement

> Use Machine Learning to predict [person's heart health condition](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease) expressed as a binary variable (heart disease: yes/no).

### Logistic Regression:
   - **Type:** Classification algorithm.
   - **Output:** Predicts the probability of an instance belonging to a particular class (binary or multiclass).
   - **Use case:** Used when the target variable is categorical and represents two or more classes.
   - **Equation:** Applies the logistic function (sigmoid function) to a linear combination of input features.
   - **Evaluation:** Common evaluation metrics for binary classification include accuracy, precision, recall, and F1-score.

<img src="https://saedsayad.com/images/LogReg_1.png" height=500 width=500>

   Logistic Regression is widely used in binary classification problems, such as spam vs. non-spam email detection, disease diagnosis, or any scenario where the goal is to predict a categorical outcome.

> Linear Regression is used for predicting continuous outcomes, while Logistic Regression is used for classification problems where the outcome is categorical. Despite the name "regression" in Logistic Regression, it is primarily a classification algorithm, and its output is transformed through the logistic (sigmoid) function to ensure it falls between 0 and 1, representing probabilities.

### Load the Data

In [67]:
import opendatasets as od
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

In [None]:
px.line, px.bar, px.pie

In [None]:
matplotlib - large datasets 
seaborn - specific graphs
plotly - interactive graphs
folium - geographical map

In [68]:
od.download('https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease')

Skipping, found downloaded files in "./personal-key-indicators-of-heart-disease" (use force=True to force download)


In [69]:
df = pd.read_csv('personal-key-indicators-of-heart-disease/2020/heart_2020_cleaned.csv')

In [70]:
df

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.60,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
2,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
3,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
4,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319790,Yes,27.41,Yes,No,No,7.0,0.0,Yes,Male,60-64,Hispanic,Yes,No,Fair,6.0,Yes,No,No
319791,No,29.84,Yes,No,No,0.0,0.0,No,Male,35-39,Hispanic,No,Yes,Very good,5.0,Yes,No,No
319792,No,24.24,No,No,No,0.0,0.0,No,Female,45-49,Hispanic,No,Yes,Good,6.0,No,No,No
319793,No,32.81,No,No,No,0.0,0.0,No,Female,25-29,Hispanic,No,No,Good,12.0,No,No,No


### Clean the Data

In [71]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319795 entries, 0 to 319794
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   HeartDisease      319795 non-null  object 
 1   BMI               319795 non-null  float64
 2   Smoking           319795 non-null  object 
 3   AlcoholDrinking   319795 non-null  object 
 4   Stroke            319795 non-null  object 
 5   PhysicalHealth    319795 non-null  float64
 6   MentalHealth      319795 non-null  float64
 7   DiffWalking       319795 non-null  object 
 8   Sex               319795 non-null  object 
 9   AgeCategory       319795 non-null  object 
 10  Race              319795 non-null  object 
 11  Diabetic          319795 non-null  object 
 12  PhysicalActivity  319795 non-null  object 
 13  GenHealth         319795 non-null  object 
 14  SleepTime         319795 non-null  float64
 15  Asthma            319795 non-null  object 
 16  KidneyDisease     31

In [72]:
df.isna().sum()

HeartDisease        0
BMI                 0
Smoking             0
AlcoholDrinking     0
Stroke              0
PhysicalHealth      0
MentalHealth        0
DiffWalking         0
Sex                 0
AgeCategory         0
Race                0
Diabetic            0
PhysicalActivity    0
GenHealth           0
SleepTime           0
Asthma              0
KidneyDisease       0
SkinCancer          0
dtype: int64

In [9]:
#No missing data!

### Split the Data

In [10]:
from sklearn.model_selection import train_test_split

#train and test

### Bias-Variance Tradeoff

<img src="https://static.packt-cdn.com/products/9781788830577/graphics/b4c9fc0d-ce22-4c30-8215-492a805a8b8c.png" height=550 weight=550>

The bias-variance tradeoff is a fundamental concept in machine learning that deals with the balance between the bias and variance of a model.

1. **Bias:**
   - Bias refers to the error introduced by approximating a real-world problem, which may be complex, by a simplified model.
   - High bias models are too simple and may not capture the underlying patterns in the data, leading to systematic errors.

2. **Variance:**
   - Variance refers to the model's sensitivity to small fluctuations in the training data.
   - High variance models are overly complex, capturing noise in the training data and performing well on training data but poorly on new, unseen data.

The tradeoff arises because increasing model complexity generally decreases bias but increases variance, and vice versa. The goal is to find the right level of model complexity that minimizes both bias and variance, leading to a model that generalizes well to new, unseen data.

- **High Bias (Underfitting):**
  - Occurs when the model is too simple to capture the underlying patterns in the data.
  - Results in poor performance on both the training and test datasets.

- **High Variance (Overfitting):**
  - Occurs when the model is too complex and captures noise in the training data.
  - Performs well on the training data but poorly on new, unseen data.

- **Balancing Bias and Variance:**
  - The goal is to find a model complexity that minimizes both bias and variance, striking a balance.
  - This is often achieved through techniques like regularization, cross-validation, and choosing an appropriate model complexity.

- **Model Selection:**
  - Depending on the problem and dataset, different models may exhibit different bias-variance tradeoffs.
  - It's crucial to choose a model that suits the specific requirements of the problem at hand.

In summary, the bias-variance tradeoff is a critical consideration in machine learning model development. Striking the right balance helps create models that generalize well to new data while avoiding underfitting or overfitting.

### Train, Validation & Test Sets

In machine learning, datasets are typically split into three main subsets: the training set, the validation set, and the test set. Each of these subsets serves a specific purpose in training and evaluating a machine learning model:

1. **Training Set:**
   - The training set is the portion of the dataset used to train the machine learning model.
   - The model learns from the patterns, relationships, and features present in the training data.
   - A larger and diverse training set helps the model generalize better to new, unseen data.

2. **Validation Set:**
   - The validation set is a separate subset used to fine-tune the model's parameters, such as hyperparameters.
   - The model's performance on the validation set helps to assess how well it generalizes to new data and prevents overfitting.
   - Hyperparameter tuning is an iterative process, and the validation set provides feedback on the model's performance during this process.

3. **Test Set:**
   - The test set is a completely independent subset that is not used during the training or validation phases.
   - It serves as a final evaluation of the model's performance after training and hyperparameter tuning.
   - The test set provides an unbiased assessment of how well the model is expected to perform on new, unseen data.

The purpose of these three sets is to simulate the real-world scenario where a model is trained on historical data, fine-tuned on some intermediate data, and then evaluated on completely new data. This approach helps to ensure that the model generalizes well and is not simply memorizing the training examples.

It's important to note that the sizes of these sets can vary based on the available data, but common splits include a significant portion for training (e.g., 70-80%), a smaller portion for validation (e.g., 10-15%), and a separate portion for testing (e.g., 10-15%). The exact split depends on factors such as the size of the dataset and the specific requirements of the machine learning task.

In [83]:
# randomly 80 rows - 1 set - 56, 78, 24, 56,...- train
#          20 rows - 1 set - 24, 65, 78 93,.... test
    
# randomly 80 rows - 2 set - 76, 90, 12, 43,...- train
# 20 rows - 1 set - 24, 65, 78 93,.... test

# randomly 80 rows - 3 set - 54, 90, 12, 43,...- train
# 20 rows - 1 set - 24, 65, 78 93,.... test

In [82]:
#train_test_split(df, test_size=0.2,random_state=30)

In [84]:
#train_df, test_df = train_test_split(original_df, test_size=0.2, random_state = 42)

In [85]:
#train_df, val_df = train_test_split(train_df, test_size = 0.25, random_state=42 )

In [88]:
df.shape

(319795, 18)

In [86]:
train_val_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
train_df, val_df = train_test_split(train_val_df, test_size=0.25, random_state=42)

In [89]:
print('train_df.shape :', train_df.shape)
print('val_df.shape :', val_df.shape)
print('test_df.shape :', test_df.shape)

train_df.shape : (191877, 18)
val_df.shape : (63959, 18)
test_df.shape : (63959, 18)


In [13]:
train_df

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
197501,No,25.77,Yes,No,No,0.0,0.0,No,Male,80 or older,Hispanic,No,No,Fair,8.0,No,No,No
74357,No,27.89,Yes,No,No,0.0,0.0,No,Female,55-59,Black,No,Yes,Good,6.0,Yes,No,No
202820,No,27.32,Yes,No,No,0.0,0.0,No,Male,40-44,White,No,Yes,Very good,6.0,No,No,No
59127,Yes,29.35,No,No,Yes,10.0,5.0,No,Female,60-64,Black,Yes,Yes,Very good,5.0,No,No,No
253134,No,40.74,No,No,No,0.0,0.0,No,Female,65-69,White,Yes (during pregnancy),Yes,Very good,8.0,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
303013,No,24.41,No,No,No,0.0,0.0,No,Female,75-79,White,No,Yes,Excellent,10.0,No,No,No
279613,No,27.96,Yes,No,No,0.0,0.0,No,Female,50-54,White,No,No,Good,5.0,No,No,No
139081,No,35.51,No,No,No,2.0,0.0,No,Male,55-59,White,No,No,Fair,8.0,No,No,No
119308,No,29.86,No,No,No,0.0,6.0,No,Female,18-24,Black,No,No,Very good,6.0,No,No,No


In [92]:
list(train_df.columns[1:])

['BMI',
 'Smoking',
 'AlcoholDrinking',
 'Stroke',
 'PhysicalHealth',
 'MentalHealth',
 'DiffWalking',
 'Sex',
 'AgeCategory',
 'Race',
 'Diabetic',
 'PhysicalActivity',
 'GenHealth',
 'SleepTime',
 'Asthma',
 'KidneyDisease',
 'SkinCancer']

In [93]:
#train_df[['BMI','Smoking','']]

In [94]:
inputs = list(train_df.columns)[1:]
target_col = 'HeartDisease'

In [95]:
inputs

['BMI',
 'Smoking',
 'AlcoholDrinking',
 'Stroke',
 'PhysicalHealth',
 'MentalHealth',
 'DiffWalking',
 'Sex',
 'AgeCategory',
 'Race',
 'Diabetic',
 'PhysicalActivity',
 'GenHealth',
 'SleepTime',
 'Asthma',
 'KidneyDisease',
 'SkinCancer']

In [96]:
target_col

'HeartDisease'

Creating inputs and targets for the training, validation and test sets for model training.

In [97]:
train_inputs = train_df[input_cols]
train_targets = train_df[target_col]

In [98]:
val_inputs = val_df[input_cols]
val_targets = val_df[target_col]

In [99]:
test_inputs = test_df[input_cols]
test_targets = test_df[target_col]

### Scaling the Data

In [26]:
numerics = train_inputs.select_dtypes(include=np.number).columns.tolist()
categoricals = train_inputs.select_dtypes('object').columns.tolist()

In [27]:
numerics

['BMI', 'PhysicalHealth', 'MentalHealth', 'SleepTime']

In [28]:
categoricals

['Smoking',
 'AlcoholDrinking',
 'Stroke',
 'DiffWalking',
 'Sex',
 'AgeCategory',
 'Race',
 'Diabetic',
 'PhysicalActivity',
 'GenHealth',
 'Asthma',
 'KidneyDisease',
 'SkinCancer']

In [29]:
from sklearn.preprocessing import StandardScaler

`StandardScaler` is a preprocessing technique used in machine learning to standardize or scale the features of a dataset. Standardization is the process of rescaling the features so that they have the properties of a standard normal distribution, specifically a mean of 0 and a standard deviation of 1. This is also known as z-score normalization.

The `StandardScaler` is applied independently to each feature and transforms the data according to the following formula for each feature:

<img src="https://i.imgur.com/swnWR5C.png" height=300 width=300>

The purpose of using `StandardScaler` is to ensure that the features are on a similar scale. This is particularly important for machine learning algorithms that rely on distances between data points, such as k-nearest neighbors or support vector machines. Standardizing the features helps prevent certain features from dominating the learning process due to their larger scales.


In [31]:
scaler = StandardScaler()

In [33]:
scaler.fit(df[numerics])

In [34]:
train_inputs[numerics] = scaler.transform(train_inputs[numerics])
val_inputs[numerics] = scaler.transform(val_inputs[numerics])
test_inputs[numerics] = scaler.transform(test_inputs[numerics])

In [35]:
train_inputs[numeric_cols].describe()

Unnamed: 0,BMI,PhysicalHealth,MentalHealth,SleepTime
count,191877.0,191877.0,191877.0,191877.0
mean,0.000343,-0.002574,-0.000622,-0.000789
std,1.000658,0.995764,0.999401,0.999739
min,-2.565319,-0.42407,-0.490039,-4.245859
25%,-0.675793,-0.42407,-0.490039,-0.763977
50%,-0.155032,-0.42407,-0.490039,-0.067601
75%,0.490018,-0.172524,-0.112928,0.628776
max,10.466277,3.349118,3.281069,11.7708


In this example:
- `fit` is used on the training data to compute the mean and standard deviation of each feature and transform the data accordingly.
- `transform` is used on the validation and test data to scale them based on the parameters learned from the training data.

Using `StandardScaler` is a common preprocessing step in machine learning pipelines to ensure consistent and meaningful comparisons between features.

### Encoding the Data

When working with machine learning models, it's often necessary to encode categorical data into a numerical format, as many algorithms require numerical input. There are different techniques for encoding categorical data, and the choice depends on the nature of the categorical variables. Two common methods are Label Encoding and One-Hot Encoding:

1. **Label Encoding:**
   - Label Encoding involves assigning a unique integer to each category or label.
   - This method is suitable for ordinal categorical data, where there is an inherent order among the categories.
   - It is performed using libraries like scikit-learn in Python.

   ```python
   from sklearn.preprocessing import LabelEncoder

   # Create a LabelEncoder object
   label_encoder = LabelEncoder()

   # Fit and transform the categorical column
   encoded_labels = label_encoder.fit_transform(categorical_column)
   ```

   The drawback of Label Encoding is that it may imply an ordinal relationship between the categories, which might not be accurate for non-ordinal categorical variables.

2. **One-Hot Encoding:**
   - One-Hot Encoding creates binary columns for each category and represents the presence or absence of a category with 1s and 0s.
   - It is suitable for nominal categorical data without any inherent order among categories.
   - Libraries like scikit-learn provide the `OneHotEncoder` class, and pandas has a convenient `get_dummies` function for this purpose.

   ```python
   import pandas as pd

   # Use get_dummies for one-hot encoding
   one_hot_encoded = pd.get_dummies(categorical_column)
   ```

   One-Hot Encoding avoids introducing ordinal relationships among categories but may lead to a high-dimensional dataset if there are many categories.

The choice between Label Encoding and One-Hot Encoding depends on the nature of the data and the requirements of the machine learning algorithm. Additionally, some frameworks and models, like tree-based models in scikit-learn, can handle categorical data without explicit encoding, allowing you to use the original categorical features directly. Always check the documentation of the specific machine learning algorithm you are using to determine the most suitable approach for handling categorical variables.

In [36]:
from sklearn.preprocessing import OneHotEncoder

In [42]:
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

In [44]:
encoder.fit(df[categorical_cols])



In [45]:
encoded_cols = list(encoder.get_feature_names_out(categorical_cols))
print(encoded_cols)

['Smoking_No', 'Smoking_Yes', 'AlcoholDrinking_No', 'AlcoholDrinking_Yes', 'Stroke_No', 'Stroke_Yes', 'DiffWalking_No', 'DiffWalking_Yes', 'Sex_Female', 'Sex_Male', 'AgeCategory_18-24', 'AgeCategory_25-29', 'AgeCategory_30-34', 'AgeCategory_35-39', 'AgeCategory_40-44', 'AgeCategory_45-49', 'AgeCategory_50-54', 'AgeCategory_55-59', 'AgeCategory_60-64', 'AgeCategory_65-69', 'AgeCategory_70-74', 'AgeCategory_75-79', 'AgeCategory_80 or older', 'Race_American Indian/Alaskan Native', 'Race_Asian', 'Race_Black', 'Race_Hispanic', 'Race_Other', 'Race_White', 'Diabetic_No', 'Diabetic_No, borderline diabetes', 'Diabetic_Yes', 'Diabetic_Yes (during pregnancy)', 'PhysicalActivity_No', 'PhysicalActivity_Yes', 'GenHealth_Excellent', 'GenHealth_Fair', 'GenHealth_Good', 'GenHealth_Poor', 'GenHealth_Very good', 'Asthma_No', 'Asthma_Yes', 'KidneyDisease_No', 'KidneyDisease_Yes', 'SkinCancer_No', 'SkinCancer_Yes']


In [46]:
train_inputs[encoded_cols] = encoder.transform(train_inputs[categoricals])
val_inputs[encoded_cols] = encoder.transform(val_inputs[categoricals])
test_inputs[encoded_cols] = encoder.transform(test_inputs[categoricals])

### Fit the Data

In [47]:
from sklearn.linear_model import LogisticRegression

In [48]:
model = LogisticRegression()

In [49]:
model.fit(train_inputs[numeric_cols + encoded_cols], train_targets)

In [54]:
print(model.coef_)

[[ 0.05487292  0.02431454  0.04437473 -0.03430672 -0.3480692   0.00787059
  -0.03956402 -0.30063459 -0.67225653  0.33205792 -0.28541285 -0.05478576
  -0.53634811  0.1961495  -1.58209827 -1.35869036 -1.25110455 -1.04516023
  -0.65242699 -0.33912195  0.11148421  0.3097046   0.58788235  0.84494886
   1.09911505  1.35041701  1.58485167  0.20500549 -0.3781326  -0.20331953
  -0.06935054  0.03465911  0.07093947 -0.29415981 -0.12533575  0.20089679
  -0.12159985 -0.18725052 -0.1529481  -1.06192193  0.47897968 -0.00790469
   0.82539075 -0.57474241 -0.29470069 -0.04549792 -0.46286488  0.12266627
  -0.23023018 -0.10996843]]


In [52]:
print(model.intercept_)

[-0.35362575]


### Evaluate the Data

In [55]:
Train = train_inputs[numeric_cols + encoded_cols]
Val = val_inputs[numeric_cols + encoded_cols]
Test = test_inputs[numeric_cols + encoded_cols]

In [64]:
test_preds = model.predict(Test)

In [58]:
from sklearn.metrics import accuracy_score

In [65]:
accuracy_score(test_preds, test_targets)

0.9138979658843946

### Exercise

**Dataset:**
- The dataset is available here: https://www.kaggle.com/datasets/saurabhbagchi/dish-network-hackathon?select=Train_Dataset.csv

**Tasks:**
1. Load the dataset and explore its structure.
2. Preprocess the data: 
    - Handle missing values
    - Scale numerical variables
    - Encode categorical variables
    - Split the data into training, validation and testing sets.
3. Implement Logistic Regression using a suitable library (e.g., scikit-learn).
4. Train the model on the training set.