# Logistic Regression in scikit-learn - Lab

## Introduction 

In this lab, you are going to fit a logistic regression model to a dataset concerning heart disease. Whether or not a patient has heart disease is indicated in the column labeled `'target'`. 1 is for positive for heart disease while 0 indicates no heart disease.

## Objectives

In this lab you will: 

- Fit a logistic regression model using scikit-learn 


## Let's get started!

Run the following cells that import the necessary functions and import the dataset: 

In [24]:
# Import necessary functions
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np


In [25]:
# Import data
df = pd.read_csv('heart.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


## Define appropriate `X` and `y` 

Recall the dataset contains information about whether or not a patient has heart disease and is indicated in the column labeled `'target'`. With that, define appropriate `X` (predictors) and `y` (target) in order to model whether or not a patient has heart disease.

In [26]:
# Split the data into target and predictors
y = df['target']
X = df.drop('target',axis = 1)

## Train- test split 

- Split the data into training and test sets 
- Assign 25% to the test set 
- Set the `random_state` to 0 

N.B. To avoid possible data leakage, it is best to split the data first, and then normalize.

In [27]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size =0.25, random_state = 0)
X_train

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
173,58,1,2,132,224,0,0,173,0,3.2,2,2,3
261,52,1,0,112,230,0,1,160,0,0.0,2,1,2
37,54,1,2,150,232,0,0,165,0,1.6,2,0,3
101,59,1,3,178,270,0,0,145,0,4.2,0,0,3
166,67,1,0,120,229,0,0,129,1,2.6,1,2,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,43,1,0,132,247,1,0,143,1,0.1,1,4,3
192,54,1,0,120,188,0,1,113,0,1.4,1,1,3
117,56,1,3,120,193,0,0,162,0,1.9,1,0,3
47,47,1,2,138,257,0,0,156,0,0.0,2,0,2


## Normalize the data 

Normalize the data (`X`) prior to fitting the model. 

In [36]:
# Your code here
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaler.fit(X_train)

X_train = pd.DataFrame(scaler.transform(X_train),index = X_train.index, columns = X_train.columns)

X_train.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
173,0.604167,1.0,0.666667,0.387755,0.214781,0.0,0.0,0.778626,0.0,0.516129,1.0,0.5,1.0
261,0.479167,1.0,0.0,0.183673,0.228637,0.0,0.5,0.679389,0.0,0.0,1.0,0.25,0.666667
37,0.520833,1.0,0.666667,0.571429,0.233256,0.0,0.0,0.717557,0.0,0.258065,1.0,0.0,1.0
101,0.625,1.0,1.0,0.857143,0.321016,0.0,0.0,0.564885,0.0,0.677419,0.0,0.0,1.0
166,0.791667,1.0,0.0,0.265306,0.226328,0.0,0.0,0.442748,1.0,0.419355,0.5,0.5,1.0


## Fit a model

- Instantiate `LogisticRegression`
  - Make sure you don't include the intercept  
  - set `C` to a very large number such as `1e12` 
  - Use the `'liblinear'` solver 
- Fit the model to the training data 

In [37]:
# Instantiate the model
logreg = LogisticRegression(fit_intercept = False,C = 1e12, solver = 'liblinear')


# Fit the model
model= logreg.fit(X_train,y_train)


## Predict
Generate predictions for the training and test sets. 

In [39]:
# Generate predictions
y_hat_train = logreg.predict(X_train)

y_hat_test = logreg.predict(X_test)

## How many times was the classifier correct on the training set?

In [40]:
# Your code here
train_residuals =np.abs(y_train -y_hat_train)
print(pd.Series(train_residuals, name="Residuals (counts)").value_counts())
print()
print(pd.Series(train_residuals, name="Residuals (proportions)").value_counts(normalize=True))


Residuals (counts)
0    194
1     33
Name: count, dtype: int64

Residuals (proportions)
0    0.854626
1    0.145374
Name: proportion, dtype: float64


## How many times was the classifier correct on the test set?

In [42]:
# Your code here
test_residuals =np.abs(y_test -y_hat_test)
print(pd.Series(test_residuals, name="Residuals (counts)").value_counts())
print()
print(pd.Series(test_residuals, name="Residuals (proportions)").value_counts(normalize=True))

Residuals (counts)
0    53
1    23
Name: count, dtype: int64

Residuals (proportions)
0    0.697368
1    0.302632
Name: proportion, dtype: float64


## Analysis
Describe how well you think this initial model is performing based on the training and test performance. Within your description, make note of how you evaluated performance as compared to your previous work with regression.

In [34]:
# Your analysis here
"""
Counts: The model made 53 correct predictions and 23 incorrect predictions on the test set.
Proportions: About 69.74% of the predictions were correct, while 30.26% were incorrect.
"""

In [None]:
"""
Training Performance:

The model performed well on the training set, with about 85.46% of the predictions being correct. This indicates that the model has learned the training data effectively.

Test Performance:
The performance on the test set is lower, with only 69.74% of the predictions being correct. This drop in accuracy suggests that the model may be overfitting the training data, especially given the high value of the regularization parameter (C=1e12), which reduces regularization.

Comparison to Regression:
In regression tasks, performance is  evaluated using metrics like Mean Squared Error (MSE) or R-squared, which provide a continuous measure of how well the model predicts the target variable. In classification tasks, we evaluate performance using accuracy, precision, recall, and F1-score.

The use of residuals in classification (0 for correct predictions and 1 for incorrect predictions) is a different approach compared to regression, where residuals are continuous values representing the difference between predicted and actual values.

Conclusion
Overall, while the initial logistic regression model shows good performance on the training set, the significant drop in accuracy on the test set indicates potential overfitting. Further tuning of hyperparameters, feature selection, or trying different models may be necessary to improve generalization to unseen data.

"""

## Summary

In this lab, you practiced a standard data science pipeline: importing data, split it into training and test sets, and fit a logistic regression model. In the upcoming labs and lessons, you'll continue to investigate how to analyze and tune these models for various scenarios.