# Logistic Regression in scikit-learn 

## Introduction 

In this lab, I am are going to fit a logistic regression model to a dataset concerning heart disease. Whether or not a patient has heart disease is indicated in the column labeled `'target'`. 1 is for positive for heart disease while 0 indicates no heart disease.

## Objectives

In this lab I will: 

- Fit a logistic regression model using scikit-learn 
- Practice using Pipeline to make my code neater and reduce the number of lines


## Let's get started!

The following cells import the necessary functions and import the dataset: 

In [1]:
# Import necessary functions
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

In [2]:
# Import data
df = pd.read_csv('heart.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


## Define appropriate `X` and `y` 

Recall the dataset contains information about whether or not a patient has heart disease and is indicated in the column labeled `'target'`. With that, define appropriate `X` (predictors) and `y` (target) in order to model whether or not a patient has heart disease.

In [3]:
# Split the data into target and predictors
y = df['target']
X = df.drop(['target'], axis=1)

## Normalize the data 

Normalize the data (`X`) prior to fitting the model. 

In [4]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X)
X_norms = pd.DataFrame(X_normalized, columns=X.columns)
X_norms.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,0.952197,0.681005,1.973123,0.763956,-0.256334,2.394438,-1.005832,0.015443,-0.696631,1.087338,-2.274579,-0.714429,-2.148873
1,-1.915313,0.681005,1.002577,-0.092738,0.072199,-0.417635,0.898962,1.633471,-0.696631,2.122573,-2.274579,-0.714429,-0.512922
2,-1.474158,-1.468418,0.032031,-0.092738,-0.816773,-0.417635,-1.005832,0.977514,-0.696631,0.310912,0.976352,-0.714429,-0.512922
3,0.180175,0.681005,0.032031,-0.663867,-0.198357,-0.417635,0.898962,1.239897,-0.696631,-0.206705,0.976352,-0.714429,-0.512922
4,0.290464,-1.468418,-0.938515,-0.663867,2.08205,-0.417635,0.898962,0.583939,1.435481,-0.379244,0.976352,-0.714429,-0.512922


## Train- test split 

- Split the data into training and test sets 
- Assign 25% to the test set 
- Set the `random_state` to 0 

In [5]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_norms, y, test_size=0.25, random_state=0)

In [6]:
X_train.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
173,0.400752,0.681005,1.002577,0.021488,-0.430263,-0.417635,-1.005832,1.021244,-0.696631,1.863764,0.976352,1.244593,1.123029
261,-0.26098,0.681005,-0.938515,-1.12077,-0.31431,-0.417635,0.898962,0.452748,-0.696631,-0.896862,0.976352,0.265082,-0.512922
37,-0.040403,0.681005,1.002577,1.04952,-0.275659,-0.417635,-1.005832,0.6714,-0.696631,0.483451,0.976352,-0.714429,1.123029
101,0.511041,0.681005,1.973123,2.648682,0.458709,-0.417635,-1.005832,-0.20321,-0.696631,2.72646,-2.274579,-0.714429,1.123029
166,1.393352,0.681005,-0.938515,-0.663867,-0.333636,-0.417635,-1.005832,-0.902898,1.435481,1.346147,-0.649113,1.244593,1.123029


## Fit a model

- Instantiate `LogisticRegression`
  - Make sure you don't include the intercept  
  - set `C` to a very large number such as `1e12` 
  - Use the `'liblinear'` solver 
- Fit the model to the training data 

In [7]:
# Instantiate the model
logreg = LogisticRegression(fit_intercept=False, C=1e12, solver='liblinear')

# Fit the model
model = logreg.fit(X_train, y_train)
model


LogisticRegression(C=1000000000000.0, fit_intercept=False, solver='liblinear')

## Predict
Generate predictions for the training and test sets. 

In [8]:
# Generate predictions
y_hat_train = model.predict(X_train)
y_hat_test = model.predict(X_test)

## How many times was the classifier correct on the training set?

In [9]:
# Your code here
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_train, y_hat_train)*100
accuracy


84.58149779735683

## How many times was the classifier correct on the test set?

In [10]:
# Your code here
accuracy = accuracy_score(y_test, y_hat_test)*100
accuracy


82.89473684210526

### Logistic Regression Pipeline

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
model = Pipeline([
    ('preprocessor', ColumnTransformer(
        transformers=[
            ('num_transform', StandardScaler(), X.columns)
        ],
    )),
    ('logreg', LogisticRegression(C=1e12, fit_intercept=False, solver='liblinear'))
])
model_pipe = model.fit(X_train, y_train)
accuracy = model_pipe.score(X_test, y_test)*100
accuracy


82.89473684210526

## Summary

In this lab, I practiced a standard data science pipeline: importing data, split it into training and test sets, and fit a logistic regression model. 