# General Modeling Framework: Simple
## Logistic Regression with Boston Housing
]

-------------------------------------------------

This is a bare-bones script to get you up and running with classification. For a given dataset, you should be able to code relevant content in the cells below. 

Note that this 'recipe' is similar for a regression problem - just need to use a different model and different error metrics.

## Getting started
Import modules, mount Drive, read in the data, check data types and missing values. You may also do some light EDA prior to modeling.

Notice how we are using the same functions over and over again... it really is like following a recipe.


In [None]:
# import modules we need for EDA and wrangling
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# import other functions we'll need for classification modeling
from sklearn.preprocessing import  MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression # logistic

# classification error metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
# mount your google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# read in some data
df = pd.read_csv('/content/drive/My Drive/Summer 2020/Week 3 Materials: Stats and Regression Modeling/Data/Boston Housing.csv')

In [None]:
# data type, shape and columns

print("This is the shape :\n", df.shape, '\n') # escape characters are fun! \n adds a return line
print("These are the column names: \n", df.columns, '\n') # helps keep things nice and clean
print("These are the data types: \n", df.dtypes)
print("\nThis is the head:") # see how I can pop that \n anywhere?
df.head()   # also note how much stuff I have pasted in ONE CELL...
            # now you are cooking with gas!

# the head confirms to use that the valid column is some type of time/date stamp
# we need to stop what we are doing and take care of this NOW!

This is the shape :
 (506, 14) 

These are the column names: 
 Index(['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax',
       'ptratio', 'b', 'lstat', 'medv'],
      dtype='object') 

These are the data types: 
 crim       float64
zn         float64
indus      float64
chas         int64
nox        float64
rm         float64
age        float64
dis        float64
rad          int64
tax          int64
ptratio    float64
b          float64
lstat      float64
medv       float64
dtype: object

This is the head:


Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [None]:
# here's another cool way to get a lot of this info AND MORE
df.info() # gives you missing values report too - this is nice complete data

# data types, shape, missing values per columm.... pandas rocks...

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   crim     506 non-null    float64
 1   zn       506 non-null    float64
 2   indus    506 non-null    float64
 3   chas     506 non-null    int64  
 4   nox      506 non-null    float64
 5   rm       506 non-null    float64
 6   age      506 non-null    float64
 7   dis      506 non-null    float64
 8   rad      506 non-null    int64  
 9   tax      506 non-null    int64  
 10  ptratio  506 non-null    float64
 11  b        506 non-null    float64
 12  lstat    506 non-null    float64
 13  medv     506 non-null    float64
dtypes: float64(11), int64(3)
memory usage: 55.5 KB


## Data splitting
Subset your data into X features and Y target variable for modeling. Convert X and Y to numpy arrays. Then use train_test_split for data splitting (80/20 is very common); don't forget random seed and shuffle.

In [None]:
# one extra step here - we are making up our own problem.
# we want to predict if a house price is greater than the median
# so i will use a numpy.where() statement to do this
df['medv'] = np.where(df['medv'] > df['medv'].median(),
                      1, # if true
                      0) # if false

# check your work - looks good
df['medv'].head(n=10)

0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    0
9    0
Name: medv, dtype: int64

In [None]:
# the target variable is Y
# we know that this is 'medv'
Y = df['medv']
print(Y.shape) # a single column with 506 rows

(506,)


In [None]:
# everything else is X 
# so just drop 'medv' and you are done
X = df.drop('medv', axis=1)
print(X.shape) # note that we have gone from 14 to 13 columns, this is good! 506 rows.

(506, 13)


In [None]:
# now, split the data in ONE LINE OF CODE
# notice how we are assigning four different variables at once
# this makes it really clean

# be careful of capital vs. lowercase X and Y, you might get an error...
# notice the 80/20 split we perform

X_train, X_test, y_train, y_test = train_test_split(X, Y,
                                                    test_size = 0.2,
                                                    shuffle = True,
                                                    random_state = 42)

In [None]:
# check your work - does the shape match what you think it should be?
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(404, 13) (102, 13) (404,) (102,)


In [None]:
# convert these all to numpy arrays
X_train = np.array(X_train)
X_test = np.array(X_test)
y_train = np.array(y_train)
y_test = np.array(y_test)

## Min/Max Scaling
This will ensure all of your X data is between 0 (min) and 1 (max). You will use fit_transform() on the train data first, then fit on the test data. If you don't do this step after splitting, you will have data leakage. 

Only scale the X data, not the Y data!

Do yourself a favor and just overwrite X_train and X_test when standardizing, as I do below. Min/max scaling requires that the data are numpy arrays and all numeric data.


**Like this example:**
```
import numpy as np
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
```


In [None]:
# you probably have already imported the MinMaxScale at the top of your script
# you should convert to numpy array before scaling
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# if you wanted to run summary stats on these to check the range,
# you would need to convert to a pandas dataframe.
tmp = pd.DataFrame(X_train)
tmp.describe() # notice how all the max values are 1, all min values are 0.

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
count,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0
mean,0.040465,0.115693,0.379446,0.071782,0.352848,0.498859,0.676173,0.243577,0.363323,0.414184,0.608332,0.89757,0.296009
std,0.099757,0.231525,0.255356,0.258447,0.24219,0.144285,0.28831,0.193802,0.373466,0.317123,0.237096,0.23089,0.196203
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.000814,0.0,0.162593,0.0,0.13786,0.412345,0.439238,0.08714,0.130435,0.175573,0.446809,0.945969,0.139142
50%,0.002836,0.0,0.28963,0.0,0.314815,0.477324,0.77034,0.186066,0.173913,0.272901,0.648936,0.985892,0.253725
75%,0.0359,0.2,0.642963,0.0,0.506173,0.564114,0.934604,0.3884,0.478261,0.914122,0.808511,0.997113,0.404042
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Fit The Model
Fit the model and make new variables to save your train and test predictions. Make sure you are using the appropriate regression or classification model.

In [None]:
# make a variable to store the general model
LR = LogisticRegression() # use logistic for a classification problem
# fit the model - one line of code
LR = LR.fit(X_train, y_train) # always going to be (X_train, y_train)

In [None]:
# store the predictions
train_preds = LR.predict(X_train) # same shape as Y_train
test_preds = LR.predict(X_test)  # same shape as Y_test

## Evaluate the Model
Look at the appropriate error metrics depending on the problem you are solving. 

For a regression problem, look at the R2, MAE and MSE; then make a scatterplot of actual vs. predicted values with nice labels and titles.

For a classification problem, create the classification report (gives a confusion matrix and useful metrics in one line of code).

In [None]:
# this is a classification problem, so we have other ways of
# evaluating our model than a regression problem

# train results
trainResults = classification_report(y_train, train_preds) # (actual, predicted)
print(trainResults)

              precision    recall  f1-score   support

           0       0.86      0.82      0.84       196
           1       0.84      0.87      0.85       208

    accuracy                           0.85       404
   macro avg       0.85      0.85      0.85       404
weighted avg       0.85      0.85      0.85       404



In [None]:
# train confusion matrix
confusion_matrix(y_train, train_preds)

# top left is TN
# bottom left is FN
# top right is FP
# bottom right is TP

array([[161,  35],
       [ 27, 181]])

In [None]:
# here are tp, tn, fp, fn
tn, fp, fn, tp = confusion_matrix(y_train, train_preds).ravel()
print("TP:", tp)
print("TN:", tn)
print("FP:", fp)
print("FN:", fn)

TP: 181
TN: 161
FP: 35
FN: 27


In [None]:
# test results
testResults = classification_report(y_test, test_preds)
# don't forget to use 'print' otherwise it looks goofy
print(testResults)

              precision    recall  f1-score   support

           0       0.89      0.85      0.87        60
           1       0.80      0.86      0.83        42

    accuracy                           0.85       102
   macro avg       0.85      0.85      0.85       102
weighted avg       0.86      0.85      0.85       102



In [None]:
# test confusion matrix
confusion_matrix(y_test, test_preds)

# top left is TN
# bottom left is FN
# top right is FP
# bottom right is TP

array([[51,  9],
       [ 6, 36]])

In [None]:
# here are tp, tn, fp, fn
tn, fp, fn, tp = confusion_matrix(y_test, test_preds).ravel()
print("TP:", tp)
print("TN:", tn)
print("FP:", fp)
print("FN:", fn)

TP: 36
TN: 51
FP: 9
FN: 6


# Done!
You have just completed a very simple ML framework for classification modeling. Even though you used a simple logistic regression, you still got great results. 

Later on, you will expand on these topics and start fitting multiple models, and may start tweaking them ('hyperparameter tuning') to get even better peformance. Sit tight!