<a href="https://colab.research.google.com/github/AsifKarimShaik/AsifKarimShaik/blob/main/2_Logistic_General_Modeling_Framework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic Regression with CA Housing
-------------------------------------------------

**Dr. Dave Wanik - University of Connecticut**


This is a bare-bones script to get you up and running with classification. For a given dataset, you should be able to code relevant content in the cells below.

💡 Note that this 'recipe' is VERY similar for a regression problem - you just need to use a different model and different error metrics.

## Getting started
Import modules, mount Drive, read in the data, check data types and missing values. You may also do some light EDA prior to modeling.

Notice how we are using the same functions over and over again... it really is like following a recipe.


In [None]:
# import modules we need for EDA and wrangling
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# import other functions we'll need for classification modeling
from sklearn.preprocessing import  MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression # logistic

# classification error metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
# # mount your google drive
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
# read in some data
df = pd.read_csv('/content/sample_data/california_housing_train.csv')

In [None]:
# data type, shape and columns

print("This is the shape :\n", df.shape, '\n') # escape characters are fun! \n adds a return line
print("These are the column names: \n", df.columns, '\n') # helps keep things nice and clean
print("These are the data types: \n", df.dtypes)
print("\nThis is the head:") # see how I can pop that \n anywhere?
df.head()   # also note how much stuff I have pasted in ONE CELL...
            # now you are cooking with gas!

This is the shape :
 (17000, 9) 

These are the column names: 
 Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value'],
      dtype='object') 

These are the data types: 
 longitude             float64
latitude              float64
housing_median_age    float64
total_rooms           float64
total_bedrooms        float64
population            float64
households            float64
median_income         float64
median_house_value    float64
dtype: object

This is the head:


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


In [None]:
# here's another cool way to get a lot of this info AND MORE
df.info() # gives you missing values report too - this is nice complete data

# data types, shape, missing values per columm.... pandas rocks...

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           17000 non-null  float64
 1   latitude            17000 non-null  float64
 2   housing_median_age  17000 non-null  float64
 3   total_rooms         17000 non-null  float64
 4   total_bedrooms      17000 non-null  float64
 5   population          17000 non-null  float64
 6   households          17000 non-null  float64
 7   median_income       17000 non-null  float64
 8   median_house_value  17000 non-null  float64
dtypes: float64(9)
memory usage: 1.2 MB


## Data splitting
Subset your data into X features and Y target variable for modeling. Convert X and Y to numpy arrays. Then use train_test_split for data splitting (80/20 is very common); don't forget random seed and shuffle.

In [None]:
# one extra step here - we are making up our own problem.
# we want to predict if a house price is greater than the median
# so i will use a numpy.where() statement to do this
df['median_house_value'] = np.where(df['median_house_value'] > df['median_house_value'].median(),
                      1, # if true
                      0) # if false

# check your work - looks good
df['median_house_value'].head(n=10)

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
Name: median_house_value, dtype: int64

In [None]:
# the target variable is y
# we know that this is 'median_house_value'
y = df['median_house_value']
print(y.shape) # a single column with 17000 rows

(17000,)


In [None]:
# everything else is X
# so just drop 'median_house_value' and you are done
X = df.drop('median_house_value', axis=1)
print(X.shape) # note that we have gone from 9 to 8 columns, this is good! 17000 rows.

(17000, 8)


In [None]:
# now, split the data in ONE LINE OF CODE
# notice how we are assigning four different variables at once
# this makes it really clean

# be careful of capital vs. lowercase X and Y, you might get an error...
# notice the 80/20 split we perform

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    shuffle = True,
                                                    random_state = 42)

In [None]:
# check your work - does the shape match what you think it should be?
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(13600, 8) (3400, 8) (13600,) (3400,)


## Min/Max Scaling
This will ensure all of your X data is between 0 (min) and 1 (max). You will use fit_transform() on the train data first, then fit on the test data. If you don't do this step after splitting, you will have data leakage.

Only scale the X data, not the Y data!

Do yourself a favor and just overwrite X_train and X_test when standardizing, as I do below. Min/max scaling requires that the data are numpy arrays and all numeric data.


**Like this example:**
```
import numpy as np
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
```


In [None]:
# you probably have already imported the MinMaxScale at the top of your script
# the scaler function converts it to a numpy array, which destroys the column names
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# this is optional - just to check your work
# if you wanted to run summary stats on these to check the range,
# you would need to convert to a pandas dataframe.
tmp = pd.DataFrame(X_train)
tmp.describe() # notice how all the max values are 1, all min values are 0.

Unnamed: 0,0,1,2,3,4,5,6,7
count,13600.0,13600.0,13600.0,13600.0,13600.0,13600.0,13600.0,13600.0
mean,0.482551,0.327069,0.542284,0.069519,0.083567,0.040048,0.082229,0.23276
std,0.203859,0.226774,0.247018,0.057161,0.065365,0.032523,0.063121,0.131696
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.255341,0.147715,0.333333,0.038381,0.045779,0.022002,0.046045,0.142507
50%,0.592065,0.181722,0.54902,0.055964,0.067349,0.032792,0.067259,0.209383
75%,0.640895,0.549416,0.705882,0.083122,0.100403,0.048292,0.099326,0.293106
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Fit The Model
Fit the model and make new variables to save your train and test predictions. Make sure you are using the appropriate regression or classification model.

In [None]:
# make a variable to store the general model
LR = LogisticRegression() # use logistic for a classification problem
# fit the model - one line of code
LR = LR.fit(X_train, y_train) # always going to be (X_train, y_train)

In [None]:
# store the predictions
train_preds = LR.predict(X_train) # same shape as Y_train
test_preds = LR.predict(X_test)  # same shape as Y_test

In [None]:
# see how they are all 0s and 1s?
train_preds

array([1, 1, 1, ..., 0, 0, 1])

In [None]:
# what if you wanted to raw predicted probabilites?
LR.predict_proba(X_train) # like this!

array([[0.00891757, 0.99108243],
       [0.04251066, 0.95748934],
       [0.18231322, 0.81768678],
       ...,
       [0.6137507 , 0.3862493 ],
       [0.6003873 , 0.3996127 ],
       [0.47960173, 0.52039827]])

## Evaluate the Model
Look at the appropriate error metrics depending on the problem you are solving.

For a regression problem, look at the R2, MAE and MSE; then make a scatterplot of actual vs. predicted values with nice labels and titles.

For a classification problem, create the classification report (gives a confusion matrix and useful metrics in one line of code).

In [None]:
# this is a classification problem, so we have other ways of
# evaluating our model than a regression problem

# train results
trainResults = classification_report(y_train, train_preds) # (actual, predicted)
print(trainResults)

              precision    recall  f1-score   support

           0       0.83      0.83      0.83      6852
           1       0.83      0.83      0.83      6748

    accuracy                           0.83     13600
   macro avg       0.83      0.83      0.83     13600
weighted avg       0.83      0.83      0.83     13600



In [None]:
# train confusion matrix
confusion_matrix(y_train, train_preds)

# top left is TN
# bottom left is FN
# top right is FP
# bottom right is TP

array([[5670, 1182],
       [1171, 5577]])

In [None]:
# here are tp, tn, fp, fn
tn, fp, fn, tp = confusion_matrix(y_train, train_preds).ravel()
print("TP:", tp)
print("TN:", tn)
print("FP:", fp)
print("FN:", fn)

TP: 5577
TN: 5670
FP: 1182
FN: 1171


In [None]:
# test results
testResults = classification_report(y_test, test_preds)
# don't forget to use 'print' otherwise it looks goofy
print(testResults)

              precision    recall  f1-score   support

           0       0.81      0.83      0.82      1656
           1       0.84      0.82      0.83      1744

    accuracy                           0.82      3400
   macro avg       0.82      0.82      0.82      3400
weighted avg       0.82      0.82      0.82      3400



In [None]:
# test confusion matrix
confusion_matrix(y_test, test_preds)

# top left is TN
# bottom left is FN
# top right is FP
# bottom right is TP

array([[1377,  279],
       [ 319, 1425]])

In [None]:
1377/(1377+279)

0.8315217391304348

In [None]:
# here are tp, tn, fp, fn
tn, fp, fn, tp = confusion_matrix(y_test, test_preds).ravel()
print("TP:", tp)
print("TN:", tn)
print("FP:", fp)
print("FN:", fn)

TP: 1425
TN: 1377
FP: 279
FN: 319


# Done!
You have just completed a very simple ML framework for classification modeling. Even though you used a simple logistic regression, you still got great results.

Later on, you will expand on these topics and start fitting multiple models, and may start tweaking them ('hyperparameter tuning') to get even better peformance. Sit tight!