## Module 7: Logistic Regression

### Step 0

Load the appropriate libraries and bring in the data. Note that we have to run a script to get the [California Housing dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html) to match as it is in scikit-learn. We cannot pull it directly from scikit-learn since CodeGrade cannot access the internet.

In [1]:
# CodeGrade step0

from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from scipy.stats import pearsonr
import os
import tarfile
import joblib # Imporxst joblib directly
from sklearn.datasets._base import _pkl_filepath, get_data_home
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as snsxs
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from statsmodels.stats.outliers_influence import variance_inflation_factor

archive_path = "cal_housing.tgz" # change the path if it's not in the current directory
data_home = get_data_home(data_home=None) # change data_home if you are not using ~/scikit_learn_data
if not os.path.exists(data_home):
    os.makedirs(data_home)
filepath = _pkl_filepath(data_home, 'cal_housing.pkz')

with tarfile.open(mode="r:gz", name=archive_path) as f:
    cal_housing = np.loadtxt(
        f.extractfile('CaliforniaHousing/cal_housing.data'),
        delimiter=',')
    # Columns are not in the same order compared to the previous
    # URL resource on lib.stat.cmu.edu
    columns_index = [8, 7, 2, 3, 4, 5, 6, 1, 0]
    cal_housing = cal_housing[:, columns_index]

    joblib.dump(cal_housing, filepath, compress=6) # Now using the directly imported joblib

# Load dataset
california = fetch_california_housing(as_frame=True)
data = california.data
data['MedianHouseValue'] = california.target

Print the basic information of the data using `.info()` and `.describe`.

In [7]:
# Display structure and summary
data.info()
data.describe().T

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   MedInc            20640 non-null  float64
 1   HouseAge          20640 non-null  float64
 2   AveRooms          20640 non-null  float64
 3   AveBedrms         20640 non-null  float64
 4   Population        20640 non-null  float64
 5   AveOccup          20640 non-null  float64
 6   Latitude          20640 non-null  float64
 7   Longitude         20640 non-null  float64
 8   MedianHouseValue  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MedInc,20640.0,3.870671,1.899822,0.4999,2.5634,3.5348,4.74325,15.0001
HouseAge,20640.0,28.639486,12.585558,1.0,18.0,29.0,37.0,52.0
AveRooms,20640.0,5.429,2.474173,0.846154,4.440716,5.229129,6.052381,141.909091
AveBedrms,20640.0,1.096675,0.473911,0.333333,1.006079,1.04878,1.099526,34.066667
Population,20640.0,1425.476744,1132.462122,3.0,787.0,1166.0,1725.0,35682.0
AveOccup,20640.0,3.070655,10.38605,0.692308,2.429741,2.818116,3.282261,1243.333333
Latitude,20640.0,35.631861,2.135952,32.54,33.93,34.26,37.71,41.95
Longitude,20640.0,-119.569704,2.003532,-124.35,-121.8,-118.49,-118.01,-114.31
MedianHouseValue,20640.0,2.068558,1.153956,0.14999,1.196,1.797,2.64725,5.00001


### Step 1

Define `threshold` as the median of `MedianHouseValue`.

Next create a binary target value called `HightValue` like so:

> `data['HighValue'] = (data['MedianHouseValue'] > threshold).astype(int)`

Finally give an array of the `unique_values` that returns the unique values of `HighValue`.


In [8]:
# CodeGrade step1
threshold = data['MedianHouseValue'].median()

data['HighValue'] = (data['MedianHouseValue'] > threshold).astype(int)

unique_values = np.unique(data['HighValue'])
print(unique_values)

[0 1]


### Step 2

Select `MedInc`, `AveRoom`, and `AveOccup` as the variables of `X` and let `y` be the variable `HighValue`.

Let `seed` be set to 42.

Now split the data into `X_train`, `X_test`, `y_train`, and `y_test`, with a test stize of 30% and a random state of 42.

Return the shapes of these four arrays in the same order as listed above.

In [25]:
# CodeGrade step2
X = data[['MedInc', 'AveRooms', 'AveOccup']]
y = data['HighValue']

seed = 42

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=seed
)

print(f"{X_train.shape},{X_test.shape},{y_train.shape},{y_test.shape}")

(14448, 3),(6192, 3),(14448,),(6192,)


### Step 3

Using `scaler = StandardScaler()`, `fit_transform` `X_train`, calling this `X_train_scaled`. Likewise use `.transform` to transform `X_test` calling this `X_test_scaled`.

Now return the shape of `X_test_scaled`.

In [26]:
# CodeGrade step3
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(X_test_scaled.shape)

(6192, 3)


### Step 0

Run the code below

In [27]:
# CodeGrade step0

# Train logistic regression model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


### Step 4

Return the model's intercept.

In [28]:
# CodeGrade step4
print(model.intercept_)

[0.12755905]


### Step 5

Return the model's coefficients.

In [29]:
# CodeGrade step5
print(model.coef_)

[[ 2.33711986 -0.88891482 -2.55688063]]


### Step 6

Using the model, predict the probabilities of `X_test_scaled` calling this `y_pred_prob` and predict the class of `X_test_scaled` calling this `y_pred_class`.

Now return the first five elments of both of these arrays, `y_pred_prob`, `y_pred_class`.

In [30]:
# CodeGrade step6
y_pred_prob = model.predict_proba(X_test_scaled)[:, 1]
y_pred_class = model.predict(X_test_scaled)

print(y_pred_prob[:5])
print(y_pred_class[:5])

[0.09349518 0.21617122 0.63030197 0.88899024 0.51316254]
[0 0 1 1 1]


### Step 7

Give the confusion matrix of `y_test` and `y_pred_class`.

In [31]:
 # CodeGrade step7
cm = confusion_matrix(y_test, y_pred_class)
print(cm)

[[2478  591]
 [ 810 2313]]


### Step 8

Roudning to four decimal places, give the accuracy score of `y_test` and `y_pred_class`.

In [32]:
 # CodeGrade step8
acc = accuracy_score(y_test, y_pred_class)
print(f"{acc:.4f}")

0.7737


### Step 9

Rounding to 3 decimal placess for each, give the VIFs for each of the three columns of `X_trained_scaled`.

In [33]:
 # CodeGrade step9
X_vif = sm.add_constant(X_train_scaled, has_constant='add')
vifs = [variance_inflation_factor(X_vif, i) for i in range(1, X_vif.shape[1])]

vifs_rounded = [round(v, 3) for v in vifs]
print(vifs_rounded)

[1.118, 1.117, 1.001]
