# Logistic Regression Lab

## Logistic Regression Documentation

from http://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic.html#sphx-glr-auto-examples-linear-model-plot-logistic-py

## Lab Instruction

### Part 1: Importing the Dataset

Import "Logistic Regression Lab.csv". 

In [1]:
import pandas as pd
df = pd.read_csv("Logistic_Regression_Lab.csv",index_col=False)

In [2]:
df.head()

Unnamed: 0.1,Unnamed: 0,Foundation,1stFlrSF,KitchenQual,Fireplaces,HeatingQC,FullBath,BsmtQual,OpenPorchSF,GarageYrBlt,...,BsmtFinType1,YearBuilt,GarageArea,TotRmsAbvGrd,GarageCars,GrLivArea,YearRemodAdd,LotFrontage,Fence,SalePrice
0,0,PConc,856,Gd,0,Ex,2,Gd,61,2003.0,...,GLQ,2003,548,8,2,1710,2003,65.0,,High
1,1,CBlock,1262,TA,1,Ex,2,Gd,0,1976.0,...,ALQ,1976,460,6,2,1262,1976,80.0,,Medium
2,2,PConc,920,Gd,1,Ex,2,Gd,42,2001.0,...,GLQ,2001,608,6,2,1786,2002,68.0,,High
3,3,BrkTil,961,Gd,1,Gd,1,TA,35,1998.0,...,ALQ,1915,642,7,3,1717,1970,60.0,,Medium
4,4,PConc,1145,Gd,1,Ex,2,Gd,84,2000.0,...,GLQ,2000,836,9,3,2198,2000,84.0,,High


In [3]:
columns = df.columns.tolist()

### Part 2: Preprocessing

Preprocess the dataset. Try to understand the techniques you use.

In [4]:
columns

['Unnamed: 0',
 'Foundation',
 '1stFlrSF',
 'KitchenQual',
 'Fireplaces',
 'HeatingQC',
 'FullBath',
 'BsmtQual',
 'OpenPorchSF',
 'GarageYrBlt',
 'ExterQual',
 'OverallQual',
 'BsmtFinType1',
 'YearBuilt',
 'GarageArea',
 'TotRmsAbvGrd',
 'GarageCars',
 'GrLivArea',
 'YearRemodAdd',
 'LotFrontage',
 'Fence',
 'SalePrice']

In [5]:
df.columns[df.isna().any()].tolist()

['BsmtQual', 'GarageYrBlt', 'BsmtFinType1', 'LotFrontage', 'Fence']

In [6]:
df.isna().sum()

Unnamed: 0         0
Foundation         0
1stFlrSF           0
KitchenQual        0
Fireplaces         0
HeatingQC          0
FullBath           0
BsmtQual          37
OpenPorchSF        0
GarageYrBlt       81
ExterQual          0
OverallQual        0
BsmtFinType1      37
YearBuilt          0
GarageArea         0
TotRmsAbvGrd       0
GarageCars         0
GrLivArea          0
YearRemodAdd       0
LotFrontage      259
Fence           1179
SalePrice          0
dtype: int64

In [7]:
df = df.drop(columns=['Unnamed: 0', 'Fence','YearRemodAdd', 'YearRemodAdd','GarageYrBlt','YearBuilt'])

In [8]:
df = df.dropna(axis=0)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1170 entries, 0 to 1459
Data columns (total 17 columns):
Foundation      1170 non-null object
1stFlrSF        1170 non-null int64
KitchenQual     1170 non-null object
Fireplaces      1170 non-null int64
HeatingQC       1170 non-null object
FullBath        1170 non-null int64
BsmtQual        1170 non-null object
OpenPorchSF     1170 non-null int64
ExterQual       1170 non-null object
OverallQual     1170 non-null int64
BsmtFinType1    1170 non-null object
GarageArea      1170 non-null int64
TotRmsAbvGrd    1170 non-null int64
GarageCars      1170 non-null int64
GrLivArea       1170 non-null int64
LotFrontage     1170 non-null float64
SalePrice       1170 non-null object
dtypes: float64(1), int64(9), object(7)
memory usage: 164.5+ KB


In [10]:
df = df.select_dtypes(include=['int64','float64']).join(df.SalePrice)

In [11]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [12]:
import numpy as np
from sklearn.model_selection import train_test_split

df.reset_index(drop=True, inplace=True)

df_y = df.SalePrice.copy()
df.drop('SalePrice', axis=1, inplace=True)

df = df.reindex_axis(sorted(df.columns), axis=1)
X = df.copy().values
y = df_y.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

  if __name__ == '__main__':


### Part 3: Perform Logistic Regression 

You can import a Logistic Regression Classifier by using the following codes:

In [14]:
from sklearn import linear_model


lr = linear_model.LogisticRegression()
lr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

You can look at the parameters and functions of Logistic Regression at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

## Part 4: Analyze the result

After you perform Logistic Regression, answer the following question.

1. If you change your preprosessing method, can you improve the model?
2. If you change your parameters setting, can you improve the model?

In [15]:
y_pred = lr.predict(X_test)

In [16]:
from sklearn.metrics import accuracy_score

In [17]:
accuracy_score(y_test, y_pred)

0.7286821705426356