# Logistic Regression Implementation

An example binary classification problem can be represented by a dataset containing information about customers who did or did not default on their credit cards.  We want to do the following:

- Basic EDA: explore default groups for each individual feature (boxplots could be a nice way in here)
- Process categorical variables using `pd.get_dummies`
- Split your data
- Run a `LogisticRegression` to explore the likelihood of default based on the `balance` column.
- Cross validate this using values $[0.1, 1, 5, 10, 100]$ for the `C` parameter.
- Incorporate `PolynomialFeatures` into your model and rerun.  How did the performance change?
- Repeat for the `student` column.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

In [2]:
df = pd.read_csv('data/default.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
default    10000 non-null object
student    10000 non-null object
balance    10000 non-null float64
income     10000 non-null float64
dtypes: float64(2), object(2)
memory usage: 312.6+ KB


In [4]:
df.describe()

Unnamed: 0,balance,income
count,10000.0,10000.0
mean,835.374886,33516.981876
std,483.714985,13336.639563
min,0.0,771.967729
25%,481.731105,21340.462903
50%,823.636973,34552.644802
75%,1166.308386,43807.729272
max,2654.322576,73554.233495


In [32]:
df.head()

Unnamed: 0,default,student,balance,income
0,No,No,729.526495,44361.625074
1,No,Yes,817.180407,12106.1347
2,No,No,1073.549164,31767.138947
3,No,No,529.250605,35704.493935
4,No,No,785.655883,38463.495879


In [None]:
plt.figure(figsize=(9, 4))
plt.subplot(121)
sns.boxplot(x - 'default', y = 'balance, data = credit')

In [24]:
default_dum = pd.get_dummies(df['default'],drop_first=True)

In [26]:
print(default_dum)

      Yes
0       0
1       0
2       0
3       0
4       0
5       0
6       0
7       0
8       0
9       0
10      0
11      0
12      0
13      0
14      0
15      0
16      0
17      0
18      0
19      0
20      0
21      0
22      0
23      0
24      0
25      0
26      0
27      0
28      0
29      0
...   ...
9970    0
9971    0
9972    0
9973    0
9974    0
9975    0
9976    0
9977    0
9978    1
9979    0
9980    0
9981    0
9982    0
9983    0
9984    0
9985    0
9986    0
9987    0
9988    0
9989    0
9990    0
9991    0
9992    0
9993    0
9994    0
9995    0
9996    0
9997    0
9998    0
9999    0

[10000 rows x 1 columns]


In [27]:
default_dum.max()

Yes    1
dtype: uint8

In [28]:
default_dum.min()

Yes    0
dtype: uint8

In [29]:
default_dum.describe()

Unnamed: 0,Yes
count,10000.0
mean,0.0333
std,0.179428
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [30]:
default_dum.head()

Unnamed: 0,Yes
0,0
1,0
2,0
3,0
4,0


In [31]:
X = df['balance'][:, 3:]
y = (df['default_dum'] == 2).astype(np.int)

ValueError: Can only tuple-index with a MultiIndex

In [22]:
X = df['balance'].values.reshape(-1,1)
y = default_dum
lr = LogisticRegression()
lr.fit(X, y)

ValueError: bad input shape (10000, 2)

In [None]:
default = df[boy_girl['gender'] == 1]

In [None]:
X = df.['heights'].values.reshape(-1,1)
y = df.['default']
lr = LogisticRegression()
lr.fit(X, y)