# Micki Rhen HW1 - Classification models in sklearn

The goal of homework 1 is to build a classifier model to predict political party using other variables that have been provided

## Preliminaries

In [93]:
# To auto-reload modules in jupyter notebook (so that changes in files *.py doesn't require manual reloading):
# https://stackoverflow.com/questions/5364050/reloading-submodules-in-ipython
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Import commonly used libraries and magic command for inline plotting

In [94]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [95]:
%matplotlib inline

## The Initial Tasks

### Task 1 - Create project folder structure

Task 1 involved creating a new project folder structure with the cookiecutter-datascience-simple template which was introduced in Module 1.


The following script was run from the directory of the new 'rhenhw1' project directory:

cookiecutter https://github.com/misken/cookiecutter-datascience-simple

![check](images/check.PNG)

### Task 2 - Put the new project folder under version control using git

After moving the supplied data to ...\rhenhw1\data\raw, the initial version control was implemented using the following commands:
    
    * git add *.ipynb
    * git add docs/*.md
    * git add .gitignore
    * git commit -m 'initial commit'
    
A new repository was created online for rhenhw1 and GitHub desktop was used to push the committed files to this repository

Adds and commits will be done periodically to backup work as it is being produced.  Other types of files (.py, etc) may be added at these later times

![check](images/check.PNG)

## The Real Goal... models

Assignment requirements: Build at least one logistic regression model (with regularization) and one random forest model to predict PoliticalParty.

### Data prep

I decided to follow your lead and initially explore the data by reading it into a pandas dataframe

In [96]:
tax_df = pd.read_csv("./data/raw/TaxInfo.csv")

This allows us to check out the structure of the dataframes and scan the values a bit...

In [97]:
tax_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1004 entries, 0 to 1003
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   HHI             1004 non-null   int64 
 1   HHDL            1004 non-null   int64 
 2   Married         1004 non-null   int64 
 3   CollegGrads     1004 non-null   int64 
 4   AHHAge          1004 non-null   int64 
 5   Cars            1004 non-null   int64 
 6   Filed_2017      1004 non-null   int64 
 7   Filed_2016      1004 non-null   int64 
 8   Filed_2015      1004 non-null   int64 
 9   PoliticalParty  1004 non-null   object
dtypes: int64(9), object(1)
memory usage: 78.6+ KB


From the 'aap_hw1_s21_sklearn.ipynb' file we know the following information about the fields:

* `HHI` - household income
* `HHDL` - household debt level
* `Married` - categorical with a few levels
* `CollegGrads` - number of college grads in the household
* `AHHAge` - average age of people in the household
* `Cars` - number of cars in the household
* `Filed_2017` - 1 means they filed a tax return with the IRS for 2017
* `Filed_2016` - 1 means they filed a tax return with the IRS for 2016
* `Filed_2015` - 1 means they filed a tax return with the IRS for 2015
* `PoliticalParty` - categorical with 3 levels

'Political party' is categorical. 'Married' is categorical as well but read into our data frame as an integer.  My guess is that we have an entry for married, and blank for not, but we better take a quick look at some of the actual data to be sure...

In [98]:
tax_df.head()

Unnamed: 0,HHI,HHDL,Married,CollegGrads,AHHAge,Cars,Filed_2017,Filed_2016,Filed_2015,PoliticalParty
0,49685,227187,0,0,105,0,1,1,1,Democrat
1,64756,-507342,2,3,68,3,1,0,0,Independent
2,115435,521290,1,3,81,2,0,1,0,Republican
3,99454,251829,2,1,52,4,1,0,0,Republican
4,157274,-472337,0,1,28,1,1,0,1,Independent


'Filed_2017', 'Filed_2016', and 'Filed_2015' are actually categorical variables as well, but they have already been converted from strings to numbers, so I am not going to worry about them 

and well... I was wrong about the "Married' variable lol
The variable has multiple levels.  Let's turn that one into a categorical variable and recheck our data types...

In [99]:
tax_df['Married'] = tax_df['Married'].apply(str)
print (tax_df.dtypes)


HHI                int64
HHDL               int64
Married           object
CollegGrads        int64
AHHAge             int64
Cars               int64
Filed_2017         int64
Filed_2016         int64
Filed_2015         int64
PoliticalParty    object
dtype: object


We could also look at this data in something like 'SweetViz' to get a visual perspective on what is going on (processed in Jupyter Notebook, sice Jupyter Lab doesn't plat nice with SweetViz)

SweetViz is a nice visual.  It also shows us that we don't have any issues with missing data entries.  There is not a whole lot of data here, so I am going to keep it all in at this point.

In [None]:
import sweetviz

In [None]:
report = sweetviz.analyze(tax_df)

In [None]:
report.show_html("output/sweetviz_report.html")

#### Data preprocessing - variable type lists

Create a list of numeric columns and categorical columns to facilitate preprocessing.

Let's break up the tax_df into two separate dataframes called X and y

In [101]:
X = tax_df.iloc[:, 0:9]
y = tax_df.iloc[:, 9]
X.head()

Unnamed: 0,HHI,HHDL,Married,CollegGrads,AHHAge,Cars,Filed_2017,Filed_2016,Filed_2015
0,49685,227187,0,0,105,0,1,1,1
1,64756,-507342,2,3,68,3,1,0,0
2,115435,521290,1,3,81,2,0,1,0
3,99454,251829,2,1,52,4,1,0,0
4,157274,-472337,0,1,28,1,1,0,1


In [102]:
y.head()

0       Democrat
1    Independent
2     Republican
3     Republican
4    Independent
Name: PoliticalParty, dtype: object

..., and we need to partition our data into training and test using the seed provided

In [103]:
# Partition our data into train and test sets to use for model fitting and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)

In [104]:
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
numeric_cols = X.select_dtypes(include=['number']).columns.tolist()

all_cols = X.columns.tolist()


In [105]:
numeric_cols

['HHI',
 'HHDL',
 'CollegGrads',
 'AHHAge',
 'Cars',
 'Filed_2017',
 'Filed_2016',
 'Filed_2015']

In [106]:
categorical_cols

['Married']

We will use an *assertion* to make sure we didn't miss any columns.

In [107]:
assert len(all_cols) == len(categorical_cols) + len(numeric_cols), 'each col should either be in categorical or numeric lists'

Nothing happened, so we are good to keep moving...

We may need a list of column indices later, so we will do that too...

In [108]:
categorical_cols_idx = [tax_df.columns.get_loc(c) for c in categorical_cols]
categorical_cols_idx

[2]

In [109]:
numeric_cols_idx = [tax_df.columns.get_loc(c) for c in numeric_cols]
numeric_cols_idx

[0, 1, 3, 4, 5, 6, 7, 8]

### Let's make some models!

Let's start by loading in some things we may need

In [110]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split

Since the request was for regularized logistic regression, the numeric variables should be rescaled so that the units of measurement don't affect the model fitting process, and categorical values

In [111]:
# Create a StandardScalar object to use on our numeric variables
numeric_transformer = StandardScaler()

In [112]:
# Create an object to use on our categorical variables
categorical_transformer = OneHotEncoder(handle_unknown='ignore')


In [113]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols)])

... and we need to make a model object to use in our pipeline.  I'm using the same Ridge regression model we used for the in class example

In [114]:
# Classifier model
clf_model = LogisticRegression(penalty='l2', C=1, solver='saga', max_iter=500)

Here is what it looks like when they are all piped together...

In [115]:
# Create transformer objects
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

# Combine transformers into a preprocessor step
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols)])

# Classifier model
clf_model = LogisticRegression(penalty='l2', C=1, solver='saga', max_iter=500)

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', clf_model)])

In [116]:
# Visual depiction of pipeline from the new 1.0 ColumnTransformer example. 
from sklearn import set_config

set_config(display='diagram')
clf

Now let's fit that 1st model!

In [117]:
# Fit model on new training data - notice that clf is actually the Pipeline
clf.fit(X_train, y_train)

print(f"Training score: {clf.score(X_train, y_train):.3f}")
print(f"Test score: {clf.score(X_test, y_test):.3f}")

Training score: 0.390
Test score: 0.313


Ooof, that's REALLY bad!