<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Modeling Walkthrough

_Authors: Riley Dallas (AUS)_

---

### Learning Objectives
*After this lesson, you will be able to:*

- Gather, clean, explore and model a dataset from scratch.
- Split data into testing and training sets using both train/test split and cross-validation and apply both techniques to score a model.


## Importing libaries
---

We'll need the following libraries for today's lesson:

1. `pandas`
2. `numpy`
3. `seaborn`
4. `matplotlib.pyplot`
5. `train_test_split` and `cross_val_score` from `sklearn`'s `model_selection` module
6. `LinearRegression` from `sklearn`'s `linear_model` module
7. `r2_score` from `sklearn`'s `metrics` module 

In [1]:
import numpy as np
import pandas as pd
import patsy

from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split
from sklearn.metrics import r2_score

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

## Load the Data

---

Today's [dataset](http://www-bcf.usc.edu/~gareth/ISL/data.html) (`College.csv`) is from the [ISLR website](http://www-bcf.usc.edu/~gareth/ISL/). 

Rename `Unnamed: 0` to `University`.

In [13]:
df = pd.read_csv('./datasets/College.csv')
df.rename(columns={"Unnamed: 0": "University"}, inplace=True)
df.head()

Unnamed: 0,University,Private,Apps,Accept,Enroll,Top10perc,Top25perc,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal,PhD,Terminal,S.F.Ratio,perc.alumni,Expend,Grad.Rate
0,Abilene Christian University,Yes,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,18.1,12,7041,60
1,Adelphi University,Yes,2186,1924,512,16,29,2683,1227,12280,6450,750,1500,29,30,12.2,16,10527,56
2,Adrian College,Yes,1428,1097,336,22,50,1036,99,11250,3750,400,1165,53,66,12.9,30,8735,54
3,Agnes Scott College,Yes,417,349,137,60,89,510,63,12960,5450,450,875,92,97,7.7,37,19016,59
4,Alaska Pacific University,Yes,193,146,55,16,44,249,869,7560,4120,800,1500,76,72,11.9,2,10922,15


In [9]:
df.dtypes

Unnamed: 0      object
Private         object
Apps             int64
Accept           int64
Enroll           int64
Top10perc        int64
Top25perc        int64
F.Undergrad      int64
P.Undergrad      int64
Outstate         int64
Room.Board       int64
Books            int64
Personal         int64
PhD             object
Terminal         int64
S.F.Ratio      float64
perc.alumni      int64
Expend           int64
Grad.Rate        int64
dtype: object

## Data cleaning: Initial check
---

Check the following in the cells below:
1. Do we have any null values?
2. Are any numerical columns being read in as `object`?

In [15]:
# Check for nulls
def check_non_numeric(x):
    try:
        if int(x) % 1 == 0:
            pass
    except Exception as e:
        print(x)

df['PhD'].apply(check_non_numeric)

?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?


0      None
1      None
2      None
3      None
4      None
       ... 
772    None
773    None
774    None
775    None
776    None
Name: PhD, Length: 777, dtype: object

In [16]:
df['PhD'].unique()

array(['70', '29', '53', '92', '76', '?', '90', '89', '79', '40', '82',
       '73', '60', '36', '78', '48', '62', '69', '83', '55', '88', '57',
       '93', '85', '65', '66', '81', '59', '58', '68', '98', '71', '74',
       '61', '35', '87', '80', '63', '75', '39', '99', '100', '95', '77',
       '72', '64', '10', '86', '22', '50', '41', '8', '67', '94', '56',
       '46', '54', '84', '97', '51', '42', '49', '52', '43', '37', '45',
       '47', '91', '31', '96', '34', '33', '44', '32', '14', '103', '26',
       '16'], dtype=object)

In [17]:
df['PhD'] = df['PhD'].map(lambda x: np.nan if x == '?' else int(x))

In [20]:
df.dtypes

University      object
Private         object
Apps             int64
Accept           int64
Enroll           int64
Top10perc        int64
Top25perc        int64
F.Undergrad      int64
P.Undergrad      int64
Outstate         int64
Room.Board       int64
Books            int64
Personal         int64
PhD            float64
Terminal         int64
S.F.Ratio      float64
perc.alumni      int64
Expend           int64
Grad.Rate        int64
dtype: object

In [22]:
df.isna().sum()

University      0
Private         0
Apps            0
Accept          0
Enroll          0
Top10perc       0
Top25perc       0
F.Undergrad     0
P.Undergrad     0
Outstate        0
Room.Board      0
Books           0
Personal        0
PhD            29
Terminal        0
S.F.Ratio       0
perc.alumni     0
Expend          0
Grad.Rate       0
dtype: int64

In [23]:
df.dropna(inplace=True)

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 748 entries, 0 to 776
Data columns (total 19 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   University   748 non-null    object 
 1   Private      748 non-null    object 
 2   Apps         748 non-null    int64  
 3   Accept       748 non-null    int64  
 4   Enroll       748 non-null    int64  
 5   Top10perc    748 non-null    int64  
 6   Top25perc    748 non-null    int64  
 7   F.Undergrad  748 non-null    int64  
 8   P.Undergrad  748 non-null    int64  
 9   Outstate     748 non-null    int64  
 10  Room.Board   748 non-null    int64  
 11  Books        748 non-null    int64  
 12  Personal     748 non-null    int64  
 13  PhD          748 non-null    float64
 14  Terminal     748 non-null    int64  
 15  S.F.Ratio    748 non-null    float64
 16  perc.alumni  748 non-null    int64  
 17  Expend       748 non-null    int64  
 18  Grad.Rate    748 non-null    int64  
dtypes: float

## Data cleaning: Clean up `PhD` column
---

`PhD` is being read in as a string because some of the cells contain non-numerical values. In the cell below, replace any non-numerical values with `NaN`'s, and change the column datatype to float.

## Data cleaning: Drop `NaN`'s
---

Since there are a small percentage of null cells, let's go ahead and drop them.

## Feature engineering: Binarize `'Private'` column
---

In the cells below, convert the `Private` column into numerical values.

## EDA: Plot a Heatmap of the Correlation Matrix
---

Heatmaps are an effective way to visually examine the correlational structure of your predictors. 

In [None]:
top_corr_cols

## EDA: Use seaborn's `.pairplot()` method to create scatterplots 
---

Let's create a pairplot to see how some of our stronger predictors correlate to our target (`Apps`). Instead of creating a pairplot of the entire DataFrame, we can use the `y_vars` and `x_vars` params to get a smaller subset.

## EDA: Create histograms of all numerical columns
---

In [None]:
df.hist(gigsize=15,15)

## EDA: Boxplots
---

In the cells below, create two boxplots:
1. One for our target (`Apps`)
2. And one for our strongest predictor (`Accept`)

## Model Prep: Create our features matrix (`X`) and target vector (`y`)
---

Every **numerical** column (that is not our target) will be used as a feature.

The `Apps` column is our label: the number of applications received by that university.

In the cell below, create your `X` and `y` variables.

## Model Prep: Train/test split
---

We always want to have a holdout set to test our model. Use the `train_test_split` function to split our `X` and `y` variables into a training set and a holdout set.

## Model Prep: Instantiate our model
---

Create an instance of `LinearRegression` in the cell below.

## Cross validation
---

Use `cross_val_score` to evaluate our model.

## Model Fitting and Evaluation
---

Fit the model to the training data, and evaluate the training and test scores below.