Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf), Chapter 8
  - _**[Gradient Boosting Explained](https://www.gormanalysis.com/blog/gradient-boosting-explained/)**_ — Ben Gorman
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html) — Alex Rogozhnikov
  - [How to explain gradient boosting](https://explained.ai/gradient-boosting/) — Terence Parr & Jeremy Howard

# Imports

I will be copying the code from yesterday into this notebook for today's assignment below

In [None]:
#Imports

import pandas as pd
import numpy as np
import plotly.express as px

from sklearn.linear_model import LinearRegression, LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer

from xgboost import XGBRegressor

# Import our data

In [None]:
#Import our data

#URL to our data on my github repo for build week
url = 'https://raw.githubusercontent.com/JeremySpradlin/DS-Unit-2-Build-Week/master/sunspot_data.csv'


df = pd.read_csv(url)
df.head()

Unnamed: 0.1,Unnamed: 0,Year,Month,Day,Date In Fraction Of Year,Number of Sunspots,Standard Deviation,Observations,Indicator
0,0,1818,1,1,1818.001,-1,-1.0,0,1
1,1,1818,1,2,1818.004,-1,-1.0,0,1
2,2,1818,1,3,1818.007,-1,-1.0,0,1
3,3,1818,1,4,1818.01,-1,-1.0,0,1
4,4,1818,1,5,1818.012,-1,-1.0,0,1


## Wrangle Function

Below we will define our wrangle function that will be used to clean our dataset and prepare for fitting. It needs to perform the following actions:

- Remove spaces from column names
- Remove upper case letters from column names
- `Number of Sunspots`
 - Remove all -1's, replace with NAN
 - Verify that 0's are accompanied with verifcation with observations
    - Handled by removing rows with observations=0
- Remove Columns:
 - `Indicator`
 - `Unnamed: 0`
-Remove Rows:
 - Where `Observations` = 0
 - Where `Sunspots` = NAN
- Split our data into training and validation sets
 - Since we are looking at data over a 200 yr period, we will split the data chronologically.
   - Training Set: 1802 - 1902
   - Validation Set: 1903 - 1953
   - Testing Set: 1954 - 2018

**NOTE:** Testing set sizes might change in the future

In [None]:
#Create our data wrangling function

def wrangle(df):
  """This function will take in a dataframe of Sunspot activity
  and perform different functions and actions on it to 
  prepare the dataset for training in a predictive model."""

  #Remove spaces from column names and change to lowercase
  df.columns = df.columns.str.lower().str.replace(' ', '_')

  #Replace -1's in target column
  df['number_of_sunspots'].replace(-1, np.NaN, inplace=True)

  #Remove columns
  df = df.drop(['indicator', 'unnamed:_0'], axis=1)

  #Remove observations with missing values or no observations
  df = df.dropna()
  mask = df[(df['observations'] == 0)].index
  df = df.drop(mask)

  #Split our dataset into data and target sets
  y = df['number_of_sunspots']
  X = df.drop('number_of_sunspots', axis=1)

  #Create Training set
  X_train = X[(X['year'] <= 1902)]
  y_train = y[y.index.isin(X_train.index)]

  #Create validation set
  X_val = X[(X['year'] > 1902) & (X['year'] <= 1952)]
  y_val = y[y.index.isin(X_val.index)]

  #Create test set
  X_test = X[(X['year'] > 1952)]
  y_test = y[y.index.isin(X_test.index)]


  #Return the dataframes
  return X_train, y_train, X_val, y_val, X_test, y_test


In [None]:
X_train, y_train, X_val, y_val, X_test, y_test = wrangle(df)
X_train.shape, y_train.shape

((27798, 6), (27798,))

# Baseline

In [None]:
guess = y_train.mean()
errors = guess - y_train
mae = errors.abs().mean()
print('Our naive baseline mae is:', mae)

Our naive baseline mae is: 57.92122350081531


# Create our Pipeline

In [None]:
#Create our pipeline

model = make_pipeline(
    SimpleImputer(),
    StandardScaler(),
    LinearRegression()
)

In [None]:
#Fit our data to our model

model.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('simpleimputer',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('linearregression',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

# Testing our model

In [None]:
#Check our accuracy on our different sets
print('Training Accuracy: ', model.score(X_train, y_train))
print('Validation Accuracy: ', model.score(X_val, y_val))
print('Testing Accuracy: ', model.score(X_test, y_test))

Training Accuracy:  0.9062704166214505
Validation Accuracy:  0.7675162495780349
Testing Accuracy:  0.6303832707461011


# XGB Classifier

In [None]:
#Create our new pipeline with XGB

xgbmodel = make_pipeline(
    SimpleImputer(),
    StandardScaler(),
    XGBRegressor(learning_rate=1.5, n_estimators=200, max_depth=2)
)

In [None]:
xgbmodel.fit(X_train, y_train)



Pipeline(memory=None,
         steps=[('simpleimputer',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('xgbregressor',
                 XGBRegressor(base_score=0.5, booster='gbtree',
                              colsample_bylevel=1, colsample_bynode=1,
                              colsample_bytree=1, gamma=0,
                              importance_type='gain', learning_rate=1.5,
                              max_delta_step=0, max_depth=2, min_child_weight=1,
                              missing=None, n_estimators=200, n_jobs=1,
                              nthread=None, objective='reg:linear',
                              random_state=0, reg_alpha=0, reg_lambda=1,
                              scale_pos_weight=

In [None]:
#Check our accuracy on our different sets
print('Training Accuracy: ', xgbmodel.score(X_train, y_train))
print('Validation Accuracy: ', xgbmodel.score(X_val, y_val))
print('Testing Accuracy: ', xgbmodel.score(X_test, y_test))

Training Accuracy:  0.999337815394306
Validation Accuracy:  0.88258239364709
Testing Accuracy:  0.5342629550410566


In [None]:
!pip install eli5



In [None]:
import eli5
from eli5.sklearn import PermutationImportance
permuter = PermutationImportance(
    model, 
    scoring='neg_mean_absolute_error',
    n_iter=5,
    random_state=42
)

In [None]:
permuter.fit(X_train, y_train)

PermutationImportance(cv='prefit',
                      estimator=Pipeline(memory=None,
                                         steps=[('simpleimputer',
                                                 SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               missing_values=nan,
                                                               strategy='mean',
                                                               verbose=0)),
                                                ('standardscaler',
                                                 StandardScaler(copy=True,
                                                                with_mean=True,
                                                                with_std=True)),
                                                ('linearregressio

In [None]:
eli5.show_weights(
    permuter,
    top=None,
    feature_names=X_train.columns.tolist()
)

Weight,Feature
885.8704  ± 4.1429,date_in_fraction_of_year
876.2188  ± 6.0254,year
63.4711  ± 0.3869,standard_deviation
3.0239  ± 0.0830,month
0.0258  ± 0.0144,day
0  ± 0.0000,observations
