Lambda School Data Science

*Unit 2, Sprint 3, Module 4*

---


# Model Interpretation 2

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] Continue to iterate on your project: data cleaning, exploratory visualization, feature engineering, modeling.
- [ ] Make a Shapley force plot to explain at least 1 individual prediction.
- [ ] Share at least 1 visualization (of any type) on Slack.

But, if you aren't ready to make a Shapley force plot with your own dataset today, that's okay. You can practice this objective with another dataset instead. You may choose any dataset you've worked with previously.

## Stretch Goals
- [ ] Make Shapley force plots to explain at least 4 individual predictions.
    - If your project is Binary Classification, you can do a True Positive, True Negative, False Positive, False Negative.
    - If your project is Regression, you can do a high prediction with low error, a low prediction with low error, a high prediction with high error, and a low prediction with high error.
- [ ] Use Shapley values to display verbal explanations of individual predictions.
- [ ] Use the SHAP library for other visualization types.

The [SHAP repo](https://github.com/slundberg/shap) has examples for many visualization types, including:

- Force Plot, individual predictions
- Force Plot, multiple predictions
- Dependence Plot
- Summary Plot
- Summary Plot, Bar
- Interaction Values
- Decision Plots

We just did the first type during the lesson. The [Kaggle microcourse](https://www.kaggle.com/dansbecker/advanced-uses-of-shap-values) shows two more. Experiment and see what you can learn!


## Links
- [Kaggle / Dan Becker: Machine Learning Explainability — SHAP Values](https://www.kaggle.com/learn/machine-learning-explainability)
- [Christoph Molnar: Interpretable Machine Learning — Shapley Values](https://christophm.github.io/interpretable-ml-book/shapley.html)
- [SHAP repo](https://github.com/slundberg/shap) & [docs](https://shap.readthedocs.io/en/latest/)

#### Provided

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*
    !pip install eli5
    !pip install pdpbox
    !pip install shap

# If you're working locally:
else:
    DATA_PATH = '../data/'

## Assignment

### Importing


In [2]:
import numpy as np 
import pandas as pd 
import os 
import seaborn as sns
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import json
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
import category_encoders as ce
from sklearn.impute import SimpleImputer
import eli5
from eli5.sklearn import PermutationImportance
from xgboost import XGBClassifier
%matplotlib inline
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

#### Fetching

In [3]:
cd C:\Users\Hakuj\Documents\DataSets\Kickstarter

C:\Users\Hakuj\Documents\DataSets\Kickstarter


In [4]:
def get_a_year(year):
    df = pd.DataFrame(
            columns=['backers_count', 'blurb', 'category', 'converted_pledged_amount',
       'country', 'created_at', 'creator', 'currency', 'currency_symbol',
       'currency_trailing_code', 'current_currency', 'deadline',
       'disable_communication', 'fx_rate', 'goal', 'id', 'is_starrable',
       'launched_at', 'name', 'photo', 'pledged', 'profile', 'slug',
       'source_url', 'spotlight', 'staff_pick', 'state', 'state_changed_at',
       'static_usd_rate', 'urls', 'usd_pledged', 'usd_type', 'location',
       'friends', 'is_backing', 'is_starred', 'permissions']
    )
    folders = os.listdir(f'Data\\{year}') #Get the monthly folders inside the year
    for folder in folders:
        files = os.listdir(f'Data\\{year}\\{folder}')  #Get the filenames inside monthly folders
        monthly = pd.concat(
            [pd.read_csv(
                f'Data\\{year}\\{folder}\\{file}') for file in files]
        ) #Reads in all the csv files in a given month
        df = df.append(monthly)
        df = df.reset_index().drop(columns='index')
    return df
    

In [5]:
df = get_a_year(2019)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


### Cleaning

In [6]:
# I only care about these two states for now
df = df[df['state'].isin(['failed', 'successful'])]

In [7]:
def drop_dupes(df):
    df = df[~df.duplicated('id')]
    df = df.reset_index().drop(columns='index')
    return df

In [8]:
df = drop_dupes(df)

In [9]:
def datetime_convert(df):
    #Time is in seconds (epoch)
    df['created_at'] = pd.to_datetime(df['created_at'], unit='s')
    df['deadline'] = pd.to_datetime(df['deadline'], unit='s')
    df['launched_at'] = pd.to_datetime(df['launched_at'], unit='s')
    df['state_changed_at'] = pd.to_datetime(df['state_changed_at'], unit='s')
    #Convert back into strings so that we can pass to model
    df['created_at'] = pd.to_datetime(df['created_at'], format='%m%d%Y').astype(str)
    df['deadline'] = pd.to_datetime(df['deadline'], format='%m%d%Y').astype(str)
    df['launched_at'] = pd.to_datetime(df['launched_at'], format='%m%d%Y').astype(str)
    df['state_changed_at'] = pd.to_datetime(df['state_changed_at'], format='%m%d%Y').astype(str)
    #TO ADD: Break time up into columns Month day etc

    return df

In [10]:
df = datetime_convert(df)

In [11]:
def col_dict(df, col):
    #ONLY WORKS WITH 'category' AS IS!
    # So I removed the for loop for now
    """Takes in a DataFrame and a list of column
    names and unpacks the 'dictionaries' into new columns"""
#     for col in cols: #Loop over columns
    df[col] = df[col].apply(json.loads)
    df_of_column = df[col].apply(pd.Series)
    df_of_column.columns = [f'{col}_'+col_name for col_name in df_of_column.columns]
    df = df.join(df_of_column)
    return df.drop(columns=col)

In [12]:
df = col_dict(df, 'category')

In [13]:
df = df.drop(columns=['creator', 'location', 'photo', 'profile', 'urls', 'category_urls'])


In [14]:
df

Unnamed: 0,backers_count,blurb,converted_pledged_amount,country,created_at,currency,currency_symbol,currency_trailing_code,current_currency,deadline,disable_communication,friends,fx_rate,goal,id,is_backing,is_starrable,is_starred,launched_at,name,permissions,pledged,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,unread_messages_count,unseen_activity_count,usd_pledged,usd_type,category_id,category_name,category_slug,category_position,category_parent_id,category_color
0,4,Handmade glass trees that make a special Chris...,205,US,2016-06-03 04:20:02,USD,$,True,USD,2016-06-21 02:06:00,False,,1.000000,1000.0,1145008426,,False,,2016-06-04 02:31:55,Glass Christmas Trees & Glass Cross,,205.00,glass-christmas-trees-and-glass-cross,https://www.kickstarter.com/discover/categorie...,False,False,failed,2016-06-21 02:06:04,1.000000,,,2.050000e+02,domestic,347,Glass,crafts/glass,5,26.0,16744876
1,171,Perfect pair of Socks for any Adventurer! Sock...,6061,US,2018-10-24 14:34:20,USD,$,True,USD,2018-11-15 17:59:00,False,,1.000000,2000.0,1687733153,,False,,2018-10-30 20:00:02,Socks of Speed and Socks of Elvenkind,,6061.00,socks-of-speed-and-socks-of-elvenkind,https://www.kickstarter.com/discover/categorie...,True,False,successful,2018-11-15 17:59:00,1.000000,,,6.061000e+03,international,34,Tabletop Games,games/tabletop games,6,12.0,51627
2,9,This is a Series of 6 Books on Blessed Oscar A...,800,US,2015-06-17 23:47:06,USD,$,True,USD,2015-08-16 16:13:11,False,,1.000000,4400.0,1608693208,,False,,2015-07-07 16:13:11,The Complete Homilies of Blessed Oscar Romero:...,,800.00,the-complete-homilies-of-blessed-oscar-romero-...,https://www.kickstarter.com/discover/categorie...,False,False,failed,2015-08-16 16:13:11,1.000000,,,8.000000e+02,domestic,327,Translations,publishing/translations,13,18.0,14867664
3,24,Prodeus makes self employment simple and intui...,1484,US,2017-05-05 15:10:43,USD,$,True,USD,2017-06-21 16:01:16,False,,1.000000,50000.0,66308869,,False,,2017-05-22 16:01:16,Prodeus: The Future of Work & Learning,,1484.00,prodeus-social-network-learning-community-micr...,https://www.kickstarter.com/discover/categorie...,False,False,failed,2017-06-21 16:01:16,1.000000,,,1.484000e+03,domestic,342,Web,technology/web,15,16.0,6526716
4,73,Power Punch Boot Camp is an original all-ages ...,3871,GB,2018-07-25 14:06:52,GBP,£,False,USD,2018-09-05 10:00:43,False,,1.290333,3000.0,227936657,,False,,2018-08-06 10:00:43,Power Punch Boot Camp: An All-Ages Graphic Novel,,3010.00,power-punch-boot-camp-an-all-ages-graphic-novel,https://www.kickstarter.com/discover/categorie...,True,False,successful,2018-09-05 10:00:43,1.300500,,,3.914505e+03,domestic,250,Comic Books,comics/comic books,2,3.0,16776056
5,17,Sixxeight is a shirt brand hosting a live scre...,1110,US,2017-05-15 21:52:57,USD,$,True,USD,2017-07-09 15:41:03,False,,1.000000,1100.0,454186436,,False,,2017-06-09 15:41:03,"Live Printing with SX8: ""Squeegee Pulp Up""",,1110.00,live-printing-with-sx8-squeegee-pulp-up,https://www.kickstarter.com/discover/categorie...,True,False,successful,2017-07-09 15:41:04,1.000000,,,1.110000e+03,international,263,Apparel,fashion/apparel,2,9.0,16752598
6,68,Lost Dog Street Band is ready to record a new ...,4807,US,2014-08-30 20:54:40,USD,$,True,USD,2014-11-10 06:00:00,False,,1.000000,3500.0,629469071,,False,,2014-09-25 18:46:01,Lost Dog Street Band's Next Album,,4807.00,lost-dog-street-bands-next-album,https://www.kickstarter.com/discover/categorie...,True,True,successful,2014-11-10 06:00:13,1.000000,,,4.807000e+03,international,37,Country & Folk,music/country & folk,5,14.0,10878931
7,723,"its magnetic, no more switches no more heavy &...",40368,US,2016-02-21 13:54:48,USD,$,True,USD,2017-01-27 16:35:11,False,,1.000000,30000.0,183973060,,False,,2016-11-28 16:35:11,"Qto-X, a Tiny Lantern",,40368.00,qto-x-a-magnetic-light-bar,https://www.kickstarter.com/discover/categorie...,True,False,successful,2017-01-27 16:35:11,1.000000,,,4.036800e+04,international,337,Gadgets,technology/gadgets,7,16.0,6526716
8,3,World Marketplace for barter transactions and ...,57,IT,2018-10-22 12:13:01,EUR,€,False,USD,2019-01-27 11:19:22,False,,1.133688,35000.0,396641120,,False,,2018-12-13 11:19:22,Marketing Alliance - Barter & Partners.,,50.00,marketing-alliance-barter-and-partners,https://www.kickstarter.com/discover/categorie...,False,False,failed,2019-01-27 11:19:22,1.132671,,,5.663354e+01,domestic,342,Web,technology/web,15,16.0,6526716
9,3,Golf instruction book based on the principles ...,52,US,2016-02-03 22:12:19,USD,$,True,USD,2016-03-11 16:33:18,False,,1.000000,50000.0,459122294,,False,,2016-02-10 16:33:18,The Tai Chi of Golf,,52.00,the-tai-chi-of-golf,https://www.kickstarter.com/discover/categorie...,False,False,failed,2016-03-11 16:33:18,1.000000,,,5.200000e+01,international,359,Print,journalism/print,3,13.0,1228010


### ML

In [16]:
X = df.drop(columns=['state','pledged', 'usd_pledged', 'state_changed_at', 'spotlight',
                     'converted_pledged_amount', 'source_url', 'backers_count', 'state'])
y = df['state']
X_train, X_val,y_train, y_val = train_test_split(X, y, random_state=42)

In [17]:
processor = make_pipeline(
    ce.OrdinalEncoder(),
#     SimpleImputer(strategy='most_frequent')
)

In [18]:
X_train_processed = processor.fit_transform(X_train)
X_val_processed = processor.transform(X_val)

In [19]:
eval_set = [(X_train_processed, y_train), 
            (X_val_processed, y_val)]

model = XGBClassifier(n_estimators=1000, eval_set=eval_set, random_State=42, n_jobs=-1)

In [20]:
model.fit(X_train_processed, y_train, eval_set=eval_set, eval_metric='auc', 
          early_stopping_rounds=10)

[0]	validation_0-auc:0.776316	validation_1-auc:0.778764
Multiple eval metrics have been passed: 'validation_1-auc' will be used for early stopping.

Will train until validation_1-auc hasn't improved in 10 rounds.
[1]	validation_0-auc:0.800485	validation_1-auc:0.80175
[2]	validation_0-auc:0.810947	validation_1-auc:0.812107
[3]	validation_0-auc:0.818506	validation_1-auc:0.820282
[4]	validation_0-auc:0.818825	validation_1-auc:0.820402
[5]	validation_0-auc:0.820873	validation_1-auc:0.822424
[6]	validation_0-auc:0.82469	validation_1-auc:0.826313
[7]	validation_0-auc:0.825332	validation_1-auc:0.826956
[8]	validation_0-auc:0.826339	validation_1-auc:0.828016
[9]	validation_0-auc:0.827099	validation_1-auc:0.828802
[10]	validation_0-auc:0.829445	validation_1-auc:0.831095
[11]	validation_0-auc:0.831794	validation_1-auc:0.833314
[12]	validation_0-auc:0.838598	validation_1-auc:0.839822
[13]	validation_0-auc:0.838504	validation_1-auc:0.839719
[14]	validation_0-auc:0.839303	validation_1-auc:0.840548


[142]	validation_0-auc:0.883978	validation_1-auc:0.882567
[143]	validation_0-auc:0.884008	validation_1-auc:0.882594
[144]	validation_0-auc:0.884074	validation_1-auc:0.882662
[145]	validation_0-auc:0.88411	validation_1-auc:0.882709
[146]	validation_0-auc:0.884243	validation_1-auc:0.882702
[147]	validation_0-auc:0.884275	validation_1-auc:0.882706
[148]	validation_0-auc:0.884318	validation_1-auc:0.882742
[149]	validation_0-auc:0.884442	validation_1-auc:0.882869
[150]	validation_0-auc:0.884632	validation_1-auc:0.883047
[151]	validation_0-auc:0.884689	validation_1-auc:0.883122
[152]	validation_0-auc:0.884862	validation_1-auc:0.883303
[153]	validation_0-auc:0.885144	validation_1-auc:0.883561
[154]	validation_0-auc:0.885161	validation_1-auc:0.883561
[155]	validation_0-auc:0.885219	validation_1-auc:0.883606
[156]	validation_0-auc:0.885293	validation_1-auc:0.88366
[157]	validation_0-auc:0.885334	validation_1-auc:0.883686
[158]	validation_0-auc:0.885418	validation_1-auc:0.883672
[159]	validation

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1,
              eval_set=[(         blurb  country  created_at  currency  currency_symbol  currency_trailing_code  current_currency  deadline  disable_communication  friends   fx_rate       goal      id  is_backing  is_starrable  is_starred  launched_at    name  permissions    slug  staff_pick  static_usd_rate  unread_messages_...
114017    successful
22754     successful
154885        failed
Name: state, Length: 45891, dtype: object)],
              gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=1000, n_jobs=-1,
              nthread=None, objective='binary:logistic', random_State=42,
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=None, silent=None, subsample=1, verbosity=1)

In [21]:
from sklearn.metrics import roc_auc_score
class_index = 1
y_pred_proba = model.predict_proba(X_val_processed)[:, class_index]
print(f'Test ROC AUC for class {class_index}:')
print(roc_auc_score(y_val, y_pred_proba))

Test ROC AUC for class 1:
0.8864146196610518


In [23]:
import shap
row = X_val.iloc[[1]]

explainer = shap.TreeExplainer(model)
row_processed = processor.transform(row)
shap_values = explainer.shap_values(row_processed)

shap.initjs()
shap.force_plot(
    base_value=explainer.expected_value, 
    shap_values=shap_values, 
    features=row
)

In [24]:
import shap
row = X_val.iloc[[2]]

explainer = shap.TreeExplainer(model)
row_processed = processor.transform(row)
shap_values = explainer.shap_values(row_processed)

shap.initjs()
shap.force_plot(
    base_value=explainer.expected_value, 
    shap_values=shap_values, 
    features=row
)