Lambda School Data Science

*Unit 2, Sprint 3, Module 2*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

In [1]:
import numpy as np 
import pandas as pd 
import os 
import seaborn as sns
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
import category_encoders as ce
from sklearn.impute import SimpleImputer
import eli5
from eli5.sklearn import PermutationImportance
%matplotlib inline
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [2]:
cd C:\Users\Hakuj\Documents\DataSets\Kickstarter

C:\Users\Hakuj\Documents\DataSets\Kickstarter


## Getting csv

In [3]:
def get_a_year(year):
    df = pd.DataFrame(
            columns=['backers_count', 'blurb', 'category', 'converted_pledged_amount',
       'country', 'created_at', 'creator', 'currency', 'currency_symbol',
       'currency_trailing_code', 'current_currency', 'deadline',
       'disable_communication', 'fx_rate', 'goal', 'id', 'is_starrable',
       'launched_at', 'name', 'photo', 'pledged', 'profile', 'slug',
       'source_url', 'spotlight', 'staff_pick', 'state', 'state_changed_at',
       'static_usd_rate', 'urls', 'usd_pledged', 'usd_type', 'location',
       'friends', 'is_backing', 'is_starred', 'permissions']
    )
    folders = os.listdir(f'Data\\{year}') #Get the monthly folders inside the year
    for folder in folders[:1]:
        files = os.listdir(f'Data\\{year}\\{folder}')  #Get the filenames inside monthly folders
        monthly = pd.concat(
            [pd.read_csv(
                f'Data\\{year}\\{folder}\\{file}') for file in files[:1]]
        ) #Reads in all the csv files in a given month
        df = df.append(monthly)
    return df

In [4]:
df = get_a_year(2018)

In [5]:
df.shape

(4076, 37)

In [6]:
df.describe()

Unnamed: 0,fx_rate,goal,pledged,static_usd_rate,usd_pledged,friends,is_backing,is_starred,permissions
count,4076.0,4076.0,4076.0,4076.0,4076.0,0.0,0.0,0.0,0.0
mean,1.012103,30836.03,3746.904271,1.022456,3355.188621,,,,
std,0.144858,1566747.0,13626.882355,0.17612,8203.541462,,,,
min,0.052117,1.0,0.0,0.048231,0.0,,,,
25%,1.0,676.125,110.0,1.0,120.0,,,,
50%,1.0,2000.0,1065.89,1.0,1071.5,,,,
75%,1.0,5000.0,3451.25,1.0,3462.934884,,,,
max,1.354969,100000000.0,485520.0,1.714466,167832.01,,,,


In [7]:
df.head()

Unnamed: 0,backers_count,blurb,category,converted_pledged_amount,country,created_at,creator,currency,currency_symbol,currency_trailing_code,current_currency,deadline,disable_communication,fx_rate,goal,id,is_starrable,launched_at,name,photo,pledged,profile,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_pledged,usd_type,location,friends,is_backing,is_starred,permissions
0,1,"Monsters, Fantasy, Illusion, Delusion, and a h...","{""urls"":{""web"":{""discover"":""http://www.kicksta...",20,US,1332493397,"{""urls"":{""web"":{""user"":""https://www.kickstarte...",USD,$,True,USD,1336447572,False,1.0,5400.0,2016865793,False,1332818772,"Support the Strange and Unusual, from fantasy ...","{""small"":""https://ksr-ugc.imgix.net/assets/011...",20.0,"{""background_image_opacity"":0.8,""should_show_f...",represent-the-strange-and-unusual,https://www.kickstarter.com/discover/categorie...,False,False,failed,1336447572,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",20.0,domestic,"{""country"":""US"",""urls"":{""web"":{""discover"":""htt...",,,,
1,37,Nano Art will make and market customized piece...,"{""urls"":{""web"":{""discover"":""http://www.kicksta...",1974,US,1332823105,"{""urls"":{""web"":{""user"":""https://www.kickstarte...",USD,$,True,USD,1337287105,False,1.0,5000.0,120596924,False,1333399105,Nano Art: Reloaded,"{""small"":""https://ksr-ugc.imgix.net/assets/011...",1974.0,"{""background_image_opacity"":0.8,""should_show_f...",nano-art-reloaded,https://www.kickstarter.com/discover/categorie...,False,True,failed,1337287105,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",1974.0,domestic,"{""country"":""US"",""urls"":{""web"":{""discover"":""htt...",,,,
2,81,Video and audio coverage of the MUTEK festival...,"{""urls"":{""web"":{""discover"":""http://www.kicksta...",4845,US,1331241234,"{""urls"":{""web"":{""user"":""https://www.kickstarte...",USD,$,True,USD,1337227140,False,1.0,20000.0,694989709,False,1334604529,MUTEK 2012: the VIRTUAL FESTIVAL STUDIO project,"{""small"":""https://ksr-ugc.imgix.net/assets/011...",4845.0,"{""background_image_opacity"":0.8,""should_show_f...",mutek-festival-virtual-festival-studio,https://www.kickstarter.com/discover/categorie...,False,True,failed,1337227140,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",4845.0,domestic,"{""country"":""CA"",""urls"":{""web"":{""discover"":""htt...",,,,
3,95,"Finally, A Storyboard App Done Right!","{""urls"":{""web"":{""discover"":""http://www.kicksta...",2948,US,1332350493,"{""urls"":{""web"":{""user"":""https://www.kickstarte...",USD,$,True,USD,1338448438,False,1.0,20000.0,1254591807,False,1335424438,SketchPad Pro: A Filmmaker's Storyboard for th...,"{""small"":""https://ksr-ugc.imgix.net/assets/011...",2948.0,"{""background_image_opacity"":0.8,""should_show_f...",sketchpad-pro-a-filmmakers-storyboard-for-the-...,https://www.kickstarter.com/discover/categorie...,False,False,failed,1338448438,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",2948.0,domestic,"{""country"":""US"",""urls"":{""web"":{""discover"":""htt...",,,,
4,10,"If you like books and bookmarks, stick these s...","{""urls"":{""web"":{""discover"":""http://www.kicksta...",522,US,1333768943,"{""urls"":{""web"":{""user"":""https://www.kickstarte...",USD,$,True,USD,1338609540,False,1.0,6000.0,1162595888,False,1335499325,APPLES & LEMONS REVIEWS,"{""small"":""https://ksr-ugc.imgix.net/assets/011...",522.0,"{""background_image_opacity"":0.8,""should_show_f...",apples-and-lemons-reviews,https://www.kickstarter.com/discover/categorie...,False,False,failed,1338609542,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",522.0,domestic,"{""country"":""US"",""urls"":{""web"":{""discover"":""htt...",,,,


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4076 entries, 0 to 4075
Data columns (total 37 columns):
backers_count               4076 non-null object
blurb                       4075 non-null object
category                    4076 non-null object
converted_pledged_amount    4076 non-null object
country                     4076 non-null object
created_at                  4076 non-null object
creator                     4076 non-null object
currency                    4076 non-null object
currency_symbol             4076 non-null object
currency_trailing_code      4076 non-null object
current_currency            4076 non-null object
deadline                    4076 non-null object
disable_communication       4076 non-null object
fx_rate                     4076 non-null float64
goal                        4076 non-null float64
id                          4076 non-null object
is_starrable                4076 non-null object
launched_at                 4076 non-null object
name     

In [9]:
df.isna().sum()

backers_count                  0
blurb                          1
category                       0
converted_pledged_amount       0
country                        0
created_at                     0
creator                        0
currency                       0
currency_symbol                0
currency_trailing_code         0
current_currency               0
deadline                       0
disable_communication          0
fx_rate                        0
goal                           0
id                             0
is_starrable                   0
launched_at                    0
name                           0
photo                          0
pledged                        0
profile                        0
slug                           0
source_url                     0
spotlight                      0
staff_pick                     0
state                          0
state_changed_at               0
static_usd_rate                0
urls                           0
usd_pledge

In [10]:
df['state'].value_counts()

successful    2597
failed        1309
canceled       158
live            10
suspended        2
Name: state, dtype: int64

## Assignment 1 redo

### Target selection and baseline
- I will use funded as my target.
  - I want to have a classification of 'Funded' 'Failed' and 'Funded Early'
    - I may not be able to do the last one
  - I would also like to present probability
    - suggestions on how to improve would be a good stretch goal for me.
- I will have to engineer it from 'state'
- I can also see if it is funded ahead of time by using 'goal' and (usd)'pledged'

In [11]:
base_preds = ['successful'] * len(df)

In [12]:
accuracy_score(base_preds, df['state'])

0.637144259077527

### Feature selection
- There will be repeats as some campaing run longer than the scrape periods, so I will have to mind that
- I will have to be careful with time travel
- There are some features that are mostly NaN
- Pledged and usd_pledged are essentially the same.
  - I may not even include these in my project as I want to see if you will be funded before you start

## Assignment 2

In [13]:
def wrangle(df):
    #Time series data
    df['created_at'] = pd.to_datetime(df['created_at'], format='%m%d%Y').astype(str)
    df['deadline'] = pd.to_datetime(df['deadline'], format='%m%d%Y').astype(str)
    df['launched_at'] = pd.to_datetime(df['launched_at'], format='%m%d%Y').astype(str)
    df['state_changed_at'] = pd.to_datetime(df['state_changed_at'], format='%m%d%Y').astype(str)
    return df

In [14]:
X = df.drop(columns=['state','pledged', 'usd_pledged', 'state'])
y = df['state']

In [15]:
X_train, X_test,y_train, y_test = train_test_split(X, y, random_state=42)

In [16]:
pipeline1 = make_pipeline(
    SimpleImputer(strategy='most_frequent'), 
    ce.OrdinalEncoder(), 
    DecisionTreeClassifier(random_state=42, max_depth=3)
)

In [None]:
pipeline1.fit(X_train, y_train)

In [None]:
accuracy_score(pipeline1.predict(X_test), y_test)

In [None]:
transformer = make_pipeline(
    SimpleImputer(strategy='most_frequent'), 
    ce.OrdinalEncoder()
)

In [None]:
model = RandomForestClassifier(random_state=42, max_depth=3)

In [None]:
X_train_transformed = transformer.fit_transform(X_train)
X_test_transformed = transformer.transform(X_test)

In [None]:
model.fit(X_train_transformed, y_train)

In [None]:
permuter = PermutationImportance(
    model,
    scoring='accuracy',
    n_iter=3,
    random_state=42
)

In [None]:
permuter.fit(X_test_transformed, y_test)

In [None]:
features = X_test.columns.tolist()
pd.Series(permuter.feature_importances_, features[:30])

In [None]:
eli5.show_weights(
    permuter, 
    top=None,
    feature_names=features[:30]
)