Lambda School Data Science

*Unit 2, Sprint 3, Module 2*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [X] If you haven't completed assignment #1, please do so first.
- [X] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

In [20]:
# - [X] Continue to clean and explore your data. Make exploratory visualizations.
# The goal of this program is to determine the likelihood of Early Access Games to
# leave Early Access.

import pandas as pd
import pandas_profiling as pdp


def google_drive_useable(link):
    return link.replace('/open?', '/uc?')

data = r'https://drive.google.com/open?id=1pEOgqOZcgxwu7gA4GnFuMwO_wVBLQ7cf'
df   = pd.read_csv(google_drive_useable(data))
df   = df[df['ReleaseType'] != 'Traditional Release']

df.profile_report(style = {'full_width': True})



In [25]:
# Split the data
from sklearn.model_selection import train_test_split


train, test = train_test_split(df
                              ,train_size   = 0.60
                              ,random_state = 6
                              )

train, val  = train_test_split(train
                              ,train_size   = 0.60
                              ,random_state = 6
                              )


In [77]:
# Baseline Guess is everything is Ex Early Access

target   = 'ReleaseType'
baseline = df[target].unique()[0]

baseline_y_pred = [baseline] * len(val)

baseline

'Ex Early Access'

In [38]:
target   = 'ReleaseType'
features = ['Metacritic'
           ,'ReleaseDate'
           
           ,'RecommendationCount'
           ,'DeveloperCount'
           ,'PublisherCount'
           ,'DLCCount'
           ,'ScreenshotCount'
           
           ,'CategorySinglePlayer'
           ,'CategoryMultiplayer'
           ,'CategoryCoop'
           ,'CategoryMMO'
           ,'CategoryInAppPurchase'
           ,'CategoryIncludeSrcSDK'
           ,'CategoryIncludeLevelEditor'
           ,'CategoryVRSupport'
           
           ,'GenreIsIndie'
           ,'GenreIsAction'
           ,'GenreIsAdventure'
           ,'GenreIsCasual'
           ,'GenreIsStrategy'
           ,'GenreIsRPG'
           ,'GenreIsSimulation'
           ,'GenreIsSports'
           ,'GenreIsRacing'
           ,'GenreIsMassivelyMultiplayer'
           
           ,'PlatformWindows'
           ,'PlatformLinux'
           ,'PlatformMac'
           
           ,'PriceInitial'
           ,'PriceFinal'
           ]

In [48]:
X_train  = train[features]
y_train  = train[target]
X_val    =  val[features]
y_val    =  val[target]
X_test   =  test[features]
y_test   =  test[target]
train.shape, test.shape

((769, 59), (856, 59))

In [45]:
import category_encoders as ce
from sklearn.impute            import SimpleImputer
from sklearn.preprocessing     import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble          import RandomForestClassifier
from sklearn.pipeline          import make_pipeline
from sklearn.model_selection   import cross_val_score, RandomizedSearchCV


def cat_predict_func(X_train, y_train):
    pipeline = make_pipeline(ce.OrdinalEncoder()
                            ,SimpleImputer(strategy = 'mean')
                            ,StandardScaler()
                            ,SelectKBest(f_classif)
                            ,RandomForestClassifier(random_state = 6)
    )

    param_dict = {'selectkbest__k'                       :range(  1, len(features), 1)
                 ,'randomforestclassifier__n_estimators' :range(100,           150, 1)
                 ,'randomforestclassifier__max_depth'    :range(  1,            30, 1)
                 }

    search = RandomizedSearchCV(pipeline
                               ,param_distributions = param_dict
                               ,n_iter              = 10
                               ,cv                  = 3
                               ,scoring             = 'accuracy'
                               ,return_train_score  = True
                               ,n_jobs              = -1
                               )

    return search.fit(X_train, y_train)


In [46]:
search   = cat_predict_func(X_train, y_train)
search

RandomizedSearchCV(cv=3, error_score='raise-deprecating',
                   estimator=Pipeline(memory=None,
                                      steps=[('ordinalencoder',
                                              OrdinalEncoder(cols=None,
                                                             drop_invariant=False,
                                                             handle_missing='value',
                                                             handle_unknown='value',
                                                             mapping=None,
                                                             return_df=True,
                                                             verbose=0)),
                                             ('simpleimputer',
                                              SimpleImputer(add_indicator=False,
                                                            copy=True,
                                                            f

In [81]:
# - [X] Fit a model. Does it beat your baseline? 
baseline_y_pred = [baseline] * len(train)
train_y_pred    = search.predict(X_train)
val_y_pred      = search.predict(X_val)


print("Baseline average accuracy:\t", (y_train == baseline_y_pred).mean(), '\n')

print("Training average accuracy:\t", (y_train == train_y_pred).mean())
print("Validation average accuracy:\t", (y_val   == val_y_pred).mean())

# Definitely did not expect to get this high a result

Baseline average accuracy:	 0.32639791937581275 

Training average accuracy:	 0.9648894668400521
Validation average accuracy:	 0.9317738791423001


In [None]:
# - [ ] Try xgboost.


In [None]:
# - [ ] Get your model's permutation importances.
