Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf), Chapter 8
  - _**[Gradient Boosting Explained](https://www.gormanalysis.com/blog/gradient-boosting-explained/)**_ — Ben Gorman
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html) — Alex Rogozhnikov
  - [How to explain gradient boosting](https://explained.ai/gradient-boosting/) — Terence Parr & Jeremy Howard

In [1]:
import pandas as pd
from pathlib import Path

filepath = Path('../data/steam/steam.csv')
data = pd.read_csv(filepath)

import math
import numpy as np


def create_target(data:pd.DataFrame):
    data = data.copy()
    df = data[['positive_ratings', 'negative_ratings']]
    data = data.drop(['positive_ratings', 'negative_ratings'], axis=1)
    df['total_reviews'] = df['positive_ratings'] + df['negative_ratings']
    df =df[df['total_reviews'] >= 500]
    df['review_score'] = df['positive_ratings'] / df['total_reviews']
    df['superscript']= [math.log10(x+1) for x in df['total_reviews']]
    df['exponent'] = [2**(-x) for x in df['superscript']]
    df['rating'] =  [x-(x-0.5) * y for x,y in zip(df['review_score'], df['exponent'])]
    df['good'] = df['rating'] >= 0.85
    data = data.merge(df[['good']], left_index=True,right_index=True)
    return data

data_with_target = create_target(data)


def x_y_split(df):
    target = 'good'
    X = df.copy().drop(target, axis=1)
    y = df.copy()[target]
    return X, y

from sklearn.model_selection import train_test_split
train_and_val_set, test_set = train_test_split(data_with_target, stratify = data_with_target['good'], random_state = 11)
train_set, val_set = train_test_split(train_and_val_set, stratify = train_and_val_set['good'], random_state = 11)

X_train, y_train = x_y_split(train_set)
X_val, y_val = x_y_split(val_set)
X_test, y_test = x_y_split(test_set)

train = (X_train, y_train)
val = (X_val, y_val)

In [2]:
from lightgbm import LGBMClassifier

from sklearn.base import TransformerMixin
import category_encoders as ce
import numpy as np


class Wrangler(TransformerMixin):

    def __init__(self):
        pass

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X):
        X = X.copy()
        cols_with_zeros = ['average_playtime', 'median_playtime']
        for col in cols_with_zeros:
            X[col] = X[col].replace(0, np.nan)
        X['release_date'] = pd.to_datetime(X['release_date'], infer_datetime_format=True)
        X['year'] = X['release_date'].dt.year
        X['month'] = X['release_date'].dt.month
        X['day'] = X['release_date'].dt.day
        genres = X['genres'].str.split(';', expand=True).stack().str.get_dummies().add_prefix('genre').sum(level=0)
        tags = X['steamspy_tags'].str.split(';', expand=True).stack().str.get_dummies().add_prefix('tag').sum(level=0)
        X = X.drop(columns=[ 'appid', 'name', 'release_date', 'genres', 'steamspy_tags', 'name'])
        X = X.merge(genres, left_index=True, right_index=True)
        X = X.merge(tags, left_index=True, right_index=True)
        return X


class NumericalFilter(TransformerMixin):

    def __init__(self, include = True):
        self.columns = None
        self.include = include

    def fit(self, X, y=None,**fit_params):
        if self.include:
            self.columns = X.select_dtypes(include='number').columns.tolist()
        else:
            self.columns = X.select_dtypes(exclude='number').columns.tolist()
        return self

    def transform(self, X):
        return X[self.columns]

In [3]:
X_train.columns

Index(['appid', 'name', 'release_date', 'english', 'developer', 'publisher',
       'platforms', 'required_age', 'categories', 'genres', 'steamspy_tags',
       'achievements', 'average_playtime', 'median_playtime', 'owners',
       'price'],
      dtype='object')

In [82]:
from sklearn.pipeline import make_pipeline, make_union
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
import lightgbm
from lightgbm import LGBMClassifier
import numpy as np
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

categorical_pipe = make_pipeline(NumericalFilter(False), ce.OrdinalEncoder())

numerical_pipe = make_pipeline(NumericalFilter(), SimpleImputer())

processing_pipe = make_union(categorical_pipe, numerical_pipe)

classifier = make_pipeline(
    Wrangler(),
    numerical_pipe,
    XGBClassifier()
    )

transformer = classifier = make_pipeline(
    Wrangler(),
    processing_pipe,
    )

In [83]:
classifier.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('wrangler', <__main__.Wrangler object at 0x7f47c5a7a810>),
                ('featureunion',
                 FeatureUnion(n_jobs=None,
                              transformer_list=[('pipeline-1',
                                                 Pipeline(memory=None,
                                                          steps=[('numericalfilter',
                                                                  <__main__.NumericalFilter object at 0x7f47c5a7a510>),
                                                                 ('ordinalencoder',
                                                                  OrdinalEncoder(cols=['developer',
                                                                                       'publisher',
                                                                                       'platforms',
                                                                                       'categories',
  

In [84]:
classifier.score(*val)

AttributeError: 'FeatureUnion' object has no attribute 'score'

In [None]:
from sklearn import metrics

metrics.recall_score(y_val, classifier.predict(X_val))

In [None]:
from sklearn.dummy import DummyClassifier

dummy = DummyClassifier('stratified')

In [None]:
dummy.fit(*train)

In [None]:
metrics.recall_score(y_val, dummy.predict(X_val))

In [None]:
X_train['owners']

In [None]:
X_train['average_playtime']

In [91]:
X_train['genres'].str.split(';', expand=True).stack().str.get_dummies().add_prefix('genre_').sum(level=0)

Unnamed: 0,genre_Action,genre_Adventure,genre_Animation & Modeling,genre_Casual,genre_Design & Illustration,genre_Early Access,genre_Education,genre_Free to Play,genre_Game Development,genre_Gore,...,genre_Racing,genre_Sexual Content,genre_Simulation,genre_Software Training,genre_Sports,genre_Strategy,genre_Utilities,genre_Video Production,genre_Violent,genre_Web Publishing
2091,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
759,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1523,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1773,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
672,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4823,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4926,1,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2356,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [92]:
X_train['steamspy_tags'].str.split(';', expand=True).stack().str.get_dummies().add_prefix('tag_').sum(level=0)

Unnamed: 0,tag_1980s,tag_2D,tag_2D Fighter,tag_3D Platformer,tag_4X,tag_Action,tag_Action RPG,tag_Adventure,tag_Agriculture,tag_Aliens,...,tag_Violent,tag_Visual Novel,tag_Walking Simulator,tag_War,tag_Warhammer 40K,tag_Western,tag_World War I,tag_World War II,tag_Wrestling,tag_Zombies
2091,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
759,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1523,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1773,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
672,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4823,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4926,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2356,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [93]:
import eli5
from eli5.sklearn import PermutationImportance

In [94]:
pi = PermutationImportance(XGBClassifier)

In [95]:
pi.fit(transformer(*train), y_train)

TypeError: 'Pipeline' object is not callable

In [None]:
X_transformed = transformer.fit_transform(*train)
X_transformed

In [None]:
xg = XGBClassifier().fit(transformer.fit_transform(*train), y_train)

In [98]:
perm = PermutationImportance(xg).fit(transformer.fit_transform(*train), y_train)

In [99]:
columns = Wrangler().fit_transform(*train).columns.to_list()

In [100]:
eli5.show_weights(perm, feature_names= columns )

Weight,Feature
0.0538  ± 0.0066,platforms
0.0506  ± 0.0071,median_playtime
0.0501  ± 0.0070,owners
0.0496  ± 0.0056,year
0.0390  ± 0.0050,developer
0.0354  ± 0.0058,english
0.0281  ± 0.0036,required_age
0.0277  ± 0.0067,price
0.0242  ± 0.0033,day
0.0193  ± 0.0041,genreMassively Multiplayer
