Lambda School Data Science

*Unit 2, Sprint 3, Module 2*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

In [1]:
import numpy as np 
import pandas as pd 
import os 
import seaborn as sns
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
import category_encoders as ce
from sklearn.impute import SimpleImputer
%matplotlib inline
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


In [2]:
cd C:\Users\Hakuj\Documents\DataSets\Kickstarter

C:\Users\Hakuj\Documents\DataSets\Kickstarter


## Getting csv

In [3]:
def get_a_year(year):
    df = pd.DataFrame(
            columns=['backers_count', 'blurb', 'category', 'converted_pledged_amount',
       'country', 'created_at', 'creator', 'currency', 'currency_symbol',
       'currency_trailing_code', 'current_currency', 'deadline',
       'disable_communication', 'fx_rate', 'goal', 'id', 'is_starrable',
       'launched_at', 'name', 'photo', 'pledged', 'profile', 'slug',
       'source_url', 'spotlight', 'staff_pick', 'state', 'state_changed_at',
       'static_usd_rate', 'urls', 'usd_pledged', 'usd_type', 'location',
       'friends', 'is_backing', 'is_starred', 'permissions']
    )
    folders = os.listdir(f'Data\\{year}') #Get the monthly folders inside the year
    for folder in folders:
        files = os.listdir(f'Data\\{year}\\{folder}')  #Get the filenames inside monthly folders
        monthly = pd.concat(
            [pd.read_csv(
                f'Data\\{year}\\{folder}\\{file}') for file in files]
        ) #Reads in all the csv files in a given month
        df = df.append(monthly)
    return df

In [None]:
df = get_a_year(2018)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
df['state'].value_counts()

## Assignment 1 redo

### Target selection and baseline
- I will use funded as my target.
  - I want to have a classification of 'Funded' 'Failed' and 'Funded Early'
    - I may not be able to do the last one
  - I would also like to present probability
    - suggestions on how to improve would be a good stretch goal for me.
- I will have to engineer it from 'state'
- I can also see if it is funded ahead of time by using 'goal' and (usd)'pledged'

In [None]:
base_preds = ['successful'] * len(df)

In [None]:
accuracy_score(base_preds, df['state'])

### Feature selection
- There will be repeats as some campaing run longer than the scrape periods, so I will have to mind that
- I will have to be careful with time travel
- There are some features that are mostly NaN
- Pledged and usd_pledged are essentially the same.
  - I may not even include these in my project as I want to see if you will be funded before you start

## Assignment 2

In [None]:
def wrangle(df):
    #Time series data
    df['created_at'] = pd.to_datetime(df['created_at'], format='%m%d%Y').astype(str)
    df['deadline'] = pd.to_datetime(df['deadline'], format='%m%d%Y').astype(str)
    df['launched_at'] = pd.to_datetime(df['launched_at'], format='%m%d%Y').astype(str)
    df['state_changed_at'] = pd.to_datetime(df['state_changed_at'], format='%m%d%Y').astype(str)
    return df

In [None]:
X = df.drop(columns=['state','pledged', 'usd_pledged', 'converted_pledged_amount'])
y = df['state']

In [None]:
X_train, y_train, X_test, y_test = train_test_split(X,y, random_state=42)

In [None]:
pipeline1 = make_pipeline(
    SimpleImputer(strategy='most_frequent',verbose=10, add_indicator=True), 
    ce.OrdinalEncoder(), 
    DecisionTreeClassifier(random_state=42, n_jobs=-1, max_depth=3)
)

In [None]:
pipeline1.fit(X_train, y_train)

In [None]:
pipeline1.predict(X_test,y_test)