<a href="https://colab.research.google.com/github/lechemrc/DS-Unit-2-Applied-Modeling/blob/master/module2/assignment_applied_modeling_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Applied Modeling, Module 2

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] Plot the distribution of your target. 
    - Classification problem: Are your classes imbalanced? Then, don't use just accuracy.
    - Regression problem: Is your target skewed? If so, let's discuss in Slack.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline?
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

### Colab Setup

In [0]:
import os, sys
in_colab = 'google.colab' in sys.modules

# If you're in Colab...
if in_colab:
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Applied-Modeling.git
    !git pull origin master
    
    # Install required python packages
    !pip install -r requirements.txt
    
    # Change into directory for module
    os.chdir('module1')

### Important Imports

In [0]:
# libraries and math functions
import pandas as pd
import numpy as np
import pandas_profiling
from scipy.stats import randint, uniform

# imports for pipeline and regression
import category_encoders as ce
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import validation_curve
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.cluster import KMeans
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.utils.multiclass import unique_labels
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# plotting
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [0]:
df = pd.read_csv('https://raw.githubusercontent.com/lechemrc/Datasets-to-ref/master/Endangered%20Species%20and%20Environment%20sets/merged.csv')
print(df.shape)
df.head()

(48, 163)


Unnamed: 0,country,1990_emissions,1991_emissions,1992_emissions,1993_emissions,1994_emissions,1995_emissions,1996_emissions,1997_emissions,1998_emissions,1999_emissions,2000_emissions,2001_emissions,2002_emissions,2003_emissions,2004_emissions,2005_emissions,2006_emissions,2007_emissions,2008_emissions,2009_emissions,2010_emissions,2011_emissions,2012_emissions,2013_emissions,2014_emissions,2015_emissions,2016_emissions,2017_emissions,1990_stringency,1991_stringency,1992_stringency,1993_stringency,1994_stringency,1995_stringency,1996_stringency,1997_stringency,1998_stringency,1999_stringency,2000_stringency,...,1995_lead,2000_lead,2005_lead,2006_lead,2007_lead,2008_lead,2009_lead,2010_lead,2011_lead,2012_lead,2013_lead,2014_lead,2015_lead,2016_lead,2017_lead,Mammals_percent_species,Birds_percent_species,Reptiles_percent_species,Amphibians_percent_species,Fish_percent_species,Marine Fish_percent_species,Freshwater Fish_percent_species,Vascular plants_percent_species,Mosses_percent_species,Lichens_percent_species,Invertebrates_percent_species,2005_population,2006_population,2007_population,2008_population,2009_population,2010_population,2011_population,2012_population,2013_population,2014_population,2015_population,2016_population,2017_population,2018_population
0,Australia,420315,421381,425702,426232,426305,434913,442506,454629,468406,474027,485019,492462,496319,498119,515931,521801,526437,533138,537032,540913,537275,538281,540616,530434,524957,535174,546772,554127,0.5,0.5,0.5,0.75,0.5,0.5,0.46,0.46,0.77,1.02,0.98,...,2291,2220,2083,2078,2124,2132,2087,2043,2047,2000,1985,2031,2078,2134,2203,27,17,6,12,1,,,7,0,,0,20176844,20450966,20827622,21249199,21691653,22031750,22340024,22733465,23128129,23475686,23815995,24190907,24601860,24992860
1,Austria,78670,82349,75750,75932,76207,79584,82875,82405,81702,80105,80415,84324,86111,91788,91383,92567,90117,87473,86816,80329,84753,82460,79811,80353,76680,78897,79596,82261,1.17,1.42,1.42,1.48,1.48,1.94,1.94,1.85,1.94,1.85,2.02,...,688,684,608,588,591,584,592,589,594,595,589,589,596,590,590,27,27,64,60,46,,46.0,33,23,21.0,2,8225278,8267948,8295189,8321541,8341483,8361069,8388534,8426311,8477230,8543932,8629519,8739806,8795073,8837707
2,Belgium,146587,149324,148920,147890,152523,154665,158303,149782,154957,148645,149730,148132,147567,147932,149231,145284,142660,138961,138828,126263,132922,122198,119373,119304,113506,117122,115783,114540,0.67,0.67,0.71,0.77,0.77,0.77,0.77,0.77,0.77,0.77,0.85,...,2142,2131,1925,1841,1841,1840,1810,1766,1770,1783,1769,1702,1734,1712,1722,21,28,40,32,20,14.0,35.0,23,27,59.0,11,10478617,10547956,10625701,10709976,10796498,10895589,10993616,11067748,11125033,11179778,11238474,11295003,11349081,11403740
3,Canada,602184,593402,610441,612264,633675,651011,672053,686988,694531,707376,730588,719733,724349,741003,742972,730349,721445,743767,723225,681699,692619,703379,711023,722063,723091,721992,707727,715749,0.38,0.38,0.71,0.5,0.5,0.5,0.46,0.65,0.65,0.65,0.9,...,1414,1360,1260,1227,1231,1226,1201,1176,1161,1179,1168,1202,1237,1251,1272,24,13,58,34,6,4.0,22.0,21,21,21.0,6,32243753,32571174,32889025,33247118,33628895,34004889,34339328,34714222,35082954,35437435,35702908,36109487,36540268,37058856
4,Chile,52016,50681,52579,55432,58349,61452,67715,75203,76151,79279,76587,74732,76179,76956,82609,84334,85555,93656,94262,90887,91862,99861,104492,104304,101474,108243,111678,715749,0.38,0.38,0.71,0.5,0.5,0.5,0.46,0.65,0.65,0.65,0.9,...,237,248,271,270,274,269,284,293,290,294,300,301,308,315,327,21,9,40,68,6,4.0,22.0,21,21,21.0,1,16183489,16347890,16517933,16697754,16881078,17063927,17254159,17443491,17611902,17787617,17971423,18167147,18419192,18751405


In [0]:
numeric = df.select_dtypes(include= "number").columns
categorical = df.select_dtypes(exclude = "number").columns

In [0]:
c_steps = [('c_imputer', SimpleImputer(strategy="most_frequent"))]
c_pipeline = Pipeline(c_steps)

n_steps = [('n_imputer', SimpleImputer())]
n_pipeline = Pipeline(n_steps)

In [0]:
df[numeric] = n_pipeline.fit_transform(df[numeric])
df[categorical] = c_pipeline.fit_transform(df[categorical])

In [0]:
def simple_preprocess(df):
  
  numeric = df.select_dtypes(include= "number").columns
  categorical = df.select_dtypes(exclude = "number").columns
  
  c_steps = [('c_imputer', SimpleImputer(strategy="most_frequent"))]
  c_pipeline = Pipeline(c_steps)
  
  n_steps = [('n_imputer', SimpleImputer())]
  n_pipeline = Pipeline(n_steps)
  
  df[numeric] = n_pipeline.fit_transform(df[numeric])
  df[categorical] = c_pipeline.fit_transform(df[categorical])
  
  return df

In [0]:
def train_validate_test_split(df, train_percent=.5, validate_percent=.25, seed=42):
    np.random.seed(42)
    perm = np.random.permutation(df.index)
    m = len(df.index)
    train_end = int(train_percent * m)
    validate_end = int(validate_percent * m) + train_end
    train = df.iloc[perm[:train_end]]
    validate = df.iloc[perm[train_end:validate_end]]
    test = df.iloc[perm[validate_end:]]
    return train, validate, test

train, val, test = train_validate_test_split(df)

train.shape, val.shape, test.shape

((24, 163), (12, 163), (12, 163))

In [0]:
target = 'total_threatened'
features = df.columns.drop(target)

X_train = train.drop(columns=target)
y_train = train[target]
X_val = train.drop(columns=target)
y_val = train[target]
X_test = test

In [0]:
from xgboost import XGBClassifier

xgboost = make_pipeline(
    ce.OrdinalEncoder(),
    XGBClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)

xgboost.fit(X_train, y_train)
print('Validation Accuracy', xgboost.score(X_val, y_val))

Validation Accuracy 0.9166666666666666
