Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf), Chapter 8
  - _**[Gradient Boosting Explained](https://www.gormanalysis.com/blog/gradient-boosting-explained/)**_ — Ben Gorman
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html) — Alex Rogozhnikov
  - [How to explain gradient boosting](https://explained.ai/gradient-boosting/) — Terence Parr & Jeremy Howard

In [1]:
# Imports 
import sys
!pip install pandas-profiling==2.*
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.metrics import mean_poisson_deviance, mean_gamma_deviance, mean_tweedie_deviance
from sklearn.metrics import explained_variance_score, max_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from pandas_profiling import ProfileReport
from sklearn.feature_selection import SelectKBest, f_regression



  import pandas.util.testing as tm


In [2]:
TPPI = pd.read_csv('Total_Population_Period_Indicators')
df = TPPI.dropna(subset=['Births'])

## **Exploratory Analysis:**

In [3]:
# pandas profile report for original dataset
profile = ProfileReport(df, minimal= True).to_notebook_iframe()
profile

HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=38.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='Render HTML', max=1.0, style=ProgressStyle(description_wi…




In [4]:
df.describe()

Unnamed: 0.1,Unnamed: 0,LocID,MidPeriod,Total Fertility,NRR(surviving daughters),Crude Birth Rate,Births,Life Expectancy(birth),LExMale,LExFemale,Infant Mortality Rate,Under-five Mortality,Crude Death Rate,Deaths,DeathsMale,DeathsFemale,Net Migration Rate(pK),Net Migrants(K),GrowthRate,Natural Increase Rate,Sex ratio(m per f births,Mean Age Childbearing(f),PopFemale,PopMale,PopTotal,PopDensity
count,7088.0,7088.0,7088.0,7088.0,7088.0,7088.0,7088.0,7088.0,7088.0,7088.0,7088.0,7088.0,7088.0,7088.0,7088.0,7088.0,7088.0,7088.0,7088.0,7088.0,7088.0,7088.0,7088.0,7088.0,7088.0,7088.0
mean,3814.994357,1077.688488,1994.875,3.831392,1.534372,27.639051,57279.246641,64.279876,62.067798,66.536494,59.517827,87.951534,11.360619,24977.727445,13181.393661,11796.333783,-0.130939,-214.672219,1.617724,16.278433,1.053711,28.810993,245247.7,248902.7,494150.4,141.007502
std,2210.800671,729.124394,33.862164,1.897975,0.60477,12.934401,119362.543037,12.607073,12.158732,13.112974,52.344031,85.371346,5.500324,53517.256616,28209.979282,25322.540725,7.185746,2824.85032,1.247379,10.526515,0.017371,1.415274,543754.1,556001.4,1099675.0,827.146104
min,0.0,4.0,1953.0,0.85,0.41,5.558,1.198,14.49,11.88,18.12,0.126,0.231,1.147,1.597,0.808,0.681,-70.787,-24462.517,-5.321,-20.979,1.004,23.897,7.923,9.321,17.244,0.07
25%,1915.75,462.0,1971.75,2.07975,0.97,15.91175,602.71825,55.34,53.58,57.05,16.03125,19.55275,7.5085,287.015,150.79,136.60325,-1.4,-457.89,0.746,7.60475,1.05,27.712,2499.81,2478.078,4949.038,17.0575
50%,3783.5,922.0,1990.5,3.301,1.4575,25.893,5639.506,66.565,64.045,69.24,43.6235,56.9285,9.8805,2716.019,1410.282,1289.021,-0.239,-17.073,1.663,17.738,1.05,28.997,25883.8,25291.37,51352.56,37.36
75%,5779.25,1596.0,2009.25,5.642,2.039,39.64475,51853.4955,73.34,70.5,76.29,93.70925,137.602,13.35675,20751.77825,11056.459,9867.95,0.85925,88.075,2.431,25.1835,1.06,29.70325,212296.2,206557.2,418454.2,102.086
max,7631.0,5501.0,2098.0,8.8,3.653,58.263,701277.931,94.02,91.14,96.93,319.239,465.517,61.634,606048.782,313187.571,292861.211,134.414,23278.467,16.986,42.294,1.173,34.997,5425119.0,5443229.0,10868350.0,33470.669


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7088 entries, 0 to 7631
Data columns (total 28 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                7088 non-null   int64  
 1   LocID                     7088 non-null   int64  
 2   Location                  7088 non-null   object 
 3   Time Period               7088 non-null   object 
 4   MidPeriod                 7088 non-null   int64  
 5   Total Fertility           7088 non-null   float64
 6   NRR(surviving daughters)  7088 non-null   float64
 7   Crude Birth Rate          7088 non-null   float64
 8   Births                    7088 non-null   float64
 9   Life Expectancy(birth)    7088 non-null   float64
 10  LExMale                   7088 non-null   float64
 11  LExFemale                 7088 non-null   float64
 12  Infant Mortality Rate     7088 non-null   float64
 13  Under-five Mortality      7088 non-null   float64
 14  Crude De

In [6]:
df.corr()

Unnamed: 0.1,Unnamed: 0,LocID,MidPeriod,Total Fertility,NRR(surviving daughters),Crude Birth Rate,Births,Life Expectancy(birth),LExMale,LExFemale,Infant Mortality Rate,Under-five Mortality,Crude Death Rate,Deaths,DeathsMale,DeathsFemale,Net Migration Rate(pK),Net Migrants(K),GrowthRate,Natural Increase Rate,Sex ratio(m per f births,Mean Age Childbearing(f),PopFemale,PopMale,PopTotal,PopDensity
Unnamed: 0,1.0,0.115001,0.001882997,0.035619,0.049049,0.027552,0.083345,-0.007368,-0.00824,-0.005441,0.001786,0.002113,-0.00478,0.079705,0.079601,0.079773,0.009614,0.000467,0.036884,0.036352,-0.040937,0.030808,0.078898,0.076674,0.077779,-0.053888
LocID,0.115001,1.0,6.678162e-17,-0.018191,-0.042183,-0.00699,0.349902,-0.050572,-0.048896,-0.05198,0.074841,0.061842,0.022538,0.339482,0.339806,0.338917,0.012125,-0.075898,-0.010909,-0.020365,0.17133,-0.06617,0.326534,0.324994,0.32578,-0.087305
MidPeriod,0.001883,6.678162e-17,1.0,-0.53658,-0.488129,-0.577177,0.024017,0.657374,0.676729,0.636959,-0.592875,-0.558928,-0.350596,0.124363,0.123335,0.125432,0.010903,-0.03713,-0.438558,-0.52601,0.060118,0.239992,0.171441,0.169626,0.170536,0.066422
Total Fertility,0.035619,-0.01819134,-0.5365795,1.0,0.94737,0.983822,-0.026479,-0.878907,-0.862902,-0.887643,0.858983,0.858352,0.632857,-0.078312,-0.080097,-0.076276,-0.050887,-0.0957,0.713911,0.878185,-0.366239,0.336256,-0.161623,-0.158142,-0.159875,-0.112642
NRR(surviving daughters),0.049049,-0.04218319,-0.4881295,0.94737,1.0,0.926462,-0.046548,-0.729365,-0.714381,-0.737235,0.681679,0.663777,0.379474,-0.111123,-0.112176,-0.109884,-0.049454,-0.103613,0.767387,0.940102,-0.356688,0.305017,-0.174548,-0.171272,-0.172905,-0.114052
Crude Birth Rate,0.027552,-0.006989685,-0.577177,0.983822,0.926462,1.0,-0.00532,-0.910672,-0.897149,-0.917208,0.87455,0.871301,0.609647,-0.075568,-0.076566,-0.074409,-0.073771,-0.127895,0.727406,0.910191,-0.358069,0.255076,-0.150655,-0.146758,-0.148696,-0.112709
Births,0.083345,0.3499024,0.02401737,-0.026479,-0.046548,-0.00532,1.0,-0.067764,-0.054524,-0.078954,0.085547,0.070177,-0.019299,0.922801,0.927863,0.916606,-0.003621,-0.377816,0.000301,0.003547,0.331068,-0.124402,0.922531,0.924822,0.923757,-0.038314
Life Expectancy(birth),-0.007368,-0.05057234,0.6573743,-0.878907,-0.729365,-0.910672,-0.067764,1.0,0.996916,0.997463,-0.949806,-0.950813,-0.773516,-0.002939,-0.001851,-0.004149,0.081298,0.125577,-0.557435,-0.714805,0.284401,-0.122073,0.085866,0.082442,0.084141,0.12688
LExMale,-0.00824,-0.04889595,0.6767288,-0.862902,-0.714381,-0.897149,-0.054524,0.996916,1.0,0.988903,-0.939478,-0.939908,-0.769638,0.010209,0.011214,0.009082,0.085915,0.112837,-0.54238,-0.700215,0.289666,-0.090666,0.096811,0.09406,0.095427,0.129967
LExFemale,-0.005441,-0.0519796,0.6369594,-0.887643,-0.737235,-0.917208,-0.078954,0.997463,0.988903,1.0,-0.954617,-0.956178,-0.77425,-0.014487,-0.013277,-0.015825,0.078383,0.135599,-0.56561,-0.722453,0.279231,-0.149532,0.076107,0.072141,0.074107,0.123035


In [7]:
# baseline regression model (from previous assignment)
baseline = round(df.Births.mean())
baselist = [baseline] * len(df.Births)
errs = baseline - df['Births']
mae = errs.abs().mean()
print(f'The baseline is {baseline:,.0f} births,')
print(f'with a mean absolute error of {mae:,.0f}.')

The baseline is 57,279 births,
with a mean absolute error of 72,921.


In [8]:
# Train/Val/Test split
train,Test = train_test_split(df, train_size=0.80, test_size=0.20)
Train, Val = train_test_split(train,train_size=0.80, test_size=0.20)
Train.shape,Val.shape,Test.shape

((4536, 28), (1134, 28), (1418, 28))

## **Intro:**

In [9]:
# In the first assignment I chose 'Births as my target
target = 'Births'
# The features chosen 
features = ['Total Fertility','GrowthRate','Crude Birth Rate','Natural Increase Rate','PopFemale','PopTotal']

In [10]:
train.columns
drop = ['Unnamed: 0','Location','Time Period', target]
X_train = train.drop(columns=drop)
X_test = Test.drop(columns=drop)
Y_train = train[target]

In [12]:
# I became curious about which features SelectKbest would choose
select = SelectKBest(k=10)


x_trainS = select.fit_transform(X_train, Y_train)
x_testS = select.transform(X_test)
selected_mask = select.get_support()
all_names = X_train.columns
selected_names = all_names[selected_mask]

print('Features selected:')
for name in selected_names:
    print(name)

Features selected:
MidPeriod
Deaths
DeathsMale
DeathsFemale
Net Migrants(K)
Mean Age Childbearing(f)
PopFemale
PopMale
PopTotal
PopDensity


In [13]:
# X and Y
x_train = Train[features]
x_val   = Val[features]
x_test  = Test[features]

y_train = Train[target]
y_val   = Val[target]
y_test  = Test[target]