Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf), Chapter 8
  - _**[Gradient Boosting Explained](https://www.gormanalysis.com/blog/gradient-boosting-explained/)**_ — Ben Gorman
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html) — Alex Rogozhnikov
  - [How to explain gradient boosting](https://explained.ai/gradient-boosting/) — Terence Parr & Jeremy Howard

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV


from category_encoders import OneHotEncoder, OrdinalEncoder
from pandas_profiling import ProfileReport
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, plot_roc_curve, roc_auc_score, plot_confusion_matrix
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

In [3]:
def wrangle(filepath):
    # Establish path to datafiles
    DATA_PATH = '../data/class-project/LoanApproval/'
    
    # Read in the data
    df = pd.read_csv(DATA_PATH + filepath)
    
    # Remove underscores in columns names (only used in 4 features/12 features)
    # for consistent formatting and increased ease of accessing features
    new_cols = [col.replace('_', '') for col in df.columns]
    df.columns = new_cols
    
    # Drop high-cardinality column, 'Loan_ID'
    threshold = 10 
    high_card_cols = [col for col in df.select_dtypes('object').columns
                     if df[col].nunique() > threshold]
    df.drop(high_card_cols, axis=1, inplace=True) 
      
    # Fill NaN values in 'LoanAmount' and 'LoanAmountTerm' columns = continuous variables 
    # will fill 'LoanAmount'with median value, will fill 'LoanAmountTerm' with the mode == max == 360 
    # Will impute remaining NaNs later with a strategy='most_frequent'
    
    df['LoanAmount'].fillna(value=df['LoanAmount'].median(), inplace=True)     
    df['LoanAmountTerm'].fillna(value=df['LoanAmountTerm'].mode()[0], inplace=True)
    
    # Convert 'Dependents' from strings to integers
    df['Dependents'] = df['Dependents'].str.replace('+', '')
    
    # Feature Engineering (No features are being built based on the target; should not be any leakage)
    # Applicant Income to LoanAmount
    df['ApplicantIncome2LoanAmount'] = (df['ApplicantIncome'] / df['LoanAmount']).round(2)
    
    # Total Income to LoanAmount
    df['TotalIncome2LoanAmount'] = ((df['ApplicantIncome'] + df['CoapplicantIncome']) /
                                    df['LoanAmount']).round(2)
    
    # Loan Amount to Loan_Amount_Term
    df['LoanAmount2LoanTerm'] = (df['LoanAmount'] / df['LoanAmountTerm']).round(2)
    
    # Convert 'CoapplicantIncome', 'LoanAmount', and 'LoanAmountTerm' to integers from floats
    for col in ['CoapplicantIncome', 'LoanAmount','LoanAmountTerm']:
        df[col] = df[col].astype(int)
    
    return df

train_path = 'train_data.csv'
test_path = 'test_data.csv'

train = wrangle(train_path)
test = wrangle(test_path)

In [None]:
# ProfileReport(pd.DataFrame(pd.read_csv(DATA_PATH + 'train_data.csv')))

In [4]:
train.head()

Unnamed: 0,Gender,Married,Dependents,Education,SelfEmployed,ApplicantIncome,CoapplicantIncome,LoanAmount,LoanAmountTerm,CreditHistory,PropertyArea,LoanStatus,ApplicantIncome2LoanAmount,TotalIncome2LoanAmount,LoanAmount2LoanTerm
0,Male,No,0,Graduate,No,5849,0,128,360,1.0,Urban,Y,45.7,45.7,0.36
1,Male,Yes,1,Graduate,No,4583,1508,128,360,1.0,Rural,N,35.8,47.59,0.36
2,Male,Yes,0,Graduate,Yes,3000,0,66,360,1.0,Urban,Y,45.45,45.45,0.18
3,Male,Yes,0,Not Graduate,No,2583,2358,120,360,1.0,Urban,Y,21.52,41.18,0.33
4,Male,No,0,Graduate,No,6000,0,141,360,1.0,Urban,Y,42.55,42.55,0.39


In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 15 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Gender                      601 non-null    object 
 1   Married                     611 non-null    object 
 2   Dependents                  599 non-null    object 
 3   Education                   614 non-null    object 
 4   SelfEmployed                582 non-null    object 
 5   ApplicantIncome             614 non-null    int64  
 6   CoapplicantIncome           614 non-null    int64  
 7   LoanAmount                  614 non-null    int64  
 8   LoanAmountTerm              614 non-null    int64  
 9   CreditHistory               564 non-null    float64
 10  PropertyArea                614 non-null    object 
 11  LoanStatus                  614 non-null    object 
 12  ApplicantIncome2LoanAmount  614 non-null    float64
 13  TotalIncome2LoanAmount      614 non