## BloomTech Data Science

---

# Permutation & Boosting

- Get **permutation importances** for model interpretation and feature selection
- Use xgboost for **gradient boosting**

# Downloading the Tanzania Waterpump Dataset

Make sure  you only use the dataset that is available through the **DS** **Kaggle Competition**. DO NOT USE any other Tanzania waterpump datasets that you might find online.

There are two ways you can get the dataset. Make sure you have joined the competition first!:

1. You can download the dataset directly by accessing the challenge and the files through the Kaggle Competition URL on Canvas (make sure you have joined the competition!)

2. Use the Kaggle API using the code in the following cells. This article provides helpful information on how to fetch your Kaggle Dataset into Google Colab using the Kaggle API.

> https://medium.com/analytics-vidhya/how-to-fetch-kaggle-datasets-into-google-colab-ea682569851a

# Using Kaggle API to download datset

In [1]:
# mounting your google drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
#change your working directory, if you want to or have already saved your kaggle dataset on google drive.
%cd /content/gdrive/My Drive/Kaggle
# update it to your folder location on drive that contains the dataset and/or kaggle API token json file.

/content/gdrive/My Drive/Kaggle


In [4]:
# Download your Kaggle Dataset, if you haven't already done so.
import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/Kaggle"
!kaggle competitions download -c bloomtech-water-pump-challenge # double check if this command corresponds with the one given in Canvas

Downloading bloomtech-water-pump-challenge.zip to /content/gdrive/My Drive/Kaggle
  0% 0.00/4.18M [00:00<?, ?B/s]
100% 4.18M/4.18M [00:00<00:00, 63.4MB/s]


In [5]:
# Unzip your Kaggle dataset, if you haven't already done so.
!unzip \*.zip  && rm *.zip

Archive:  bloomtech-water-pump-challenge.zip
replace sample_submission.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: sample_submission.csv   
  inflating: test_features.csv       
  inflating: train_features.csv      
  inflating: train_labels.csv        


In [6]:
# List all files in your Kaggle folder on your google drive.
!ls

2024-01-08_2015_submission.csv	new_submission.csv     train_features.csv
kaggle.json			sample_submission.csv  train_labels.csv
model_rf_rs_80			test_features.csv      Untitled0.ipynb


# Install Libraries

In [7]:
%%capture
!pip install category_encoders==2.*

In [8]:
# data analysis and wrangling
import pandas as pd
import numpy as np

# visualization
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split

# encoders
from category_encoders import OrdinalEncoder
from sklearn.impute import SimpleImputer

#pipeline
from sklearn.pipeline import make_pipeline

# Bagged Model
from sklearn.ensemble import RandomForestClassifier

# Boosted Models
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

# Permutation Importance
from sklearn.inspection import permutation_importance

# for displaying images and html
from IPython.display import Image
from IPython.core.display import HTML

# Wrangle Data

We'll go back to Tanzania Waterpumps for this lesson.

In [9]:
def wrangle(fm_path, tv_path=None):
    if tv_path:
        df = pd.merge(pd.read_csv(fm_path, parse_dates=['date_recorded'], na_values=[-2e-08, 0]),
                         pd.read_csv(tv_path)).set_index('date_recorded')

    else:
        df = pd.read_csv(fm_path, parse_dates=['date_recorded']).set_index('date_recorded')

    # When columns have zeros and shouldn't, they are like null values.
    # So we will replace the zeros with nulls, and impute missing values later.
    # Also create a "missing indicator" column, because of the fact that
    # values are missing may be a predictive signal.
    cols_with_zeros = ['longitude', 'latitude', 'construction_year',
                       'gps_height', 'population']
    for col in cols_with_zeros:
        df[col] = df[col].replace(0, np.nan)
        df[col+'_MISSING'] = df[col].isnull()

    # Drop duplicate columns
    duplicates = ['quantity_group', 'payment_type']
    df = df.drop(columns=duplicates)

    # Drop recorded_by (never varies) and id (always varies, random)
    unusable_variance = ['recorded_by', 'id']
    df = df.drop(columns=unusable_variance)

    # Extract components from date_recorded
    df['year_recorded'] = df.index.year
    df['month_recorded'] = df.index.month
    df['day_recorded'] = df.index.day

    # Engineer feature: how many years from construction_year to date_recorded
    df['pump_age'] = df['year_recorded'] - df['construction_year']
    df['years_MISSING'] = df['pump_age'].isnull()

    # return the wrangled dataframe
    return df

In [10]:
fm_path = 'train_features.csv'
tv_path = 'train_labels.csv'

df = wrangle(fm_path, tv_path)

In [11]:
# Split data into feature and target
target = 'status_group'
X, y = df.drop(columns = target), df[target]

In [21]:
# Split data into train, validation and test sets
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.80, test_size=0.20, random_state=42)

In [22]:
# Baseline Metric
print('The baseline accuracy is ', y_train.value_counts(normalize=True).max())

The baseline accuracy is  0.5425489938182296


# Build Model

In [23]:
# Random Forest Model

model_rf = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    RandomForestClassifier(random_state=42, n_jobs=-1,n_estimators=75)
)

# Fit on train, score on val
model_rf.fit(X_train, y_train)

In [24]:
# Gradient Boosted Model

model_gb = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(strategy='mean'),
    GradientBoostingClassifier(random_state=42, n_estimators=75)
)

model_gb.fit(X_train, y_train)

In [29]:
# XGB Model

model_xgb = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(strategy='mean'),
    XGBClassifier(random_state=42, n_estimators=75, n_jobs=-1)
)

# THIS DOESNT WORK WHEN THE "INVALID CLASSES INFERRED FROM UNIQUE VALUES OF Y" ERROR HAPPENS WITH THE XGB METHOD:

# Convert the target variable to numerical values according to the AI, but I think that it messes with the accuracy.
# yz_train = y_train.replace({'functional': 0, 'functional needs repair': 1, 'non functional': 2})
# yz_val = y_val.replace({'functional': 0, 'functional needs repair': 1, 'non functional': 2})
# These two lines above are suspect ^^^ FIGURE OUT WHATS GOING ON TOMORROW

# I TRIED USING THIS SOLUTION FROM STACK OVERFLOW INSTEAD:

# from sklearn.preprocessing import LabelEncoder
# le = LabelEncoder()
# y_train = le.fit_transform(y_train)

# Neither work properly though. Alright back to the regular code:

model_xgb.fit(X_train,y_train);

ValueError: Invalid classes inferred from unique values of `y`.  Expected: [0 1 2], got ['functional' 'functional needs repair' 'non functional']

# Check Metrics

In [26]:
print('Training Accuracy', model_rf.score(X_train, y_train))
print('Validation Accuracy', model_rf.score(X_val, y_val))

Training Accuracy 0.9999210837827174
Validation Accuracy 0.8041877104377104


In [27]:
print('Training Accuracy', model_gb.score(X_train, y_train))
print('Validation Accuracy', model_gb.score(X_val, y_val))

Training Accuracy 0.7537287912666053
Validation Accuracy 0.7432659932659933


In [28]:
print('Training Accuracy', model_xgb.score(X_train, y_train))
print('Validation Accuracy', model_xgb.score(X_val, y_val))

Training Accuracy 0.8464553465737209
Validation Accuracy 0.0


### Try adjusting these hyperparameters

#### Random Forest
- class_weight (for imbalanced classes)
- max_depth (usually high, can try decreasing)
- n_estimators (too low underfits, too high wastes time)
- min_samples_leaf (increase if overfitting)
- max_features (decrease for more diverse trees)

#### Xgboost
- scale_pos_weight (for imbalanced classes)
- max_depth (usually low, can try increasing)
- n_estimators (too low underfits, too high wastes time/overfits) — Use Early Stopping!
- learning_rate (too low underfits, too high overfits)

For more ideas, see [Notes on Parameter Tuning](https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html) and [DART booster](https://xgboost.readthedocs.io/en/latest/tutorials/dart.html).

# Tuning / Communication

- How can we determine or communicate which features are most important to our model when making predictions?

**Option 1:** Grab feature importances from our pipeline


In [None]:



feat_imp.tail(10).plot(kind='barh')
plt.xlabel('Gini Importance')
plt.ylabel('Feature')

**Option 2:** Drop-column Importance


In [None]:
selected_column =

In [None]:
model_with_col = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(),
    XGBClassifier(n_estimators=75,
                  random_state=42,
                  n_jobs=-1,)
)

model_with_col.fit(X_train, y_train)

print(f'Validation Accuracy w/ "{selected_column}" included:', model_with_col.score(X_val, y_val))

In [None]:




print(f'Validation Accuracy w/ "{selected_column}" excluded:', )

**Option 3:** Permutation Importance

![](https://i.imgur.com/h17tMUU.png)

In [None]:
# By hand

# Step 1: Choose my feature
column_to_permute =

# Step 2: Train model w/ ALL features
model_to_permute = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(),
    XGBClassifier(n_estimators=75,
                  random_state=42,
                  n_jobs=-1,)
)

model_to_permute.fit(X_train, y_train);

In [None]:
# Step 3: Evaluate model using VALIDATION DATA.
print('Validation Accuracy', model_to_permute.score(X_val, y_val))

In [None]:
# Step 4: In VALIDATION DATA, permute the feature we're evaluating


In [None]:
# Step 5: Calculate our error metric with the permuted data
print('Validation Accuracy', model_to_permute.score(, y_val))

In [None]:
# Automated using sklearn


In [None]:
data_perm = {'imp_mean':perm_imp['importances_mean'],
             'imp_std':perm_imp['importances_std']}
df_perm = pd.DataFrame(data_perm, index=X_val.columns).sort_values('imp_mean')

In [None]:
df_perm

In [None]:
df_perm['imp_mean'].tail(10).plot(kind='barh')
