<h3><b> Kaggle: Intermediate Machine Learning </b></h3>

In [None]:
!pip3 install xgboost

Dummy dataframe and train/valuation segmentation:

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("file_name.csv")
y = df."predictor_col_name"      
exp_rows = ["col_name1", "col_name2", ...]
X = df[exp_rows]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

<b> Chapter 2: Missing Values </b>

Approaches to missing values:
1. Drop columns
2. Imputation (Replace with values: mean, median etc...)
3. Imputation with new binary column suggesting missing value

In [None]:
# Libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error 
from sklearn.impute import SimpleImputer

# Function returning error size of models
def score_dataset(train_X, val_X, train_y, val_y):
    model = RandomForestRegressor(n_estimators = 10, random_state = 0)
    model.fit(train_X, train_y)
    preds = model.predict(val_X)
    return mean_absolute_error(val_y, preds) # Can replace MAE with other types of error metric

# 1. Dropping columns
cols_with_missing = [col for col in train_X.columns if train_X[col].isnull().any()]
reduced_train_X = train_X.drop(cols_with_missing, axis = 1)
reduced_val_X = val_X.drop(cols_with_missing, axis = 1)

# 2. Imputation
my_imputer = SimpleImputer()
imputed_train_X = pd.DataFrame(my_imputer.fit_transform(train_X))
imputed_val_X = pd.DataFrame(my_imputer.transform(val_X))
imputed_train_X.columns = train_X.columns    # Put back column names
imputed_val_X.columns = val_X.columns

# 3. Imputation with dummy missing column
train_X_plus = train_X.copy()
val_X_plus = val_X.copy()
for col in cols_with_missing:                # For making new dummy columns
    train_X_plus[col + '_was_missing'] = train_X_plus[col].isnull()
    val_X_plus[col + '_was_missing'] = val_X_plus[col].isnull()

my_imputer = SimpleImputer()
imputed_train_X_plus = pd.DataFrame(my_imputer.fit_transform(train_X_plus))
imputed_val_X_plus = pd.DataFrame(my_imputer.transform(val_X_plus))
imputed_train_X_plus.columns = train_X_plus.columns
imputed_val_X_plus.columns = val_X_plus.columns

<b> Chapter 3: Categorical Variables </b>

Approaches to categorical variables:
1. Drop variables
2. Ordinal encoding
3. One-Hot encoding (n binary cols for n categories)

In [None]:
# Libraries
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

# Get list of categorical variables
object_cols = [col for col in train_X.columns if train_X[col].dtype == "object"]

# 1. Drop categorical variables
drop_train_X = train_X.select_dtypes(exclude = ['object'])
drop_val_X = val_X.select_dtypes(exclude = ['object'])

# 2. Ordinal encoding 
label_train_X = train_X.copy()
label_val_X = val_X.copy()
ordinal_encoder = OrdinalEncoder()

# Ordinal encoding will output error if column names if train data doesnt equal validation data
# Only use repeated columns in both train and validation set
good_label_cols = [col for col in object_cols if set(val_X[col]).issubset(set(train_X[col]))]
bad_label_cols = list(set(object_cols) - set(good_label_cols))  # Problematic columns, will be dropped

label_train_X = train_X.drop(bad_label_cols, axis=1)            # Drop problematic columns
label_val_X = val_X.drop(bad_label_cols, axis=1)
label_train_X[good_label_cols] = ordinal_encoder.fit_transform(train_X[good_label_cols])
label_val_X[good_label_cols] = ordinal_encoder.transform(val_X[good_label_cols])

# 3. One-Hot encoding
# Only choose columns with low cardinality (low # of categories), or else size of dataset explodes
low_cardinality_cols = [col for col in object_cols if train_X[col].nunique() < 10]
high_cardinality_cols = list(set(object_cols) - set(low_cardinality_cols))

OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(train_X[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(val_X[low_cardinality_cols]))

OH_cols_train.index = train_X.index                             # Put back column names
OH_cols_valid.index = val_X.index

num_train_X = train_X.drop(object_cols, axis = 1)               # Remove categorical columns
num_val_X = val_X.drop(object_cols, axis = 1)

OH_train_X = pd.concat([num_train_X, OH_cols_train], axis = 1)  # Add one-hot encoded columns
OH_val_X = pd.concat([num_val_X, OH_cols_valid], axis = 1)

OH_train_X.columns = OH_train_X.columns.astype(str)             # Ensure all columns are string dtype
OH_val_X.columns = OH_val_X.columns.astype(str)

<b> Chapter 4: Pipelines </b>

In [None]:
# Libraries
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# 1. Check for missing values
cols_with_missing = [col for col in train_X.columns if train_X[col].isnull().any()]

# 2. Check for categorical variables
object_cols = [col for col in train_X.columns if train_X[col].dtype == "object"]

# 3. Preprocess numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# 4. Preprocess categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# 5. Initiate preprocessor for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# 6. Initiate model
model = RandomForestRegressor(n_estimators = 100, random_state = 0)

# 7. Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# 8. Fit model 
my_pipeline.fit(train_X, train_y)

# 9. Predict y
preds = my_pipeline.predict(val_X)

# 10. Evaluate model
score = mean_absolute_error(val_y, preds)

# 11. Publish model on full data X = train_X + val_X
preds_test = my_pipeline.predict(X)

<b> Chapter 5: Cross-Validation </b> 

Point of cross-validation: Set different chunks of data to be validation and training, so to generate more hollistic error values for evaluating models

In [None]:
# Libraries
from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y, cv = 5, scoring = 'neg_mean_absolute_error')

<b> Chapter 6: XGBoost </b>

Gradient boosting methodology:
1. Use single model for prediction (Naive model)
2. Calculate loss function (MSE, MAE, etc...)
3. Fit new model to reduce loss
4. Add new model to ensemble
5. Predict and repeat 2-5

In [None]:
# Libraries
from xgboost import XGBRegressor

# Gradient boosting regressor model
model = XGBRegressor(n_estimators = ..., learning_rate = ..., n_jobs = ...)
model.fit(train_X, train_y, early_stopping_rounds = ...,
          eval_set = [(val_X, val_y)], verbose = False)
predictions = model.predict(val_X)

XGBRegressor parameters:
1. n_estimators: Too little params = underfit, too many will overfit; Typically 100-1000
2. early_stopping_rounds: Stops model from iterating when loss function stops improving; Typically 5
3. learning_rate: Weighting on new models; Default 0.1
4. n_jobs: Number of model fitting simultaneously; Typical set as # of cores on machine

<b> Chapter 7: Data Leakage </b>

Data leakage: 