### Missing Data
1. Drop columns with missing values
2. Imputation - fill with some number i.e. the mean
3. Imputation + Imputed Column - add a new column showing location of imputed values

In [None]:
# Fill in the line below: get names of columns with missing values
cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()] # Your code here

# Fill in the lines below: drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

In [None]:
from sklearn.impute import SimpleImputer

# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

### Categorical Variables
1. Drop them
2. Ordinal encoding - assign value to integer (i.e. likert -> integer) (ordinal)
3. One-hot encoding - new columns indicating presence/absence of value (i.e. color -> RGB) (nominal)

In [None]:
from sklearn.preprocessing import OrdinalEncoder

# Make copy to avoid changing original data 
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])

print("MAE from Approach 2 (Ordinal Encoding):") 
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

# In coding ordinally, sometimes not all data is seen in the X_train. Make sure that X_train has the same unique entries as in X_valid

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

# Ensure all columns have string type
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)

print("MAE from Approach 3 (One-Hot Encoding):") 
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))

### Pipelines
1. Define preprocessing data
2. Define model
3. Create and evaluate pipeline

this is so fucking easy compared to what im doing earlier akdsfnaskfaskfs

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),  # Impute numerical data
        ('cat', categorical_transformer, categorical_cols)  # Impute and one-hot code categorical data
    ])

In [None]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)

In [None]:
from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

### Cross Validation
- run modeling process on different subsets of the data to get multiple measures of model quality
- for small datasets where extra computation time is not a problem

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

my_pipeline = Pipeline(steps=[('preprocessor', SimpleImputer()),
                              ('model', RandomForestRegressor(n_estimators=50,
                                                              random_state=0))
                             ])

In [None]:
from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

print("MAE scores:\n", scores)

### XGBoost
- Ensemble - combine predictions of several models (i.e. several trees for a random forest)
- Gradient Boosting - cycles and adds models into an ensemble

naive model -> (make prediction -> calculate loss -> train new model -> add new model to ensemble ->)

In [None]:
from xgboost import XGBRegressor

my_model = XGBRegressor()
my_model.fit(X_train, y_train)

In [None]:
from sklearn.metrics import mean_absolute_error

predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))

#### Parameter Tuning
1. `n_estimators`-  how many times to go through the modeling cycle and equal to number of models in ensemble
- too low = underfitting
- too high = overfitting
- typically 100-1000

2. `early_stopping_rounds` - automatically find the ideal value for n_estimators
- set high value for n_estimators (=500)
- default early_stopping_rounds = 5

3. `learning_rate` - multiply each model by a small number before adding it in
- each tree added to ensemble helps less: set higher value for n_estimators without overfitting
- small learning rate, large n_estimators = accurate XGBmodels
- default learning_rate = 0.1

4. `n_jobs` - large datasets parallelism
- default n_jobs equal to number of cores
- doesnt help with small data sets
- doesnt improve model, just time, use if fit takes long

### Data Leakage
- training data contains information about target but similar data will not be available when the model is used for prediction leading to high performance on training set but model will perform poorly in production.

1. **Target Leakage** - predictors include data that will not be available at the time you make predictions
- X variables depends on result of Y, X is not truly independent
- i.e. in a model predicting pnuemonia, if patient took antibiotics (which will depend if they are diagnosed)
- Usable -> Prediction Moment -> Unusable

1. **Train-Test Contamination**
- exclude validation data before fitting 
- do preprocessing inside the pipelines