# **INTERMEDIATE MACHINE LEANING**

## LESSON 1: ***Missing Values***

----
### ***Three Approaches¶***
#### 1) A Simple Option: Drop Columns with Missing Values
#### The simplest option is to drop columns with missing values.


#### Unless most values in the dropped columns are missing, the model loses access to a lot of (potentially useful!) information with this approach. As an extreme example, consider a dataset with 10,000 rows, where one important column is missing a single entry. This approach would drop the column entirely!

#### 2) A Better Option: Imputation
#### Imputation fills in the missing values with some number. For instance, we can fill in the mean value along each column.

#### The imputed value won't be exactly right in most cases, but it usually leads to more accurate models than you would get from dropping the column entirely.

#### 3) An Extension To Imputation
#### Imputation is the standard approach, and it usually works well. However, imputed values may be systematically above or below their actual values (which weren't collected in the dataset). Or rows with missing values may be unique in some other way. In that case, your model would make better predictions by considering which values were originally missing.
----


#### ***Example:***

##### *Define Function to measure qualirty of each approach*

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
#load the data
data=pd.read_csv("melb_data.csv")
#select the target
y=data.Price
# to keep things simple, we'll use nly numerical predictors
melb_predictors=data.drop(['Price'],axis=1)
X=melb_predictors.select_dtypes(exclude=['object'])
#divide data into training and validation subsets
X_train,X_valid,y_train,y_valid=train_test_split(X,y,train_size=0.8,test_size=0.2,random_state=0)

In [7]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
#function for comparing different approaches
#We define a function score_dataset()
#to compare different approaches to dealing with missing values.
def score_dataset(X_train,X_valid,y_train,y_valid):
    model=RandomForestRegressor(n_estimators=10,random_state=0)
    model.fit(X_train,y_train)
    preds=model.predict(X_valid)
    return mean_absolute_error(y_valid,preds)

##### ***Score from Approach 1(Drop columns with Missing values)***

In [9]:
#Get names of columns with missing values
cols_with_missing=[col for col in X_train.columns if X_train[col].isnull().any()]
#Drop columns in training and validation data
reduced_X_train=X_train.drop(cols_with_missing,axis=1)
reduced_X_valid=X_valid.drop(cols_with_missing,axis=1)
print("MAE from approach 1 (Drop columns with missing values):")
print(score_dataset(reduced_X_train,reduced_X_valid,y_train,y_valid))

MAE from approach 1 (Drop columns with missing values):
183550.22137772635


##### ***Score from approach 2(Imputation)***

In [11]:
#Next, we use SimpleImputer to replace missing values with the mean value along each column.
from sklearn.impute import SimpleImputer
#Imputation
my_imputer=SimpleImputer()
imputed_X_train=pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid=pd.DataFrame(my_imputer.transform(X_valid))
#Imputation removed column names;put them back
imputed_X_train.columns=X_train.columns
imputed_X_valid.columns=X_valid.columns
print("MAE from approach 2 (Imputation):")
print(score_dataset(imputed_X_train,imputed_X_valid,y_train,y_valid))

MAE from approach 2 (Imputation):
178166.46269899711


###### *We see that Approach 2 has lower MAE than Approach 1, so Approach 2 performed better on this dataset.*

##### *Score from Approach 3(AN extension to Imputation)*

In [14]:
# Make copy to avoid changing original data(When imputing)
X_train_plus=X_train.copy()
X_valid_plus=X_valid.copy()
# Make new columns indicating what will be imputes
for col in cols_with_missing:
    X_train_plus[col + '_was_missing']=X_train_plus[col].isnull()
    X_valid_plus[col + '_was_missing']=X_valid_plus[col].isnull()
#Imputation
my_imputer=SimpleImputer()
imputed_X_train_plus=pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus=pd.DataFrame(my_imputer.transform(X_valid_plus))
#Imputatiin removed column names; pu them back
imputed_X_train_plus.columns=X_train_plus.columns
imputed_X_valid_plus.columns=X_valid_plus.columns
print("MAE from Approach 3 (An Extension to Imputation):")
print(score_dataset(imputed_X_train_plus,imputed_X_valid_plus,y_train,y_valid))


MAE from Approach 3 (An Extension to Imputation):
178927.503183954


##### *So,why did Imputation perform better than dropping the columns?*

In [16]:
# SHape of training data (num_rows,num_columns)
print(X_train.shape)
# Number of missing values in each column of training data
missing_val_count_by_column=(X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column>0])

(10864, 12)
Car               49
BuildingArea    5156
YearBuilt       4307
dtype: int64


##### ***Conclusion:*** 
###### *As is common, imputing missing values (in Approach 2 and Approach 3) yielded better results, relative to when we simply dropped columns with missing values (in Approach 1).*

## LESSON 2: ***Categorical Variables***

###### *A categorical variable takes only a limited number of values.*

### **Three Approaches**

### *1).Drop Categorical Variables.*

###### The easiest approach to dealing with categorical variables is to simply remove them from the dataset. This approach will only work well if the columns did not contain useful information.

#### *Example:*

In [24]:
import pandas as pd
from sklearn.model_selection import train_test_split
#read the data
data=pd.read_csv("melb_data.csv")
#Separate target from predictors
y=data.Price
X=data.drop(['Price'],axis=1)
#Divide data training and validation subsets
X_train_full,X_valid_full,y_train,y_valid=train_test_split(X,y,train_size=0.8,test_size=0.2,random_state=0)
#drop column with missing values(simplest approach)
cols_with_missing=[col for col in X_train_full.columns if X_train_full[col].isnull().any()]
X_train_full.drop(cols_with_missing,axis=1,inplace=True)
X_valid_full.drop(cols_with_missing,axis=1,inplace=True)
#cardinality means the number of unique values in a columns
#select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols=[cname for cname in X_train_full.columns if X_train_full[cname].nunique()<10 and X_train_full[cname].dtype == 'object']
#Select numerical columns
numerical_cols=[cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64','float64']]
#keep selected columns only
my_cols=low_cardinality_cols + numerical_cols
X_train=X_train_full[my_cols].copy()
X_valid =X_valid_full[my_cols].copy()

In [25]:
# taking a look at the top 5 rows
X_train.head()

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount
12167,u,S,Southern Metropolitan,1,5.0,3182.0,1.0,1.0,0.0,-37.85984,144.9867,13240.0
6524,h,SA,Western Metropolitan,2,8.0,3016.0,2.0,2.0,193.0,-37.858,144.9005,6380.0
8413,h,S,Western Metropolitan,3,12.6,3020.0,3.0,1.0,555.0,-37.7988,144.822,3755.0
2919,u,SP,Northern Metropolitan,3,13.0,3046.0,3.0,1.0,265.0,-37.7083,144.9158,8870.0
6043,h,S,Western Metropolitan,3,13.3,3020.0,3.0,1.0,673.0,-37.7623,144.8272,4217.0


In [26]:
# Get a list of categorical variables
s = (X_train.dtypes == 'object')
object_cols=list(s[s].index)
print('Categorical variables:')
print(object_cols)

Categorical variables:
['Type', 'Method', 'Regionname']


#### *Define Function to Measure Quality of Each Approach*

In [28]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
#function for comparing different approaches
def score_dataset(X_train,X_valid,y_train,y_valid):
    model=RandomForestRegressor(n_estimators=100,random_state=0)
    model.fit(X_train,y_train)
    preds=model.predict(X_valid)
    return mean_absolute_error(y_valid,preds)

### *Score from approach 1:* ***Drop Categorical Variables***

In [30]:
drop_X_train=X_train.select_dtypes(exclude=['object'])
drop_X_valid=X_valid.select_dtypes(exclude=['object'])
print("MAE form approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train,drop_X_valid,y_train,y_valid))

MAE form approach 1 (Drop categorical variables):
175703.48185157913


### *Score from approach 2:* ***Ordinal Encoding.***

In [32]:
from sklearn.preprocessing import OrdinalEncoder
# make copy to avoid changing origina; data
label_X_train=X_train.copy()
label_X_valid=X_valid.copy()
#Apply ordinal encoder to each column with categoical data
ordinal_encoder=OrdinalEncoder()
label_X_train[object_cols]=ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols]=ordinal_encoder.transform(X_valid[object_cols])
print("MAE from Approach 2 (Ordinal Encoding):")
print(score_dataset(label_X_train,label_X_valid,y_train,y_valid))

MAE from Approach 2 (Ordinal Encoding):
165936.40548390493


### *Score from approach 3:* ***One-Hot Encoding***

In [34]:
from sklearn.preprocessing import OneHotEncoder
#apply one-hot encoder to each column with categorical data
OH_encoder=OneHotEncoder(handle_unknown ='ignore', sparse_output=False)
OH_cols_train=pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid=pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))
#one-hot encoding removed index;put it back
OH_cols_train.index=X_train.index
OH_cols_valid.index=X_valid.index
#Remove categorical columns (Will replace with one-hot encoding)
num_X_train=X_train.drop(object_cols,axis=1)
num_X_valid=X_valid.drop(object_cols,axis=1)
#Add one-hot encoded columns to  numerical features
OH_X_train=pd.concat([num_X_train,OH_cols_train],axis=1)
OH_X_valid=pd.concat([num_X_valid,OH_cols_valid],axis=1)
#Ensure all columns have string tyep
OH_X_train.columns=OH_X_train.columns.astype(str)
OH_X_valid.columns=OH_X_valid.columns.astype(str)
print("MAE from Approach 3(One-hot Encoding):")
print(score_dataset(OH_X_train,OH_X_valid,y_train,y_valid))

MAE from Approach 3(One-hot Encoding):
166089.4893009678


## LESSON 3: ***Pipelines***

#### Many data scientists hack together models without pipelines, but pipelines have some important benefits. Those include:

##### ***1.Cleaner Code*** : Accounting for data at each step of preprocessing can get messy. With a pipeline, you won't need to manually keep track of your training and validation data at each step.
##### ***2.Fewer Bugs*** :There are fewer opportunities to misapply a step or forget a preprocessing step.
##### ***3.Easier to Productionize*** : It can be surprisingly hard to transition a model from a prototype to something deployable at scale. We won't go into the many related concerns here, but pipelines can help.
##### ***4.More Options for Model Validation*** : You will see an example in the next tutorial, which covers cross-validation.

### *Example:*

In [47]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv('melb_data.csv')

# Separate target from predictors
y = data.Price
X = data.drop(['Price'], axis=1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

In [39]:
# first few rows
X_train.head()

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
12167,u,S,Southern Metropolitan,1,5.0,3182.0,1.0,1.0,1.0,0.0,,1940.0,-37.85984,144.9867,13240.0
6524,h,SA,Western Metropolitan,2,8.0,3016.0,2.0,2.0,1.0,193.0,,,-37.858,144.9005,6380.0
8413,h,S,Western Metropolitan,3,12.6,3020.0,3.0,1.0,1.0,555.0,,,-37.7988,144.822,3755.0
2919,u,SP,Northern Metropolitan,3,13.0,3046.0,3.0,1.0,1.0,265.0,,1995.0,-37.7083,144.9158,8870.0
6043,h,S,Western Metropolitan,3,13.3,3020.0,3.0,1.0,2.0,673.0,673.0,1970.0,-37.7623,144.8272,4217.0


#### STEP 1: *Define Preprocessing Steps*

In [49]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

#### STEP 2: *Define the Model.*

In [51]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)

#### STEP 3: *Create and Evaluate the Pipeline.*

----
##### Finally, we use the **Pipeline** class to define a pipeline that bundles the preprocessing and modeling steps. There are a few important things to notice:
##### *. With the pipeline, we preprocess the training data and fit the model in a single line of code. (In contrast, without a pipeline, we have to do imputation, one-hot encoding, and model training in separate steps. This becomes especially messy if we have to deal with both numerical and categorical variables!)*
##### *. With the pipeline, we supply the unprocessed features in ***X_valid*** to the ***predict()*** command, and the pipeline automatically preprocesses the features before generating predictions. (However, without a pipeline, we have to remember to preprocess the validation data before making predictions.)*
----

In [53]:
from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])
/
# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

MAE: 160679.18917034855


## **LESSON 4:** ***Cross-validation***

#### Instead of training your model once on the training data and testing it on a separate test set, cross-validation splits the training data into multiple smaller folds (subsets).
#### The model is trained on some folds and validated on the remaining fold.

#### This process is repeated several times, and the performance scores are averaged.

### *Example:*

In [69]:
import pandas as pd

# Read the data
data = pd.read_csv('melb_data.csv')

# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# Select target
y = data.Price

In [71]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

my_pipeline = Pipeline(steps=[('preprocessor', SimpleImputer()),
                              ('model', RandomForestRegressor(n_estimators=50,
                                                              random_state=0))
                             ])


In [73]:
from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

print("MAE scores:\n", scores)

MAE scores:
 [301628.7893587  303164.4782723  287298.331666   236061.84754543
 260383.45111427]


In [75]:
print("Average MAE score (across experiments):")
print(scores.mean())

Average MAE score (across experiments):
277707.3795913405


## **LESSON 5:** ***XGBoost***

###### We refer to the random forest method as an "ensemble method". By definition, ensemble methods combine the predictions of several models (e.g., several trees, in the case of random forests).

### ***Gradient Boosting***

##### ***Gradient boosting*** is a method that goes through cycles to iteratively add models into an ensemble.
##### It begins by initializing the ensemble with a single model, whose predictions can be pretty naive. (Even if its predictions are wildly inaccurate, subsequent additions to the ensemble will address those errors.)

### *Example:*

In [82]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv('melb_data.csv')

# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# Select target
y = data.Price

# Separate data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y)

In [84]:
from xgboost import XGBRegressor

my_model = XGBRegressor()
my_model.fit(X_train, y_train)

0,1,2
,objective,'reg:squarederror'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


In [86]:
from sklearn.metrics import mean_absolute_error

predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))

Mean Absolute Error: 233153.5562177835


### ***Parameter Tuning***

In [89]:
my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train)

0,1,2
,objective,'reg:squarederror'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


In [95]:
!pip install --upgrade xgboost





In [101]:
import xgboost as xgb

dtrain = xgb.DMatrix(X_train, label=y_train)
dvalid = xgb.DMatrix(X_valid, label=y_valid)

params = {"objective": "reg:squarederror", "eval_metric": "rmse"}

model = xgb.train(
    params,
    dtrain,
    num_boost_round=500,
    evals=[(dtrain, "train"), (dvalid, "valid")],
    early_stopping_rounds=5
)




[0]	train-rmse:532403.32730	valid-rmse:574083.95170
[1]	train-rmse:473138.89976	valid-rmse:518727.14202
[2]	train-rmse:433047.69722	valid-rmse:482500.37379
[3]	train-rmse:407227.43926	valid-rmse:460746.13057
[4]	train-rmse:389103.85490	valid-rmse:446136.18073
[5]	train-rmse:376451.09658	valid-rmse:438844.30861
[6]	train-rmse:367601.67879	valid-rmse:432483.38377
[7]	train-rmse:360787.41031	valid-rmse:427378.33555
[8]	train-rmse:355667.06754	valid-rmse:423189.72672
[9]	train-rmse:349132.67380	valid-rmse:418478.76289
[10]	train-rmse:344068.76103	valid-rmse:415951.15896
[11]	train-rmse:340431.63380	valid-rmse:414105.61394
[12]	train-rmse:334512.24939	valid-rmse:410240.26369
[13]	train-rmse:333861.85263	valid-rmse:410042.92054
[14]	train-rmse:330686.92939	valid-rmse:407901.82187
[15]	train-rmse:329319.41832	valid-rmse:407450.55909
[16]	train-rmse:324573.02834	valid-rmse:406484.18698
[17]	train-rmse:323940.88238	valid-rmse:406089.63761
[18]	train-rmse:322820.09779	valid-rmse:405463.06461
[19

## **LESSON 6:** ***Data Leakage***

#### *Data leakage (or leakage)* happens when your training data contains information about the target, but similar data will not be available when the model is used for prediction. This leads to high performance on the training set (and possibly even the validation data), but the model will perform poorly in production.

### ***Target Leakage***
----
##### *Target leakage* occurs when your predictors include data that will not be available at the time you make predictions. It is important to think about target leakage in terms of the timing or chronological order that data becomes available, not merely whether a feature helps make good predictions.
----

### *Example:*

In [113]:
import pandas as pd
#read the data
data=pd.read_csv("AER_credit_card_data.csv",true_values=['yes'],false_values=['no'])
#select target
y=data.card
#select predictors
X=data.drop(['card'],axis=1)
print('Number of rows in the dataset:',X.shape[0])
X.head()

Number of rows in the dataset: 1319


Unnamed: 0,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,0,37.66667,4.52,0.03327,124.9833,True,False,3,54,1,12
1,0,33.25,2.42,0.005217,9.854167,False,False,3,34,1,13
2,0,33.66667,4.5,0.004156,15.0,True,False,4,58,1,5
3,0,30.5,2.54,0.065214,137.8692,False,False,0,25,1,7
4,0,32.16667,9.7867,0.067051,546.5033,True,False,2,64,1,5


In [115]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Since there is no preprocessing, we don't need a pipeline (used anyway as best practice!)
my_pipeline = make_pipeline(RandomForestClassifier(n_estimators=100))
cv_scores = cross_val_score(my_pipeline, X, y, 
                            cv=5,
                            scoring='accuracy')

print("Cross-validation accuracy: %f" % cv_scores.mean())

Cross-validation accuracy: 0.981810


In [117]:
expenditures_cardholders = X.expenditure[y]
expenditures_noncardholders = X.expenditure[~y]

print('Fraction of those who did not receive a card and had no expenditures: %.2f' \
      %((expenditures_noncardholders == 0).mean()))
print('Fraction of those who received a card and had no expenditures: %.2f' \
      %(( expenditures_cardholders == 0).mean()))


Fraction of those who did not receive a card and had no expenditures: 1.00
Fraction of those who received a card and had no expenditures: 0.02


In [119]:
# Drop leaky predictors from dataset
potential_leaks = ['expenditure', 'share', 'active', 'majorcards']
X2 = X.drop(potential_leaks, axis=1)

# Evaluate the model with leaky predictors removed
cv_scores = cross_val_score(my_pipeline, X2, y, 
                            cv=5,
                            scoring='accuracy')

print("Cross-val accuracy: %f" % cv_scores.mean())

Cross-val accuracy: 0.831694
