# Hypothesis Testing

In this optional notebook, I tried to test some hypothesis of mine about Mileage being a Missing Not at Random (MNAR) variable, my alternative hypothesis is that this predictor is Missing At Random (MAR), and if this is so, I am going to just drop the NA values in this column.  
Also, I will use H20 library, which allows data to be categorical, as a normal library should, this is one of the downsides of the scikit-learn library.

In [17]:
import set_jupyter_path

In [29]:
import pandas as pd
import h2o
h2o.init(nthreads=-1, max_mem_size=4)
from h2o.estimators import H2ORandomForestEstimator

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O cluster uptime:,2 days 3 hours 38 mins
H2O cluster timezone:,Asia/Bishkek
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.0.2
H2O cluster version age:,20 days
H2O cluster name:,H2O_from_python_danberd_9jeqmn
H2O cluster total nodes:,1
H2O cluster free memory:,2.661 Gb
H2O cluster total cores:,12
H2O cluster allowed cores:,12


In [19]:
from src.car_price_prediction.data_cleaning import data_cleaner

In [20]:
from src.car_price_prediction.data_cleaning import processed_data_maker

In [21]:
data = pd.read_excel('../data/raw/cars_raw_data.xlsx')

In [22]:
pro_data = processed_data_maker.get_processed_data((data_cleaner.get_clean_data(data)))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df.loc[:, df.columns != 'Mileage'].dropna(inplace=True)


In [138]:
pro_data.Publication = pd.to_datetime(pro_data.Publication)

In [139]:
import datetime
hard_date = datetime.date(2017, 1, 1)

In [140]:
pro_data.Publication = pro_data.Publication.apply(lambda x: (x.date() - hard_date).days)

In [23]:
pro_data.isna().sum()

Expiration      0
Year            0
Publication     0
Transmission    0
Brand           0
Model           0
Capacity        0
Drive           0
Mileage         0
Wheel           0
Carcass         0
Fuel            0
Color           0
Price           0
dtype: int64

In [24]:
pro_data_dropped = pro_data.dropna()

In [142]:
def knn_impute_mileage(df):
    cols = df.loc[:, df.columns != 'Mileage'].columns.delete(-1)
    print(cols)
    X_train, X_test, y_train, y_test = get_train_test(
        df, cols, 'Mileage')
    y_pred = get_y_pred(X_train, X_test, y_train, y_test)
    y_pred = pd.Series(y_pred, index=y_test.index, name=y_test.name)
    df.loc[df[~df.Mileage.notnull()].index, 'Mileage'] = y_pred
    return df


def get_train_test(df, df_columns, target):
    X_train, y_train = get_train(df, df_columns, target)
    X_test, y_test = get_test(df, df_columns, target)
    missing_cols = set(X_train.columns) - set(X_test.columns)
    for c in missing_cols:
        X_test[c] = 0
    X_test = X_test[X_train.columns]
    return X_train, X_test, y_train, y_test


def get_train(df, data_columns, target):
    train_data = df.dropna()
    X_train, y_train = train_data[data_columns], train_data[target]
    X_train = pd.get_dummies(X_train)
    return X_train, y_train


def get_test(df, data_columns, target):
    test_data = df[~df[target].notnull()]
    X_test, y_test = test_data[data_columns], test_data[target]
    X_test = pd.get_dummies(X_test)
    return X_test, y_test


def get_y_pred(X_train, X_test, y_train, y_test):
    forest = RandomForestRegressor(n_estimators=30)
    forest.fit(X_train, y_train)
    y_pred = forest.predict(X_test)
    return y_pred


In [143]:
pro_data_imputed = knn_impute_mileage(pro_data)

Index(['Expiration', 'Year', 'Publication', 'Transmission', 'Brand', 'Model',
       'Capacity', 'Drive', 'Wheel', 'Carcass', 'Fuel', 'Color'],
      dtype='object')


In [25]:
drop = h2o.H2OFrame(pro_data_dropped)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [145]:
impute = h2o.H2OFrame(pro_data_imputed)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [30]:
from h2o.estimators.random_forest import H2ORandomForestEstimator

In [31]:
splits = drop.split_frame(ratios=[0.8, 0.1999], seed = 10012)
drop_train = splits[0]
drop_test = splits[1]

In [None]:
splits = impute.split_frame(ratios=[0.8, 0.1999], seed = 123)
impute_train = splits[0]
impute_test = splits[1]

In [32]:
y = 'Price'
x = list(pro_data.columns)
x.remove('Price')

In [33]:
rf_fit1 = H2ORandomForestEstimator(model_id='rf_fit1', ntrees = 100, seed = 1231)

In [34]:
rf_fit1.train(x=x, y=y, training_frame=drop_train)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [None]:
rf_fit2 = H2ORandomForestEstimator(model_id='rf_fit2', ntrees = 100, seed =34)
rf_fit2.train(x=x, y=y, training_frame=impute_train)

drf Model Build progress: |██████████████████████████████████████████

In [36]:
rf_perf1 = rf_fit1.model_performance(drop_test)

In [None]:
rf_perf2 = rf_fit2.model_performance(impute_test)

In [37]:
rf_perf1


ModelMetricsRegression: drf
** Reported on test data. **

MSE: 17442445.795701068
RMSE: 4176.41542422459
MAE: 1669.4881179334723
RMSLE: 0.20077436513314742
Mean Residual Deviance: 17442445.795701068




In [None]:
rf_perf2