Feature engineering is very important in most machine learning problems.

The following code used continuous & categorical features to encode some 'important' features, especially manager_id. The code was partially copied from Little Boat: https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/discussion/32123
The following algorithms and public log loss for the test dataset are achieved:

- random forests: 0.62909
- xgboost: 0.60205
- logistic regression: 1.52583
- Gaussian Naive Bayes: 3.49425

In [22]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss

In [23]:
train_df = pd.read_json("train.json")
test_df = pd.read_json("test.json")
sub = pd.DataFrame()
sub["listing_id"] = test_df["listing_id"]

### Feature Engineering

#### Define get_stats function
- It first merge train_df and test_df, followed by grouping the dataframe by group_column, then calculating the count, mean, std, median, max, min of the target_column feature.
- It returns the train and test df with the newly added columns as numpy array (selected_train, selected_test).

In [24]:
def get_stats(train_df, test_df, target_column, group_column = 'manager_id'):
    '''
    target_column: numeric columns to group with (e.g. price, bedrooms, bathrooms)
    group_column: categorical columns to group on (e.g. manager_id, building_id)
    '''
    train_df['row_id'] = range(train_df.shape[0])
    test_df['row_id'] = range(test_df.shape[0])
    train_df['train'] = 1
    test_df['train'] = 0
    all_df = train_df[['row_id', 'train', target_column, group_column]].append(test_df[['row_id','train', target_column, group_column]])
    all_df = all_df.reindex()
    grouped = all_df[[target_column, group_column]].groupby(group_column)
    
    the_size = pd.DataFrame(grouped.size()).reset_index()
    the_size.columns = [group_column, '%s_size' % target_column]
    
    the_mean = pd.DataFrame(grouped.mean()).reset_index()
    the_mean.columns = [group_column, '%s_mean' % target_column]
    
    the_std = pd.DataFrame(grouped.std()).reset_index().fillna(0)
    the_std.columns = [group_column, '%s_std' % target_column]
    
    the_median = pd.DataFrame(grouped.median()).reset_index()
    the_median.columns = [group_column, '%s_median' % target_column]
    
    the_stats = pd.merge(the_size, the_mean)
    the_stats = pd.merge(the_stats, the_std)
    the_stats = pd.merge(the_stats, the_median)

    the_max = pd.DataFrame(grouped.max()).reset_index()
    the_max.columns = [group_column, '%s_max' % target_column]
    
    the_min = pd.DataFrame(grouped.min()).reset_index()
    the_min.columns = [group_column, '%s_min' % target_column]

    the_stats = pd.merge(the_stats, the_max)
    the_stats = pd.merge(the_stats, the_min)

    all_df = pd.merge(all_df, the_stats)

    selected_train = all_df[all_df['train'] == 1]
    selected_test = all_df[all_df['train'] == 0]
    
    selected_train.sort_values('row_id', inplace=True)
    selected_test.sort_values('row_id', inplace=True)
    
    selected_train.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True)
    selected_test.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True)

    return np.array(selected_train), np.array(selected_test)

#### Use the get_stats function

The following code set group_column = 'manager_id', scan target_id = 'bathrooms', 'bedrooms', 'latitude', 'longitude', 'price', and return train_stack_list and test_stack_list both with dimension of (6, 49352, 6).

In [25]:
# target_column
selected_manager_id_proj = ['bathrooms', 'bedrooms', 'latitude', 'longitude', 'price']

train_stack_list = []
test_stack_list = []

# group column = 'manager_id'
for target_col in selected_manager_id_proj:
    tmp_train, tmp_test = get_stats(train_df, test_df, target_column=target_col)
    # dimension of tmp_train is (49352, 6)
    train_stack_list.append(tmp_train)
    test_stack_list.append(tmp_test)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


The following code set group_column = 'bedrooms', scan target_id = 'price', 'listing_id'.

In [26]:
selected_bedrooms_proj = ['price', 'listing_id']

for target_col in selected_bedrooms_proj:
    tmp_train, tmp_test = get_stats(train_df, test_df, target_column=target_col, group_column='bedrooms')
    train_stack_list.append(tmp_train)
    test_stack_list.append(tmp_test)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [27]:
np.shape(train_stack_list)

(7, 49352, 6)

In [28]:
np.shape(test_stack_list)

(7, 74659, 6)

In [29]:
train_df = train_df.set_index([[i for i in range(len(train_df))]])
test_df = test_df.set_index([[i for i in range(len(test_df))]])

for i in range(7):
    train_add = pd.DataFrame(train_stack_list[i],  columns = ['size_{0}'.format(i), 'mean_{0}'.format(i), 'std_{0}'.format(i),
                                                              'median_{0}'.format(i), 'max_{0}'.format(i), 'min_{0}'.format(i)])
    test_add = pd.DataFrame(test_stack_list[i],  columns = ['size_{0}'.format(i), 'mean_{0}'.format(i), 'std_{0}'.format(i), 
                                                            'median_{0}'.format(i), 'max_{0}'.format(i), 'min_{0}'.format(i)])
    train_add = train_add.set_index([[i for i in range(len(train_add))]])
    test_add = test_add.set_index([[i for i in range(len(test_add))]])
    train_df = pd.concat([train_df, train_add], axis = 1)
    test_df = pd.concat([test_df, test_add], axis = 1)

In [30]:
def featureE(df):
    df["num_photos"] = df["photos"].apply(len)
    df["num_features"] = df["features"].apply(len)
    df["num_description_words"] = df["description"].apply(lambda x: len(x.split(" ")))
    df["created"] = pd.to_datetime(df["created"])
    df["created_year"] = df["created"].dt.year
    df["created_month"] = df["created"].dt.month
    df["created_day"] = df["created"].dt.day
    df = df.drop(['photos', 'features', 'description', 'listing_id', 'created', 'building_id', 'manager_id', 'display_address', 'street_address'], axis=1)
    return df

In [31]:
train_df = featureE(train_df)
test_df = featureE(test_df)
y = train_df['interest_level']
train_df = train_df.drop(['interest_level', 'row_id', 'train', 'created_year'], axis = 1)
test_df = test_df.drop(['row_id', 'train', 'created_year'], axis = 1)

In [32]:
x = train_df

### Random Forest Algorithm

Training model

In [33]:
validation_size = 0.30
seed = 2018
X_train, X_validation, Y_train, Y_validation = train_test_split(x, y, test_size = validation_size, random_state = seed)

In [34]:
clf = RandomForestClassifier(n_estimators=1000)
clf.fit(X_train, Y_train)
y_val_pred = clf.predict_proba(X_validation)
log_loss(Y_validation, y_val_pred)

0.64070518969863688

Make predictions

In [35]:
y = clf.predict_proba(test_df)

In [36]:
clf.classes_

array(['high', 'low', 'medium'], dtype=object)

In [37]:
labels2idx = {label: i for i, label in enumerate(clf.classes_)}

In [38]:
labels2idx

{'high': 0, 'low': 1, 'medium': 2}

In [39]:
for label in ["high", "low", "medium"]:
    sub[label] = y[:, labels2idx[label]]
sub.to_csv("ml_3_rf.csv", index=False)

Public score: 0.62909
Private score: 0.62976

### XGBoost

In [40]:
from xgboost import XGBClassifier
from xgboost import plot_importance
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

validation_size = 0.30
seed = 2018
X_train, X_validation, Y_train, Y_validation = train_test_split(x, y, test_size = validation_size, random_state = seed)

model = XGBClassifier()
learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]
param_grid = dict(learning_rate = learning_rate)
kfold = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 2018)
grid_search = GridSearchCV(model, param_grid, scoring = 'neg_log_loss', n_jobs = 1, cv = kfold)
result = grid_search.fit(X_train, Y_train)

# summarize results
print("BestL %f using %s" % (- result.best_score_, result.best_params_))
means, stdevs = [], []
for params, mean_score, scores in result.grid_scores_:
    stdev = scores.std()
    means.append(- mean_score)
    stdevs.append(stdev)
    print("%f (%f) with: %r" %(- mean_score, stdev, params))    


ValueError: Found input variables with inconsistent numbers of samples: [49352, 74659]

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

model = XGBClassifier(learning_rate = 0.3)
model.fit(X_train, Y_train)
print(model.feature_importances_)

plot_importance(model)
plt.show()

In [15]:
test_df = test_df[train_df.columns]

In [None]:
y = model.predict_proba(test_df)

In [18]:
labels2idx = {label: i for i, label in enumerate(model.classes_)}

In [19]:
for label in ["high", "medium", "low"]:
    sub[label] = y[:, labels2idx[label]]
sub.to_csv("ml_3_xgb.csv", index = False)

Private score: 0.60389
Public score: 0.60205

### Logistic Regression

In [375]:
from sklearn.linear_model import LogisticRegression

validation_size = 0.30
seed = 2018
y = y.map({'high':0, 'medium':1, 'low':2})
X_train, X_validation, Y_train, Y_validation = train_test_split(x, y, test_size = validation_size, random_state = seed)
clf = LogisticRegression(tol = 1e-8, fit_intercept = True, random_state = 5, max_iter = 1000)
clf.fit(X_train, Y_train)
y_val_pred = clf.predict_proba(X_validation)
log_loss(Y_validation, y_val_pred)

  np.exp(prob, prob)


0.71912043179077489

In [381]:
y_pred = clf.predict_proba(test_df.values)
labels2idx = {label: i for i, label in enumerate(clf.classes_)}
labels2idx = {'high': 0, 'low': 1, 'medium': 2}
for label in ["high", "medium", "low"]:
    sub[label] = y_pred[:, labels2idx[label]]
sub.to_csv("ml_3_lr.csv", index = False)

  np.exp(prob, prob)


Public score: 1.52583
Private score: 1.52764

Logistic regression performs really bad here.

### Gaussian Naive Bayes

In [400]:
from sklearn.naive_bayes import GaussianNB

train_df = pd.read_json("train.json")
y = train_df['interest_level']

validation_size = 0.30
seed = 2018
y = y.map({'high':0, 'medium':1, 'low':2})
X_train, X_validation, Y_train, Y_validation = train_test_split(x, y, test_size = validation_size, random_state = seed)
clf = GaussianNB(priors = None)
clf.fit(X_train, Y_train)

GaussianNB(priors=None)

In [401]:
y_pred = clf.predict_proba(test_df.values)
labels2idx = {label: i for i, label in enumerate(clf.classes_)}
labels2idx = {'high': 0, 'low': 1, 'medium': 2}
for label in ["high", "medium", "low"]:
    sub[label] = y_pred[:, labels2idx[label]]
sub.to_csv("ml_3_nb.csv", index = False)

Private score: 3.57894
Public score: 3.49425

Gaussian Naive Bayes performs bad here. Probably the naive independent assumption doesn't hold for the features here.