The following code provides feature engineering for data after being cleaned. Here the feature engineering includes 
- Naive feature engineering to get sum, average and counts of some features
- get_stats function from Little Boat: https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/discussion/32123

### Import Data

In [13]:
%matplotlib inline
import random
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [14]:
train = pd.read_json('Datacleaned_train.json')
test = pd.read_json('test.json')

### 1. Naive Feature Engineering

In [15]:
def naiveFE(df):
    ''' do naive feature engineering to both the train and test data frame
    '''
    # total number of room
    df["sum_room"] = df["bedrooms"] + df["bathrooms"]
    df["room_diff"] = df["bedrooms"] - df["bathrooms"]
    
    # average price per room (withnan)
    #df["price_s"] = df["price"]/df["sum_room"]
    #df["price_bed"] = df["price"]/df["bedrooms"]
    #df["price_bath"] = df["price"]/df["bathrooms"]
    
    # sum of bedrooms and bathrooms
    df["room_sum"] = df["bedrooms"] + df["bathrooms"] 
    
    # number of photos
    df["num_photos"] = df["photos"].apply(len)
    
    # number features
    df["num_features"] = df["features"].apply(len)
    
    # count of words present in description column
    df["num_description_words"] = df["description"].apply(lambda x: len(x.split(" ")))
    
    # created time, year = 2016 constant
    df["created"] = pd.to_datetime(df["created"])
    df["created_month"] = df["created"].dt.month
    df["created_day"] = df["created"].dt.day
    
    return df

In [16]:
train_df = naiveFE(train)
test_df = naiveFE(test)

### get_stats function

#### Define get_stats function
It first merge train_df and test_df, followed by grouping the dataframe by group_column (especially manager_id), then calculating the count, mean, std, median, max, min of the target_column feature. It returns the train and test df with the newly added columns as numpy array (selected_train, selected_test).

The following code was partially copied from Little Boat: https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/discussion/32123

In [17]:
def get_stats(train_df, test_df, target_column, group_column = 'manager_id'):
    '''
    target_column: numeric columns to group with (e.g. price, bedrooms, bathrooms)
    group_column: categorical columns to group on (e.g. manager_id, building_id)
    '''
    train_df['row_id'] = range(train_df.shape[0])
    test_df['row_id'] = range(test_df.shape[0])
    train_df['train'] = 1
    test_df['train'] = 0
    all_df = train_df[['row_id', 'train', target_column, group_column]].append(test_df[['row_id','train', target_column, group_column]])
    all_df = all_df.reindex()
    grouped = all_df[[target_column, group_column]].groupby(group_column)
    
    the_size = pd.DataFrame(grouped.size()).reset_index()
    the_size.columns = [group_column, '%s_size' % target_column]
    
    the_mean = pd.DataFrame(grouped.mean()).reset_index()
    the_mean.columns = [group_column, '%s_mean' % target_column]
    
    the_std = pd.DataFrame(grouped.std()).reset_index().fillna(0)
    the_std.columns = [group_column, '%s_std' % target_column]
    
    the_median = pd.DataFrame(grouped.median()).reset_index()
    the_median.columns = [group_column, '%s_median' % target_column]
    
    the_stats = pd.merge(the_size, the_mean)
    the_stats = pd.merge(the_stats, the_std)
    the_stats = pd.merge(the_stats, the_median)

    the_max = pd.DataFrame(grouped.max()).reset_index()
    the_max.columns = [group_column, '%s_max' % target_column]
    
    the_min = pd.DataFrame(grouped.min()).reset_index()
    the_min.columns = [group_column, '%s_min' % target_column]

    the_stats = pd.merge(the_stats, the_max)
    the_stats = pd.merge(the_stats, the_min)

    all_df = pd.merge(all_df, the_stats)

    selected_train = all_df[all_df['train'] == 1]
    selected_test = all_df[all_df['train'] == 0]
    
    selected_train.sort_values('row_id', inplace=True)
    selected_test.sort_values('row_id', inplace=True)
    
    selected_train.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True)
    selected_test.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True)

    return np.array(selected_train), np.array(selected_test)

#### Use the get_stats function
The following code set group_column = 'manager_id' or 'building_id', scan target_id = 'bathrooms', 'bedrooms', 'latitude', 'longitude', 'price' and update train_df and test_df correspondently.

Note:The SettingWithCopyWarning is to show users that they may be operating on a copy and not the original as they think. 

In [18]:
target_column = ['bathrooms', 'bedrooms', 'latitude', 'longitude', 'price']
group_column = ['manager_id']

train_stack_list = []
test_stack_list = []
column_name_list = []

for target_col in target_column:
    for group_col in group_column:
        tmp_train, tmp_test = get_stats(train_df, test_df, target_column = target_col, group_column = group_col)
        tmp_name = target_col + '_' + group_col
        tmp_name_list = [tmp_name + '_count', tmp_name + '_mean', tmp_name + '_std', tmp_name + '_median', tmp_name + '_max', tmp_name + '_min']
        train_stack_list.append(tmp_train)
        test_stack_list.append(tmp_test)
        column_name_list.append(tmp_name_list)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


#### Add engineered statistics into original train_df and test_df

Both train_stack_list and test_stack_list are of dimension (10, 49352, 6).

In [19]:
for i in range(len(train_stack_list)):
    stat = pd.DataFrame(train_stack_list[i], columns = column_name_list[i])
    stat['row_id'] = range(stat.shape[0])
    train_df = pd.merge(train_df, stat)

for i in range(len(test_stack_list)):
    stat = pd.DataFrame(test_stack_list[i], columns = column_name_list[i])
    stat['row_id'] = range(stat.shape[0])
    test_df = pd.merge(test_df, stat)

### Prepare data for ML & Export data

In [20]:
train_df.drop(['created', 'building_id', 'manager_id', 'description', 'row_id', 'display_address', 'features', 'photos', 
               'street_address', 'train', 'listing_id'], axis = 1, inplace = True)

test_df.drop(['building_id', 'created', 'description', 'display_address', 'features', 'listing_id', 
               'manager_id', 'photos', 'street_address', 'row_id', 'train'], axis = 1, inplace = True)

In [21]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49121 entries, 0 to 49120
Data columns (total 44 columns):
bathrooms                      49121 non-null float64
bedrooms                       49121 non-null int64
interest_level                 49121 non-null object
latitude                       49121 non-null float64
longitude                      49121 non-null float64
price                          49121 non-null int64
sum_room                       49121 non-null float64
room_diff                      49121 non-null float64
room_sum                       49121 non-null float64
num_photos                     49121 non-null int64
num_features                   49121 non-null int64
num_description_words          49121 non-null int64
created_month                  49121 non-null int64
created_day                    49121 non-null int64
bathrooms_manager_id_count     49121 non-null float64
bathrooms_manager_id_mean      49121 non-null float64
bathrooms_manager_id_std       49121 non-n

In [22]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 74659 entries, 0 to 74658
Data columns (total 43 columns):
bathrooms                      74659 non-null float64
bedrooms                       74659 non-null int64
latitude                       74659 non-null float64
longitude                      74659 non-null float64
price                          74659 non-null int64
sum_room                       74659 non-null float64
room_diff                      74659 non-null float64
room_sum                       74659 non-null float64
num_photos                     74659 non-null int64
num_features                   74659 non-null int64
num_description_words          74659 non-null int64
created_month                  74659 non-null int64
created_day                    74659 non-null int64
bathrooms_manager_id_count     74659 non-null float64
bathrooms_manager_id_mean      74659 non-null float64
bathrooms_manager_id_std       74659 non-null float64
bathrooms_manager_id_median    74659 non-

In [23]:
train_df.to_json('Datacleaned_FE4_train.json')
test_df.to_json('FE4_test.json')