# Introduction/Data Understanding

In this competition, you will predict how popular an apartment rental listing is based on the listing content like text description, photos, number of bedrooms, price, etc. The data comes from renthop.com, an apartment listing website. These apartments are located in New York City.

The target variable, **interest_level**, is defined by the number of inquiries a listing has in the duration that the listing was live on the site. 

** File descriptions ** 

train.json - the training set

test.json - the test set

sample_submission.csv - a sample submission file in the correct format

images_sample.zip - listing images organized by listing_id (a sample of 100 listings)

Kaggle-renthop.7z - (optional) listing images organized by listing_id. Total size: 78.5GB compressed. Distributed by BitTorrent (Kaggle-renthop.torrent). 

** Data fields ** 

bathrooms: number of bathrooms

bedrooms: number of bathrooms

building_id

** created **

** description **

** display_address ** 

** features: ** a list of features about this apartment

latitude

listing_id

longitude

manager_id

** photos: ** a list of photo links. You are welcome to download the pictures yourselves from renthop's site, but they are the same as imgs.zip. 

price: in USD

street_address

interest_level: this is the target variable. It has 3 categories: 'high', 'medium', 'low'


** Some thoughts on features + data construction ** 

I've bolded the fields which I think will seperate the really good classifiers from the okay ones. I also want to do some research into the process of searching RentHop to see what might implicitly be important. 

1. created: people would be more interested if they see the posting is recent (neophilia bias)
2. description: more qualitative facts, do some textual analysis
3. display_address: people obsessed with location? some ritz-y streets could draw interest?
4. features: 
5. photos: I'm first thinking some basic heuristics; what is the resolution of the images? Then move on to more algorithmic approachs: sentiment analysis on the images. At a high level, what makes people interested/disinterested in a photo?


# Data Ingestion


In [59]:
%matplotlib inline
import numpy as np 
import pandas as pd 
import matplotlib
import matplotlib.pyplot as plt

UNWANTED_FEATURES = ['building_id', 'features', 'manager_id', 
                     'street_address', 'description', 'display_address',
                     ]
X_train = pd.read_json("../input/train.json").set_index('listing_id')
Y_train = X_train['interest_level']
X_train = X_train.drop('interest_level', axis=1)


After a little exploratory analysis, we see we have 49352 training examples, each with 14 features. Before I decide on my model/approach, I'm going to do some more exploration of the data. We examine the various numbers of photos below

In [None]:
print(list(X_train.columns))
D = {}
for i in range(X_train.shape[0]):
    num_photos = len(X_train['photos'].iloc[i])
    if num_photos in D:
        D[num_photos] += 1
    else:
        D[num_photos] = 1

plt.bar(range(len(D)), D.values(), align='center')
plt.xticks(range(len(D)), list(D.keys()))

plt.show()

This analysis is makes a lot of sense: most listings include a few photos, and as the number increases past 10 we see a marked decrease (although apparently there is a listing with 68 photos??). I'm going to try just substituting the photos URL listing for literally a count of photos and see how that works. Also a bunch of other feature modification, as in...

## Feature Engineering!!!


#### Photo count transformation

In [60]:
X_train['photos'] = X_train['photos'].apply(len)

#### Datetime normalization
We need a numerical measure of the 'created' feature. There are a couple of options; the first transformation simply uses the min date as 0, and finds the absolute difference between the current date and the min date. The second uses the day of the month. 

In [61]:
from datetime import datetime
MIN_DATE = datetime.strptime(X_train['created'].min(), '%Y-%m-%d %H:%M:%S')

def norm_date_absolute(curr):
    curr_date = datetime.strptime(curr, '%Y-%m-%d %H:%M:%S')
    res = curr_date - MIN_DATE
    return res.days

def norm_date_day(curr):
    return datetime.strptime(curr, '%Y-%m-%d %H:%M:%S').day

X_train['created'] = X_train['created'].apply(norm_date_day)


#### Location Normalization

There are a number of issues with the location duplication issue. building_id, display_address, lat/log, and street_address all duplicate location data. (WORK IN PROGRESS, PICK UP HERE)

In [62]:
X_train.groupby('building_id').agg('count').sort_values('bedrooms', ascending=False)


X_train = X_train.drop(UNWANTED_FEATURES, axis=1)
X_train

Unnamed: 0_level_0,bathrooms,bedrooms,created,latitude,longitude,photos,price
listing_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7211212,1.5,3,24,40.7145,-73.9425,5,3000
7150865,1.0,2,12,40.7947,-73.9667,11,5465
6887163,1.0,1,17,40.7388,-74.0018,8,2850
6888711,1.0,1,18,40.7539,-73.9677,3,3275
6934781,1.0,4,28,40.8241,-73.9493,3,3350
6894514,2.0,4,19,40.7429,-74.0028,5,7995
6930771,1.0,2,27,40.8012,-73.9660,10,3600
6867392,2.0,1,13,40.7427,-73.9957,5,5645
6898799,1.0,1,20,40.8234,-73.9457,5,1725
6814332,2.0,4,2,40.7278,-73.9808,9,5800


# Data Modelling

We will use the base AdaBoost classifier, and tune its hyper parameters. 

In [63]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import GridSearchCV

clf = AdaBoostClassifier()
gs_params = {'learning_rate': list(np.arange(0.1,1,0.05))}

list(np.arange(0.3,1,0.05))
clf = GridSearchCV(clf, gs_params, cv=2)
clf.fit(X_train, Y_train)



GridSearchCV(cv=2, error_score='raise',
       estimator=AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'learning_rate': [0.10000000000000001, 0.15000000000000002, 0.20000000000000004, 0.25000000000000006, 0.30000000000000004, 0.35000000000000009, 0.40000000000000013, 0.45000000000000007, 0.50000000000000011, 0.55000000000000016, 0.6000000000000002, 0.65000000000000013, 0.70000000000000018, 0.75000000000000022, 0.80000000000000016, 0.8500000000000002, 0.90000000000000024, 0.95000000000000029]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [64]:
clf

GridSearchCV(cv=2, error_score='raise',
       estimator=AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'learning_rate': [0.10000000000000001, 0.15000000000000002, 0.20000000000000004, 0.25000000000000006, 0.30000000000000004, 0.35000000000000009, 0.40000000000000013, 0.45000000000000007, 0.50000000000000011, 0.55000000000000016, 0.6000000000000002, 0.65000000000000013, 0.70000000000000018, 0.75000000000000022, 0.80000000000000016, 0.8500000000000002, 0.90000000000000024, 0.95000000000000029]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

Below is the testing code: 

In [65]:
X_test = pd.read_json("../input/test.json").set_index('listing_id')
X_test = X_test.drop(UNWANTED_FEATURES, axis=1)
X_test['photos'] = X_test['photos'].apply(len)
X_test['created'] = X_test['created'].apply(norm_date_day)
X_test

Unnamed: 0_level_0,bathrooms,bedrooms,created,latitude,longitude,photos,price
listing_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7142618,1.0,1,11,40.7185,-73.9865,8,2950
7210040,1.0,2,24,40.7278,-74.0000,3,2850
7103890,1.0,1,3,40.7306,-73.9890,6,3758
7143442,1.0,2,11,40.7109,-73.9571,6,3300
6860601,2.0,2,12,40.7650,-73.9845,7,4900
6840081,3.0,3,7,40.7901,-73.9774,8,9000
6922337,1.0,2,25,40.7730,-73.9571,8,2800
6913616,1.0,0,22,40.6751,-73.9511,5,1900
6937820,1.0,2,28,40.7597,-73.9929,3,3000
6893933,1.0,0,19,40.7208,-73.9887,3,2300


In [66]:
out = pd.DataFrame(clf.predict_proba(X_test), index = X_test.index, columns = ['high', 'low', 'medium'])
out = out[['high','medium','low']]
out.to_csv('../output/ada_5.csv')
out


Unnamed: 0_level_0,high,medium,low
listing_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
7142618,0.324302,0.334973,0.340725
7210040,0.328142,0.335699,0.336159
7103890,0.319537,0.334180,0.346282
7143442,0.324157,0.334917,0.340926
6860601,0.318972,0.335116,0.345912
6840081,0.309943,0.331551,0.358506
6922337,0.323709,0.335372,0.340920
6913616,0.325566,0.334876,0.339558
6937820,0.325861,0.335697,0.338443
6893933,0.327158,0.334497,0.338345
