# Two Sigma Connect - Rental Listing Inqueries

[Kaggle link](https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries)

In [27]:
%matplotlib inline
import os
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import DistanceMetric
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import log_loss

In [25]:
os.chdir("../data/")
train_df = pd.read_json("train.json")

# Business Understanding

### What problem are we trying solve?
* RentHop is an apartment search website. We are trying to predict the interest level (high, medium, low) of a new listing. 
* This will help RentHop better handle fraud control, identify potential listing quality issues, and allow owners and agents to better understand renters’ needs and preferences.

### What are the relevant metrics? How much do we plan to improve them?
* The evaluation metric is the multiclass loss, essentially logloss for 3 interest levels.
* A baseline prediction of 0.33 for each class will result in a loss of 1.1. We plan on reducing the logloss to 0.7 or lower (or a prediction of 0.5 for the correct class, an almost 50% increase in confidence from the baseline prediction)

### What will we deliver?
* A categorical response prediction model for predicting the interest level of an apartment listing.
* This prediction will primarily be used to rank apartments from the RentHop search page. 

# Data Understanding

### What are the raw data sources?
* The data source is the train.json file

### What does each 'unit' (e.g. row) of data represent?
* Each row is an apartment listing

### What are the fields (columns)?
* Dependent variable: interest_level
* Independent variable: bathrooms, bedrooms, building_id, created, description, display_address, features, latitude, longitude, listing_id, manager_id, photos, price, street_address

In [8]:
train_df.columns

Index([u'bathrooms', u'bedrooms', u'building_id', u'created', u'description',
       u'display_address', u'features', u'interest_level', u'latitude',
       u'listing_id', u'longitude', u'manager_id', u'photos', u'price',
       u'street_address'],
      dtype='object')

### EDA
* Missing values

In [17]:
np.sum(train_df.isnull().any(axis=1)) # there are no missing values

0

* Distribution of target
    * There are about 70% low interest, 23% medium interest and 8% low interest

In [24]:
print "raw counts of targets: "
print train_df.interest_level.value_counts()

print "\n\npercentages for targets: "
print train_df.interest_level.value_counts() * 100.0 / train_df.shape[0]

raw counts of targets: 
low       34284
medium    11229
high       3839
Name: interest_level, dtype: int64


percentages for targets: 
low       69.468309
medium    22.752877
high       7.778813
Name: interest_level, dtype: float64


* Distribution of each feature

* Relationships between features

# Data Preparation

### What steps are taken to prepare the data for modeling?
* feature transformations? engineering?

### Transforming the target into one-hot encoding
* Interest level (e.g. High, Medium, Low) to One-Hot (e.g. 1,0,0)

In [15]:
target_num_map = {'high':0, 'medium':1, 'low':2}
y = np.array(train_df['interest_level'].apply(lambda x: target_num_map[x]))

### NLP Features from the Apartment Listing Description

### Listing description Numerical features
* Number of photos
* Number of features (features are the tags provided by the listing, e.g. Doorman, Elevator, Fitness Center..etc)
* Number of words in description
* Year created
* Month created
* Day created
* Hour creaed

In [13]:
train_df["num_photos"] = train_df["photos"].apply(len)r
train_df["num_features"] = train_df["features"].apply(len)
train_df["num_description_words"] = train_df["description"].apply(lambda x: len(x.split(" ")))
train_df["created"] = pd.to_datetime(train_df["created"])
train_df["created_year"] = train_df["created"].dt.year
train_df["created_month"] = train_df["created"].dt.month
train_df["created_day"] = train_df["created"].dt.day
train_df["created_hour"] = train_df["created"].dt.hour

### Precise description of modeling base tables.
* What are the rows/columns of X (the predictors)?
* What is y (the target)?

# Modeling

### What model are we using? Why?
### Assumptions?
### Regularization?

# Evaluation

### How well does the model perform?

# Deployment

### How is the model deployed?
* The model will be deployed via a python-based API server.
* The requests will contain all the necessary covariates, and the API will respond with predictions for each different interest level.

### What support is provided after initial deployment?
* The model will be updated nightly to provide the most up to date predictions.