# Two Sigma Connect - Rental Listing Inqueries

[Kaggle link](https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries)

In [27]:
%matplotlib inline

# Base dependencies
import feather # may need to put this in your .bashrc: export MACOSX_DEPLOYMENT_TARGET=10.10
import json
import numpy as np
import matplotlib.pyplot as plt
from multiprocessing import Pool
import os
import pandas as pd
import requests
import seaborn as sns
import uuid

# Machine learning / stats dependencies
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import DistanceMetric
from sklearn.neighbors import KNeighborsClassifier

# Image processing dependencies
from PIL import Image
from StringIO import StringIO

# Suppress annoying deprecation warnings
import warnings
warnings.filterwarnings('ignore')

In [25]:
os.chdir("../data/")
train_df = pd.read_json("train.json")

# Business Understanding

### What problem are we trying solve?
* RentHop is an apartment search website. We are trying to predict the interest level (high, medium, low) of a new listing.
* RentHop could use the model developed in this exercise to improve the quality of search results and therefore increase the frequency of bookings.
* In addition, our analytic might help RentHop better handle fraud control, identify potential listing quality issues, and allow owners and agents to better understand renters’ needs and preferences.

### What are the relevant metrics? How much do we plan to improve them?
* The evaluation metric is the multiclass loss, essentially logloss for 3 interest levels.
* A baseline prediction of 0.33 for each class will result in a loss of 1.1. We plan on reducing the logloss to 0.7 or lower (or a prediction of 0.5 for the correct class, an almost 50% increase in confidence from the baseline prediction)

### What will we deliver?
* A categorical response prediction model for predicting the interest level of an apartment listing.
* This prediction will primarily be used to rank apartments from the RentHop search page. 

# Data Understanding

### What are the raw data sources?
* The training data provided are raw listing details, provided in JSON format by RentHop.

### What does each 'unit' (e.g. row) of data represent?
* Each row is an apartment listing, containing internal apartment characteristics (like number of bathrooms) and contextual metadata (like lat-lon and street address)

### What are the fields (columns)?
* Dependent variable: 
    - *interest_level (categorical)*: 'High', 'Medium', or 'Low' rental interest, calculated by RentHop using an algorithm undisclosed to the public
* Independent variable: 
    - *bathrooms (numeric)*: Number of bathrooms in the unit
    - *bedrooms (numeric)*: Number of bedrooms in the unit
    - *building_id (categorical)*: Unique ID for particular building
    - *created (date_string)*: Date the listing was first created on RentHop
    - *description (string)*: Open-text description of the unit, provided by the listing author
    - *display_address (string)*: Marketing-friendly address (not strictly a Post Office address) like "Studio at 5528-5532 S. Everett Avenue"
    - *features (string)*: Semi-structured list of features like "gas stove" and "air conditioning"
    - *latitude (numeric)*: Latitude of the listed property
    - *longitude (numeric)*: Longitude of the listed property
    - *listing_id (categorical)*: Unique ID for a particular listing
    - *manager_id (categorical)*: Unique ID for a building manager
    - *photos (list of strings)*: List of URLs to listing photos on RentHop
    - *price (numeric)*: Monthly rent price (in USD)
    - *street_address (string)*: Actual street address of the listed property

In [8]:
train_df.columns

Index([u'bathrooms', u'bedrooms', u'building_id', u'created', u'description',
       u'display_address', u'features', u'interest_level', u'latitude',
       u'listing_id', u'longitude', u'manager_id', u'photos', u'price',
       u'street_address'],
      dtype='object')

### EDA
* Missing values

In [17]:
np.sum(train_df.isnull().any(axis=1)) # there are no missing values

0

* Distribution of target
    * There are about 70% low interest, 23% medium interest and 8% high interest

In [24]:
print "raw counts of targets: "
print train_df.interest_level.value_counts()

print "\n\npercentages for targets: "
print train_df.interest_level.value_counts() * 100.0 / train_df.shape[0]

raw counts of targets: 
low       34284
medium    11229
high       3839
Name: interest_level, dtype: int64


percentages for targets: 
low       69.468309
medium    22.752877
high       7.778813
Name: interest_level, dtype: float64


* Distribution of each feature

* Relationships between features

# Data Preparation

### What steps are taken to prepare the data for modeling?
* feature transformations? engineering?

### Transforming the target into one-hot encoding
* Interest level (e.g. High, Medium, Low) to One-Hot (e.g. 1,0,0)

In [15]:
target_num_map = {'high':0, 'medium':1, 'low':2}
y = np.array(train_df['interest_level'].apply(lambda x: target_num_map[x]))

### NLP Features from the Apartment Listing Description

### Listing description Numerical features
* Number of photos
* Number of features (features are the tags provided by the listing, e.g. Doorman, Elevator, Fitness Center..etc)
* Number of words in description
* Year created
* Month created
* Day created
* Hour creaed

In [13]:
train_df["num_photos"] = train_df["photos"].apply(len)r
train_df["num_features"] = train_df["features"].apply(len)
train_df["num_description_words"] = train_df["description"].apply(lambda x: len(x.split(" ")))
train_df["created"] = pd.to_datetime(train_df["created"])
train_df["created_year"] = train_df["created"].dt.year
train_df["created_month"] = train_df["created"].dt.month
train_df["created_day"] = train_df["created"].dt.day
train_df["created_hour"] = train_df["created"].dt.hour

### Features Derived from Listing Images

We use Python's PIL library to process the provided image files into numeric matrices. Each pixel in the provided image is encoded as a 3-element tuple representing RGB (red, green, blue) values. Each color's value is referred to as a "channel" in the image processing literature and in this report

* Mean pixel value, red channel
* Mean pixel value, green channel
* Mean pixel value, blue channel
* Standard deviation of pixel values, red channel
* Standard deviation of pixel values, green channel
* Standard deviation of pixel values, blue channel
* Image resolution (total number of pixels)

In [None]:
# Define helper functions for creating image features
def parallelize_dataframe(df, func):
    num_partitions = 250 #number of partitions to split dataframe
    num_cores = 7 #number of cores on your machine
    df_split = np.array_split(df, num_partitions)
    pool = Pool(num_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

def append_image_features(data):
    img_features = data['photos'].map(lambda photo_album: get_image_features(photo_album))
    img_df = pd.DataFrame({
            'mean_red': np.array([feature_dict['mean_red'] for feature_dict in img_features]),
            'mean_green': np.array([feature_dict['mean_green'] for feature_dict in img_features]),
            'mean_blue': np.array([feature_dict['mean_blue'] for feature_dict in img_features]),
            'std_red': np.array([feature_dict['std_red'] for feature_dict in img_features]),
            'std_green': np.array([feature_dict['std_green'] for feature_dict in img_features]),
            'std_blue': np.array([feature_dict['std_blue'] for feature_dict in img_features]),
            'img_resolution': np.array([feature_dict['img_resolution'] for feature_dict in img_features])
        })
    return img_df

def get_image_features(photo_url_list):
    """
    Create one row of features for a collection of
    images.
    """
    
    # Write a temp file to disk to track progress
    fname = '/Users/jlamb/repos/sandbox/tmp/' + str(uuid.uuid1())
    with open(fname, 'w') as f:
        f.write('x')
    
    if len(photo_url_list) > 0:
        
        try: 
            # Set up collectors
            mean_red = []
            mean_green = []
            mean_blue = []
            std_red = []
            std_green = []
            std_blue = []
            img_resolution = []

            # TESTING: Just use first image for now
            photo_url_list = [photo_url_list[0]]
            for url in photo_url_list:

                # Get photo (http://stackoverflow.com/questions/7391945/how-do-i-read-image-data-from-a-url-in-python)
                url = url
                response = requests.get(url)
                img = np.array(Image.open(StringIO(response.content)))

                # Mean value by channel
                mean_red.append(img[:,0].mean())
                mean_green.append(img[:,1].mean())
                mean_blue.append(img[:,2].mean())

                # standard deviation by channel
                std_red.append(img[:,0].std())
                std_green.append(img[:,1].std())
                std_blue.append(img[:,2].std())

                # resolution (num pixels)
                img_resolution.append(img.size)

            # Summarize 
            out_dict = {
                'mean_red': np.mean(np.array(mean_red)),
                'mean_green': np.mean(np.array(mean_green)),
                'mean_blue': np.mean(np.array(mean_blue)),
                'std_red': np.mean(np.array(std_red)),
                'std_green': np.mean(np.array(std_green)),
                'std_blue': np.mean(np.array(std_blue)),
                'img_resolution': np.mean(np.array(img_resolution))
            }
            
        except:
            
            out_dict = {
                'mean_red': float('nan'),
                'mean_green': float('nan'),
                'mean_blue': float('nan'),
                'std_red': float('nan'),
                'std_green': float('nan'),
                'std_blue': float('nan'),
                'img_resolution': float('nan')
            }
            
        
    else:
    
        out_dict = {
            'mean_red': float('nan'),
            'mean_green': float('nan'),
            'mean_blue': float('nan'),
            'std_red': float('nan'),
            'std_green': float('nan'),
            'std_blue': float('nan'),
            'img_resolution': float('nan')
        }
        
    return(out_dict)

# Apply the image processing or readin cached features
if not os.path.isfile('img_df.feather'):
    img_df = parallelize_dataframe(train_df, append_image_features)
    feather.write_dataframe(img_df, 'img_df.feather')
else:
    img_df = feather.read_dataframe('img_df.feather')

# Append image features to train_df (equivalent to R 'cbind')
train_df = pd.concat([train_df.reset_index(), img_df.reset_index()], axis = 1)

### Precise description of modeling base tables.
* What are the rows/columns of X (the predictors)?
* Target variable:
    - *interest_level (categorical)*: A three-level categorical variable. Encoded as 0, 1, 2 corresponding with levels "low", "medium" and "high" interest

# Modeling

### What model are we using? Why?
### Assumptions?
### Regularization?

# Evaluation

### How well does the model perform?

# Deployment

Implementing a deployed analytic is outside the scope of this exercise. The answers below are hypothetical only.

### How is the model deployed?
* To be deployed at RentHop, it's likely that the model would need to be deployed as a standalone microservice which can be managed by Operational personnell. Regardless of the exact technology used, the application should expect to receive a JSON payload with raw listing details and should produce a small JSON objects with probabilities for each class ('low', 'medium', 'high'). We see a few possible configurations that could support such an app:
    1. All-Python app (e.g. Flask) listening for POST requests with listing data
    2. Containerized Python app (e.g. in Docker) in a container which also runs a web server
    3. Python model rewritten in Java by engineers

### What support is provided after initial deployment?
* Model results will be tracked by nightly batch jobs to try to catch eroding environmental fit
* The model may need to be updated periodically to refelect a changing renatl environment or to incorporate new data sources