# Introduction/Data Understanding

In this competition, you will predict how popular an apartment rental listing is based on the listing content like text description, photos, number of bedrooms, price, etc. The data comes from renthop.com, an apartment listing website. These apartments are located in New York City.

The target variable, **interest_level**, is defined by the number of inquiries a listing has in the duration that the listing was live on the site. 

** File descriptions ** 

train.json - the training set

test.json - the test set

sample_submission.csv - a sample submission file in the correct format

images_sample.zip - listing images organized by listing_id (a sample of 100 listings)

Kaggle-renthop.7z - (optional) listing images organized by listing_id. Total size: 78.5GB compressed. Distributed by BitTorrent (Kaggle-renthop.torrent). 

** Data fields ** 

bathrooms: number of bathrooms

bedrooms: number of bathrooms

building_id

** created **

** description **

** display_address ** 

** features: ** a list of features about this apartment

latitude

listing_id

longitude

manager_id

** photos: ** a list of photo links. You are welcome to download the pictures yourselves from renthop's site, but they are the same as imgs.zip. 

price: in USD

street_address

interest_level: this is the target variable. It has 3 categories: 'high', 'medium', 'low'


** Some thoughts on features + data construction ** 

I've bolded the fields which I think will seperate the really good classifiers from the okay ones. I also want to do some research into the process of searching RentHop to see what might implicitly be important. 

1. created: people would be more interested if they see the posting is recent (neophilia bias)
2. description: more qualitative facts, do some textual analysis
3. display_address: people obsessed with location? some ritz-y streets could draw interest?
4. features: 
5. photos: I'm first thinking some basic heuristics; what is the resolution of the images? Then move on to more algorithmic approachs: sentiment analysis on the images. At a high level, what makes people interested/disinterested in a photo?


# Data Ingestion
We drop out a few features (random ids, photos, the random feature desciptions) that will probably be helpful later but not right now.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from subprocess import check_output

X_train = pd.read_json("../input/train.json").set_index('listing_id')
Y_train = X_train['interest_level']
for attr in ['building_id', 'photos', 'features', 'manager_id', 'interest_level', 
             'street_address', 'description', 'display_address', 'created']:
    X_train = X_train.drop(attr, axis=1)


After a little exploratory analysis, we see we have 49352 training examples, each with 14 features. Before I decide on my model/approach, I'm going to do some more exploration of the data. We examine the various numbers of photos below

In [None]:
print(list(X_train.columns))
s = set()
for i in range(X_train.shape[0]):
    s.add(len(X_train['photos'].iloc[i]))
print(s)

# Data Modelling

We start with just a basic logistic regression, no tuning or anything. 

In [3]:
from sklearn.linear_model import LogisticRegression

lg = LogisticRegression()
lg.fit(X_train, Y_train)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Below is the testing code: 

In [4]:
X_test = pd.read_json("../input/test.json").set_index('listing_id')
for attr in ['building_id', 'photos', 'features', 'manager_id', 'created',
             'street_address', 'description', 'display_address']:
    X_test = X_test.drop(attr, axis=1)


ValueError: Expected object or value

In [2]:
out = pd.DataFrame(lg.predict_proba(X_test), index = X_test.index, columns = ['high', 'low', 'medium'])
out = out[['high','medium','low']]
out.to_csv('../output/log_reg_2.csv')


NameError: name 'lg' is not defined