# Creating the model

This notebook contains code to create the model used for the API.

## Data preparation

In [6]:
import glob
import pandas as pd

truthful_files = glob.glob('/Users/jediv/repos/papis-deploy-python-model-apis/model/op_spam/*/truthful*/*/*txt')
deceptive_files = glob.glob('/Users/jediv/repos/papis-deploy-python-model-apis/model/op_spam/*/deceptive*/*/*txt')

def read_file(path):
    with open(path) as f:
        return f.read()
text = map(read_file, truthful_files + deceptive_files)

labels = [True] * len(truthful_files) + [False] * len(deceptive_files)

data = pd.DataFrame(data= list(zip(text, labels)), columns=['text','label'])
data.head()

Unnamed: 0,text,label
0,My $200 Gucci sunglasses were stolen out of my...,True
1,This was a gorgeous hotel from the outside and...,True
2,The hotel is very impressive upon entering and...,True
3,Going to the Internet Retailer 2010 at the las...,True
4,"I checked into this hotel, Rm 1760 on 11/13/20...",True


In [30]:
real = data[data['label'] == True].iloc[50]
fake = data[data['label'] == False].iloc[50]
print real.text
print fake.text

The service was reasonably well...they seemed to have my reservation and checked me in fairly easily. The location was fine being that it was central to the city but I found out that it was also a tremendous drawback. It was interesting that the survey I took at the hotel in order to obtain internet access at the hotel asked what my most important quality was in a hotel visit and my response "Quiet". My visit was anything but, mainly attributed to the fact that they put me on the second floor, facing the street. I awoke at 4 a.m. and never did return to sleep because of the constant street noise and screeching of the L train that sounded as though it was just outside my window. Not the most conducive for a restful sleep prior to an important meeting. At check-out when I informed the hotel clerk of my dissatisfation he chuckled and said, " Ah, city noise."

While visiting the Chicago area, we chose the Hotel Monaco Chicago for our stay. As one of the premier luxury hotels in Chicago, we

## Training the model

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', RandomForestClassifier(n_estimators=1000))
                     ])

from sklearn.model_selection import GridSearchCV
parameters = {"clf__n_estimators": [1000],
              "clf__max_depth": [2, 4, 10, None],
              "tfidf__stop_words": [None, 'english']}
model = GridSearchCV(text_clf, parameters, n_jobs=-1, scoring='accuracy')

model.fit(data.text.values, data.label.values)

print(model.best_params_)
print(model.best_score_)

{'tfidf__stop_words': None, 'clf__max_depth': None, 'clf__n_estimators': 1000}
0.845625


In [3]:
text_clf.set_params(**model.best_params_)
text_clf.fit(data.text.values, data.label.values)

Pipeline(steps=[('tfidf', TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=Tru...ators=1000, n_jobs=1, oob_score=False,
            random_state=None, verbose=0, warm_start=False))])

## Storing the model

In [4]:
import joblib
joblib.dump(text_clf, 'model.pkl', compress=5)

['model.pkl']

In [15]:
import joblib
model = joblib.load('model.pkl')
import time
start = time.time()
for _ in range(10):
    print model.predict(["I will NEVER stay in this hotel again!"])
end = time.time()
print(end - start)

[False]
[False]
[False]
[False]
[False]
[False]
[False]
[False]
[False]
[False]
6.02303814888


In [14]:
print model.predict([data[data['label'] == False].iloc[25].text])
print data[data['label'] == False].iloc[25].text

[False]
James Chicago; the luxurious nice hotel as it was advertised. I didn't want stay in this hotel, but my wife insisted. Ok, so when I booked the ticket, it cost me $299 for one room. $299! I could get a new honeycomb android tablet with dual core processor at that price point! The room was stylish though, it looked nice. There was a TV, and the beds were fine. Although I couldn't quite understand why they had Wifi and no computer. Why would I even need wifi if there is no computer? What were they thinking!? The experience I had with James Chicago was a bit below of what I had expected. I mean the food available there wasn't very tasty at all. I had to wait in a long line to get the ticket and I was also hoping to browse the internet, but they had no computer, only wifi. That's like only having bread, without any fillings to make it into a sandwich!

