<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 7

## NLP and Machine Learning on [travel.statsexchange.com](http://travel.stackexchange.com/) data

---

In Project 7 you'll be doing NLP and machine learning on post data from stackexchange's travel subdomain. 

This project is setup like a mini Kaggle competition. You are given the training data and when projects are submitted your model will be tested on the held-out testing data. There will be prizes for the people who build models that perform best on the held out test set!

---

## Notes on the data

The data is again compressed into the `.7z` file format to save space. There are 6 .csv files and one readme file that contains some information on the fields.

    posts_train.csv
    comments_train.csv
    users.csv
    badges.csv
    votes_train.csv
    tags.csv
    readme.txt
    
The data is located in your datasets folder:

    DSI-SF-2/datasets/stack_exchange_travel.7z
    
If you're interested in where this data came from and where to get more data from other stackexchange subdomains, see here:

https://ia800500.us.archive.org/22/items/stackexchange/readme.txt


### Recommended Utilities for .7z

- For OSX [Keka](http://www.kekaosx.com/en/) or [The Unarchiver](http://wakaba.c3.cx/s/apps/unarchiver.html). 
- For Windows [7-zip](http://www.7-zip.org/) is the standard. 
- For Linux try the `p7zip` utility.  `sudo apt-get install p7zip`.



<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 1. Use LDA to find what topics are discussed on travel.stackexchange.com.

---

Text can be found in the posts and the comments datasets. The `ParentId` column in the posts dataset indicates what the "question" post was for a given post. Comment text can be merged onto the post they are part of with the `PostId` field.

The text may have some HTML tags. BeautifulSoup has convenient ways to get rid of markup or extract text if you need to. You can also parse the strings yourself if you like.

The tags dataset has the "tags" that the users have officially given the post.

**1.1 Implement LDA against the text features of the dataset(s).**

- This can be posts or a combination of posts and comments if you want more power.
- Find optimal **K/num_topics**.

**1.2 Compare your topics to the tags. Do the LDA topics make sense? How do they compare to the tags?**


In [54]:
#import 
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer
from collections import defaultdict
from gensim import corpora, models, matutils
from sklearn import grid_search
from sklearn.grid_search import GridSearchCV

import nltk
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import StandardScaler
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier


In [61]:
#data fetch
path = '/Users/paulmartin/Desktop/DSI-SF-2-GitPaulM/datasets/stack_exchange_travel/'

tposts    = pd.read_csv(path+'posts_train.csv')
tcomments = pd.read_csv(path+'comments_train.csv')
users     = pd.read_csv(path+'users.csv')
badges    = pd.read_csv(path+'badges.csv')
tvotes    = pd.read_csv(path+'votes_train.csv')
tags      = pd.read_csv(path+'tags.csv')
tposts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41289 entries, 0 to 41288
Data columns (total 21 columns):
AcceptedAnswerId         6516 non-null float64
AnswerCount              13988 non-null float64
Body                     40476 non-null object
ClosedDate               2627 non-null object
CommentCount             41289 non-null int64
CommunityOwnedDate       181 non-null object
CreationDate             41289 non-null object
FavoriteCount            3522 non-null float64
Id                       41289 non-null int64
LastActivityDate         41289 non-null object
LastEditDate             23363 non-null object
LastEditorDisplayName    844 non-null object
LastEditorUserId         22883 non-null float64
OwnerDisplayName         1173 non-null object
OwnerUserId              40552 non-null float64
ParentId                 23967 non-null float64
PostTypeId               41289 non-null int64
Score                    41289 non-null int64
Tags                     13988 non-null object
Titl

1.1

In [62]:
#clean up some of the comments

def chicken_noodle(x):
    try:
        y = BeautifulSoup(x, 'html.parser').get_text()
        y = y.replace('\n','')
        y = y.replace('ewr','')
        y.decode('quoted-printable').decode('utf-8')
    except:
        y = np.nan
    return y

tposts['Body2']= tposts['Body'].map(lambda x: chicken_noodle(x))
tposts['Body2'].dropna(inplace=True)




In [63]:
#Vectorizer
documents = tposts['Body2'].tolist()
CountVectorizer()
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
vocab = {v: k for k, v in vectorizer.vocabulary_.iteritems()}

In [64]:
#LDA
lda = models.LdaModel(
    matutils.Sparse2Corpus(X, documents_columns=False),
    num_topics  =  3,
    passes      =  5,
    id2word     =  vocab
)

lda.print_topics(num_topics=3, num_words=5)

[(0,
  u'0.015*flight + 0.012*airport + 0.009*ticket + 0.008*time + 0.006*check'),
 (1, u'0.007*like + 0.006*just + 0.006*people + 0.005*don + 0.005*use'),
 (2, u'0.033*visa + 0.014*passport + 0.010*uk + 0.010*need + 0.009*country')]

In [None]:
# Y = CountVectorizer()
# Y.fit(temp)
# columns=Y.get_feature_names()
# columns


1.2

In [65]:
#Cleanup

def cleantags(x):
    try:
        y = x.replace('<','')
        y = y.replace('>',',')
        y = y.replace('http','')
        y = y.replace('com','')
        y = y.replace('href','')
        y = y.replace('amp','')
        y = y.replace('www','')
        y = y.replace('use','')
        y = y.replace('li','')
    except:
        y = ""
    return y

tposts['Tags2']= tposts['Body'].map(lambda x: cleantags(x))


In [66]:
#Vectorizer
documents = tposts['Tags2'].tolist()
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
vocab = {v: k for k, v in vectorizer.vocabulary_.iteritems()}

In [67]:
#LDA
lda = models.LdaModel(
    matutils.Sparse2Corpus(X, documents_columns=False),
    num_topics  =  3,
    passes      =  5,
    id2word     =  vocab
)

lda.print_topics(num_topics=3, num_words=10)

[(0,
  u'0.014*nofollow + 0.014*rel + 0.008*imgur + 0.008*stack + 0.008*train + 0.007*strong + 0.007*airport + 0.006*bus + 0.006*time + 0.005*jpg'),
 (1,
  u'0.020*visa + 0.008*travel + 0.008*uk + 0.008*passport + 0.007*need + 0.007*strong + 0.007*fght + 0.007*em + 0.006*blockquote + 0.006*country'),
 (2,
  u'0.009*strong + 0.008*em + 0.007*org + 0.007*rel + 0.006*people + 0.006*wikipedia + 0.006*nofollow + 0.006*wiki + 0.005*ke + 0.005*en')]

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 2. What makes an answer likely to be "accepted"?

---

**2.1 Build a model to predict whether a post will be marked as the answer.**

- This is a classification problem.
- You're free to use any of the machine learning algorithms or techniques we have learned in class to build the best model you can.
- NLP will be very useful here for pulling out useful and relevant features from the data. 
- Though not required, using bagging and boosting models like Random Forests and Gradient Boosted Trees will _probably_ get you the highest performance on the test data (but who knows!).


**2.2 Evaluate the performance of your classifier with a confusion matrix and accuracy. Explain how your model is performing.**

**2.3 Plot either a ROC curve or precision-recall curve (or both!) and explain what they tell you about your model.**

NOTE: You should only be predicting this for `PostTypeID=2` posts, which are the "answer" posts. This doesn't mean, however, that you can't or shouldn't use the parent questions as predictors!


In [108]:
def Flag_it(x):
    y = 0
    try:
        if (("visa" in x) or
            ("train" in x) or
            ("passport" in x) or
            ("airport" in x) or
            ("travel" in x) or
            ("country" in x)):
            y =1
    except:
        pass
    return y

2.1

In [109]:
#Prep
Xforpipe = tposts
tposts.info()

tposts['Target'] = tposts['PostTypeId'].map(lambda x: 1 if x == 2 else 0)
tposts['NLPf'] = tposts['Body'].map(lambda x: Flag_it(x))
tposts['Tagf'] = tposts['Tags'].map(lambda x: Flag_it(x))
tposts['CreationYear'] = tposts['CreationDate'].map(lambda x: int(x[:4]))
tposts['ViewCount'] .fillna(0, inplace=True)
tposts['FavoriteCount'] .fillna(0, inplace=True)

df_cols = ['Target','NLPf','Tagf','Score','ViewCount','CommentCount','FavoriteCount','CreationYear']
df = tposts[df_cols]
df.iloc[np.random.permutation(len(df))]
df.reset_index()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41289 entries, 0 to 41288
Data columns (total 27 columns):
AcceptedAnswerId         6516 non-null float64
AnswerCount              13988 non-null float64
Body                     40476 non-null object
ClosedDate               2627 non-null object
CommentCount             41289 non-null int64
CommunityOwnedDate       181 non-null object
CreationDate             41289 non-null object
FavoriteCount            41289 non-null float64
Id                       41289 non-null int64
LastActivityDate         41289 non-null object
LastEditDate             23363 non-null object
LastEditorDisplayName    844 non-null object
LastEditorUserId         22883 non-null float64
OwnerDisplayName         1173 non-null object
OwnerUserId              40552 non-null float64
ParentId                 23967 non-null float64
PostTypeId               41289 non-null int64
Score                    41289 non-null int64
Tags                     13988 non-null object
Tit

Unnamed: 0,index,Target,NLPf,Tagf,Score,ViewCount,CommentCount,FavoriteCount,CreationYear
0,0,0,1,0,8,361.0,4,0.0,2011
1,1,0,0,0,8,219.0,1,0.0,2011
2,2,0,0,0,11,340.0,0,2.0,2011
3,3,0,1,1,11,9219.0,2,1.0,2011
4,4,0,0,0,12,1503.0,1,8.0,2011
5,5,0,1,1,24,1604.0,1,4.0,2011
6,6,0,1,0,11,449.0,4,0.0,2011
7,7,0,1,1,7,329.0,4,0.0,2011
8,8,0,1,1,57,4561.0,2,33.0,2011
9,9,1,0,0,10,0.0,3,0.0,2011


In [110]:
def conf_class(mod, X, y, margins=False):
    pred = pd.Series(mod.predict(X), name='Predicted')
    true = pd.Series(y, name='True')
    confusion = pd.crosstab(true, pred, margins=margins)
    print "Confusion Matrix:"
    print confusion
    print "\nClassification Report:"
    classify_rept = classification_report(y, pred)
    print classify_rept
    return confusion


In [111]:
#KNN
print df.shape
X = df.iloc[:,1:]
y = np.ravel([df['Target']])
Xpipe = X # for prob 5
ypipe = y # for prob 5
ss = StandardScaler()
Xn = ss.fit_transform(X)

print Xn.shape, y.shape
X_train, X_test, y_train, y_test =  train_test_split(Xn, y, test_size=0.4)
knn = KNeighborsClassifier()

search_parameters = {
    'n_neighbors':  range(1,10,2), 
    'weights':      ("uniform", "distance"),
    'algorithm':    ("ball_tree", "kd_tree", "brute", "auto"),
    'p':            [1,2]
}


knns = grid_search.GridSearchCV(knn, search_parameters, cv=5, verbose=1, n_jobs=-1)
knns.fit(X_train, y_train)
print "Best Estimator:", knns.best_estimator_.n_neighbors
print "Best Params:", knns.best_params_
print "Best Score:", knns.best_score_
y_pred = knns.predict(X_test)
y_knn = ytrain

(41289, 8)
(41289, 7) (41289,)
Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   35.0s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 400 out of 400 | elapsed:  4.8min finished


Best Estimator: 1
Best Params: {'n_neighbors': 1, 'weights': 'uniform', 'algorithm': 'brute', 'p': 2}
Best Score: 0.948694142817


2.2

In [72]:
cc = conf_class(knns, X_test, y_test, margins=False)

Confusion Matrix:
Predicted     0     1
True                 
0          6479   465
1           287  9285

Classification Report:
             precision    recall  f1-score   support

          0       0.96      0.93      0.95      6944
          1       0.95      0.97      0.96      9572

avg / total       0.95      0.95      0.95     16516



In [73]:
#Random Forest for comparison

forest = RandomForestRegressor( )

params = {'max_depth':[3,4,5], 
          'max_features':[2,3,4], 
          'max_leaf_nodes':[5,6,7], 
          'min_samples_split':[3,4],
          'n_estimators': [50]
         }

rf_gs = GridSearchCV(forest, params, n_jobs=-1,  cv=5,verbose=1) 

rf_gs.fit(X_train, y_train)

## Print best estimator, best parameters, and best score
rf_gs_best = rf_gs.best_estimator_
print "best estimator", rf_gs_best
print "\n==========\n"
print "best parameters", rf_gs.best_params_
print "\n==========\n"
print "best score", rf_gs.best_score_
y_hat = rf_gs.predict(X_test)

Fitting 5 folds for each of 54 candidates, totalling 270 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    6.7s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   26.5s
[Parallel(n_jobs=-1)]: Done 270 out of 270 | elapsed:   35.2s finished


best estimator RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=5,
           max_features=4, max_leaf_nodes=7, min_samples_leaf=1,
           min_samples_split=4, min_weight_fraction_leaf=0.0,
           n_estimators=50, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)


best parameters {'max_features': 4, 'max_leaf_nodes': 7, 'min_samples_split': 4, 'n_estimators': 50, 'max_depth': 5}


best score 0.95433806533


In [None]:
#confusion matrix did not work for Random Forest Regressor

2.3

In [114]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
%matplotlib inline

# [insert explanation of this section -- what is this for?  Where do we use this later?]
# not needed
Y_score = knn.decision_function(X_train)

# Confulson matrix metrics
FPR = dict()
TPR = dict()
ROC_AUC = dict()

# an assigning ofsets from the 2nd set of probabilities from my .predict_proba() predictions
#this data is what will be plotted once we throw it to the figure
FPR[1], TPR[1], _ = roc_curve(y, probabilities[:, 1])
ROC_AUC[1] = auc(FPR[1], TPR[1])

# initialize a plank plot al?]
plt.figure(figsize=[11,9])
#plot my false and true rates (returned from a ROC curve)
plt.plot(FPR[1], TPR[1], label='ROC curve (area = %0.2f)' % ROC_AUC[1], linewidth=4)

#plot a dotted line diagonally, representing .5 (can we do better than guessing)
plt.plot([0, 1], [0, 1], 'k--', linewidth=4)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=18)
plt.ylabel('True Positive Rate', fontsize=18)
plt.title('Receiver operating characteristic for "over_200k" predictions', fontsize=18)
plt.legend(loc="lower right")
plt.show()

AttributeError: 'KNeighborsClassifier' object has no attribute 'decision_function'

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 3. What is the score of a post?

---

**3.1 Build a model that predicts the score of a post.**

- This is a regression problem now. 
- You can and should be predicting score for both "question" and "answer" posts, so keep them both in your dataset.
- Again, use any techniques that you think will get you the best model.

**3.2 Evaluate the performance of your model with cross-validation and report the results.**

**3.3 What is important for determining the score of a post, if anything?**


3.1

In [83]:
df_cols = ['PostTypeId','NLPf','Tagf','Score','ViewCount','CommentCount','FavoriteCount','CreationYear']
df = tposts[df_cols]
print df.shape
print df_cols
print type(df)

(41289, 8)
['PostTypeId', 'NLPf', 'Tagf', 'Score', 'ViewCount', 'CommentCount', 'FavoriteCount', 'CreationYear']
<class 'pandas.core.frame.DataFrame'>


In [91]:
y = np.ravel([df['Score']])
X = df[[x for x in df_cols if x not in ['PostTypeID','Score']]]
feature_cols = X.columns.values
ss = StandardScaler()
Xn = ss.fit_transform(X)

print Xn.shape, y.shape
X_train, X_test, y_train, y_test =  train_test_split(Xn, y, test_size=0.4)


(41289, 7) (41289,)


X.info()

3.2

In [92]:
from sklearn.ensemble import AdaBoostClassifier
adb = AdaBoostClassifier(n_estimators=2)
adb.fit(X, y)
adb_scores = cross_val_score(adb, X, y, cv=5)
print adb_scores, np.mean(adb_scores)



[ 0.24777911  0.24181269  0.19139394  0.18743163  0.14463263] 0.202609999622


In [None]:
# Model not great. Note Gradient Boost did not yield results


3.3

In [93]:
## Print Feature importances
feature_importance = pd.DataFrame({ 'feature':feature_cols, 
                                   'importance':adb.feature_importances_
                                  })

feature_importance.sort_values('importance', ascending=False, inplace=True)
feature_importance

Unnamed: 0,feature,importance
0,PostTypeId,0.5
4,CommentCount,0.5
1,NLPf,0.0
2,Tagf,0.0
3,ViewCount,0.0
5,FavoriteCount,0.0
6,CreationYear,0.0


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 4. How many views does a post have?

---

**4.1 Build a model that predicts the number of views a post has.**

- This is another regression problem. 
- Predict the views for all posts, not just the "answer" posts.

**4.2 Evaluate the performance of your model with cross-validation and report the results.**

**4.3 What is important for the number of views a post has, if anything?**

4.1

In [94]:
df_cols = ['PostTypeId','NLPf','Tagf','Score','ViewCount','CommentCount','FavoriteCount','CreationYear']
df = tposts[df_cols]
print df.shape
print df_cols

(41289, 8)
['PostTypeId', 'NLPf', 'Tagf', 'Score', 'ViewCount', 'CommentCount', 'FavoriteCount', 'CreationYear']


In [95]:
y = np.ravel([df['ViewCount']])
X = df[[x for x in df_cols if x not in ['ViewCount','PostTypeId']]]
feature_cols = X.columns
ss = StandardScaler()
Xn = ss.fit_transform(X)

print Xn.shape, y.shape
X_train, X_test, y_train, y_test =  train_test_split(Xn, y, test_size=0.4)
feature_cols

(41289, 6) (41289,)


Index([u'NLPf', u'Tagf', u'Score', u'CommentCount', u'FavoriteCount',
       u'CreationYear'],
      dtype='object')

In [97]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(max_depth=10)
rfr.fit(Xn, y)
rfr_scores = cross_val_score(rfr, Xn, y, cv=5)
print rfr_scores, np.mean(rfr_scores)

[-5.50946852  0.1697011  -0.06927937  0.09396441 -2.55529346] -1.57407516782


4.2

In [None]:
4.3

In [98]:
## Print Feature importances
feature_importance = pd.DataFrame({ 'feature':feature_cols, 
                                   'importance':rfr.feature_importances_
                                  })

feature_importance.sort_values('importance', ascending=False, inplace=True)
feature_importance

Unnamed: 0,feature,importance
2,Score,0.296347
4,FavoriteCount,0.295878
3,CommentCount,0.171134
5,CreationYear,0.150149
0,NLPf,0.046042
1,Tagf,0.040451


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 5. Build a pipeline or other code to automate evaluation of your models on the test data.

---

Now that you've constructed your three predictive models, build a pipeline or code that can easily load up the raw testing data and evaluate your models on it.

The testing data that is held out is in the same raw format as the training data you have. _Any cleaning and preprocessing that you did on the training data will need to be done on the testing data as well!_

This is a good opportunity to practice building pipelines, but you're not required to. Custom functions and classes are fine as long as they are able to process and test the new data.


In [None]:

# df_cols = ['Target','NLPf','Tagf','Score','ViewCount','CommentCount','FavoriteCount','CreationYear']
# df = tposts[df_cols]
# df.iloc[np.random.permutation(len(df))]

from sklearn.base import BaseEstimator, TransformerMixin

class PipelinePreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def _create_features(self, df): 
        df['NLPf'] = df['Body'].map(lambda x: Flag_it(x))
        df['Tagf'] = df['Tags'].map(lambda x: Flag_it(x))
        df['Target'] = df['PostTypeId'].map(lambda x: 1 if x == 2 else 0)
        return df
    
    def _get_year(self, df):
        df['CreationYear'] = df['CreationDate'].map(lambda x: int(x[:4]))
        return df
    
    def _manage_nulls(self,df):
        df['ViewCount'] .fillna(0, inplace=True)
        df['FavoriteCount'] .fillna(0, inplace=True)
        return df
    
# Originally in but decided to exclude from the pipline.
    
#     def _build_df(self, df):
#         df_cols = ['Target','NLPf','Tagf','Score','ViewCount','CommentCount','FavoriteCount','CreationYear']
#         df = df[df_cols]
#         df.iloc[np.random.permutation(len(df))]
#         return df
    
    def transform(self, X, *args):
        X = self._create_features(X)
        X = self._get_year(X)
        X = self._manage_nulls(X)
 #       X = self._build_df(X)
        return X
    
    def fit(self, X, *args):
        return self
    


In [99]:
#pipe_prep = PipelinePreprocessor()
ss = StandardScaler()
knn_for_pipe =  KNeighborsClassifier(n_neighbors=1,weights='uniform',algorithm='brute',p=2)

In [100]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

knn_pipe = Pipeline(steps=[('ss', ss),
                          ('knnfp', knn_for_pipe)])

In [None]:
# Best Estimator: 1
# Best Params: {'n_neighbors': 1, 'weights': 'uniform', 'algorithm': 'brute', 'p': 2}
# Best Score: 0.953699592298

In [101]:
Xpipe.shape, ypipe.shape

((41289, 7), (41289,))

In [102]:
# Let's test train the data and fit the model


Xtrain, Xtest, ytrain, ytest = train_test_split(Xpipe, ypipe, test_size=0.5)
print Xtrain.shape, ytrain.shape, Xtest.shape, ytest.shape
#pd.options.mode.chained_assignment = None  # default='warn'
knn_pipe.fit(Xtrain, ytrain)
knn_pipe.score(Xtest, ytest)

(20644, 7) (20644,) (20645, 7) (20645,)


0.9392104625817389