<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 5


## NLP and Machine Learning on [travel.statsexchange.com](http://travel.stackexchange.com/) data

---

In Project 7 you'll be doing NLP and machine learning on post data from stackexchange's travel subdomain. 

This project is setup like a mini Kaggle competition. You are given the training data and when projects are submitted your model will be tested on the held-out testing data. There will be prizes for the people who build models that perform best on the held out test set!

---

## Notes on the data

The data is again compressed into the `.7z` file format to save space. There are 6 .csv files and one readme file that contains some information on the fields.

    posts_train.csv
    comments_train.csv
    users.csv
    badges.csv
    votes_train.csv
    tags.csv
    readme.txt
    
The data is located in your datasets folder:

    DSI-SF-2/datasets/stack_exchange_travel.7z
    
If you're interested in where this data came from and where to get more data from other stackexchange subdomains, see here:

https://ia800500.us.archive.org/22/items/stackexchange/readme.txt


### Recommended Utilities for .7z

- For OSX [Keka](http://www.kekaosx.com/en/) or [The Unarchiver](http://wakaba.c3.cx/s/apps/unarchiver.html). 
- For Windows [7-zip](http://www.7-zip.org/) is the standard. 
- For Linux try the `p7zip` utility.  `sudo apt-get install p7zip`.



In [1]:
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns 

from bs4 import BeautifulSoup

import nltk
from nltk.corpus import stopwords # Import the stop word list
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

from gensim import corpora, models
import gensim 

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import auc, roc_curve
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA

import xgboost as xgb

from scipy.stats.mstats import gmean


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 1. Use LDA to find what topics are discussed on travel.stackexchange.com.

---

Text can be found in the posts and the comments datasets. The `ParentId` column in the posts dataset indicates what the "question" post was for a given post. Comment text can be merged onto the post they are part of with the `PostId` field.

The text may have some HTML tags. BeautifulSoup has convenient ways to get rid of markup or extract text if you need to. You can also parse the strings yourself if you like.

The tags dataset has the "tags" that the users have officially given the post.

**1.1 Implement LDA against the text features of the dataset(s).**

- This can be posts or a combination of posts and comments if you want more power.
- Find optimal **K/num_topics**.

**1.2 Compare your topics to the tags. Do the LDA topics make sense? How do they compare to the tags?**


In [2]:
post = pd.read_csv('../../datasets/stack_exchange_travel/posts_train.csv')
comment = pd.read_csv('../../datasets/stack_exchange_travel/comments_train.csv')

In [3]:
post.head(3)

Unnamed: 0,AcceptedAnswerId,AnswerCount,Body,ClosedDate,CommentCount,CommunityOwnedDate,CreationDate,FavoriteCount,Id,LastActivityDate,...,LastEditorDisplayName,LastEditorUserId,OwnerDisplayName,OwnerUserId,ParentId,PostTypeId,Score,Tags,Title,ViewCount
0,393.0,4.0,<p>My fiancée and I are looking for a good Car...,2013-02-25T23:52:47.953,4,,2011-06-21T20:19:34.730,,1,2012-05-24T14:52:14.760,...,,101.0,,9.0,,1,8,<caribbean><cruising><vacations>,What are some Caribbean cruises for October?,361.0
1,,1.0,<p>Singapore Airlines has an all-business clas...,,1,,2011-06-21T20:24:57.160,,4,2013-01-09T09:55:22.743,...,,693.0,,24.0,,1,8,<loyalty-programs><routes><ewr><singapore-airl...,Does Singapore Airlines offer any reward seats...,219.0
2,770.0,5.0,<p>Another definition question that interested...,,0,,2011-06-21T20:25:56.787,2.0,5,2012-10-12T20:49:08.110,...,,101.0,,13.0,,1,11,<romania><transportation>,What is the easiest transportation to use thro...,340.0


In [4]:
# mask = (post['ParentId'] == 1) & (post['PostTypeId']==2)
# temp_df = pd.DataFrame(post[mask]) # create new temporary dataframe to hold question and answers
# pd.concat(objs=[post[post['Id']==1], temp_df], axis=0)

In [5]:
# mask = (post['PostTypeId']==2) & (post['ParentId']==1)
# # post.iloc[0]
# pd.concat(objs=[post.iloc[0],post[mask]], axis=1, )

In [6]:
post.dropna(axis=0, subset=['Body'], inplace=True)
post = post.reset_index()
post = post.drop(labels='index',axis=1)
post = post[['Id'] + post.columns.tolist()[:8] + post.columns.tolist()[9:]]

In [7]:
def processHTML(text, stemming=True, output_type='string'):
    text = BeautifulSoup(text).get_text()

    text = text.replace('\n', '')
        
    # 2. Extract special negator like n't
    text = re.sub('n\'t', ' not', text)
    
    letters_only = re.sub('[^a-zA-Z]',
                     ' ',
                     text)
    
    words = letters_only.lower().split()       # Convert to lower case
    
    stop = set(stopwords.words('english'))
    
    meaningful_words = [w for w in words if w not in stop]
    
    # Optional stemming 
    if stemming == True:
        s_stemmer = SnowballStemmer('english')
        meaningful_words = [s_stemmer.stem(w) for w in meaningful_words]
    
    if output_type == 'string':
        return(" ".join(meaningful_words)) 
    elif output_type == 'list':
        return meaningful_words

    
def getCleanPosts(df, output_type='string'):
    clean_posts = []
    for i, post in enumerate(df['Body'].values):        
        if (i+1) % 5000 == 0:
            print 'cleaning review #{} out of {}'.format(i+1, df.shape[0])
        clean_posts.append(processHTML(post, output_type))
    
    return clean_posts
        
def getBOW(posts, ngram_range=(1,3), max_features=5000, min_df=0.0, max_df = 1.0):
    # First create the vectorizer class
    vectorizer = CountVectorizer(ngram_range = ngram_range,
                                 min_df = min_df,
                                 max_df = max_df,
                                 analyzer='word',
                                 max_features = max_features,
                                 stop_words=None,
                                 preprocessor=None,
                                 tokenizer=None
                                )

    feature_matrix = vectorizer.fit_transform(posts)

    return feature_matrix  

def getTFIDF( clean_reviews, ngram_range=(1,3), max_features=5000, min_df=0.0, max_df = 1.0):
    vectorizer = TfidfVectorizer(ngram_range = ngram_range,
                                 min_df = min_df,
                                 max_df = max_df,
                                 analyzer='word',
                                 max_features = max_features,
                                 stop_words=None,
                                 preprocessor=None,
                                 tokenizer=None
                                )
    

    feature_matrix = vectorizer.fit_transform(clean_reviews)

    return feature_matrix

In [8]:
posts = getCleanPosts(post)



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


cleaning review #5000 out of 40476
cleaning review #10000 out of 40476
cleaning review #15000 out of 40476
cleaning review #20000 out of 40476
cleaning review #25000 out of 40476
cleaning review #30000 out of 40476
cleaning review #35000 out of 40476
cleaning review #40000 out of 40476


** Convert to Bag of Words **

In [183]:
# feature_matrix = getBOW(posts, ngram_range=(1,1))

# feature_matrix = feature_matrix.todense()

In [204]:
dictionary = corpora.Dictionary(posts)
corpus = [dictionary.doc2bow(text) for text in posts]

## LDA

In [224]:
# https://radimrehurek.com/gensim/models/ldamulticore.html#module-gensim.models.ldamulticore
ldamulticore = gensim.models.ldamulticore.LdaMulticore(corpus, num_topics=100, workers=3, id2word=dictionary)

In [225]:
print(ldamulticore.print_topics(num_topics=10, num_words=5))

[(41, u'0.009*"http" + 0.009*"strike" + 0.009*"travel" + 0.008*"visa" + 0.008*"would"'), (98, u'0.011*"restaur" + 0.010*"find" + 0.008*"bar" + 0.007*"like" + 0.007*"place"'), (61, u'0.015*"us" + 0.014*"passport" + 0.010*"record" + 0.009*"uk" + 0.008*"travel"'), (75, u'0.018*"batteri" + 0.011*"pakistan" + 0.010*"airport" + 0.008*"surnam" + 0.006*"travel"'), (64, u'0.014*"camp" + 0.011*"sleep" + 0.008*"also" + 0.007*"take" + 0.006*"car"'), (51, u'0.028*"card" + 0.022*"bank" + 0.009*"v" + 0.008*"statement" + 0.008*"use"'), (83, u'0.029*"ticket" + 0.027*"flight" + 0.024*"book" + 0.017*"airlin" + 0.011*"check"'), (42, u'0.027*"visa" + 0.017*"countri" + 0.016*"travel" + 0.015*"schengen" + 0.011*"would"'), (91, u'0.016*"seat" + 0.012*"use" + 0.011*"would" + 0.011*"travel" + 0.008*"flight"'), (6, u'0.022*"marri" + 0.016*"certif" + 0.014*"birth" + 0.012*"marriag" + 0.010*"travel"')]


In [229]:
print(ldamulticore[corpus[0]])[0] # get topic probability distribution for a document

(80, 0.96586206896551619)


In [238]:
doc_num = 0
ldamulticore.print_topic(ldamulticore[corpus[doc_num]][0][0], 20)

u'0.012*"place" + 0.012*"one" + 0.010*"like" + 0.009*"go" + 0.008*"would" + 0.007*"time" + 0.007*"get" + 0.006*"food" + 0.006*"look" + 0.006*"travel" + 0.005*"cruis" + 0.005*"day" + 0.005*"car" + 0.005*"restaur" + 0.005*"also" + 0.005*"take" + 0.004*"good" + 0.004*"way" + 0.004*"even" + 0.004*"buy"'

In [233]:
post['Body'][0]

"<p>My fianc\xc3\xa9e and I are looking for a good Caribbean cruise in October and were wondering which islands are best to see and which Cruise line to take?</p>\n\n<p>It seems like a lot of the cruises don't run in this month due to Hurricane season so I'm looking for other good options.</p>\n\n<p><strong>EDIT</strong> We'll be travelling in 2012.</p>\n"

The word list for the topic of the first document ('place', 'one','like') doesn't really match up with the tags for that document ('caribbean','cruising','vacation').

In [241]:
doc_num = 1
print ldamulticore.print_topic(ldamulticore[corpus[doc_num]][0][0], 20), '\n'
print post['Body'][doc_num]
print post['Tags'][doc_num]


0.026*"flight" + 0.016*"mile" + 0.013*"airlin" + 0.009*"ticket" + 0.008*"get" + 0.007*"time" + 0.007*"fare" + 0.007*"one" + 0.006*"fli" + 0.006*"need" + 0.006*"would" + 0.005*"book" + 0.005*"also" + 0.005*"may" + 0.005*"program" + 0.004*"check" + 0.004*"frequent" + 0.004*"free" + 0.004*"travel" + 0.004*"aa" 

<p>Singapore Airlines has an all-business class flight from EWR-SIN (Newark->Singapore), but I can't seem to find any reward Krisflyer flights for <em>any</em> dates.  </p>

<loyalty-programs><routes><ewr><singapore-airlines><sin>


The 'Body' text matches more closely with the words in the topic ('flight','mile','airline','ticket'), which sort of match the tags. 

**NOTE** I'm not sure how to find the optimal number of topics.


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 2. What makes an answer likely to be "accepted"?

---

**2.1 Build a model to predict whether a post will be marked as the answer.**

- This is a classification problem.
- You're free to use any of the machine learning algorithms or techniques we have learned in class to build the best model you can.
- NLP will be very useful here for pulling out useful and relevant features from the data. 
- Though not required, using bagging and boosting models like Random Forests and Gradient Boosted Trees will _probably_ get you the highest performance on the test data (but who knows!).


**2.2 Evaluate the performance of your classifier with a confusion matrix and accuracy. Explain how your model is performing.**

**2.3 Plot either a ROC curve or precision-recall curve (or both!) and explain what they tell you about your model.**

NOTE: You should only be predicting this for `PostTypeID=2` posts, which are the "answer" posts. This doesn't mean, however, that you can't or shouldn't use the parent questions as predictors!


I'm thinking that maybe you can quantify how similar an answer is to a question, and then use that info to make the prediction. Similarity can be quantified using cosine similarity. And in order to vectorize the text we can use tf-idf. I'm thinking that we should:
1. vectorize the text by making each set of question and answer it's own corpus (actually let's just first try using the entire dataset for our corpus). 
2. Then vectorize each corpus. 
3. Then calculate cosine similarity between each question-answer pair. 
4. Also calculate the length of each answer.
4. Create a feature for each answer from it's cosine similarity, and a feature from the answer's length. 
5. Run Random Forest, XGBoost, Logistic Regression using the cosine similarity as the feature. 



In [9]:
# We need to make a dataframe that contains only questions and answers 
posts_strings = getCleanPosts(post, output_type='string')

cleaning review #5000 out of 40476
cleaning review #10000 out of 40476
cleaning review #15000 out of 40476
cleaning review #20000 out of 40476
cleaning review #25000 out of 40476
cleaning review #30000 out of 40476
cleaning review #35000 out of 40476
cleaning review #40000 out of 40476


**Insert column for cleaned post text within original dataframe**

In [10]:
body_cleaned = pd.Series(posts_strings, name='Body_cleaned')

post.insert(loc=4, column='Body_cleaned', value=body_cleaned.values)

** TF-IDF on question-answer groups (with PCA) **

In [74]:
qa_groups = post.copy()
qa_groups['AcceptedAnswer'] = 0
qa_groups['component1'] = 0
qa_groups['component2'] = 0
qa_groups['component3'] = 0
qa_groups['component4'] = 0
# qa_groups.head()

In [75]:
for i, index in enumerate(post.index.values):
    if i % 500 == 0:
        print "calculating tf-idf for row {} of {}".format(i, post.index.values.shape[0])

    if post.ix[index,'PostTypeId'] == 1: # Question
        
        post_id = post.ix[index,'Id'] # grab question post ID

        accepted_answer_id = post.ix[index,'AcceptedAnswerId'] # grab the answer ID
        try:
            accepted_answer_id = int(accepted_answer_id)
        except: # the accepted_answer_id is np.nan
            accepted_answer_id = accepted_answer_id

        mask = (post['ParentId'] == post_id) & (post['PostTypeId']==2) # create mask to find answers to this question

        # create new temporary dataframe to hold question and answers
        temp_df = pd.DataFrame(post[mask]) 
        temp_df = pd.concat(objs=[post[post['Id']==post_id], temp_df], axis=0)

        # create array of post text 
        text_array = temp_df['Body_cleaned'].values

        # use tf-idf on new temp dataframe 
        tfidf_matrix = getTFIDF(text_array, ngram_range=(1,3)).todense()

        # Reduce size using PCA 
        n_components = 4 # CHOOSE THIS 
        
        pca = PCA(n_components=n_components)
        pca_matrix = pca.fit_transform(tfidf_matrix)

        # If pca reduced our features below n_components, then we won't be able to concat 
        if pca_matrix.shape[1] < n_components: 
            rows = pca_matrix.shape[0]
            cols = n_components-pca_matrix.shape[1]
            b = np.zeros((rows, n_components)) # make a matrix that is the size of pca_matrix + however many cols needed to make n_components
            b[:,:-cols] = pca_matrix
            pca_matrix = b
        

         # create new feature/column in qa_groups in temp_df with cosine similarity value for each answer 
        tfidf_df = pd.DataFrame(pca_matrix, index=temp_df.index.values, columns=['component1', 'component2', 'component3', 'component4'])

        # Concatenate old dataframe with new 
        temp_df = pd.concat(objs=[temp_df, tfidf_df], axis=1)

         # Assign values in CosineSimilarity column in temp_df to the respective rows in the qa_groups df 
        for row in temp_df.index.values:

            # update new dataframe with cosine similarity values
            qa_groups.ix[row, 'component1'] = temp_df.ix[row, 'component1']
            qa_groups.ix[row, 'component2'] = temp_df.ix[row, 'component2']
            qa_groups.ix[row, 'component3'] = temp_df.ix[row, 'component3']
            qa_groups.ix[row, 'component4'] = temp_df.ix[row, 'component4']

            # update new dataframe with AcceptedAnswer status
            if qa_groups.ix[row, 'Id'] == accepted_answer_id:
                qa_groups.ix[row, 'AcceptedAnswer'] = 1
            else:
                qa_groups.ix[row, 'AcceptedAnswer'] = 0
                
qa_groups['PostLength'] = qa_groups['Body_cleaned'].apply(lambda x: len(x))                

calculating cosine similarity for row 0 of 40476


  explained_variance_ratio_ = explained_variance_ / total_var


calculating cosine similarity for row 500 of 40476
calculating cosine similarity for row 1000 of 40476
calculating cosine similarity for row 1500 of 40476
calculating cosine similarity for row 2000 of 40476
calculating cosine similarity for row 2500 of 40476
calculating cosine similarity for row 3000 of 40476
calculating cosine similarity for row 3500 of 40476
calculating cosine similarity for row 4000 of 40476
calculating cosine similarity for row 4500 of 40476
calculating cosine similarity for row 5000 of 40476
calculating cosine similarity for row 5500 of 40476
calculating cosine similarity for row 6000 of 40476
calculating cosine similarity for row 6500 of 40476
calculating cosine similarity for row 7000 of 40476
calculating cosine similarity for row 7500 of 40476
calculating cosine similarity for row 8000 of 40476
calculating cosine similarity for row 8500 of 40476
calculating cosine similarity for row 9000 of 40476
calculating cosine similarity for row 9500 of 40476
calculating c

In [78]:
answers = qa_groups[qa_groups['PostTypeId']==2]  # grab just the answers
answers = answers.dropna(axis=0, subset=['AcceptedAnswer']) # drop the rows with np.nan in 'AcceptedAnswer's
answers['AcceptedAnswer'] = answers['AcceptedAnswer'].astype(int)

In [81]:
# Create design matrix 
features = ['CommentCount','component1', 'component2', 'component3', 'component4']
X = answers[features].values # grab features 
y = answers['AcceptedAnswer'].values # grab target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Logistic Regression

In [79]:
lr = LogisticRegression()

lr.fit(X_train, y_train)

pred = lr.predict(X_test)

scores = cross_val_score(estimator=lr, X=X, y=y, cv=7, scoring='accuracy')

print np.mean(scores)

0.728126181096


Random Forest 

In [82]:
rfc = RandomForestClassifier(n_estimators=100)

scores = cross_val_score(estimator=rfc, X=X, y=y, cv=3, scoring='accuracy')

print np.mean(scores)

0.696582801352


** TF-IDF + BoW + features => PCA => modeling (using entire corpus) **

In [11]:
qa_all = post.copy()
qa_all['AcceptedAnswer'] = 0

for i, index in enumerate(post.index.values):
    if i % 5000 == 0:
        print "creating answer column for row {} of {}".format(i, post.index.values.shape[0])

    if post.ix[index,'PostTypeId'] == 1: # Question
        
        post_id = post.ix[index,'Id'] # grab question post ID

        accepted_answer_id = post.ix[index,'AcceptedAnswerId'] # grab the answer ID
        try:
            accepted_answer_id = int(accepted_answer_id)
        except: # the accepted_answer_id is np.nan
            accepted_answer_id = accepted_answer_id

        mask = (post['ParentId'] == post_id) & (post['PostTypeId']==2) # create mask to find answers to this question

        # create new temporary dataframe to hold question and answers
        temp_df = pd.DataFrame(post[mask]) 
        temp_df = pd.concat(objs=[post[post['Id']==post_id], temp_df], axis=0)

        for row in temp_df.index.values:

            # update new dataframe with AcceptedAnswer status
            if qa_all.ix[row, 'Id'] == accepted_answer_id:
                qa_all.ix[row, 'AcceptedAnswer'] = 1
            else:
                qa_all.ix[row, 'AcceptedAnswer'] = 0
                
qa_all['PostLength'] = qa_all['Body_cleaned'].apply(lambda x: len(x))                

creating answer column for row 0 of 40476
creating answer column for row 5000 of 40476
creating answer column for row 10000 of 40476
creating answer column for row 15000 of 40476
creating answer column for row 20000 of 40476
creating answer column for row 25000 of 40476
creating answer column for row 30000 of 40476
creating answer column for row 35000 of 40476
creating answer column for row 40000 of 40476


In [12]:
answers = qa_all[qa_all['PostTypeId']==2]  # grab just the answers
answers = answers.dropna(axis=0, subset=['AcceptedAnswer']) # drop the rows with np.nan in 'AcceptedAnswer's
answers['AcceptedAnswer'] = answers['AcceptedAnswer'].astype(int)

In [13]:
clean_answers = getCleanPosts(answers)

cleaning review #5000 out of 23967
cleaning review #10000 out of 23967
cleaning review #15000 out of 23967
cleaning review #20000 out of 23967


In [21]:
answers = answers[['ViewCount','Score','FavoriteCount','CommentCount','AcceptedAnswer']]
answers = answers.drop(labels=['ViewCount','FavoriteCount','Score'], axis=1)

Creating TF-IDF matrix

In [22]:
pca = PCA(n_components=5)

vec = TfidfVectorizer(ngram_range = (1,3),
                     analyzer='word',
                     max_features = 3000,
                     stop_words=None,
                     preprocessor=None,
                     tokenizer=None
                    )

tfidf_matrix = vec.fit_transform(clean_answers).todense()

tfidf_df = pd.DataFrame(tfidf_matrix)

Creating BoW Matrix

In [23]:
countvec = CountVectorizer(ngram_range = (1,3),
                             analyzer='word',
                             max_features=3000,
                             stop_words=None,
                             preprocessor=None,
                             tokenizer=None
                            )

bow_matrix = countvec.fit_transform(clean_answers).todense()

bow_df = pd.DataFrame(bow_matrix)

Transforming with PCA

In [24]:
bow_pca = pca.fit_transform(bow_matrix)
tfidf_pca = pca.fit_transform(tfidf_matrix)

 Converting matrices into dataframes 

In [25]:
bow_pca_df = pd.DataFrame(bow_pca)
tfidf_pca_df = pd.DataFrame(tfidf_pca)
final_df = pd.concat(objs=[bow_pca_df, tfidf_pca_df, answers], axis=1)

Train test split

In [26]:
# Create design matrix 
X = answers.ix[:,:-1].values # grab features 
y = answers['AcceptedAnswer'].values # grab target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Model 

In [27]:
rfc = RandomForestClassifier(n_estimators=100)

scores = cross_val_score(estimator=rfc, X=X, y=y, cv=3, scoring='accuracy')

print np.mean(scores)

0.727875829265


** Cosine Similarity **
----------------------

In [33]:
#---- TESTS -----#
# mask = (post['ParentId'] == 1) & (post['PostTypeId']==2)
# temp_df = pd.DataFrame(post[mask]) # create new temporary dataframe to hold question and answers

# temp_df = pd.concat(objs=[post[post['Id']==1], temp_df], axis=0)

# text_array = temp_df['Body_cleaned'].values

# tfidf_matrix = getTFIDF(text_array, max_features=20, ngram_range=(1,3)).todense()

# cosine_similarity = (tfidf_matrix * tfidf_matrix.T).A

# tfidf_df = pd.DataFrame(cosine_similarity[:, 0], index=temp_df.index.values, columns=['CosineSimilarity'])

# pd.concat(objs=[temp_df, tfidf_df], axis=1)

In [300]:
question_answers = post.copy()
question_answers['CosineSimilarity'] = np.nan
question_answers['AcceptedAnswer'] = 0

In [353]:
# I hope there are no questions with 0 answers in the post df

for i, index in enumerate(post.index.values[:500]):
    if i % 500 == 0:
        print "calculating cosine similarity for row {} of {}".format(i, post.index.values.shape[0])

    if post.ix[index,'PostTypeId'] == 1: # Question
        
        post_id = post.ix[index,'Id'] # grab question post ID
        
        accepted_answer_id = post.ix[index,'AcceptedAnswerId'] # grab the answer ID
        try:
            accepted_answer_id = int(accepted_answer_id)
        except: # the accepted_answer_id is np.nan
            accepted_answer_id = accepted_answer_id
            
        mask = (post['ParentId'] == post_id) & (post['PostTypeId']==2) # create mask to find answers to this question
        
        # create new temporary dataframe to hold question and answers
        temp_df = pd.DataFrame(post[mask]) 
        temp_df = pd.concat(objs=[post[post['Id']==post_id], temp_df], axis=0)
        
        # create array of post text 
        text_array = temp_df['Body_cleaned'].values
        
        # use tf-idf on new temp dataframe 
        tfidf_matrix = getTFIDF(text_array, max_features=20, ngram_range=(1,3)).todense()
        
        # calculate cosine similarity between each question (first row) and each answer (other rows)
        cosine_similarity = (tfidf_matrix * tfidf_matrix.T).A

        # create new feature/column in question_answers in temp_df with cosine similarity value for each answer 
        tfidf_df = pd.DataFrame(cosine_similarity[:, 0], index=temp_df.index.values, columns=['CosineSimilarity'])
        
        # Concatenate old dataframe with new 
        temp_df = pd.concat(objs=[temp_df, tfidf_df], axis=1)
        
        # Assign values in CosineSimilarity column in temp_df to the respective rows in the question_answers df 
        for row in temp_df.index.values:
            
            # update new dataframe with cosine similarity values
            question_answers.ix[row, 'CosineSimilarity'] = temp_df.ix[row, 'CosineSimilarity']
                   
            # update new dataframe with AcceptedAnswer status
            if question_answers.ix[row, 'Id'] == accepted_answer_id:
                question_answers.ix[row, 'AcceptedAnswer'] = 1
            else:
                question_answers.ix[row, 'AcceptedAnswer'] = 0
            
            
        del temp_df #delete dataframe
        
question_answers['PostLength'] = question_answers['Body_cleaned'].apply(lambda x: len(x))


calculating cosine similarity for row 0 of 40476
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan


6. Run RF, XGBOOST, LR to see which model performs best. 

In [367]:
# grab just the answers
answers = question_answers[question_answers['PostTypeId']==2] 
# drop the rows with np.nan in 'AcceptedAnswer's
answers = answers.dropna(axis=0, subset=['AcceptedAnswer'])
answers['AcceptedAnswer'] = answers['AcceptedAnswer'].astype(int)
# answers[answers['AcceptedAnswer'].isnull()]

Check the baseline and imbalance 

In [381]:
from __future__ import division 

print 'baseline accuracy:', 1-answers[answers['AcceptedAnswer']==1].shape[0] / answers.shape[0] 

baseline accuracy: 0.728126173489


In [384]:
# Create design matrix 
features = ['PostLength']
X = answers[features].values # grab features 
y = answers['AcceptedAnswer'].values # grab target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

lr = LogisticRegression()

lr.fit(X_train, y_train)

pred = lr.predict(X_test)

scores = cross_val_score(estimator=lr, X=X, y=y, cv=7, scoring='accuracy')

print np.mean(scores)

0.726707440186


In [383]:
# Create design matrix 
features = ['CosineSimilarity','PostLength']
X = answers[features].values # grab features 
y = answers['AcceptedAnswer'].values # grab target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

lr = LogisticRegression()

lr.fit(X_train, y_train)

pred = lr.predict(X_test)

scores = cross_val_score(estimator=lr, X=X, y=y, cv=7, scoring='accuracy')

print np.mean(scores)

0.72679087259


In [388]:
params = {
    'n_estimators':[100, 400],
    'max_depth':[None,50, 100], 
    'min_samples_split':[4, 7], 
    'min_samples_leaf':[2, 6]
}

gs = GridSearchCV(estimator=RandomForestClassifier(), param_grid=params, scoring='accuracy', verbose=1, cv=3)

gs.fit(X_train, y_train)

Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:  4.8min finished


GridSearchCV(cv=3, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [100, 400], 'min_samples_split': [4, 7], 'max_depth': [None, 50, 100], 'min_samples_leaf': [2, 6]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='accuracy', verbose=1)

In [390]:
print gs.best_score_

0.710843373494


---
** Conclusion**: I tried the following methods:
1. TF-IDF on question answer pairs + PCA -> LR, RF  -> accuracy : 0.728126181096
2. TF-IDF on question answer pairs + Cosine Similarity -> LR, RF -> accuracy: 0.72679087259
3. TF-IDF + BoW + PCA on entire corpus -> LR, RF -> accuracy: 0.727875829265

But I was unable to get an accuracy score that was higher than the baseline (72%). 

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 3. What is the score of a post?

---

**3.1 Build a model that predicts the score of a post.**

- This is a regression problem now. 
- You can and should be predicting score for both "question" and "answer" posts, so keep them both in your dataset.
- Again, use any techniques that you think will get you the best model.

**3.2 Evaluate the performance of your model with cross-validation and report the results.**

**3.3 What is important for determining the score of a post, if anything?**


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 4. How many views does a post have?

---

**4.1 Build a model that predicts the number of views a post has.**

- This is another regression problem. 
- Predict the views for all posts, not just the "answer" posts.

**4.2 Evaluate the performance of your model with cross-validation and report the results.**

**4.3 What is important for the number of views a post has, if anything?**

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 5. Build a pipeline or other code to automate evaluation of your models on the test data.

---

Now that you've constructed your three predictive models, build a pipeline or code that can easily load up the raw testing data and evaluate your models on it.

The testing data that is held out is in the same raw format as the training data you have. _Any cleaning and preprocessing that you did on the training data will need to be done on the testing data as well!_

This is a good opportunity to practice building pipelines, but you're not required to. Custom functions and classes are fine as long as they are able to process and test the new data.


<img src="http://imgur.com/xDpSobf.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 6. Lets Model - Tournament for stock market predictions

>Start this section of the project by downloading the train and test datasets from the following site: https://numer.ai/rules

> - The data set is clean, your goal is to develop a classification model(s) 
> - Report all the results including log loss, and other coefficients you consider iteresting