## This notebook contains 1-gram tfidf without any other text preprocessing. And 2-grams tfidf just with removing low frequence 2-grams. Both use logistic regresson method.
* logistic regresson over bage of 1-gram with tfidf has 88.3% accuracy for prediction
* logistic regresson over bage of 2-grams with tfidf has 89.2%(increase 0.9%) accuracy for prediction


Firstly, let us look at details about logistic regression over bag of 1-gram with tfidf.

**What is the tfidf?**    

From Wikipedia: TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. 

Example:  
Consider we have a document containing 100 words where the word "business" appears 10 times. For "business", the term frequency is (10 / 100) = 0.1. Addition, assume we have 10 thousand documents and the word "business" appears in one hundred of these. The inverse document frequency is log(10000 / 100) = 2. Thus, the Tf-idf weight of "business" is their product 0.1 * 2 = 0.2.

In practice, we don't calculate tfidf weights for every word in our dataset. In sklearn, we could simply use TfidfVectorizer to do it.

**Why do I choose logistic regression?**    

After we take bag of 1-grams with TF-IDF, we will get a matrix of features which is 25,000 rows and 75,000 columns.  It is an extremely sparse matrix since 99.8% of all values are zero.
A linear model can handle sparse data and this is kind of binary classification problem. Thus, I choose logistic regression model.  

Let's look at the code for logistic regression over bag of 1-gram with tfidf.


In [2]:
# import required libraries
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
# import os
# import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [4]:
# store all train reviews to reviews_train list
reviews_train = []
for line in open('movie_data/full_train.txt'): 
    reviews_train.append(line.strip())

# store all test reviews to reviews_train list
reviews_test = []
for line in open('movie_data/full_test.txt'):  
    reviews_test.append(line.strip())

In [3]:
# take bag of 1-gram with tfidf
tfidf = TfidfVectorizer()
features = tfidf.fit_transform(reviews_train)
X = pd.DataFrame(features.todense(),
            columns=tfidf.get_feature_names())
X_test = tfidf.transform(reviews_test)

In [4]:
# first 12500 rows are positive reviews, others are negative reviews
target = [1 if i < 12500 else 0 for i in range(25000)]

# do cross_validation to find the best C in logistic regression
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.75)

# try multiple C in logisticregression and find the best which will be used to predict
for c in [0.01, 0.05, 0.25, 0.5, 1]:
    lr = LogisticRegression(C=c)
    lr.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" % (c, accuracy_score(y_val, lr.predict(X_val))))

Accuracy for C=0.01: 0.7936
Accuracy for C=0.05: 0.83152
Accuracy for C=0.25: 0.86832
Accuracy for C=0.5: 0.87968
Accuracy for C=1: 0.88672


In [5]:
# C=1 has the highest accuracy of validation, use final_model to predict
# the accuracy for test is 88.3% not bad, it is better than random classification(50%)
final_model = LogisticRegression(C=1)
final_model.fit(X, target)
print ("Final Accuracy: %s" % accuracy_score(target, final_model.predict(X_test)))

Final Accuracy: 0.88316


Please look at top positive words. They are great, excellent, best, perfect, wonderful. It means the model captured positive sentiment under the case where it knows nothing about English. The model also learns negative sentiment, if we look at top negative words: worst, bad, awful, waste, boring. **Pretty Cool!**

In [10]:
# find the top positive words and negative words
feature_to_coef = {word: coef for word, coef in zip(tfidf.get_feature_names(), final_model.coef_[0])}

positives = sorted(feature_to_coef.items(), key=lambda x: x[1], reverse=True)
for best_positive in positives[:5]:
    print ("top positive word: ", best_positive)

print('-'*60)
negatives = sorted(feature_to_coef.items(), key=lambda x: x[1])
    
for best_negative in negatives[:5]:
    print ("top negative word: ", best_negative)

top positive word:  ('great', 7.597249852954926)
top positive word:  ('excellent', 6.184830619925593)
top positive word:  ('best', 5.127475318155725)
top positive word:  ('perfect', 4.818259658999337)
top positive word:  ('wonderful', 4.675988654271976)
------------------------------------------------------------
top negative word:  ('worst', -9.241902496068317)
top negative word:  ('bad', -7.9557462534905765)
top negative word:  ('awful', -6.4928601936902535)
top negative word:  ('waste', -6.278174894676147)
top negative word:  ('boring', -6.019791640101872)


Next, try to improve model. Let us look at details about logistic regression over bag of 2-grams with tfidf.  
This time we throw away some 2-grams that appear less than 5 times in all documents. Because these 2-grams are likely either typos or people don't say like that, and some of them don't make any sense. After throw away low frequence 2-grams, we will get a matrix with 25000 rows and 156821 columns.  
This time just use C=1 for logistic regression since if we use cross_validation to choose the best C like in 1-gram, the kernel will die and restart.

In [5]:
## 2-grams throw away n-grams less than 5 times
tfidf = TfidfVectorizer(ngram_range=(1, 2), min_df=5)
features = tfidf.fit_transform(reviews_train)
X = pd.DataFrame(features.todense(),
            columns=tfidf.get_feature_names())
X_test = tfidf.transform(reviews_test)

In [6]:
# the accuracy for test is 89.2% increased
target = [1 if i < 12500 else 0 for i in range(25000)]
model = LogisticRegression(C=1)
model.fit(X, target)
print ("Final Accuracy: %s" % accuracy_score(target, model.predict(X_test)))

Final Accuracy: 0.89228


In [7]:
# find the top positive n-grams and negative n-grams
# We saw one 2-grams(the worst) in top negative ones.
feature_to_coef = {word: coef for word, coef in zip(tfidf.get_feature_names(), model.coef_[0])}

positives = sorted(feature_to_coef.items(), key=lambda x: x[1], reverse=True)
for best_positive in positives[:5]:
    print ("top positive word: ", best_positive)

print('-'*60)
negatives = sorted(feature_to_coef.items(), key=lambda x: x[1])
    
for best_negative in negatives[:5]:
    print ("top negative word: ", best_negative)

top positive word:  ('great', 7.663454840228788)
top positive word:  ('excellent', 5.315721883208201)
top positive word:  ('wonderful', 4.363274665123357)
top positive word:  ('best', 4.132473357265538)
top positive word:  ('perfect', 4.124617663286926)
------------------------------------------------------------
top negative word:  ('bad', -8.448189376176252)
top negative word:  ('worst', -7.323039515276972)
top negative word:  ('the worst', -5.867879346841515)
top negative word:  ('awful', -5.76580591273964)
top negative word:  ('boring', -5.360840023004068)
