## Natural Language Processing in a Kaggle Competition for Movie Reviews

**Shivam Panchal**
**(shivam.panchal@mail.com)**

**Importing the required libraries**

In [1]:
import re
from bs4 import BeautifulSoup

## Cleaning the Reviews

**Now, We will define a function, which will clean up the reviews for us**

In [2]:
def review_to_words( raw_review ):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review).get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))

**Great! Now, it's time to go ahead and load the data**
**Link: https://www.kaggle.com/c/word2vec-nlp-tutorial/data**

In [3]:
import pandas as pd

## Loading the Datasets. We will use the labelled trained data and test data.

In [4]:
# Setting the work directory
import os
os.chdir('C://users/Shivam Panchal/Desktop/bags of popcorn/Datasets')
print os.getcwd()

C:\users\Shivam Panchal\Desktop\bags of popcorn\Datasets


In [5]:
train = pd.read_csv('labeledTrainData.tsv', header = 0, delimiter="\t", quoting =3)

test = pd.read_csv('testData.tsv', header = 0, delimiter="\t", quoting =3)

** Get the labels from our training dataset, to teach our classifier**

In [6]:
y_train = train['sentiment']

**Now we need to clean both the train and test data to get it ready for the next part of our program.**

In [7]:
from nltk.corpus import stopwords # Import the stop word list

num_reviews = train["review"].size
print("Cleaning and parsing the training set movie reviews...\n")
clean_train_reviews = []
for i in xrange( 0, num_reviews ):
    # If the index is evenly divisible by 1000, print a message
    if( (i+1)%1000 == 0 ):
        print("Review %d of %d" % ( i+1, num_reviews ))                                                                 
    clean_train_reviews.append( review_to_words( train["review"][i] ))

Cleaning and parsing the training set movie reviews...





 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


Review 1000 of 25000
Review 2000 of 25000
Review 3000 of 25000
Review 4000 of 25000
Review 5000 of 25000
Review 6000 of 25000
Review 7000 of 25000
Review 8000 of 25000
Review 9000 of 25000
Review 10000 of 25000
Review 11000 of 25000
Review 12000 of 25000
Review 13000 of 25000
Review 14000 of 25000
Review 15000 of 25000
Review 16000 of 25000
Review 17000 of 25000
Review 18000 of 25000
Review 19000 of 25000
Review 20000 of 25000
Review 21000 of 25000
Review 22000 of 25000
Review 23000 of 25000
Review 24000 of 25000
Review 25000 of 25000


In [8]:
len(clean_train_reviews)

25000

# TF-IDF Vectorization

**The next thing we are going to do is make TF-IDF (term frequency-interdocument frequency) vectors of our reviews. In case you are not familiar with what this is doing, essentially we are going to evaluate how often a certain term occurs in a review, but normalize this somewhat by how many reviews a certain term also occurs in.**

'''This can be a great technique for helping to determine which words (or ngrams of words) will make good features to classify a review as positive or negative.

To do this, we are going to use the TFIDF vectorizer from scikit-learn. Then, decide what settings to use.'''

In [9]:
# Read the test data
test = pd.read_csv("testData.tsv", header=0, delimiter="\t", \
                   quoting=3 )
# Verify that there are 25,000 rows and 2 columns
print(test.shape)
# Create an empty list and append the clean reviews one by one
num_reviews = len(test["review"])
clean_test_reviews = [] 

print("Cleaning and parsing the test set movie reviews...")
for i in xrange(0,num_reviews):
    if( (i+1) % 5000 == 0 ):
        print("Review %d of %d" % (i+1, num_reviews))
    clean_review = review_to_words( test["review"][i] )
    clean_test_reviews.append( clean_review )

(25000, 2)
Cleaning and parsing the test set movie reviews...
Review 5000 of 25000
Review 10000 of 25000
Review 15000 of 25000
Review 20000 of 25000
Review 25000 of 25000


In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer as TFIDFV

In [12]:
tfv = TFIDFV(min_df=3, max_features=None,
            strip_accents='unicode', analyzer='word', 
            token_pattern=r'\w{1,}', stop_words= 'english')

**Now that we have the vectorization object, we need to run this on all of the data (both training and testing) to make sure it is applied to both datasets. This could take some time on your computer!**

In [17]:
X_all = clean_train_reviews + clean_test_reviews # Combine both to fit the TFIDF vectorization.

tfv.fit(X_all) # This is the slow part!
X_all = tfv.transform(X_all)

X = X_all[:25000] # Separate back into training and test sets. 
X_test = X_all[25000:]

# Making our Classifier

In [18]:
X.shape

(25000, 47038)

'''That means we have 25,000 training examples (or rows) and 47038 features (or columns). We need something that is going to be somewhat computationally efficient given how many features we have. Using something like a random forest to classify would be unwieldy (plus random forests can’t work with sparse matrices anyway yet in scikit-learn). That means we need something lightweight and fast that scales to many dimensions well. Some possible candidates are:

Naive Bayes
Logistic Regression
SGD Classifier (utilizes Stochastic Gradient Descent for much faster runtime)'''

**Let’s just try all three as submissions to Kaggle and see how they perform.

First up: Logistic Regression**

In [19]:
from sklearn.linear_model import LogisticRegression as LR
from sklearn.grid_search import GridSearchCV

In [28]:
model_LR = GridSearchCV(LR(C=30, class_weight=None, dual=True, fit_intercept=True,
            intercept_scaling=1, penalty='L2', random_state=0, tol=0.0001), 
                        grid_values, scoring = 'roc_auc', cv = 20) 

# Try to set the scoring on what the contest is asking for. 
# The contest says scoring is for area under the ROC curve, so use this.

model_LR.fit(X,y_train)



GridSearchCV(cv=20, error_score='raise',
       estimator=LogisticRegression(C=30, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='L2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1, param_grid={'C': [30]},
       pre_dispatch='2*n_jobs', refit=True, scoring='roc_auc', verbose=0)

**You can investigate which parameters did the best and what scores they received by looking at the model_LR object.**

# Let’s make our Multinomial Naive Bayes object, and train it.

In [29]:
from sklearn.naive_bayes import MultinomialNB as MNB


model_NB = MNB()
model_NB.fit(X, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

# One last classifier to try is the SGD classifier, which comes in handy when you need speed on a really large number of training examples/features.

In [33]:
from sklearn.linear_model import SGDClassifier as SGD


  
model_SGD = GridSearchCV(cv=20, estimator=SGD(alpha=0.0001, class_weight=None, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal',
           loss='modified_huber', n_iter=5, n_jobs=1, penalty='l2',
           power_t=0.5, random_state=0, shuffle=True, verbose=0,
           warm_start=False),
           fit_params={}, iid=True, n_jobs=1,
           param_grid={'alpha': [6e-05, 7e-05, 8e-05, 0.0001, 0.0005]},
           pre_dispatch='2*n_jobs', refit=True,
           scoring='roc_auc', verbose=0) # Find out which regularization parameter works the best. 
                            
model_SGD.fit(X, y_train) # Fit the model.

GridSearchCV(cv=20, error_score='raise',
       estimator=SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='modified_huber', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=0, shuffle=True, verbose=0,
       warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'alpha': [6e-05, 7e-05, 8e-05, 0.0001, 0.0005]},
       pre_dispatch='2*n_jobs', refit=True, scoring='roc_auc', verbose=0)

**Again, similar to the Logistic Regression model, we can see which parameter did the best.**

In [50]:
'''GridSearchCV(cv=20, estimator=SGD(alpha=0.0001, class_weight=None, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal',
           loss='modified_huber', n_iter=5, n_jobs=1, penalty='l2',
           power_t=0.5, random_state=0, shuffle=True, verbose=0,
           warm_start=False),
           fit_params={}, iid=True, n_jobs=1,
           param_grid={'alpha': [6e-05, 7e-05, 8e-05, 0.0001, 0.0005]},
           pre_dispatch='2*n_jobs', refit=True,
           scoring='roc_auc', verbose=0)'''

GridSearchCV(cv=20, error_score='raise',
       estimator=SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='modified_huber', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=0, shuffle=True, verbose=0,
       warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'alpha': [6e-05, 7e-05, 8e-05, 0.0001, 0.0005]},
       pre_dispatch='2*n_jobs', refit=True, scoring='roc_auc', verbose=0)

In [51]:
model_SGD.grid_scores_

[mean: 0.96310, std: 0.00498, params: {'alpha': 6e-05},
 mean: 0.96334, std: 0.00498, params: {'alpha': 7e-05},
 mean: 0.96350, std: 0.00495, params: {'alpha': 8e-05},
 mean: 0.96363, std: 0.00498, params: {'alpha': 0.0001},
 mean: 0.95919, std: 0.00526, params: {'alpha': 0.0005}]

***Looks like this beat our previous Logistic Regression model by a very small amount. Now that we have our three models, we can work on submitting our final scores in the proper format. It was found that submitting predicted probabilities of each score instead of the final predicted score worked better for evaluation from the contest participants, so we want to output this instead.***

# Submissions

**First, do our Logistic Regression submission.**

In [35]:
LR_result = model_LR.predict_proba(X_test)[:,1] # We only need the probabilities that the movie review was a 7 or greater. 
LR_output = pd.DataFrame(data={"id":test["id"], "sentiment":LR_result}) # Create our dataframe that will be written.
LR_output.to_csv('Logistic_Reg_Proj.csv', index=False, quoting=3) # Get the .csv file we will submit to Kaggle.

**Repeat this for Multinomial Naive Bayes**

In [37]:
MNB_result = model_NB.predict_proba(X_test)[:,1]
MNB_output = pd.DataFrame(data={"id":test["id"], "sentiment":MNB_result})
MNB_output.to_csv('MNB_Proj.csv', index = False, quoting = 3)

**Last, do the Stochastic Gradient Descent model**

In [38]:
SGD_result = model_SGD.predict_proba(X_test)[:,1]
SGD_output = pd.DataFrame(data={"id":test["id"], "sentiment":SGD_result})
SGD_output.to_csv('SGD_Proj.csv', index = False, quoting = 3)

**Submitting the results**