# Step 1: Download the dataset

**I have already downloaded the "Large Movie Review DataSet v1.0". for you to download the dataset simply go to this link "https://ai.stanford.edu/~amaas/data/sentiment/"**

**Note: IMDB lets the user to rate movies on a scale from 1 to 10. The curator of the dataset labelled anything with <= 4 stars as negative and >=7 stars as positive. Reviews with 5 or 6 stars has been left out.**

# Step 2: Reading the dataset in Python.

In [4]:
review_train = []
for line in open('full_train.txt','r', encoding="utf8"):
    review_train.append(line.strip())

review_test = []
for line in open('full_test.txt','r', encoding="utf8"):
    review_test.append(line.strip())
    

# Step 3: Cleaning and Preprocessing.

**The raw text is pretty messy for these reviews so before we can do any analytics we need to clean things up.**

In [5]:
import re

# Removing all the punctuation marks and HTML tags.
replace_no_space = re.compile("[.;:!\'?,\"()\[\]]")
replace_with_space = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

def preprocess_review(reviews):
    reviews = [replace_no_space.sub("",line.lower()) for line in reviews]
    reviews = [replace_with_space.sub(" ",line) for line in reviews]
    return reviews

review_train_clean = preprocess_review(review_train)
review_test_clean = preprocess_review(review_test)

**This is how my review looks**

In [6]:
review_train_clean[0]

'bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell highs satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled  at  high a classic line inspector im here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isnt'

**Vectorization**: In order for this data to make sense to our machine learning algorithm we’ll need to convert each review to a numeric representation, which we call vectorization.

In [16]:
# One-hot encoding using CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(binary = True)
cv.fit(review_train_clean)
X = cv.transform(review_train_clean)
X_test = cv.transform(review_test_clean)

# Step 4: Building Classifier

**Logistic Regression is a good baseline model for this project because of these reasons:
(1) They are easy to interpret
(2) Linear model tends to perform well on sparse datasets like this one.
(3) They learn very fast as compared to other model because of their simplicity.**

**Note**:The targets/labels we use will be same for the training as well as test dataset because both are structured in the same way, where the first 12.5k reviews are positive and the last 12.5k reviews are negative. 

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

target = [1 if i<12500 else 0 for i in range(25000)]

X_train, X_val, Y_train, Y_val = train_test_split(X, target, train_size = 0.75)

for c in [0.01, 0.05,0.25, 0.5, 1]:
    lr = LogisticRegression(C=c, solver = 'lbfgs', max_iter = 1000)
    lr.fit(X_train, Y_train)
    print("Accuracy for C=%s: %s" 
           % (c, accuracy_score(Y_val, lr.predict(X_val))))

Accuracy for C=0.01: 0.8752
Accuracy for C=0.05: 0.88448
Accuracy for C=0.25: 0.8824
Accuracy for C=0.5: 0.87696
Accuracy for C=1: 0.87616


**The value of c that gives the highest accuracy is 0.05.**

# Training the final model. 

In [13]:
final_model = LogisticRegression(C=0.05, solver="lbfgs", max_iter=1000)
final_model.fit(X,target)
print ("Final Accuracy: %s" 
       % accuracy_score(target, final_model.predict(X_test)))


Final Accuracy: 0.88156


As a sanity check, let's look at the 5 most discriminating words both for positive and negative reviews.We will do this by looking at the largest and the smallest coefficients, respectively.

In [21]:
feature_to_coef = {
    word: coef for word, coef in zip(
        cv.get_feature_names(), final_model.coef_[0]
    )
}
for best_positive in sorted(
    feature_to_coef.items(), 
    key=lambda x: x[1], 
    reverse=True)[:5]:
    print (best_positive)


('excellent', 0.9287783331979843)
('perfect', 0.7916858965240479)
('great', 0.6740498048937984)
('amazing', 0.6131909605881833)
('superb', 0.6010839559331261)


In [24]:
for best_negative in sorted(
    feature_to_coef.items(), 
    key=lambda x: x[1])[:5]:
    print (best_negative)

('worst', -1.3646347396142875)
('waste', -1.1668254583062538)
('awful', -1.0321153550729136)
('poorly', -0.8751983530173937)
('boring', -0.8567551153619931)


# Step 5: More Text Preprocessing.

**For our first iteration, we did a very basic stuff like removing puctution marks, HTML tags and making everything in the lower case. Now, we will further clean the text by removing stopwords and normalizing the text.**

To do this transformation, we will use liberaries like Natural Language Toolkit(NLTK). This is a very popular library in Python.

In [28]:
import nltk
nltk.download('wordnet')
from nltk.corpus import stopwords

englist_stop_words = stopwords.words('english')
def remove_stop_words(corpus):
    removed_stop_words = [];
    for review in corpus:
        removed_stop_words.append(
            ' '.join([word for word in review.split()
                     if word not in englist_stop_words])
        )
    return removed_stop_words

final_review = remove_stop_words(review_train_clean)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Pushkar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [30]:
print("Before")
print(" ")
print(review_train_clean[0])
print(" ")
print('After')
print(" ")
print(final_review[0])

Before
 
bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell highs satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled  at  high a classic line inspector im here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isnt
 
After
 
bromwell high cartoon comedy ran time programs school life teachers 35 years teaching profession lead believe bromwell highs satire much closer reality teachers scramble survive financially insightful students see righ

**A common next step in text preprocessing is to normalize the words in your corpus by trying to convert all the different forms of the given words into one. There are two methods exists for this. 1. Stemming 2. Lemmatization.**

In [32]:
# Stemming
def get_stemmed_text(corpus):
    from nltk.stem.porter import PorterStemmer
    stemmer = PorterStemmer()
    return [' '.join([stemmer.stem(word) for word in review.split()]) for review in corpus]

stemmed_reviews = get_stemmed_text(review_train_clean)

In [34]:
# Lemmatization
def get_lemmatized_text(corpus):
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    return [' '.join([lemmatizer.lemmatize(word) for word in review.split()]) for review in corpus]

lemmatized_reviews = get_lemmatized_text(review_train_clean)

In [36]:
print("No Normalization")
print(" ")
print(review_train_clean[0])
print(" ")
print("Stemming")
print(" ")
print(stemmed_reviews[0])
print(" ")
print("Lemmatization")
print(" ")
print(lemmatized_reviews[0])

No Normalization
 
bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell highs satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled  at  high a classic line inspector im here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isnt
 
Stemming
 
bromwel high is a cartoon comedi it ran at the same time as some other program about school life such as teacher my 35 year in the teach profess lead me to believ that bromwel high satir is much closer to 

**n-gram**:Previously, we used only single word features in our model, which we call 1-grams or unigrams. We can potentially add more predictive power to our model by adding two or three word sequences (bigrams or trigrams) as well. For example, if a review had the three word sequence “didn’t love movie” we would only consider these words individually with a unigram-only model and probably not capture that this is actually a negative sentiment because the word ‘love’ by itself is going to be highly correlated with a positive review.

In [39]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
ngram_vectorizer.fit(review_train_clean)
X = ngram_vectorizer.transform(review_train_clean)
X_test = ngram_vectorizer.transform(review_test_clean)

X_train, X_val, Y_train, Y_val = train_test_split(
    X, target, train_size = 0.75
)

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    
    lr = LogisticRegression(C=c, solver='lbfgs', max_iter=1000)
    lr.fit(X_train, Y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(Y_val, lr.predict(X_val))))

Accuracy for C=0.01: 0.884
Accuracy for C=0.05: 0.89008
Accuracy for C=0.25: 0.8928
Accuracy for C=0.5: 0.89344
Accuracy for C=1: 0.89424


**Clearly, c-1 is giving us the best accuracy close to 90%**

In [40]:
# final model.
final_ngram = LogisticRegression(C=1, solver='lbfgs', max_iter=1000)
final_ngram.fit(X, target)
print ("Final Accuracy: %s" 
       % accuracy_score(target, final_ngram.predict(X_test)))

Final Accuracy: 0.89728


**Getting accuracy pretty close to 90%! so simply considering 2-word seq instead of 1 word increased our accuracy by more than 1.6 percentage**

# Representations

**Word Counts**: Instead of simply noting whether a word appears in the review or not, we can include the number of times a given word appears. This can give our sentiment classifier a lot more predictive power. For example, if a movie reviewer says ‘amazing’ or ‘terrible’ multiple times in a review it is considerably more probable that the review is positive or negative, respectively.

In [42]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

wc_vectorizer = CountVectorizer(binary=False)
wc_vectorizer.fit(review_train_clean)
X = wc_vectorizer.transform(review_train_clean)
X_test = wc_vectorizer.transform(review_test_clean)

X_train, X_val, y_train, y_val = train_test_split(
    X, target, train_size = 0.75, 
)

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    
    lr = LogisticRegression(C=c, solver='lbfgs', max_iter=1000)
    lr.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_val, lr.predict(X_val))))

    
final_wc = LogisticRegression(C=0.05, solver='lbfgs', max_iter=1000)
final_wc.fit(X, target)
print ("Final Accuracy: %s" 
       % accuracy_score(target, final_wc.predict(X_test)))

Accuracy for C=0.01: 0.88112
Accuracy for C=0.05: 0.88576
Accuracy for C=0.25: 0.88208
Accuracy for C=0.5: 0.87984
Accuracy for C=1: 0.87888
Final Accuracy: 0.8822


**TF-IDF**:Another common way to represent each document in a corpus is to use the tf-idf statistic (term frequency-inverse document frequency) for each word, which is a weighting factor that we can use in place of binary or word count representations.

In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(review_train_clean)
X = tfidf_vectorizer.transform(review_train_clean)
X_test = tfidf_vectorizer.transform(review_test_clean)

X_train, X_val, y_train, y_val = train_test_split(
    X, target, train_size = 0.75
)

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    
    lr = LogisticRegression(C=c, solver='lbfgs', max_iter=1000)
    lr.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_val, lr.predict(X_val))))
    
final_tfidf = LogisticRegression(C=0.25, solver='lbfgs', max_iter=1000)
final_tfidf.fit(X, target)
print ("Final Accuracy: %s" 
       % accuracy_score(target, final_tfidf.predict(X_test)))


Accuracy for C=0.01: 0.792
Accuracy for C=0.05: 0.8256
Accuracy for C=0.25: 0.86384
Accuracy for C=0.5: 0.87728
Accuracy for C=1: 0.88608
Final Accuracy: 0.8676


So far we’ve chosen to represent each review as a very sparse vector (lots of zeros!) with a slot for every unique n-gram in the corpus (minus n-grams that appear too often or not often enough). Linear classifiers typically perform better than other algorithms on data that is represented in this way.

**Support Vector Machines(SVM)**:
Since Linear Classifiers tends to work well on very sparse dataset(like the one we have). Another algorithm that can produce great results with a quick training time are Support Vector Machines with a linear kernel.

In [47]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
ngram_vectorizer.fit(review_train_clean)
X = ngram_vectorizer.transform(review_train_clean)
X_test = ngram_vectorizer.transform(review_test_clean)

X_train, X_val, y_train, y_val = train_test_split(
    X, target, train_size = 0.75
)

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    
    svm = LinearSVC(C=c, max_iter=1500)
    svm.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_val, svm.predict(X_val))))

    
final_svm_ngram = LinearSVC(C=0.01, max_iter=1500)
final_svm_ngram.fit(X, target)
print ("Final Accuracy: %s" 
       % accuracy_score(target, final_svm_ngram.predict(X_test)))

Accuracy for C=0.01: 0.89056
Accuracy for C=0.05: 0.88784
Accuracy for C=0.25: 0.8864
Accuracy for C=0.5: 0.88624




Accuracy for C=1: 0.88608
Final Accuracy: 0.89708


# Final Model

**I have found that removing a small set of stopwords along with a n-gram range from 1 to 3 and a linear SVC gave me the best result.**

In [48]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC


stop_words = ['in', 'of', 'at', 'a', 'the']
ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 3), stop_words=stop_words)
ngram_vectorizer.fit(review_train_clean)
X = ngram_vectorizer.transform(review_train_clean)
X_test = ngram_vectorizer.transform(review_test_clean)

X_train, X_val, y_train, y_val = train_test_split(
    X, target, train_size = 0.75
)

for c in [0.001, 0.005, 0.01, 0.05, 0.1]:
    
    svm = LinearSVC(C=c, max_iter=1000)
    svm.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_val, svm.predict(X_val))))

    
final = LinearSVC(C=0.01, max_iter=1000)
final.fit(X, target)
print ("Final Accuracy: %s" 
       % accuracy_score(target, final.predict(X_test)))

Accuracy for C=0.001: 0.8936
Accuracy for C=0.005: 0.89712
Accuracy for C=0.01: 0.89776
Accuracy for C=0.05: 0.89872
Accuracy for C=0.1: 0.89872
Final Accuracy: 0.90024


**Hurray!, I have achived 90% mark!**

**Summary**: We have gone over several options for transforming text that can improve the accuracy of an NLP model. Which combination of these techniques will yield the best results will depend on the task, data representations, and the algorithm you choose. It's always a good idea to try out different combinations to see what works well.