# Part 3: Text Data

# Summary of the business problem

This task to do sentiment analysis on IMDB movie comments to understand the audience reference of movies.

# Solution details

In [1]:
!pip install nltk



You are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


## load package

In [127]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import WordPunctTokenizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import re, string
from nltk.collocations import *
import os
from nltk.stem import PorterStemmer
import sys
import re
from sklearn.feature_extraction import FeatureHasher
import numpy as np

## load data

In [130]:
train_path = "aclImdb/train/" # source data
test_path = "aclImdb/test/" # test data for grade evaluation. 

## Remove stopwords

REMOVE_STOPWORDS takes a sentence and the stopwords as inputs and returns the sentence without any stopwords 
Sentence - The input from which the stopwords have to be removed
Stopwords - A list of stopwords  

In [5]:
def remove_stopwords(sentence, stopwords):
    sentencewords = sentence.split()
    resultwords  = [word for word in sentencewords if word.lower() not in stopwords]
    result = ' '.join(resultwords)
    return result

IMDB_DATA_PREPROCESS explores the neg and pos folders from aclImdb/train and creates a output_file in the required format
Inpath - Path of the training samples 
Outpath - Path were the file has to be saved 
Name  - Name with which the file has to be saved 
Mix - Used for shuffling the data 

In [16]:
def imdb_data_preprocess(inpath, outpath="./", name="imdb_tr.csv", mix=False):
    sw= stopwords.words("english")
    indices = []
    text = []
    rating = []

    i =  0 

    for filename in os.listdir(inpath+"pos"):
        data = open(inpath+"pos/"+filename, 'r' , encoding="ISO-8859-1").read()
        data = remove_stopwords(data, sw)
        indices.append(i)
        text.append(data)
        rating.append("1")
        i = i + 1

    for filename in os.listdir(inpath+"neg"):
        data = open(inpath+"neg/"+filename, 'r' , encoding="ISO-8859-1").read()
        data = remove_stopwords(data, sw)
        indices.append(i)
        text.append(data)
        rating.append("0")
        i = i + 1
    
    Dataset = list(zip(indices,text,rating))

    if mix:
        np.random.shuffle(Dataset)

    df = pd.DataFrame(data = Dataset, columns=['row_Number', 'text', 'polarity'])
    df.to_csv(outpath+name, index=False, header=True)

    pass

In [17]:
imdb_data_preprocess(train_path)

In [39]:
data = pd.read_csv("imdb_tr.csv",header=0, encoding = 'ISO-8859-1')
data.head()

Unnamed: 0,row_Number,text,polarity
0,0,Bromwell High cartoon comedy. ran time program...,1
1,1,Homelessness (or Houselessness George Carlin s...,1
2,2,Brilliant over-acting Lesley Ann Warren. Best ...,1
3,3,easily underrated film inn Brooks cannon. Sure...,1
4,4,typical Mel Brooks film. much less slapstick m...,1


In [132]:
def retrieve_data(name="imdb_tr.csv", train=True):
    import pandas as pd 
    data = pd.read_csv(name,header=0, encoding = 'ISO-8859-1')
    X = data['text']
    
    if train:
        Y = data['polarity']
        return X, Y

    return X

## Tokenize

In [72]:
sent = []
for i in range(0,len(data["text"])):
               par = data["text"][i]
               sent.append(sent_tokenize(par))    
sent

[['Bromwell High cartoon comedy.',
  'ran time programs school life, "Teachers".',
  '35 years teaching profession lead believe Bromwell High\'s satire much closer reality "Teachers".',
  "scramble survive financially, insightful students see right pathetic teachers' pomp, pettiness whole situation, remind schools knew students.",
  'saw episode student repeatedly tried burn school, immediately recalled ......... .......... High.',
  "classic line: INSPECTOR: I'm sack one teachers.",
  'STUDENT: Welcome Bromwell High.',
  'expect many adults age think Bromwell High far fetched.',
  "pity isn't!"],
 ['Homelessness (or Houselessness George Carlin stated) issue years never plan help street considered human everything going school, work, vote matter.',
  "people think homeless lost cause worrying things racism, war Iraq, pressuring kids succeed, technology, elections, inflation, worrying they'll next end streets.<br /><br />But given bet live streets month without luxuries home, entertainm

In [71]:
data = data.assign(sentence = sent)
data.head()

Unnamed: 0,row_Number,text,polarity,sentence
0,0,Bromwell High cartoon comedy. ran time program...,1,"[Bromwell High cartoon comedy., ran time progr..."
1,1,Homelessness (or Houselessness George Carlin s...,1,[Homelessness (or Houselessness George Carlin ...
2,2,Brilliant over-acting Lesley Ann Warren. Best ...,1,"[Brilliant over-acting Lesley Ann Warren., Bes..."
3,3,easily underrated film inn Brooks cannon. Sure...,1,"[easily underrated film inn Brooks cannon., Su..."
4,4,typical Mel Brooks film. much less slapstick m...,1,"[typical Mel Brooks film., much less slapstick..."


In [73]:
words = []
tokenizer = RegexpTokenizer("[\w']+")
for i in range(0,len(sent)):
    for sentence in sent[i]:
        for w in tokenizer.tokenize(sentence):
            words.append(w)
words  

['Bromwell',
 'High',
 'cartoon',
 'comedy',
 'ran',
 'time',
 'programs',
 'school',
 'life',
 'Teachers',
 '35',
 'years',
 'teaching',
 'profession',
 'lead',
 'believe',
 'Bromwell',
 "High's",
 'satire',
 'much',
 'closer',
 'reality',
 'Teachers',
 'scramble',
 'survive',
 'financially',
 'insightful',
 'students',
 'see',
 'right',
 'pathetic',
 "teachers'",
 'pomp',
 'pettiness',
 'whole',
 'situation',
 'remind',
 'schools',
 'knew',
 'students',
 'saw',
 'episode',
 'student',
 'repeatedly',
 'tried',
 'burn',
 'school',
 'immediately',
 'recalled',
 'High',
 'classic',
 'line',
 'INSPECTOR',
 "I'm",
 'sack',
 'one',
 'teachers',
 'STUDENT',
 'Welcome',
 'Bromwell',
 'High',
 'expect',
 'many',
 'adults',
 'age',
 'think',
 'Bromwell',
 'High',
 'far',
 'fetched',
 'pity',
 "isn't",
 'Homelessness',
 'or',
 'Houselessness',
 'George',
 'Carlin',
 'stated',
 'issue',
 'years',
 'never',
 'plan',
 'help',
 'street',
 'considered',
 'human',
 'everything',
 'going',
 'school',
 

## Stemming words

In [83]:
stemmer = PorterStemmer() 
stemed_word = []
for word in words:
    word = stemmer.stem(word)
    stemed_word.append(word)
stemed_word

['bromwel',
 'high',
 'cartoon',
 'comedi',
 'ran',
 'time',
 'program',
 'school',
 'life',
 'teacher',
 '35',
 'year',
 'teach',
 'profess',
 'lead',
 'believ',
 'bromwel',
 "high'",
 'satir',
 'much',
 'closer',
 'realiti',
 'teacher',
 'scrambl',
 'surviv',
 'financi',
 'insight',
 'student',
 'see',
 'right',
 'pathet',
 "teachers'",
 'pomp',
 'petti',
 'whole',
 'situat',
 'remind',
 'school',
 'knew',
 'student',
 'saw',
 'episod',
 'student',
 'repeatedli',
 'tri',
 'burn',
 'school',
 'immedi',
 'recal',
 'high',
 'classic',
 'line',
 'inspector',
 "i'm",
 'sack',
 'one',
 'teacher',
 'student',
 'welcom',
 'bromwel',
 'high',
 'expect',
 'mani',
 'adult',
 'age',
 'think',
 'bromwel',
 'high',
 'far',
 'fetch',
 'piti',
 "isn't",
 'homeless',
 'or',
 'houseless',
 'georg',
 'carlin',
 'state',
 'issu',
 'year',
 'never',
 'plan',
 'help',
 'street',
 'consid',
 'human',
 'everyth',
 'go',
 'school',
 'work',
 'vote',
 'matter',
 'peopl',
 'think',
 'homeless',
 'lost',
 'caus

In [82]:
stemmer.stem("years")

'year'

## Punctuation removal

In [87]:
replacement_patterns = [(r'won\'t', 'will not'),\
                        (r'can\'t', 'cannot'),\
                        (r'i\'m', 'i am'),\
                        (r'ain\'t', 'is not'),\
                        (r'(\w+)\'ll', '\g<1> will'),\
                        (r'(\w+)n\'t', '\g<1> not'),\
                        (r'(\w+)\'ve', '\g<1> have'),\
                        (r'(\w+)\'s', '\g<1> is'),\
                        (r'(\w+)\'re', '\g<1> are'),\
                        (r'(\w+)\'d', '\g<1> would')]
class RegexpReplacer(object):  
    def __init__(self, patterns=replacement_patterns):    
        self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns]
    def replace(self, text):    
        s = text 
        for (pattern, repl) in self.patterns:      
            s = re.sub(pattern, repl, s)
        return s

In [89]:
replacer = RegexpReplacer()
clean_word = []
for word in stemed_word:
    word = replacer.replace(word)
    clean_word.append(word)
clean_word

['bromwel',
 'high',
 'cartoon',
 'comedi',
 'ran',
 'time',
 'program',
 'school',
 'life',
 'teacher',
 '35',
 'year',
 'teach',
 'profess',
 'lead',
 'believ',
 'bromwel',
 "high'",
 'satir',
 'much',
 'closer',
 'realiti',
 'teacher',
 'scrambl',
 'surviv',
 'financi',
 'insight',
 'student',
 'see',
 'right',
 'pathet',
 "teachers'",
 'pomp',
 'petti',
 'whole',
 'situat',
 'remind',
 'school',
 'knew',
 'student',
 'saw',
 'episod',
 'student',
 'repeatedli',
 'tri',
 'burn',
 'school',
 'immedi',
 'recal',
 'high',
 'classic',
 'line',
 'inspector',
 'i am',
 'sack',
 'one',
 'teacher',
 'student',
 'welcom',
 'bromwel',
 'high',
 'expect',
 'mani',
 'adult',
 'age',
 'think',
 'bromwel',
 'high',
 'far',
 'fetch',
 'piti',
 'is not',
 'homeless',
 'or',
 'houseless',
 'georg',
 'carlin',
 'state',
 'issu',
 'year',
 'never',
 'plan',
 'help',
 'street',
 'consid',
 'human',
 'everyth',
 'go',
 'school',
 'work',
 'vote',
 'matter',
 'peopl',
 'think',
 'homeless',
 'lost',
 'ca

## Bag of Words (BoW) 

### Try unigram and bigram parameters 

#### Unigram N = 1

In [90]:
# Initialize a CountVectorizer object: count_vectorizer
count_vec = CountVectorizer(analyzer='word', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None)


# Transforms the data into a bag of words
count_train = count_vec.fit(clean_word)
bag_of_words = count_vec.transform(clean_word)

# Print count features of the count_vec
feature_name = count_vec.get_feature_names()
len(feature_name)

54274

In [123]:
def unigram_process(data):
    from sklearn.feature_extraction.text import CountVectorizer
    vectorizer = CountVectorizer()
    vectorizer = vectorizer.fit(data)
    return vectorizer

In [112]:
count_train.vocabulary_

{'bromwell': 227454,
 'high': 780242,
 'cartoon': 261997,
 'comedy': 327513,
 'ran': 1334273,
 'time': 1698698,
 'programs': 1307688,
 'school': 1442167,
 'life': 952738,
 'teachers': 1649647,
 '35': 11680,
 'years': 1892438,
 'teaching': 1649784,
 'profession': 1306370,
 'lead': 935456,
 'believe': 170945,
 'satire': 1426919,
 'much': 1106650,
 'closer': 316166,
 'reality': 1348099,
 'scramble': 1446219,
 'survive': 1629676,
 'financially': 625924,
 'insightful': 848741,
 'students': 1604377,
 'see': 1456729,
 'right': 1395431,
 'pathetic': 1223386,
 'pomp': 1275131,
 'pettiness': 1243409,
 'whole': 1848358,
 'situation': 1522211,
 'remind': 1373033,
 'schools': 1443142,
 'knew': 912057,
 'saw': 1429227,
 'episode': 532236,
 'student': 1604085,
 'repeatedly': 1376285,
 'tried': 1732379,
 'burn': 237255,
 'immediately': 827058,
 'recalled': 1354733,
 'classic': 308633,
 'line': 966742,
 'inspector': 849328,
 'sack': 1418864,
 'one': 1177769,
 'welcome': 1834462,
 'expect': 564136,
 'ma

## N = 2

In [92]:
# Initialize a CountVectorizer object: count_vectorizer
count_vec = CountVectorizer(analyzer='word', ngram_range=(1, 2), max_df=1.0, min_df=1, max_features=None)


# Transforms the data into a bag of words
count_train = count_vec.fit(clean_word)
bag_of_words = count_vec.transform(clean_word)

# Print count features of the count_vec
len(count_vec.get_feature_names())

54532

In [124]:
def bigram_process(data):
    from sklearn.feature_extraction.text import CountVectorizer
    vectorizer = CountVectorizer(ngram_range=(1,2))
    vectorizer = vectorizer.fit(data)
    return vectorizer

As oberseved, when N increase, the number of feature increase, too

##  Term Frequency - Inverse Document Frequency (TF-IDF) 

In [115]:
tf = TfidfVectorizer(smooth_idf=False, sublinear_tf=False, norm=None, analyzer='word')
txt_fitted = tf.fit(text)
txt_transformed = txt_fitted.transform(text)
vocabulary = [tf.vocabulary_]
vocabulary

[{'bromwell': 9227,
  'high': 30773,
  'cartoon': 10814,
  'comedy': 13446,
  'ran': 53198,
  'time': 66678,
  'programs': 51721,
  'school': 57799,
  'life': 38515,
  'teachers': 65496,
  '35': 641,
  'years': 73904,
  'teaching': 65498,
  'profession': 51662,
  'lead': 37913,
  'believe': 6699,
  'satire': 57354,
  'much': 44060,
  'closer': 12889,
  'reality': 53599,
  'scramble': 57987,
  'survive': 64444,
  'financially': 24527,
  'insightful': 33685,
  'students': 63532,
  'see': 58352,
  'right': 55606,
  'pathetic': 48493,
  'pomp': 50538,
  'pettiness': 49325,
  'whole': 72654,
  'situation': 60259,
  'remind': 54561,
  'schools': 57820,
  'knew': 36706,
  'saw': 57466,
  'episode': 22132,
  'student': 63531,
  'repeatedly': 54722,
  'tried': 67896,
  'burn': 9739,
  'immediately': 32595,
  'recalled': 53736,
  'classic': 12638,
  'line': 38703,
  'inspector': 33726,
  'sack': 56795,
  'one': 46733,
  'welcome': 72286,
  'expect': 23035,
  'many': 40650,
  'adults': 1912,
  'a

In [94]:
idf = tf.idf_
print(dict(zip(txt_fitted.get_feature_names(), idf)))



In [125]:
def tfidf_process(data):
    from sklearn.feature_extraction.text import TfidfTransformer 
    transformer = TfidfTransformer()
    transformer = transformer.fit(data)
    return transformer

## Get the feature with lowest and highest TFIDF

In [109]:
new1 = tf.transform(clean_word)

# find maximum value for each of the features over all of dataset:
max_val = new1.max(axis=0).toarray()

#sort weights from smallest to biggest and extract their indices 
sort_by_tfidf = max_val.argsort()

print("Features with lowest tfidf:\n{}".format(
      feature_name[sort_by_tfidf[0][0]]))

#print("\nFeatures with highest tfidf: \n{}".format(
      #feature_name[sort_by_tfidf[0][902]]))

Features with lowest tfidf:
powell


In [108]:
sort_by_tfidf[0]

array([37311, 26635, 26636, ..., 63778, 51148, 74623], dtype=int64)

 ## Feature hashing 

In [116]:
h = FeatureHasher(n_features=10)
f = h.transform(vocabulary)
f.toarray()

array([[ 5128730.,   -18003., -3122563., -6115759.,   312894.,  5721570.,
         3554191.,  1981390.,  1044407.,  2031793.]])

## Apply sentiment analysis 

STOCHASTIC_DESCENT applies Stochastic on the training data and returns the predicted labels 
Xtrain - Training Data
Ytrain - Training Labels
Xtest - Test Data 

In [117]:
def stochastic_descent(Xtrain, Ytrain, Xtest):
    from sklearn.linear_model import SGDClassifier 
    clf = SGDClassifier(loss="hinge", penalty="l1", n_iter=20)
    print ("SGD Fitting")
    clf.fit(Xtrain, Ytrain)
    print ("SGD Predicting")
    Ytest = clf.predict(Xtest)
    return Ytest

ACCURACY finds the accuracy in percentage given the training and test labels 
Ytrain - One set of labels 
Ytest - Other set of labels

In [120]:
def accuracy(Ytrain, Ytest):
    assert (len(Ytrain)==len(Ytest))
    num =  sum([1 for i, word in enumerate(Ytrain) if Ytest[i]==word])
    n = len(Ytrain)  
    return (num*100)/n

In [136]:
def write_txt(data, name):
    data = ''.join(str(word) for word in data)
    file = open(name, 'w')
    file.write(data)
    file.close()
    pass 

In [140]:
import time
start = time.time()

#imdb_data_preprocess(inpath=train_path, mix=True)
#imdb_data_preprocess(inpath=test_path, mix=True, name="imdb_te.csv")

[Xtrain_text, Ytrain] = retrieve_data()
[Xtest_text, Ytest] = retrieve_data(name="imdb_te.csv")

#ANALYSIS ON THE INSAMPLE DATA (TRAINING DATA)
## Unigram Model on the Train Data
uni_vectorizer = unigram_process(Xtrain_text)
Xtrain_uni = uni_vectorizer.transform(Xtrain_text)

## Bigram Model on the Train Data
bi_vectorizer = bigram_process(Xtrain_text)
Xtrain_bi = bi_vectorizer.transform(Xtrain_text)

## Unigram TF Model on the Train Data
uni_tfidf_transformer = tfidf_process(Xtrain_uni)
Xtrain_tf_uni = uni_tfidf_transformer.transform(Xtrain_uni)

## Bigram TF Model on the Train Data
bi_tfidf_transformer = tfidf_process(Xtrain_bi)
Xtrain_tf_bi = bi_tfidf_transformer.transform(Xtrain_bi)


#ANALYSIS ON THE TEST DATA
# Unigram Model on the Test Data
Xtest_uni = uni_vectorizer.transform(Xtest_text)
Ytest_uni = stochastic_descent(Xtrain_uni, Ytrain, Xtest_uni)
write_txt(Ytest_uni, name="unigram.output.txt")


# Bigram Model on the Test Data
Xtest_bi = bi_vectorizer.transform(Xtest_text)
Ytest_bi = stochastic_descent(Xtrain_bi, Ytrain, Xtest_bi)
write_txt(Ytest_bi, name="bigram.output.txt")


# Unigram TF Model on the Test Data
Xtest_tf_uni = uni_tfidf_transformer.transform(Xtest_uni)
Ytest_tf_uni = stochastic_descent(Xtrain_tf_uni, Ytrain, Xtest_tf_uni)
write_txt(Ytest_tf_uni, name="unigramtfidf.output.txt")


# Bigram TF Model on the Test Data
Xtest_tf_bi = bi_tfidf_transformer.transform(Xtest_bi)
Ytest_tf_bi = stochastic_descent(Xtrain_tf_bi, Ytrain, Xtest_tf_bi)
write_txt(Ytest_tf_bi, name="bigramtfidf.output.txt")

print ("Total time taken is ", time.time()-start, " seconds")



SGD Fitting
SGD Predicting




SGD Fitting
SGD Predicting




SGD Fitting
SGD Predicting




SGD Fitting
SGD Predicting
Total time taken is  89.54912805557251  seconds


In [141]:
print ("The accuracy of Unigram Model is ", accuracy(Ytest,Ytest_uni))
print ("The accuracy of Bigram Model is ", accuracy(Ytest,Ytest_bi))
print ("The accuracy of Unigram TF Model is ", accuracy(Ytest,Ytest_tf_uni))
print ("The accuracy of Bigram TF Model is ", accuracy(Ytest,Ytest_tf_bi))

The accuracy of Unigram Model is  85.0
The accuracy of Bigram Model is  85.396
The accuracy of Unigram TF Model is  87.268
The accuracy of Bigram TF Model is  86.124


# Summary of the solution and key highlights

After removing stop word, tokenize, stem and transform punctuation,I tried four models to proceed the comment text,  they are Unigram Model, Bigram Model, Unigram TF Model and Bigram TF Model. Based on accuracy score, we can see that Unigram TF Model has the highest accuracy score, which is 87.268%, so this model is the best one for sentiment analysis.

# Key learnings

From this execise I learned how to preprocess text data and make them into feature matrix, also I learned that feature hashing is useful for magnanimous text data and save storage space. Since this data has labels so I just use machine learning method to train data with sentiment labels instead of sentiment analysis directly. If I have time in the future, I will definitely to try it and compare with this method.