<u><h1 align='center'>Lab2: Building a Sentiment Analysis System</h1></u>

    
### Problem Definition
Given tweets about six US airlines, your task is to predict whether a
tweet contains positive, negative, or neutral sentiment about the airline.
Twitter data was scraped from February of 2015 and contributors were
asked to first classify positive, negative, and neutral tweets, followed
by categorizing negative reasons (such as “late flight” or “rude
service”).
Sentiment analysis is a typical supervised learning task where given a
text string, you have to categorize the text string into predefined
categories.
To solve this problem, you will first import the required libraries and
the dataset. Next, you will perform text pre-processing (data cleaning).
Then, you have to extract features using Bag of words model, TF-IDF
model and word2vec. Finally, you have to use three machine learning
algorithms (that you choose) to train and test your sentiment analysis
models.


### Steps:

- import the required libraries and the dataset
- text pre-processing
- extract features using Bag of words model, TF-IDF model and word2vec
- Modeling: choose 3 algorithms

# 0. Imports

In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from gensim.models import Word2Vec
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from sklearn.metrics import accuracy_score, recall_score, precision_score

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
!unzip Tweets.csv.zip

Archive:  Tweets.csv.zip
  inflating: Tweets.csv              


In [3]:
data=pd.read_csv('Tweets.csv')
print(data.shape)
data.head()

(14640, 15)


Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [0]:
data.airline_sentiment.value_counts()

negative    9178
neutral     3099
positive    2363
Name: airline_sentiment, dtype: int64

# 1. Text Pre-processing

In [8]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [0]:
corpus = []
for i in range(0,len(data['text'])):
    Document = re.sub('[^a-zA-Z]',' ',data['text'][i])
    Document = Document.lower()
    Document = Document.split()
    lem = WordNetLemmatizer()
    Document = [lem.lemmatize(word) for word in Document if not word in set(stopwords.words('english'))]
    Document = ' '.join(Document)
    corpus.append(Document)
    
Documents = pd.Series(corpus)

In [0]:
data['cleaned_text']=Documents

### Split data into train and test

In [11]:
Train, Test =train_test_split(data[['cleaned_text','airline_sentiment']],train_size=0.7,random_state=1,stratify=data['airline_sentiment'])
#frequency distribution of the class attribute
#train set
freqTrain = pd.crosstab(index=Train["airline_sentiment"],columns="count")
print('frequency distribution of the class attribute in Training set: \n\n',freqTrain/freqTrain.sum())
#test set
freqTest = pd.crosstab(index=Test["airline_sentiment"],columns="count")
print('frequency distribution of the class attribute in Test set: \n\n',freqTest/freqTest.sum())

frequency distribution of the class attribute in Training set: 

 col_0                 count
airline_sentiment          
negative           0.626952
neutral            0.211651
positive           0.161397
frequency distribution of the class attribute in Test set: 

 col_0                 count
airline_sentiment          
negative           0.626821
neutral            0.211749
positive           0.161430


# 3. Feature extraction:
   ## a) Bag of Words Model

In [0]:
#bag of words
parseur = CountVectorizer()
#create the document term matrix
XTrain = parseur.fit_transform(Train['cleaned_text'])
mdtTrain = XTrain.toarray()


#create the document term matrix for test set
mdtTest = parseur.transform(Test['cleaned_text'])
mdtTest = mdtTest.toarray()

   ## b) TF-IDF model

In [0]:
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(mdtTrain)
print ("IDF:",tfidf.idf_)

IDF: [4.96707726 9.54178824 9.54178824 ... 8.62549751 9.54178824 9.54178824]


In [0]:
tf_idf_matrix = tfidf.transform(mdtTrain)
print (tf_idf_matrix.todense())
tf_idf_matrix = tf_idf_matrix.toarray()


[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [0]:
#create the document term matrix for test set

tfidfTest = tfidf.transform(mdtTest)
tfidfTest = tfidfTest.toarray()

print('size of the matrix: ',tfidfTest.shape)

size of the matrix:  (4392, 10193)


   ## c) Word2Vec Model

In [0]:
def text_preprocessing(Document):
    Document = re.sub('[^a-zA-Z]',' ',Document)
    Document = Document.lower()
    Document = Document.split()
    lem = WordNetLemmatizer()
    Document = [lem.lemmatize(word) for word in Document if not word in set(stopwords.words('english'))]
    return Document

In [0]:
def return_sentences(Text):
    sentences=[]
    for sent in Text.split('.'):
        if len(sent)>0:
            if '!' in sent:
                for sentt in sent.split('!'):
                    if len(sentt)>0 :
                        if '?' in sentt :
                            for senttt in sentt.split('?'):
                                if len(senttt)>0:
                                    sentences.append(text_preprocessing(senttt))
                        else:
                            sentences.append(text_preprocessing(sentt))
            elif '?' in sent:
                for sentt in sent.split('?'):
                    if len(sentt)>0 :
                        if '!' in sentt :
                            for senttt in sentt.split('!'):
                                if len(senttt)>0:
                                    sentences.append(text_preprocessing(senttt))
                        else:
                            sentences.append(text_preprocessing(sentt))
            else:
                sentences.append(text_preprocessing(sent))
    return sentences

In [0]:
sentences=[]
for Text in data.text:
    sentences.extend(return_sentences(Text))

In [15]:
len(sentences)

31396

In [16]:
sentences[:3]

[['virginamerica', 'dhepburn', 'said'],
 ['virginamerica', 'plus', 'added', 'commercial', 'experience'],
 ['tacky']]

In [0]:
wv_model = Word2Vec(sentences,window=2, size=100)

In [18]:
wv_model['said']

  """Entry point for launching an IPython kernel.


array([ 0.4471718 , -0.29934147, -0.38920575,  0.1000734 , -0.01909727,
        0.27499887,  0.19545735, -0.47218323,  0.25697604, -0.0167041 ,
        0.01443389,  0.66365325, -0.1103569 ,  0.39532128,  0.03015464,
       -0.52213633,  0.08424173,  0.2080123 ,  0.05740016, -0.31629822,
       -0.09770916,  0.06987256, -0.02733484, -0.27587214,  0.13436335,
        0.21138057,  0.05217331,  0.21302634,  0.18201472,  0.08663177,
       -0.14469871,  0.2752368 , -0.37069973,  0.07828341, -0.1065984 ,
       -0.25163063,  0.34588635,  0.31370258,  0.11873006,  0.16105972,
        0.01733721,  0.04801674, -0.29237103,  0.2580621 , -0.15820517,
        0.02220287, -0.21669261, -0.07741657,  0.32924348, -0.07723841,
        0.20602418, -0.06632921,  0.08607916, -0.14943449,  0.3197164 ,
       -0.11283532, -0.24914756,  0.31342834,  0.1090398 , -0.55513996,
       -0.06493661,  0.32072833, -0.38897002, -0.09584168, -0.10547344,
        0.03595581,  0.00430905,  0.02573651, -0.11478409, -0.04

In [19]:
wv_model[['said','added']].sum(axis=0)

  """Entry point for launching an IPython kernel.


array([ 0.6609105 , -0.45178246, -0.5546513 ,  0.13263896, -0.01592185,
        0.40004283,  0.25921875, -0.69163656,  0.3526222 , -0.02063236,
        0.01933887,  0.95572466, -0.15609789,  0.5798899 ,  0.03072104,
       -0.77395433,  0.11112138,  0.31873274,  0.08782855, -0.4581898 ,
       -0.14890629,  0.09501515, -0.0539802 , -0.39806914,  0.19964752,
        0.28870142,  0.06485673,  0.31498328,  0.27187645,  0.12256157,
       -0.21799102,  0.36953637, -0.5321285 ,  0.09695002, -0.14625543,
       -0.34877828,  0.52059555,  0.42904764,  0.18915366,  0.21328726,
        0.02579263,  0.06916293, -0.42801774,  0.38108358, -0.2170225 ,
        0.03533864, -0.31979138, -0.12013727,  0.48967296, -0.11542636,
        0.30086482, -0.08685351,  0.10916246, -0.21683541,  0.43818554,
       -0.18848783, -0.3685473 ,  0.44920284,  0.16168621, -0.7902601 ,
       -0.09649501,  0.45945188, -0.55955964, -0.12683281, -0.16085006,
        0.05766296,  0.01756825,  0.04001569, -0.16386864, -0.06

In [0]:
vTrain, vTest =train_test_split(data[['text','airline_sentiment']],train_size=0.7,random_state=1,stratify=data['airline_sentiment'])


In [24]:
#TRAIN features
trainvect=[]
for text in vTrain.text:
  text_vect=0
  nword=0
  sentences_per_doc=return_sentences(text)
  for sent in sentences_per_doc:
      for word in sent:
        if word in wv_model.wv.vocab:
          nword+=1
          text_vect+=wv_model[word]
  trainvect.append(text_vect/nword)

  # Remove the CWD from sys.path while we load stuff.


In [0]:
len(trainvect)

In [25]:
#TEST features
testvect=[]
for text in vTest.text:
  text_vect=0
  nword=0
  sentences_per_doc=return_sentences(text)
  for sent in sentences_per_doc:
      for word in sent:
        if word in wv_model.wv.vocab:
          nword+=1
          text_vect+=wv_model[word]
  testvect.append(text_vect/nword)

  # Remove the CWD from sys.path while we load stuff.


In [0]:
len(testvect)

# 4. Modeling & prediction & Evaluation:


In [0]:
classifiers = [
    
    LogisticRegression(solver='liblinear'),
    DecisionTreeClassifier(),
    RandomForestClassifier(n_estimators=100),
    AdaBoostClassifier(),
    GaussianNB()]

## a) BOW model

In [0]:
results_list = []


for clf in classifiers:
    clf_name = clf.__class__.__name__
    print("="*30)
    print(clf_name)
    clf.fit(mdtTrain,Train['airline_sentiment'])

        
    predTest = clf.predict(mdtTest)
    acc = metrics.accuracy_score(Test['airline_sentiment'],predTest)
    results_list.append((clf_name, acc*100))


results_df = pd.DataFrame(results_list,columns=["Classifier", "Accuracy"])
results_df.set_index('Classifier',inplace=True)

LogisticRegression
DecisionTreeClassifier
RandomForestClassifier
AdaBoostClassifier
GaussianNB


In [0]:
results_df.head(6)

Unnamed: 0_level_0,Accuracy
Classifier,Unnamed: 1_level_1
LogisticRegression,78.893443
DecisionTreeClassifier,68.14663
RandomForestClassifier,75.774135
AdaBoostClassifier,71.880692
GaussianNB,48.269581


## b) TF-IDF model

In [0]:
results_list = []


for clf in classifiers:
    clf_name = clf.__class__.__name__
    print("="*30)
    print(clf_name)
    clf.fit(tf_idf_matrix,Train['airline_sentiment'])

        
    predTest = clf.predict(tfidfTest)
    acc = metrics.accuracy_score(Test['airline_sentiment'],predTest)
    results_list.append((clf_name, acc*100))


results_df = pd.DataFrame(results_list,columns=["Classifier", "Accuracy"])
results_df.set_index('Classifier',inplace=True)

LogisticRegression
DecisionTreeClassifier
RandomForestClassifier
AdaBoostClassifier
GaussianNB


In [0]:
results_df.head()

Unnamed: 0_level_0,Accuracy
Classifier,Unnamed: 1_level_1
LogisticRegression,77.709472
DecisionTreeClassifier,66.279599
RandomForestClassifier,75.523679
AdaBoostClassifier,71.766849
GaussianNB,48.269581


## c) Word2Vec model

In [26]:
results_list = []


for clf in classifiers:
    clf_name = clf.__class__.__name__
    print("="*30)
    print(clf_name)
    clf.fit(trainvect,Train['airline_sentiment'])

        
    predTest = clf.predict(testvect)
    acc = metrics.accuracy_score(Test['airline_sentiment'],predTest)
    results_list.append((clf_name, acc*100))


results_df = pd.DataFrame(results_list,columns=["Classifier", "Accuracy"])
results_df.set_index('Classifier',inplace=True)

LogisticRegression
DecisionTreeClassifier
RandomForestClassifier
AdaBoostClassifier
GaussianNB


In [27]:
results_df.head()

Unnamed: 0_level_0,Accuracy
Classifier,Unnamed: 1_level_1
LogisticRegression,67.440801
DecisionTreeClassifier,59.995446
RandomForestClassifier,70.03643
AdaBoostClassifier,67.668488
GaussianNB,45.013661
