# Natural Language Processing - Twitter Sentiment Analysis

***Natural Language Processing*** assists us in identifying relationships and patterns in our writing, and can make our predictions very interesting. Below I conduct a sentiment analysis, to determine which tweets have a positive connotation or a negative connotation. For a better understanding of sentiment analysis, take a look at this paper written by Stanford professors and researchers. The live demo link shows an excellent representation of sentiment analysis (http://nlp.stanford.edu/sentiment/index.html).



***Problem:*** Conduct a Sentiment Analysis on the following tweets to determine their polarity, how often each unique word is mentioned in all tweets in this dataset, and the unique words tf-idf (term frequency - inverse document frequency).
<br><br>
***Data:***

In [1]:
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
%matplotlib inline

In [2]:
tweets_df = pd.read_csv('tweet_data.csv', sep=';', index_col=0)
tweets_df.head()

Unnamed: 0_level_0,polarity,tweet
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1467933112,0,the angel is going to miss the athlete this we...
2323395086,0,It looks as though Shaq is getting traded to C...
1467968979,0,@clarianne APRIL 9TH ISN'T COMING SOON ENOUGH
1990283756,0,drinking a McDonalds coffee and not understand...
1988884918,0,So dissapointed Taylor Swift doesnt have a Twi...


This single tweet below is identified as a negative statement

In [3]:
tweets_df.loc[1988884918]

polarity                                                    0
tweet       So dissapointed Taylor Swift doesnt have a Twi...
Name: 1988884918, dtype: object

In [4]:
train_tweets = tweets_df['tweet'].values
train_tweets

array(['the angel is going to miss the athlete this weekend ',
       "It looks as though Shaq is getting traded to Cleveland to play w/ LeBron... Too bad for Suns' fans. The Big Cactus is no more ",
       "@clarianne APRIL 9TH ISN'T COMING SOON ENOUGH ", ...,
       '@jordanhowell lol only a PSP, had a game boy but my cousin lost it  &amp; theres a N64 around',
       'Good morning everyone!  It is such a beautiful day here in New England!  ',
       'hey guess was @magicmanil the Lakers won and KOBE is mvp  just thought I would tell ya haha'], dtype=object)

<br>***CountVectorizer:*** identify unique words in tweet and count them

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

In [6]:
#Initializes vectorizer: English stop words & convert to lower case
vectorizer = CountVectorizer(stop_words='english', lowercase=True)

In [7]:
#Creating list of unique words
vectorizer.fit(train_tweets)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [8]:
#Creates new sparse matrix. For each unique word, counts how many are in the list.
X_train = vectorizer.transform(train_tweets)
print(train_tweets[0])
print(X_train[0])

the angel is going to miss the athlete this weekend 
  (0, 319)	1
  (0, 414)	1
  (0, 1980)	1
  (0, 3062)	1
  (0, 5002)	1


In [9]:
#Identify coded number of word 'athlete'
print(vectorizer.vocabulary_.get('athlete'))

414


In [10]:
print(train_tweets[186])
print(X_train[186])
print(vectorizer.vocabulary_.get('new'))
#Here we can see that the word new is verbalized twice

At starbucks with my new sister  learning her new phone.
  (0, 2685)	1
  (0, 3218)	2
  (0, 3472)	1
  (0, 4138)	1
  (0, 4313)	1
3218


<br>***TfidfTransformer:*** identifies learning score of each unique word. The learning score may be high because the word is less frequent in other tweets, and more unique to this tweet. If it is low, the unique word is more frequent in other tweets.

In [11]:
from sklearn.feature_extraction.text import TfidfTransformer

In [12]:
transformer = TfidfTransformer()
transformer.fit(X_train)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

In [13]:
X_train_tfidf = transformer.transform(X_train)

In [14]:
print(train_tweets[186])
print(X_train_tfidf[186])
print(vectorizer.vocabulary_.get('new'))
print(vectorizer.vocabulary_.get('starbucks'))
print(vectorizer.vocabulary_.get('learning'))

At starbucks with my new sister  learning her new phone.
  (0, 4313)	0.285913376715
  (0, 4138)	0.422150263806
  (0, 3472)	0.386021064679
  (0, 3218)	0.573798111469
  (0, 2685)	0.511650428206
3218
4313
2685


Starbucks has been mentioned 70 times in the tweets in this dataset

In [15]:
sum(['starbucks' in tweet.lower() for tweet in train_tweets])

70

In [16]:
#We have ['starbucks', 'new', sister', 'learning', 'new', 'phone']
#tf(starbucks) = 1/6
#tf(new) = 2/6
#tfidf = tf(1/df +1)
#tfidf = tf/df - tf will be almost the same for each unique word. Therefore the df
#must be high. A lower score must mean that the unique word is common in other documents.

In [17]:
print(sum(['starbucks' in tweet.lower() for tweet in train_tweets]))
print(sum(['new' in tweet.lower() for tweet in train_tweets]))
print(sum(['sister' in tweet.lower() for tweet in train_tweets]))
print(sum(['learning' in tweet.lower() for tweet in train_tweets]))
print(sum(['phone' in tweet.lower() for tweet in train_tweets]))

70
87
9
1
93


In [18]:
#for x in train_tweets:
#    print(x)

Identify the polarity values. They seem to be 0 and 4. Make them 0 and 1.

In [19]:
y = tweets_df['polarity'].values
print(y)
#[tweet for tweet in train_tweets if '!!!' in tweet.lower()]

[0 0 0 ..., 0 4 4]


In [20]:
tweets_df[tweets_df['polarity']>0]
tweets_df['polarity'] = tweets_df['polarity'].replace(4, 1)
tweets_df[tweets_df['polarity']>0].head()

#We just replaced the 4's with 1's. These are the positive tweets.

Unnamed: 0_level_0,polarity,tweet
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1680347120,1,@ mcdonalds with my litto sis aka cuzin lol cr...
1835259469,1,@AnnaSaccone Love your new cards! I would de...
1983068285,1,@supricky06 that was one of the most enjoyable...
1559842363,1,Dallas vegas goodness http://twitpic.com/3lzt...
1999078293,1,@JBsFanArgentina Hey I luv this pic!!! was ama...


In [21]:
tweets_df['polarity'].unique()

array([0, 1])

In [22]:
X = X_train_tfidf.toarray()    #converts sparce matrix to ndarray
y = tweets_df['polarity'].values
print(X.shape), print(y.shape)
print(X)
#X 2034 tweets and 5220 unique words
#y is one sentiment value for every tweet

(2034, 5220)
(2034,)
[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]


<br>***Logistic Regression***

In [23]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import cross_val_score

In [24]:
logreg = LogisticRegression(C=1000)   #Large C means no regularization, small c means
#all magnitudes of weights to the cost
logreg.fit(X,y)    #computes coefficients depending on our train features

cross_val_score(logreg, X, y, cv=3)   #cv = cross validation K folds

array([ 0.82474227,  0.82743363,  0.80945347])

<br>***Multinomial Naive Bayes***

In [25]:
nb = MultinomialNB()     #Naive Bayes
nb.fit(X, y)

cross_val_score(nb, X, y, cv=3)

array([ 0.73490427,  0.73303835,  0.73855244])

In [26]:
#small c gives smaller coefficients
#If C = 10000, it will go from -28 to 24. Regularizing tries to minimize your coefficients. Finding max and min helps.
logreg.coef_.min(), logreg.coef_.max()

(-20.662861614009636, 18.036108603583237)

<br>***Improvements to make:*** to improve our score, we can try other models (neural networks), try stemming, or removing stop words.<br><br>

In [27]:
print(train_tweets[100])
print(X_train_tfidf[100])
print(vectorizer.vocabulary_.get('woo'))

hanging out with biology til 4am woo  !
  (0, 5087)	0.442275696252
  (0, 4637)	0.417479175819
  (0, 2117)	0.428629186309
  (0, 585)	0.459869107486
  (0, 107)	0.484665627919
5087


In [28]:
logreg.coef_[0][5087]
#strong positive coefficient

7.9179434329172311

In [29]:
print(train_tweets[36])
print(X_train_tfidf[36])
print(vectorizer.vocabulary_.get('ugh'))

Hmm im usually dead rite now...ugh skool monday..no more oprah ellen or kathie lee and hoda     i wish u went to skool for a millisecond!!
  (0, 5061)	0.157668011698
  (0, 5015)	0.19748714517
  (0, 4879)	0.236782492398
  (0, 4809)	0.189948566976
  (0, 4161)	0.508079976472
  (0, 3827)	0.267738033303
  (0, 3359)	0.15172580304
  (0, 3099)	0.216924980458
  (0, 3042)	0.267738033303
  (0, 2696)	0.244321070592
  (0, 2548)	0.254039988236
  (0, 2342)	0.149105611639
  (0, 2230)	0.254039988236
  (0, 2227)	0.254039988236
  (0, 1554)	0.236782492398
  (0, 1298)	0.201998307776
4809


In [30]:
logreg.coef_[0][4809]
#strong negative coefficient

-7.4697648471867515

In [31]:
vectorizer = CountVectorizer(stop_words='english', ngram_range = (1,2), #Unigrams+Bigrams
                             lowercase=True)
vectorizer.fit(train_tweets)
X_train = vectorizer.transform(train_tweets)
#If we did (2,2) that would just be bigrams
#So what we have as 1,2 is 'finals' + 'baby' + 'yeah' + 'finals baby' + 'baby yeah' = 5 instead
#of 3 from before

In [32]:
print(train_tweets[186])
print(X_train[186])
print(vectorizer.vocabulary_.get('new'))
print(vectorizer.vocabulary_.get('starbucks'))
print(vectorizer.vocabulary_.get('learning'))
X_train.shape   #Now we have 18000 paired words(features)

At starbucks with my new sister  learning her new phone.
  (0, 8787)	1
  (0, 8788)	1
  (0, 10896)	2
  (0, 10927)	1
  (0, 10932)	1
  (0, 11854)	1
  (0, 13933)	1
  (0, 13937)	1
  (0, 14435)	1
  (0, 14470)	1
10896
14435
8787


(2034, 18051)

<br>Another example

In [33]:
#vectorizer.vocabulary_

In [34]:
transformer = TfidfTransformer()
transformer.fit(X_train)
X_train_tfidf = transformer.transform(X_train)

In [35]:
X = X_train_tfidf.toarray()    #converts sparce matrix to ndarray
y = tweets_df['polarity'].values
print(X.shape), print(y.shape)
print(X)

(2034, 18051)
(2034,)
[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]


<br>***Logistic Regression***

In [36]:
logreg = LogisticRegression(C=1000)
logreg.fit(X,y)    #computes coefficients depending our train features

cross_val_score(logreg, X, y, cv=3)   #cv = cross validation K folds

array([ 0.80265096,  0.81563422,  0.79468242])

In [37]:
sentence = ["I'm crying because it's gloomy out."]
sentence_counts = vectorizer.transform(sentence)
sentence_tfidf = transformer.transform(sentence_counts)
print(logreg.predict(sentence_tfidf.toarray()))  # since it is 0, means negative
print(logreg.predict_proba(sentence_tfidf.toarray()))

print(sentence_counts)   #shows only one word was in original vocabulary (probably crying)
print(vectorizer.vocabulary_.get('crying'))
print(logreg.coef_[0][3374])

[0]
[[  9.99939712e-01   6.02883632e-05]]
  (0, 3374)	1
3374
-7.23869236133


In [38]:
sentence = ["Yes! Obama is in town!"]
sentence_counts = vectorizer.transform(sentence)
sentence_tfidf = transformer.transform(sentence_counts)
print(logreg.predict(sentence_tfidf.toarray()))  # since it is 0, means negative (predicts sarcasm? or error?)
print(logreg.predict_proba(sentence_tfidf.toarray()))

print(sentence_counts)   #shows three words in original vocabulary (Obama, town, Yes).
print(vectorizer.vocabulary_.get('town'))
print(vectorizer.vocabulary_.get('yes'))
print(vectorizer.vocabulary_.get('obama'))
print(logreg.coef_[0][15991])

[0]
[[ 0.80005783  0.19994217]]
  (0, 11230)	1
  (0, 15991)	1
  (0, 17905)	1
15991
17905
11230
-0.986603939374
