## Sentiment analysis <br> 

The objective of the second problem is to perform Sentiment analysis from the tweets data collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

### 1. Read the dataset (tweets.csv) and drop the NA's while reading the dataset

In [1]:
import pandas as pd
import numpy as np
import scipy as sp

In [2]:
df1 = pd.read_csv("tweets.csv", encoding = 'latin')

In [3]:
df1.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [4]:
df1.shape

(9093, 3)

In [5]:
df1["is_there_an_emotion_directed_at_a_brand_or_product"].unique()

array(['Negative emotion', 'Positive emotion',
       'No emotion toward brand or product', "I can't tell"], dtype=object)

In [6]:
df1.isna().sum()

tweet_text                                               1
emotion_in_tweet_is_directed_at                       5802
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64

In [7]:
df1.dropna(inplace = True)
df1.isna().sum()

tweet_text                                            0
emotion_in_tweet_is_directed_at                       0
is_there_an_emotion_directed_at_a_brand_or_product    0
dtype: int64

### Preprocess data
1. convert all text to lowercase - use .lower()
2. select only numbers, alphabets, and #+_ from text - use re.sub()
3. strip all the text - use .strip() - this is for removing extra spaces

In [8]:
import re

In [9]:
def preprocess(text):
    text = text.lower()
    text = re.sub("[^0-9a-zA-Z#+_]+", " ", text)
    text = text.strip()
    return text

In [10]:
df1["text"] = [preprocess(text) for text in df1.tweet_text]

In [11]:
df1.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,wesley83 i have a 3g iphone after 3 hrs tweeti...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,jessedee know about fludapp awesome ipad iphon...
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,swonderlin can not wait for #ipad 2 also they ...
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,sxsw i hope this year s festival isn t as cras...
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,sxtxstate great stuff on fri #sxsw marissa may...


In [12]:
df1["is_there_an_emotion_directed_at_a_brand_or_product"].unique()

array(['Negative emotion', 'Positive emotion',
       'No emotion toward brand or product', "I can't tell"], dtype=object)

### 3. Consider only rows having Positive emotion and Negative emotion and remove other rows from the dataframe.

In [13]:
df2 = df1[(df1["is_there_an_emotion_directed_at_a_brand_or_product"] == 'Negative emotion') | 
          (df1["is_there_an_emotion_directed_at_a_brand_or_product"] == 'Positive emotion')]

In [14]:
df2["is_there_an_emotion_directed_at_a_brand_or_product"].unique()

array(['Negative emotion', 'Positive emotion'], dtype=object)

### 4. Represent text as numerical data using `CountVectorizer` and get the document term frequency matrix

#### Use `vect` as the variable name for initialising CountVectorizer.

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
%matplotlib inline

In [16]:
vect = CountVectorizer()

In [17]:
df2_dtm = vect.fit_transform(df2.text)
df2_dtm

<3191x5610 sparse matrix of type '<class 'numpy.int64'>'
	with 53151 stored elements in Compressed Sparse Row format>

### 5. Find number of different words in vocabulary

In [18]:
df2_dtm.shape[1]

5610

#### Tip: To see all available functions for an Object use dir

### 6. Find out how many Positive and Negative emotions are there.

Hint: Use value_counts on that column

In [19]:
df2["is_there_an_emotion_directed_at_a_brand_or_product"].value_counts()

Positive emotion    2672
Negative emotion     519
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

### 7. Change the labels for Positive and Negative emotions as 1 and 0 respectively and store in a different column in the same dataframe named 'Label'

Hint: use map on that column and give labels

In [20]:
df2["is_there_an_emotion_directed_at_a_brand_or_product"] = np.where(df2[["is_there_an_emotion_directed_at_a_brand_or_product"]] == 'Positive emotion', 1, 0)
    
df2["is_there_an_emotion_directed_at_a_brand_or_product"].unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


array([0, 1], dtype=int64)

In [21]:
df3 = df2.rename({'is_there_an_emotion_directed_at_a_brand_or_product': 'Labels'}, axis=1)
df3.head(2)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,Labels,text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,0,wesley83 i have a 3g iphone after 3 hrs tweeti...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,1,jessedee know about fludapp awesome ipad iphon...


### 8 Define the feature set (independent variable or X) to be `text` column and `labels` as target (or dependent variable)  and divide into train and test datasets

Renaming of the columns already performed above.

In [22]:
X = df3["text"]
y = df3["Labels"]

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [24]:
X_train.head()

8347    tried installing mention on my iphone but it c...
2381    #ipad2 rocks #sxsw mention apple pop up store ...
8703    what s your take on ipad mention i really want...
4152    aron pilhofer from the new york times just end...
3368    lt guess who won an ipad at the #unsix tweetup...
Name: text, dtype: object

## 9. **Predicting the sentiment:**


### Use Naive Bayes and Logistic Regression and their accuracy scores for predicting the sentiment of the given text

In [25]:
X_train.shape

(2393,)

In [26]:
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

# use Naive Bayes to predict the star rating
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)

In [27]:
 print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

Accuracy:  0.8483709273182958


In [28]:
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

# use Naive Bayes to predict the star rating
log = LogisticRegression()
log.fit(X_train_dtm, y_train)
y_pred_class_l = log.predict(X_test_dtm)



In [29]:
 print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class_l))

Accuracy:  0.8696741854636592


## 10. Create a function called `tokenize_predict` which can take count vectorizer object as input and prints the accuracy for x (text) and y (labels)

In [30]:
def tokenize_predict(vect):
    X_train_dtm = vect.fit_transform(X_train)
    print('Features: ', X_train_dtm.shape[1])
    X_test_dtm = vect.transform(X_test)
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    y_pred_class = nb.predict(X_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

### 11 Create a count vectorizer function which includes n_grams = 1,2  and pass it to tokenize_predict function to print the accuracy score

In [31]:
vect = CountVectorizer(ngram_range=(1, 3))

In [32]:
tokenize_predict(vect)

Features:  51746
Accuracy:  0.8558897243107769


### Q 12 Create a count vectorizer function with stopwords = 'english'  and pass it to tokenize_predict function to print the accuracy score

In [33]:
vect = CountVectorizer(stop_words='english')

In [34]:
tokenize_predict(vect)

Features:  4647
Accuracy:  0.8571428571428571


### Q 13 Create a count vectorizer function with stopwords = 'english' and max_features =300  and pass it to tokenize_predict function to print the accuracy score

In [35]:
vect = CountVectorizer(stop_words='english', max_features = 300)
tokenize_predict(vect)

Features:  300
Accuracy:  0.8095238095238095


### Q 14 Create a count vectorizer function with n_grams = 1,2  and max_features = 15000  and pass it to tokenize_predict function to print the accuracy score

In [36]:
vect = CountVectorizer(ngram_range=(1, 3), max_features = 15000)
tokenize_predict(vect)

Features:  15000
Accuracy:  0.8583959899749374


### Q. 15 -Create a count vectorizer function with n_grams = 1,2  and include terms that appear at least 2 times (min_df = 2)  and pass it to tokenize_predict function to print the accuracy score

In [37]:
vect = CountVectorizer(ngram_range=(1, 3), min_df=2)
tokenize_predict(vect)

Features:  12733
Accuracy:  0.8521303258145363
