## Sentiment analysis <br> 

The objective of the second problem is to perform Sentiment analysis from the tweets data collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

### 1. Read the dataset (tweets.csv) and drop the NA's while reading the dataset

In [276]:
import pandas as pd  
import re
data = pd.read_csv("tweets.csv",encoding="unicode_escape")

In [277]:
data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [278]:
data.isna().sum()

tweet_text                                               1
emotion_in_tweet_is_directed_at                       5802
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64

In [279]:
data.shape

(9093, 3)

In [280]:
data.dropna(inplace=True)

In [281]:
data.shape

(3291, 3)

### 2. Preprocess the text and add the preprocessed text in a column with name `text` in the dataframe.

In [282]:
import unicodedata

In [283]:
def preprocess(text, remove_digits=False):
    try:
        pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
        text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        text = re.sub(pattern, '', text)
        text = text.lower()
        return text;
    except Exception as e:
        return ""

In [284]:
data['text'] = [preprocess(text,True) for text in data.tweet_text]

In [285]:
data.head(2)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,wesley i have a g iphone after hrs tweeting a...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,jessedee know about fludapp awesome ipadiphon...


### 3. Consider only rows having Positive emotion and Negative emotion and remove other rows from the dataframe.

In [286]:
X = data[data['is_there_an_emotion_directed_at_a_brand_or_product']=='Positive emotion']

In [287]:
X.shape

(2672, 4)

In [288]:
X_neg = data[data['is_there_an_emotion_directed_at_a_brand_or_product']=='Negative emotion']

In [289]:
X_neg.shape

(519, 4)

In [290]:
reviewData = pd.concat([X, X_neg])

In [291]:
reviewData.shape

(3191, 4)

In [292]:
reviewData['is_there_an_emotion_directed_at_a_brand_or_product'].unique()

array(['Positive emotion', 'Negative emotion'], dtype=object)

### 4. Represent text as numerical data using `CountVectorizer` and get the document term frequency matrix

#### Use `vect` as the variable name for initialising CountVectorizer.

In [293]:
from sklearn.feature_extraction.text import CountVectorizer

In [294]:
vect = CountVectorizer(stop_words = 'english') 

In [295]:
train_data_features = vect.fit_transform(reviewData['text'])

In [296]:
train_data_features.shape

(3191, 5706)

### 5. Find number of different words in vocabulary

In [297]:
vocab = vect.get_feature_names()
print(vocab)



#### Tip: To see all available functions for an Object use dir

### 6. Find out how many Positive and Negative emotions are there.

Hint: Use value_counts on that column

In [298]:
reviewData['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

Positive emotion    2672
Negative emotion     519
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

### 7. Change the labels for Positive and Negative emotions as 1 and 0 respectively and store in a different column in the same dataframe named 'Label'

Hint: use map on that column and give labels

In [299]:
from sklearn.preprocessing import LabelEncoder

# Create an object of the label encoder class
labelencoder = LabelEncoder()

In [300]:
reviewData['label'] = labelencoder.fit_transform(reviewData['is_there_an_emotion_directed_at_a_brand_or_product'])

In [301]:
reviewData['label'].value_counts()

1    2672
0     519
Name: label, dtype: int64

### 8 Define the feature set (independent variable or X) to be `text` column and `labels` as target (or dependent variable)  and divide into train and test datasets

In [302]:
from sklearn.model_selection import train_test_split

In [303]:
X = reviewData['text']
Y = reviewData['label']

In [311]:
vect = CountVectorizer(stop_words = 'english')
train_data_features = vect.fit_transform(X)

In [312]:
x_names = vect.get_feature_names()

In [313]:
train_data_features.shape

(3191, 5706)

In [316]:
len(x_names)

5706

In [317]:
review_df = pd.DataFrame(train_data_features.toarray(),columns=x_names)

In [318]:
review_df.head()#document term matrix

Unnamed: 0,______,_______quot,a_,a_i_____oei_aoycu,aa,aapl,abacus,abandoned,aber,able,...,zero,zimride,zip,zite,zms,zombies,zomg,zone,zoom,zzzs
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [320]:
X_train, X_test, y_train, y_test = train_test_split(review_df, Y, test_size=0.30, random_state=42)

## 9. **Predicting the sentiment:**


### Use Naive Bayes and Logistic Regression and their accuracy scores for predicting the sentiment of the given text

In [325]:
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

In [260]:
GBmodel = GaussianNB()

In [326]:
MBModel = MultinomialNB()

In [321]:
GBmodel.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [327]:
MBModel.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [322]:
GBmodel.score(X_test,y_test)

0.7693110647181628

In [328]:
MBModel.score(X_test,y_test)

0.8444676409185804

In [329]:
LRmodel = LogisticRegression()

In [330]:
LRmodel.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [331]:
LRmodel.score(X_test,y_test)

0.8653444676409185

## 10. Create a function called `tokenize_predict` which can take count vectorizer object as input and prints the accuracy for x (text) and y (labels)

In [338]:
from sklearn import metrics

In [339]:
def tokenize_test(vect):
    x_train_dtm = vect.fit_transform(x_train)
    print('Features: ', x_train_dtm.shape[1])
    x_test_dtm = vect.transform(x_test)
    nb = MultinomialNB()
    nb.fit(x_train_dtm, y_train)
    y_pred_class = nb.predict(x_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

In [332]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.30, random_state=42)

In [333]:
vectNew = CountVectorizer(stop_words = 'english')

In [340]:
tokenize_test(vectNew)

Features:  4665
Accuracy:  0.860125260960334


### 11 Create a count vectorizer function which includes n_grams = 1,2  and pass it to tokenize_predict function to print the accuracy score

In [343]:
vectNew = CountVectorizer(stop_words = 'english',ngram_range=(1, 2))

In [344]:
tokenize_test(vectNew)

Features:  18065
Accuracy:  0.8653444676409185


### Q 12 Create a count vectorizer function with stopwords = 'english'  and pass it to tokenize_predict function to print the accuracy score

Done above with stopwords = english

### Q 13 Create a count vectorizer function with stopwords = 'english' and max_features =300  and pass it to tokenize_predict function to print the accuracy score

In [345]:
vectNew = CountVectorizer(stop_words = 'english',max_features =300)

In [346]:
tokenize_test(vectNew)

Features:  300
Accuracy:  0.826722338204593


### Q 14 Create a count vectorizer function with n_grams = 1,2  and max_features = 15000  and pass it to tokenize_predict function to print the accuracy score

In [347]:
vectNew = CountVectorizer(ngram_range=(1, 2),max_features =15000)
tokenize_test(vectNew)

Features:  15000
Accuracy:  0.8632567849686847


### Q. 15 -Create a count vectorizer function with n_grams = 1,2  and include terms that appear at least 2 times (min_df = 2)  and pass it to tokenize_predict function to print the accuracy score

In [348]:
vectNew = CountVectorizer(ngram_range=(1, 2),min_df = 2)
tokenize_test(vectNew)

Features:  7144
Accuracy:  0.8465553235908142
