## Sentiment analysis <br> 

The objective of this problem is to perform Sentiment analysis from the tweets data collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

In [2]:
import pandas as pd

### 1. Read the dataset (tweets.csv) and drop the NA's while reading the dataset

In [4]:
sentdata = pd.read_csv("tweets.csv",engine='python')

In [22]:
sentdata.shape

(9093, 4)

In [25]:
sentdata.isna().sum()

tweet_text                                               1
emotion_in_tweet_is_directed_at                       5802
is_there_an_emotion_directed_at_a_brand_or_product       0
text                                                     0
dtype: int64

In [33]:
sentdata.dropna(axis = 0,inplace = True)

In [34]:
sentdata.shape

(3291, 4)

### 2. Preprocess the text and add the preprocessed text in a column with name `text` in the dataframe.

In [35]:
def preprocess(text):
    try:
        return ''.join(i if ord(i)<128 else ' ' for i in text)
    except Exception as e:
        return ""

In [36]:
sentdata['text'] = [preprocess(text) for text in sentdata.tweet_text]

In [37]:
sentdata['text']

0       .@wesley83 I have a 3G iPhone. After 3 hrs twe...
1       @jessedee Know about @fludapp ? Awesome iPad/i...
2       @swonderlin Can not wait for #iPad 2 also. The...
3       @sxsw I hope this year's festival isn't as cra...
4       @sxtxstate great stuff on Fri #SXSW: Marissa M...
                              ...                        
9077    @mention your PR guy just convinced me to swit...
9079    &quot;papyrus...sort of like the ipad&quot; - ...
9080    Diller says Google TV &quot;might be run over ...
9085    I've always used Camera+ for my iPhone b/c it ...
9088                        Ipad everywhere. #SXSW {link}
Name: text, Length: 3291, dtype: object

### 3. Consider only rows having a Positive or Negative emotion and remove other rows from the dataframe.

In [38]:
sentdata['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

Positive emotion                      2672
Negative emotion                       519
No emotion toward brand or product      91
I can't tell                             9
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

In [39]:
sentdata1 = sentdata.loc[sentdata['is_there_an_emotion_directed_at_a_brand_or_product'].isin(['Positive emotion','Negative emotion'])]

In [40]:
sentdata1['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

Positive emotion    2672
Negative emotion     519
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

In [50]:
sentdata1.drop('tweet_text', axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [51]:
sentdata1

Unnamed: 0,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text
0,iPhone,Negative emotion,.@wesley83 I have a 3G iPhone. After 3 hrs twe...
1,iPad or iPhone App,Positive emotion,@jessedee Know about @fludapp ? Awesome iPad/i...
2,iPad,Positive emotion,@swonderlin Can not wait for #iPad 2 also. The...
3,iPad or iPhone App,Negative emotion,@sxsw I hope this year's festival isn't as cra...
4,Google,Positive emotion,@sxtxstate great stuff on Fri #SXSW: Marissa M...
...,...,...,...
9077,iPhone,Positive emotion,@mention your PR guy just convinced me to swit...
9079,iPad,Positive emotion,&quot;papyrus...sort of like the ipad&quot; - ...
9080,Other Google product or service,Negative emotion,Diller says Google TV &quot;might be run over ...
9085,iPad or iPhone App,Positive emotion,I've always used Camera+ for my iPhone b/c it ...


### 4. Represent text as numerical data using `CountVectorizer` and get the document term frequency matrix



In [42]:
from sklearn.feature_extraction.text import CountVectorizer

In [63]:
Countvect = CountVectorizer()

In [80]:
X = Countvect.fit_transform(sentdata1['text'])

In [79]:
sentdata1['X'].head()

0      (0, 5413)\t1\n  (0, 2282)\t1\n  (0, 79)\t1\n...
1      (0, 5413)\t1\n  (0, 2282)\t1\n  (0, 79)\t1\n...
2      (0, 5413)\t1\n  (0, 2282)\t1\n  (0, 79)\t1\n...
3      (0, 5413)\t1\n  (0, 2282)\t1\n  (0, 79)\t1\n...
4      (0, 5413)\t1\n  (0, 2282)\t1\n  (0, 79)\t1\n...
Name: X, dtype: object

In [78]:
Countvect.get_feature_names()

['000',
 '02',
 '03',
 '08',
 '10',
 '100',
 '100s',
 '100tc',
 '101',
 '106',
 '10am',
 '10k',
 '10mins',
 '10pm',
 '10x',
 '11',
 '11ntc',
 '11th',
 '12',
 '12b',
 '12th',
 '13',
 '130',
 '14',
 '1406',
 '1413',
 '1415',
 '15',
 '150',
 '1500',
 '150m',
 '157',
 '15am',
 '15k',
 '16162',
 '16gb',
 '16mins',
 '17',
 '188',
 '1986',
 '1990style',
 '1m',
 '1of',
 '1pm',
 '1st',
 '20',
 '200',
 '2010',
 '2011',
 '2012',
 '20s',
 '21',
 '22',
 '23',
 '24',
 '25',
 '250k',
 '25th',
 '2am',
 '2day',
 '2honor',
 '2moro',
 '2nd',
 '2nite',
 '2s',
 '2yrs',
 '30',
 '300',
 '3000',
 '30a',
 '30am',
 '30p',
 '30pm',
 '32',
 '32gb',
 '35',
 '36',
 '37',
 '3d',
 '3g',
 '3gs',
 '3k',
 '3rd',
 '3x',
 '40',
 '400',
 '40min',
 '41',
 '45',
 '45am',
 '47',
 '48',
 '4android',
 '4chan',
 '4g',
 '4nqv92l',
 '4sq',
 '4sq3',
 '4square',
 '50',
 '54',
 '55',
 '58',
 '59',
 '59p',
 '59pm',
 '5pm',
 '5th',
 '60',
 '64g',
 '64gb',
 '64gig',
 '64mb',
 '65',
 '6hours',
 '6th',
 '70',
 '75',
 '7th',
 '80',
 '800',

### 5. Find number of different words in vocabulary

In [77]:
Countvect.vocabulary_

{'wesley83': 5413,
 'have': 2282,
 '3g': 79,
 'iphone': 2637,
 'after': 225,
 'hrs': 2422,
 'tweeting': 5141,
 'at': 415,
 'rise_austin': 4141,
 'it': 2659,
 'was': 5370,
 'dead': 1287,
 'need': 3300,
 'to': 5011,
 'upgrade': 5227,
 'plugin': 3702,
 'stations': 4616,
 'sxsw': 4769,
 'jessedee': 2688,
 'know': 2779,
 'about': 150,
 'fludapp': 1904,
 'awesome': 472,
 'ipad': 2627,
 'app': 345,
 'that': 4915,
 'you': 5577,
 'll': 2923,
 'likely': 2893,
 'appreciate': 365,
 'for': 1933,
 'its': 2661,
 'design': 1348,
 'also': 279,
 'they': 4939,
 're': 3956,
 'giving': 2099,
 'free': 1965,
 'ts': 5110,
 'swonderlin': 4760,
 'can': 802,
 'not': 3361,
 'wait': 5342,
 'should': 4375,
 'sale': 4194,
 'them': 4925,
 'down': 1498,
 'hope': 2395,
 'this': 4953,
 'year': 5559,
 'festival': 1836,
 'isn': 2654,
 'as': 405,
 'crashy': 1180,
 'sxtxstate': 4794,
 'great': 2175,
 'stuff': 4681,
 'on': 3434,
 'fri': 1970,
 'marissa': 3059,
 'mayer': 3090,
 'google': 2138,
 'tim': 4986,
 'reilly': 4028,
 

In [67]:
len(cvect.vocabulary_)

5610

### 6. Find out how many Positive and Negative emotions are there.

Hint: Use value_counts on that column

In [68]:
sentdata1['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

Positive emotion    2672
Negative emotion     519
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

### 7. Change the labels for Positive and Negative emotions as 1 and 0 respectively and store in a different column in the same dataframe named 'Label'

Hint: use map on that column and give labels

In [69]:
sentdata1['Label'] = sentdata1.is_there_an_emotion_directed_at_a_brand_or_product.map({'Positive emotion':1, 'Negative emotion':0})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [70]:
sentdata1.head()

Unnamed: 0,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text,Label
0,iPhone,Negative emotion,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,0
1,iPad or iPhone App,Positive emotion,@jessedee Know about @fludapp ? Awesome iPad/i...,1
2,iPad,Positive emotion,@swonderlin Can not wait for #iPad 2 also. The...,1
3,iPad or iPhone App,Negative emotion,@sxsw I hope this year's festival isn't as cra...,0
4,Google,Positive emotion,@sxtxstate great stuff on Fri #SXSW: Marissa M...,1


### 8. Define the feature set (independent variable or X) to be `text` column and `labels` as target (or dependent variable)  and divide into train and test datasets

In [129]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(sentdata1.text, sentdata1.Label, random_state=234)
print(x_train.shape,x_test.shape,y_train.shape, y_test.shape)

(2393,) (798,) (2393,) (798,)


In [147]:
Cvect = CountVectorizer(ngram_range = (1,2),min_df = 2)
x_train_dtm = Cvect.fit_transform(x_train)
x_test_dtm = Cvect.transform(x_test)

## 9. **Predicting the sentiment:**


### Use Naive Bayes and Logistic Regression and print their accuracy scores for predicting the sentiment of the given text

In [148]:
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
gnb = GaussianNB()
y_pred = gnb.fit(x_train_dtm.toarray(), y_train).predict(x_test_dtm.toarray())

In [149]:
metrics.accuracy_score(y_test, y_pred)

0.8395989974937343

In [150]:
from sklearn.linear_model import LogisticRegression
y_pred1 = LogisticRegression().fit(x_train_dtm.toarray(), y_train).predict(x_test_dtm.toarray())

In [151]:
metrics.accuracy_score(y_test, y_pred1)

0.8696741854636592

In [152]:
from sklearn.naive_bayes import ComplementNB
from sklearn import metrics
cnb = ComplementNB()
y_pred = cnb.fit(x_train_dtm.toarray(), y_train).predict(x_test_dtm.toarray())
metrics.accuracy_score(y_test, y_pred)

0.8521303258145363

## 10. Create a function called `tokenize_test` which can take count vectorizer object as input, create document term matrix out of x_train & x_test, build and train a model using dtm created and print the accuracy 

In [108]:
from sklearn.naive_bayes import MultinomialNB
def tokenize_test(vect):
    x_train_dtm = vect.fit_transform(x_train)
    print('Features: ', x_train_dtm.shape[1])
    x_test_dtm = vect.transform(x_test)
    nb = MultinomialNB()
    nb.fit(x_train_dtm, y_train)
    y_pred_class = nb.predict(x_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

### Create a count vectorizer function which includes n_grams = 1,2  and pass it to tokenize_test function to print the accuracy score

In [109]:

x_train, x_test, y_train, y_test = train_test_split(sentdata1.text, sentdata1.Label, random_state=234)
vect = CountVectorizer(ngram_range = (1,2))
tokenize_test(vect)
#tokenize_test(CountVectorizer(ngram_range = (1,2)))

Features:  24422
Accuracy:  0.8734335839598998


### Create a count vectorizer function with stopwords = 'english'  and pass it to tokenize_test function to print the accuracy score

In [111]:
vect = CountVectorizer(stop_words = 'english')
tokenize_test(vect)

Features:  4600
Accuracy:  0.8571428571428571


### Create a count vectorizer function with stopwords = 'english' and max_features =300  and pass it to tokenize_test function to print the accuracy score

In [112]:
tokenize_test(CountVectorizer(stop_words = 'english',max_features = 300))

Features:  300
Accuracy:  0.8220551378446115


### Create a count vectorizer function with n_grams = 1,2  and max_features = 15000  and pass it to tokenize_test function to print the accuracy score

In [114]:
tokenize_test(CountVectorizer(ngram_range = (1,2) ,max_features = 15000))

Features:  15000
Accuracy:  0.868421052631579


### Create a count vectorizer function with n_grams = 1,2  and include terms that appear at least 2 times (min_df = 2)  and pass it to tokenize_test function to print the accuracy score

In [115]:
tokenize_test(CountVectorizer(ngram_range = (1,2) ,min_df = 2))

Features:  7519
Accuracy:  0.8634085213032582
