## Sentiment analysis <br> 

The objective of this problem is to perform Sentiment analysis from the tweets data collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

### 1. Read the dataset (tweets.csv) and drop the NA's while reading the dataset

In [1]:
#Change directory
import os
os.chdir('G:\Paridhi')

In [2]:
import pandas as pd
data = pd.read_csv('tweets.csv',encoding= 'unicode_escape').dropna()

In [3]:
data.head(5)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


### 2. Preprocess the text and add the preprocessed text in a column with name `text` in the dataframe.

In [4]:
def preprocess(text):
    try:
        return ''.join(i if ord(i)<128 else ' ' for i in text)
    except Exception as e:
        return " "

data['text'] = [preprocess(text) for text in data.tweet_text]

In [5]:
data.head(5)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,.@wesley83 I have a 3G iPhone. After 3 hrs twe...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,@jessedee Know about @fludapp ? Awesome iPad/i...
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,@swonderlin Can not wait for #iPad 2 also. The...
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,@sxsw I hope this year's festival isn't as cra...
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,@sxtxstate great stuff on Fri #SXSW: Marissa M...


### 3. Consider only rows having a Positive or Negative emotion and remove other rows from the dataframe.

In [6]:
data = data[(data['is_there_an_emotion_directed_at_a_brand_or_product'] == 'Negative emotion') | (data['is_there_an_emotion_directed_at_a_brand_or_product'] == 'Positive emotion')]

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3191 entries, 0 to 9088
Data columns (total 4 columns):
tweet_text                                            3191 non-null object
emotion_in_tweet_is_directed_at                       3191 non-null object
is_there_an_emotion_directed_at_a_brand_or_product    3191 non-null object
text                                                  3191 non-null object
dtypes: object(4)
memory usage: 124.6+ KB


### 4. Represent text as numerical data using `CountVectorizer` and get the document term frequency matrix



In [8]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
cv.fit(data['text'])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [9]:
doc_matrix = cv.transform(data.text)

In [10]:
doc_matrix

<3191x5610 sparse matrix of type '<class 'numpy.int64'>'
	with 53151 stored elements in Compressed Sparse Row format>

In [11]:
doc_matrix1 = doc_matrix.toarray()

In [12]:
doc_matrix1.shape

(3191, 5610)

### 5. Find number of different words in vocabulary

In [13]:
# Method1 total number of words
from collections import Counter
r = Counter(" ".join(data['text']).split(" ")).items()
len(r)

9912

In [14]:
#Method2 total number of unique words - different because of adding lower()
uniqueWords = list(set(" ".join(data['text']).lower().split(" ")))
count = len(uniqueWords)
count

8619

In [15]:
uniqueWords

['',
 'intro,',
 '#crazyco',
 'fr',
 '&quot;not',
 'product...',
 'experts:',
 'wilderness',
 "company's",
 'more.',
 'class.',
 '*slips',
 'ordered',
 '#wssxsw',
 'cab',
 'please?',
 'drowning',
 'device.&quot;',
 'visiting',
 "frickin'",
 'destroyed',
 'having',
 'realize',
 'plied',
 'arm.',
 'fanboys.',
 'dropped',
 'guides',
 'convience',
 'come?',
 'amazon.',
 'goddamn',
 'yelping!!',
 '#smileyparty',
 '#iphone,',
 'installs',
 'etc.',
 'down',
 'center,',
 'may',
 'asked,',
 '98.5%',
 'good.',
 'touched',
 'one!',
 'version',
 "registrant's",
 'conquered.',
 'flip',
 'enchanting,',
 'native',
 "won't",
 'finder',
 'tuned',
 'see-really',
 'day.from',
 'geo-location',
 'popular',
 'perhaps',
 'expect.',
 'mean,',
 'replacement',
 'true:',
 'mindjet',
 'interface',
 '#futuremf',
 'pen.',
 'seats.',
 'discuss',
 '*strums',
 'shuffling',
 'free:',
 'table',
 'classiest,',
 'droid,',
 'you,',
 'source',
 'yrs',
 '#h4ckers',
 't-mobile',
 'yet:',
 'asddieu',
 "it'll",
 'tempt',
 'forg

### 6. Find out how many Positive and Negative emotions are there.

Hint: Use value_counts on that column

In [16]:
data['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

Positive emotion    2672
Negative emotion     519
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

### 7. Change the labels for Positive and Negative emotions as 1 and 0 respectively and store in a different column in the same dataframe named 'Label'

Hint: use map on that column and give labels

In [17]:
data['Label'] = data['is_there_an_emotion_directed_at_a_brand_or_product']


In [18]:
new_values = {'Positive emotion': 1, 'Negative emotion': 0}

In [19]:
data.Label = data.Label.map(new_values)

In [20]:
data.head(5)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text,Label
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,0
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,@jessedee Know about @fludapp ? Awesome iPad/i...,1
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,@swonderlin Can not wait for #iPad 2 also. The...,1
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,@sxsw I hope this year's festival isn't as cra...,0
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,@sxtxstate great stuff on Fri #SXSW: Marissa M...,1


### 8. Define the feature set (independent variable or X) to be `text` column and `labels` as target (or dependent variable)  and divide into train and test datasets

In [21]:
from sklearn.model_selection import train_test_split
X = data.text
y = data.Label


In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

In [23]:
X_train.shape

(2233,)

In [24]:
X_test.shape

(958,)

## 9. **Predicting the sentiment:**


### Use Naive Bayes and Logistic Regression and print their accuracy scores for predicting the sentiment of the given text

In [25]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [26]:
vect = CountVectorizer()

# create document-term matrices
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

In [27]:
X_train_dtm.shape

(2233, 4678)

In [28]:
X_test_dtm.shape

(958, 4678)

In [29]:
print('Features: ', X_train_dtm.shape[1])

Features:  4678


In [30]:
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)

In [31]:
# calculate accuracy
print (metrics.accuracy_score(y_test, y_pred_class))

0.8695198329853863


In [33]:
# use logistic regression with text column only
logreg = LogisticRegression()
logreg.fit(X_train_dtm, y_train)
y_pred_class = logreg.predict(X_test_dtm)
print (metrics.accuracy_score(y_test, y_pred_class))

0.8830897703549061




## 10. Create a function called `tokenize_test` which can take count vectorizer object as input, create document term matrix out of x_train & x_test, build and train a model using dtm created and print the accuracy 

In [37]:
def tokenize_test(vect):
    X_train_dtm = vect.fit_transform(X_train)
    print('Features: ', X_train_dtm.shape[1])
    X_test_dtm = vect.transform(X_test)
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    y_pred_class = nb.predict(X_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))


### Create a count vectorizer function which includes n_grams = 1,2  and pass it to tokenize_test function to print the accuracy score

In [39]:
vect1 = CountVectorizer(ngram_range=(1, 2))

In [40]:
tokenize_test(vect1)

Features:  23443
Accuracy:  0.8736951983298539


### Create a count vectorizer function with stopwords = 'english'  and pass it to tokenize_test function to print the accuracy score

In [35]:
vect2 = CountVectorizer(stop_words='english')

In [38]:
tokenize_test(vect2)

Features:  4438
Accuracy:  0.8716075156576201


### Create a count vectorizer function with stopwords = 'english' and max_features =300  and pass it to tokenize_test function to print the accuracy score

In [41]:
vect3 = CountVectorizer(stop_words='english',max_features= 300)

In [42]:
tokenize_test(vect3)

Features:  300
Accuracy:  0.824634655532359


### Create a count vectorizer function with n_grams = 1,2  and max_features = 15000  and pass it to tokenize_test function to print the accuracy score

In [43]:
vect4 = CountVectorizer(ngram_range=(1, 2),max_features= 15000)

In [44]:
tokenize_test(vect4)

Features:  15000
Accuracy:  0.8716075156576201


### Create a count vectorizer function with n_grams = 1,2  and include terms that appear at least 2 times (min_df = 2)  and pass it to tokenize_test function to print the accuracy score

In [45]:
vect5 = CountVectorizer(ngram_range=(1, 2),min_df=2)

In [46]:
tokenize_test(vect5)

Features:  7101
Accuracy:  0.8674321503131524
