### Emotion Classifier

#### Dataset

The dataset set contained in four text files consists of tweets for four different emotions: anger, fear, joy and sadness.<br>
Along with the tweet, the intensity or degree of emotion X felt by the speaker (a real-valued score between 0 and 1) is also provided. <br>
The maximum possible score 1 stands for feeling the maximum amount of emotion X (or having a mental state maximally inclined towards feeling emotion X). The minimum possible score 0 stands for feeling the least amount of emotion X (or having a mental state maximally away from feeling emotion X). 

Installing required package:<br>
```
pip3 install nltk
```

In [48]:
import nltk    

#### Reading the tweets and their corresponding emotion and intensity

In [67]:
from pandas import DataFrame
import pandas as pd

data = [] # Tweets
data_labels = [] # Emotion label (anger, fear, joy, or sadness)
data_int = [] # Intensityy of each emotion

dataset=pd.read_csv("anger-ratings-0to1.train.txt",delimiter="\t",names=['id','tweet','emotion','intensity'])
for i in range(len(dataset)):
    data.append(dataset.iat[i,1])
    data_labels.append('anger')
    data_int.append(dataset.iat[i,3])
    
dataset=pd.read_csv("fear-ratings-0to1.train.txt",delimiter="\t",names=['id','tweet','emotion','intensity'])
for i in range(len(dataset)):
    data.append(dataset.iat[i,1])
    data_labels.append('fear')
    data_int.append(dataset.iat[i,3])

dataset=pd.read_csv("joy-ratings-0to1.train.txt",delimiter="\t",names=['id','tweet','emotion','intensity'])
for i in range(len(dataset)):
    data.append(dataset.iat[i,1])
    data_labels.append('joy')
    data_int.append(dataset.iat[i,3])

dataset=pd.read_csv("sadness-ratings-0to1.train.txt",delimiter="\t",names=['id','tweet','emotion','intensity'])
for i in range(len(dataset)):
    data.append(dataset.iat[i,1])
    data_labels.append('sadness')
    data_int.append(dataset.iat[i,3])

In [68]:
# Display first few examples
pd.set_option('display.max_colwidth', -1)
dataset.head()

Unnamed: 0,id,tweet,emotion,intensity
0,40000,Depression sucks! #depression,sadness,0.958
1,40001,Feeling worthless as always #depression,sadness,0.958
2,40002,Feeling worthless as always,sadness,0.958
3,40003,My #Fibromyalgia has been really bad lately which is not good for my mental state. I feel very overwhelmed #anxiety #bipolar #depression,sadness,0.946
4,40004,Im think ima lay in bed all day and sulk. Life is hitting me to hard rn,sadness,0.934


In [74]:
# Shuffling the data
from random import shuffle
dv = []
dl = []
di = []
index_shuf = list(range(len(data)))
shuffle(index_shuf)
for i in index_shuf:
    dv.append(data[i])
    dl.append(data_labels[i])
    di.append(data_int[i])
data = dv
data_labels = dl
data_int = di

#### Feature extraction using CountVectorizer

In [52]:
from sklearn.feature_extraction.text import CountVectorizer    

vectorizer = CountVectorizer(
    analyzer = 'word',
    lowercase = False,
)


#### An example using CountVectorizer

In [53]:
example = ['this is great','This is too great to be great','THIS IS GREAT!']
print(example)

['this is great', 'This is too great to be great', 'THIS IS GREAT!']


In [54]:
features_eg = vectorizer.fit_transform(
    example
)
features_nd_eg = features_eg.toarray() # for easy usage
vectorizer.get_feature_names()

['GREAT', 'IS', 'THIS', 'This', 'be', 'great', 'is', 'this', 'to', 'too']

#### Extracting features from tweets

In [55]:
features = vectorizer.fit_transform(
    data
)
features_nd = features.toarray() # for easy usage

In [56]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test  = train_test_split(
        features_nd, 
        data_labels,
        train_size=0.80, test_size=0.20, 
        random_state=1234)

### Linear Classifier

In [57]:
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression()

In [58]:
log_model = log_model.fit(X=X_train, y=y_train)

In [59]:
y_pred = log_model.predict(X_test)

In [60]:
import numpy as np
np.mean(y_pred==y_test)

0.8492392807745505

### Accuracy

In [66]:
# Printing the predictions for some random test data
import random

j = random.randint(0,len(X_test)-7)
for i in range(j,j+7):
    ind = features_nd.tolist().index(X_test[i].tolist())
    print(y_pred[i],":",data[ind].strip())

joy : #BridgetJonesBaby is the best thing I've seen in ages! So funny, I've missed Bridget! #love  #TeamMark
joy : @Zerfash — can't wait.' She said cheerfully and grinned.
fear : I don't know how people can binge watch horror films ...ALONE!😓😰
sadness : Carry on my wayward son, there'll be peace when you are done. Lay your weary head to rest. Don't you cry no more. #Supernatural
sadness : If a friend lost his/her phone, how long do they have to mourn their lost phones before you ask for their earpiece?
fear : Y'all really insult coz of soccer???  Lmao, wow!!!!!!
fear : I want to slide into the dms but im too fucking shy


In [75]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.8492392807745505


## Exercise
```
There are two sets each containing 4 files for each emotion provided for training and development. 
Combine these two sets for training and use 5-fold cross-validation 
to find out the Accuracy in all the cases mentioned below.
```

1. Calculate the accuracy using Random Forest Classifier and tune the number of estimators to get the best results. Comment on the same.
2. Now use Logistic Regression and observe the accuracy value. Can the performance be further improved by using L1 and L2 regularizations?
3. Repeat the same using Support Vector Classifier.
4. Estimate the training & testing time for each classifier and comment on the results.
5. Fit different regression models for each emotion and display mean square error for test set.
6. A separate test set is provided. Use one of the classification models implemented earlier to determine the corresponding emotion for each tweet in this set. Use the linear regression models to calculate the emotional intensity.

```In all the above cases, CountVectorizer can be used for feature extraction```