### Emotion Classifier

#### Dataset

The dataset set contained in four text files consists of tweets for four different emotions: anger, fear, joy and sadness.<br>

Along with the tweet, the intensity or degree of emotion X felt by the speaker (a real-valued score between 0 and 1) is also provided. <br>

The maximum possible score 1 stands for feeling the maximum amount of emotion X (or having a mental state maximally inclined towards feeling emotion X). The minimum possible score 0 stands for feeling the least amount of emotion X (or having a mental state maximally away from feeling emotion X). 

#### Goals: 
i) To classify a given tweet into one of the four classes: anger, fear, joy or sadness. <br>
ii) To display the degree of the classified emotion in the tweet.

Installing required package:<br>
```
pip3 install nltk
 (or)
pip install nltk
```

In [2]:
import nltk  

In [25]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

In [3]:
from pandas import DataFrame
import pandas as pd

data = [] # Tweets
data_labels = [] # Emotion label (anger, fear, joy, or sadness)
data_int = [] # Intensityy of each emotion

dataset=pd.read_csv("training_set/anger-ratings-0to1.train.txt",delimiter="\t",names=['id','tweet','emotion','intensity'])

# Display first few examples
pd.set_option('display.max_colwidth', -1)
dataset.head()

Unnamed: 0,id,tweet,emotion,intensity
0,10000,How the fu*k! Who the heck! moved my fridge!... should I knock the landlord door. #angry #mad ##,anger,0.938
1,10001,So my Indian Uber driver just called someone the N word. If I wasn't in a moving vehicle I'd have jumped out #disgusted,anger,0.896
2,10002,@DPD_UK I asked for my parcel to be delivered to a pick up store not my address #fuming #poorcustomerservice,anger,0.896
3,10003,so ef whichever butt wipe pulled the fire alarm in davis bc I was sound asleep #pissed #angry #upset #tired #sad #tired #hangry ######,anger,0.896
4,10004,"Don't join @BTCare they put the phone down on you, talk over you and are rude. Taking money out of my acc willynilly! #fuming",anger,0.896


In [4]:
len(dataset)

857

#### Reading the tweets and their corresponding emotion and intensity

In [5]:
from pandas import DataFrame
import pandas as pd

data = [] # Tweets
data_labels = [] # Emotion label (anger, fear, joy, or sadness)
data_int = [] # Intensityy of each emotion

dataset=pd.read_csv("training_set/anger-ratings-0to1.train.txt",delimiter="\t",names=['id','tweet','emotion','intensity'])
for i in range(len(dataset)):
    data.append(dataset.iat[i,1])
    data_labels.append('anger')
    data_int.append(dataset.iat[i,3])
    
dataset=pd.read_csv("training_set/fear-ratings-0to1.train.txt",delimiter="\t",names=['id','tweet','emotion','intensity'])
for i in range(len(dataset)):
    data.append(dataset.iat[i,1])
    data_labels.append('fear')
    data_int.append(dataset.iat[i,3])

dataset=pd.read_csv("training_set/joy-ratings-0to1.train.txt",delimiter="\t",names=['id','tweet','emotion','intensity'])
for i in range(len(dataset)):
    data.append(dataset.iat[i,1])
    data_labels.append('joy')
    data_int.append(dataset.iat[i,3])

dataset=pd.read_csv("training_set/sadness-ratings-0to1.train.txt",delimiter="\t",names=['id','tweet','emotion','intensity'])
for i in range(len(dataset)):
    data.append(dataset.iat[i,1])
    data_labels.append('sadness')
    data_int.append(dataset.iat[i,3])

#### Shuffling the data

In [6]:
from random import shuffle
dv = []
dl = []
di = []
index_shuf = list(range(len(data)))
shuffle(index_shuf)
for i in index_shuf:
    dv.append(data[i])
    dl.append(data_labels[i])
    di.append(data_int[i])
data = dv
data_labels = dl
data_int = di

#### Feature extraction using CountVectorizer

In [7]:
from sklearn.feature_extraction.text import CountVectorizer    

vectorizer = CountVectorizer(
    analyzer = 'word',
    lowercase = False,
)


#### An example using CountVectorizer

In [8]:
example = ['this is great','This is too great to be great','THIS IS GREAT!']
print(example)

['this is great', 'This is too great to be great', 'THIS IS GREAT!']


In [9]:
features_eg = vectorizer.fit_transform(
    example
)
features_nd_eg = features_eg.toarray() # for easy usage
print(vectorizer.get_feature_names())
print(features_nd_eg)

['GREAT', 'IS', 'THIS', 'This', 'be', 'great', 'is', 'this', 'to', 'too']
[[0 0 0 0 0 1 1 1 0 0]
 [0 0 0 1 1 2 1 0 1 1]
 [1 1 1 0 0 0 0 0 0 0]]


#### Extracting features from tweets

In [10]:
features = vectorizer.fit_transform(
    data
)
features_nd = features.toarray() # for easy usage

In [11]:
features_nd.shape


(3613, 11239)

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test  = train_test_split(
        features_nd, 
        data_labels,
        train_size=0.80, test_size=0.20, 
        random_state=1234)

### Linear Classifier

In [13]:
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression()

In [14]:
log_model = log_model.fit(X=X_train, y=y_train)

In [15]:
y_pred = log_model.predict(X_test)

In [16]:
import numpy as np
np.mean(y_pred==y_test)

0.8561549100968188

### Accuracy

In [17]:
# Printing the predictions for some random test data
import random

j = random.randint(0,len(X_test)-7)
for i in range(j,j+7):
    ind = features_nd.tolist().index(X_test[i].tolist())
    print(y_pred[i],":",data[ind].strip())

sadness : @ticcikasie1 With a frown, she let's out a distraught 'Gardevoir' saying that she wishes she had a trainer
fear : What an actual nightmare
anger : It takes a man to suffer ignorance and smile. Be yourself, no matter what they say. #sting
anger : ordered my vacation bathing suits. one less thing to fret about.
anger : Get to work and there's a fire drill. #fire  #outthere #inthedark
fear : Rojo is shocking.......absolutely shocking !!!
fear : STAY JADED everyone is #terrible


In [18]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.8561549100968188


## Exercise
```
There are two sets each containing 4 files for each emotion provided for training and development. 
Combine these two sets for training and use 5-fold cross-validation 
to find out the Accuracy in all the cases mentioned below.
```

1. Calculate the accuracy using Random Forest Classifier and tune the number of estimators to get the best results. Comment on the same.
2. Now use Logistic Regression and observe the accuracy value. Can the performance be further improved by using L1 and L2 regularizations?
3. Repeat the same using Support Vector Classifier.
4. Estimate the training & testing time for each classifier and comment on the results.
5. Now, the emotion intensity score for each tweet is to be found on top of classification. To do this, fit different regression models on the training set for each emotion and find the emotion intensity score for each of the test set. Also, display mean square error for test set.
6. In all the above cases, create a user-defined function, which takes a tweet (text) as input and displays the predicted emotion.
7. A separate test set is provided. Use one of the classification models implemented earlier to determine the corresponding emotion for each tweet in this set. Use the linear regression models to calculate the emotional intensity.

```In all the above cases, use a feature extractor other than CountVectorizer and observe performance.```

In [19]:
features_nd.shape

(3613, 11239)

## Data extracting

In [20]:
from pandas import DataFrame
import pandas as pd

data = [] # Tweets
data_labels = [] # Emotion label (anger, fear, joy, or sadness)
data_int = [] # Intensityy of each emotion

dataset=pd.read_csv("training_set/anger-ratings-0to1.train.txt",delimiter="\t",names=['id','tweet','emotion','intensity'])
for i in range(len(dataset)):
    data.append(dataset.iat[i,1])
    data_labels.append('anger')
    data_int.append(dataset.iat[i,3])
    
dataset=pd.read_csv("training_set/fear-ratings-0to1.train.txt",delimiter="\t",names=['id','tweet','emotion','intensity'])
for i in range(len(dataset)):
    data.append(dataset.iat[i,1])
    data_labels.append('fear')
    data_int.append(dataset.iat[i,3])

dataset=pd.read_csv("training_set/joy-ratings-0to1.train.txt",delimiter="\t",names=['id','tweet','emotion','intensity'])
for i in range(len(dataset)):
    data.append(dataset.iat[i,1])
    data_labels.append('joy')
    data_int.append(dataset.iat[i,3])

dataset=pd.read_csv("training_set/sadness-ratings-0to1.train.txt",delimiter="\t",names=['id','tweet','emotion','intensity'])
for i in range(len(dataset)):
    data.append(dataset.iat[i,1])
    data_labels.append('sadness')
    data_int.append(dataset.iat[i,3])

In [21]:
dataset=pd.read_csv("dev_set/anger-ratings-0to1.dev.gold.txt",delimiter="\t",names=['id','tweet','emotion','intensity'])
for i in range(len(dataset)):
    data.append(dataset.iat[i,1])
    data_labels.append('anger')
    data_int.append(dataset.iat[i,3])
    
dataset=pd.read_csv("dev_set/fear-ratings-0to1.dev.gold.txt",delimiter="\t",names=['id','tweet','emotion','intensity'])
for i in range(len(dataset)):
    data.append(dataset.iat[i,1])
    data_labels.append('fear')
    data_int.append(dataset.iat[i,3])

dataset=pd.read_csv("dev_set/joy-ratings-0to1.dev.gold.txt",delimiter="\t",names=['id','tweet','emotion','intensity'])
for i in range(len(dataset)):
    data.append(dataset.iat[i,1])
    data_labels.append('joy')
    data_int.append(dataset.iat[i,3])

dataset=pd.read_csv("dev_set/sadness-ratings-0to1.dev.gold.txt",delimiter="\t",names=['id','tweet','emotion','intensity'])
for i in range(len(dataset)):
    data.append(dataset.iat[i,1])
    data_labels.append('sadness')
    data_int.append(dataset.iat[i,3])

In [22]:
from pandas import DataFrame
import pandas as pd

testdata = [] # Tweets
testdata_labels = [] # Emotion label (anger, fear, joy, or sadness)
testdata_int = [] # Intensityy of each emotion

dataset=pd.read_csv("testing_set/anger-ratings-0to1.test.gold.txt",delimiter="\t",names=['id','tweet','emotion','intensity'])
for i in range(len(dataset)):
    testdata.append(dataset.iat[i,1])
    testdata_labels.append('anger')
    testdata_int.append(dataset.iat[i,3])
    
dataset=pd.read_csv("testing_set/fear-ratings-0to1.test.gold.txt",delimiter="\t",names=['id','tweet','emotion','intensity'])
for i in range(len(dataset)):
    testdata.append(dataset.iat[i,1])
    testdata_labels.append('fear')
    testdata_int.append(dataset.iat[i,3])

dataset=pd.read_csv("testing_set/joy-ratings-0to1.test.gold.txt",delimiter="\t",names=['id','tweet','emotion','intensity'])
for i in range(len(dataset)):
    testdata.append(dataset.iat[i,1])
    testdata_labels.append('joy')
    testdata_int.append(dataset.iat[i,3])

dataset=pd.read_csv("testing_set/sadness-ratings-0to1.test.gold.txt",delimiter="\t",names=['id','tweet','emotion','intensity'])
for i in range(len(dataset)):
    testdata.append(dataset.iat[i,1])
    testdata_labels.append('sadness')
    testdata_int.append(dataset.iat[i,3])

In [23]:
from sklearn.feature_extraction.text import CountVectorizer    

vectorizer = CountVectorizer(
    analyzer = 'word',
    lowercase = False,
)
features = vectorizer.fit_transform(
    data
)
features_nd = features.toarray()

In [24]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test  = train_test_split(
        features_nd, 
        data_labels,
        train_size=0.80, test_size=0.20, 
        random_state=1234)

## Random Forest

In [31]:
from sklearn.model_selection import cross_val_score 
from sklearn.ensemble import RandomForestClassifier
model=RandomForestClassifier(criterion='gini',n_estimators=500,min_samples_leaf=2)
import timeit
tic = timeit.default_timer()
# X_train is split into training and testing sets for each of the k folds and scores are obtained
scores = cross_val_score(estimator=model, X=X_train, y=y_train, cv=5, n_jobs=-1)
toc = timeit.default_timer()
print("Training time" , toc - tic)
print('CV accuracy scores: %s' % scores) 
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores))) 
from sklearn.metrics import classification_report


Training time 757.3631913413255
CV accuracy scores: [0.84591195 0.80757098 0.82306477 0.83728278 0.82911392]
CV accuracy: 0.829 +/- 0.013


In [28]:
from sklearn.grid_search import GridSearchCV
param_grid = {'n_estimators': [100,200,400,500,600]}
grid = GridSearchCV(model, param_grid,n_jobs=-1)
%time grid.fit(X_train, y_train)
print(grid.best_params_)
model = grid.best_estimator_



Wall time: 17min 17s
{'n_estimators': 100}


In [29]:
import timeit
tic = timeit.default_timer()
model.fit(X_train, y_train)
toc = timeit.default_timer()
print("Training time" , toc - tic)
from sklearn.metrics import classification_report
tic = timeit.default_timer()
y_fit=model.predict(X_test)
print("Classification report for validation dataset\n",classification_report(y_test, y_fit))
toc = timeit.default_timer()
print("Testing time time" , toc - tic)

Training time 45.385237231955216
Classification report for validation dataset
              precision    recall  f1-score   support

      anger       0.94      0.75      0.83       197
       fear       0.70      0.95      0.81       240
        joy       0.96      0.83      0.89       191
    sadness       0.88      0.77      0.82       164

avg / total       0.86      0.83      0.84       792

Testing time time 0.710537148117055


### If we use random forest the cross validation accuracy is 0.83. It is less than logistic regression and SVM(linear) because the boundary would be linear in nature and random forest is trying to make boxes but is unable to do a very good job

## Logistic Regression

In [41]:
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
tic = timeit.default_timer()
model.fit(X_train, y_train)
toc = timeit.default_timer()
print("Training time" , toc - tic)
tic = timeit.default_timer()
y_fit=model.predict(X_test)
print("Classification report for validation dataset\n",classification_report(y_test, y_fit))
toc = timeit.default_timer()
print("Testing time time" , toc - tic)

Training time 1.9924491568635858
Classification report for validation dataset
              precision    recall  f1-score   support

      anger       0.88      0.86      0.87       197
       fear       0.82      0.93      0.87       240
        joy       0.94      0.85      0.89       191
    sadness       0.84      0.80      0.82       164

avg / total       0.87      0.87      0.87       792

Testing time time 0.471371835539685


In [42]:
tic = timeit.default_timer()
# X_train is split into training and testing sets for each of the k folds and scores are obtained
scores = cross_val_score(estimator=model, X=X_train, y=y_train, cv=5, n_jobs=-1)
toc = timeit.default_timer()
print("Training time" , toc - tic)
print('CV accuracy scores: %s' % scores) 
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores))) 

Training time 165.31165567550306
CV accuracy scores: [0.83647799 0.8044164  0.835703   0.84202212 0.82594937]
CV accuracy: 0.829 +/- 0.013


In [47]:
from sklearn.grid_search import GridSearchCV
param_grid = {'penalty': ['l1','l2'],
             'C':[1,5,10]}
grid = GridSearchCV(model, param_grid,n_jobs=-1)
%time grid.fit(X_train, y_train)
print(grid.best_params_)
model = grid.best_estimator_

Wall time: 6min 29s
{'C': 5, 'penalty': 'l1'}


In [48]:
tic = timeit.default_timer()
y_fit=model.predict(X_test)
print("Classification report for validation dataset\n",classification_report(y_test, y_fit))
toc = timeit.default_timer()
print("Testing time time" , toc - tic)

Classification report for validation dataset
              precision    recall  f1-score   support

      anger       0.93      0.82      0.87       197
       fear       0.78      0.93      0.85       240
        joy       0.95      0.87      0.91       191
    sadness       0.87      0.84      0.85       164

avg / total       0.87      0.87      0.87       792

Testing time time 2.659489818408474


In [49]:
tic = timeit.default_timer()
# X_train is split into training and testing sets for each of the k folds and scores are obtained
scores = cross_val_score(estimator=model, X=X_train, y=y_train, cv=5, n_jobs=-1)
toc = timeit.default_timer()
print("Training time" , toc - tic)
print('CV accuracy scores: %s' % scores) 
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores))) 

Training time 72.60058481490705
CV accuracy scores: [0.8663522  0.83280757 0.83886256 0.84834123 0.83544304]
CV accuracy: 0.844 +/- 0.012


### The logistic regression Cross validation accuracy increased once we used L1 regularization. This might be because of multicollinearity in the data and because of regularization we are able to mitigate its effect on accuracy

## SVC

In [125]:
from sklearn.svm import SVC
model=SVC()

In [126]:
model=SVC(kernel='linear')
tic = timeit.default_timer()
model.fit(X_train, y_train)
toc = timeit.default_timer()
print("Training time" , toc - tic)

Training time 278.18751782643085


In [127]:
tic = timeit.default_timer()
y_fit=model.predict(X_test)
print("Classification report for validation dataset\n",classification_report(y_test, y_fit))
toc = timeit.default_timer()
print("Testing time time" , toc - tic)

Classification report for validation dataset
              precision    recall  f1-score   support

      anger       0.89      0.83      0.86       197
       fear       0.79      0.95      0.86       240
        joy       0.94      0.84      0.89       191
    sadness       0.85      0.77      0.81       164

avg / total       0.86      0.85      0.85       792

Testing time time 43.52747052696941


In [128]:
tic = timeit.default_timer()
# X_train is split into training and testing sets for each of the k folds and scores are obtained
scores = cross_val_score(estimator=model, X=X_train, y=y_train, cv=5, n_jobs=-1)
toc = timeit.default_timer()
print("Training time" , toc - tic)
print('CV accuracy scores: %s' % scores) 
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores))) 

Training time 595.6886770845049
CV accuracy scores: [0.83490566 0.78391167 0.82464455 0.81516588 0.8085443 ]
CV accuracy: 0.813 +/- 0.017


### Linear kernel of SVM performed much better than Rbf because the boundary might be linear. The boundary being linear is also seen by higher accuracy of logistic regression than random forrest

### Training and testing time is least for logistic regression. It is less than random forest because random forest has 500 eastimators and hence it will be more complex and take more time. SVM is not good with a lot of features and in this we have around 12000 features and hence it is taking a lot of time to classify

In [63]:
data_fear=[]
data_anger=[]
data_joy=[]
data_sadness=[]
target_fear=[]
target_anger=[]
target_joy=[]
target_sadness=[]
for i in range(len(data_labels)):
    if(data_labels[i]=='fear'):
        data_fear.append(features_nd[i])
        target_fear.append(data_int[i])
    if(data_labels[i]=='anger'):
        data_anger.append(features_nd[i])
        target_anger.append(data_int[i])
    if(data_labels[i]=='joy'):
        data_joy.append(features_nd[i])
        target_joy.append(data_int[i])
    if(data_labels[i]=='sadness'):
        data_sadness.append(features_nd[i])
        target_sadness.append(data_int[i])

## Anger intensity

In [64]:
import numpy as np

In [65]:
from sklearn.ensemble import RandomForestRegressor
anger=RandomForestRegressor()

In [66]:
from sklearn.linear_model import LinearRegression

In [94]:
data_anger=np.array(data_anger)
target_anger=np.array(target_anger)
#anger=LinearRegression()
anger=RandomForestRegressor()
X_train, X_test, y_train, y_test = train_test_split(data_anger, target_anger, test_size=0.20, random_state=1234) 
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
# Split the data set into 'k' folds
#kfold = StratifiedKFold(n_splits=5, random_state=1)
kfold = KFold(n_splits=5, random_state=1)
scores = [] 
k = 0

for (train, test) in kfold.split(X_train, y_train): 
    anger.fit(X_train[train], y_train[train])          # Perform functions in pipeline
    score = anger.score(X_train[test], y_train[test])  # Calculate score for each fold 
    y_pred=anger.predict(X_train[test])
    scores.append(mean_squared_error(y_train[test], y_pred)) 
    score=mean_squared_error(y_train[test], y_pred)
    k = k+1
    y_pred=[]
    print('Fold: %s Mean Squared Error: %.3f' % (k, score))  

Fold: 1 Mean Squared Error: 0.024
Fold: 2 Mean Squared Error: 0.025
Fold: 3 Mean Squared Error: 0.025
Fold: 4 Mean Squared Error: 0.019
Fold: 5 Mean Squared Error: 0.022


## Joy

In [84]:
from sklearn.ensemble import RandomForestRegressor
joy=RandomForestRegressor()
#joy=LinearRegression()

In [85]:
data_joy=np.array(data_joy)
target_joy=np.array(target_joy)
#joy=LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(data_joy, target_joy, test_size=0.20, random_state=1234) 
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
# Split the data set into 'k' folds
#kfold = StratifiedKFold(n_splits=5, random_state=1)
kfold = KFold(n_splits=5, random_state=1)
scores = [] 
k = 0

for (train, test) in kfold.split(X_train, y_train): 
    joy.fit(X_train[train], y_train[train])          # Perform functions in pipeline
    #score = anger.score(X_train[test], y_train[test])  # Calculate score for each fold 
    y_pred=joy.predict(X_train[test])
    scores.append(mean_squared_error(y_train[test], y_pred)) 
    score=mean_squared_error(y_train[test], y_pred)
    k = k+1
    y_pred=[]
    print('Fold: %s Mean Squared Error: %.3f' % (k, score))  

Fold: 1 Mean Squared Error: 0.039
Fold: 2 Mean Squared Error: 0.033
Fold: 3 Mean Squared Error: 0.035
Fold: 4 Mean Squared Error: 0.029
Fold: 5 Mean Squared Error: 0.037


## Fear

In [87]:
data_fear=np.array(data_fear)
target_fear=np.array(target_fear)
#fear=LinearRegression()
fear=RandomForestRegressor()
X_train, X_test, y_train, y_test = train_test_split(data_fear, target_fear, test_size=0.20, random_state=1234) 
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
# Split the data set into 'k' folds
#kfold = StratifiedKFold(n_splits=5, random_state=1)
kfold = KFold(n_splits=5, random_state=1)
scores = [] 
k = 0

for (train, test) in kfold.split(X_train, y_train): 
    fear.fit(X_train[train], y_train[train])          # Perform functions in pipeline
    #score = anger.score(X_train[test], y_train[test])  # Calculate score for each fold 
    y_pred=fear.predict(X_train[test])
    scores.append(mean_squared_error(y_train[test], y_pred)) 
    score=mean_squared_error(y_train[test], y_pred)
    k = k+1
    y_pred=[]
    print('Fold: %s Mean Squared Error: %.3f' % (k, score))  

Fold: 1 Mean Squared Error: 0.030
Fold: 2 Mean Squared Error: 0.031
Fold: 3 Mean Squared Error: 0.028
Fold: 4 Mean Squared Error: 0.026
Fold: 5 Mean Squared Error: 0.028


## Sadness

In [88]:
data_sadness=np.array(data_sadness)
target_sadness=np.array(target_sadness)
#sadness=LinearRegression()
sadness=RandomForestRegressor()
X_train, X_test, y_train, y_test = train_test_split(data_sadness, target_sadness, test_size=0.20, random_state=1234) 
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
# Split the data set into 'k' folds
#kfold = StratifiedKFold(n_splits=5, random_state=1)
kfold = KFold(n_splits=5, random_state=1)
scores = [] 
k = 0

for (train, test) in kfold.split(X_train, y_train): 
    sadness.fit(X_train[train], y_train[train])          # Perform functions in pipeline
    #score = anger.score(X_train[test], y_train[test])  # Calculate score for each fold 
    y_pred=sadness.predict(X_train[test])
    scores.append(mean_squared_error(y_train[test], y_pred)) 
    score=mean_squared_error(y_train[test], y_pred)
    k = k+1
    y_pred=[]
    print('Fold: %s Mean Squared Error: %.3f' % (k, score))  

Fold: 1 Mean Squared Error: 0.030
Fold: 2 Mean Squared Error: 0.029
Fold: 3 Mean Squared Error: 0.030
Fold: 4 Mean Squared Error: 0.025
Fold: 5 Mean Squared Error: 0.028


In [91]:
sadness.predict(data_sadness[1:3])

array([0.8137, 0.7727])

In [93]:
target_sadness[2]

0.958

## Function

In [95]:
def func(tweet2):
    tweet2=[tweet2]
    features = vectorizer.transform(tweet2)
    features_nd = features.toarray()
    tweet=features_nd
    classify=model.predict(tweet)
    if(classify=='anger'):
        print("Anger with intensity-", anger.predict(tweet))
    if(classify=='sadness'):
         print("Sadness with intensity-", sadness.predict(tweet))
    if(classify=='fear'):
        print("Fear with intensity-", fear.predict(tweet))
    if(classify=='joy'):
        print("Joy with intensity-", joy.predict(tweet))

## Checking result on test data

In [96]:
features = vectorizer.transform(testdata)
features_nd = features.toarray()
y_pred=model.predict(features_nd)
print("Classification report for valid dataset\n",classification_report(y_pred, testdata_labels))

Classification report for valid dataset
              precision    recall  f1-score   support

      anger       0.79      0.87      0.83       686
       fear       0.86      0.78      0.82      1097
        joy       0.87      0.91      0.89       684
    sadness       0.82      0.82      0.82       675

avg / total       0.84      0.84      0.84      3142



In [99]:
func(testdata[1])

Anger with intensity- [0.2189]


In [100]:
testdata_int[1]

0.14400000000000002

## TfidfVectorizer

In [101]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(
    analyzer = 'word',
    lowercase = False,
)


In [119]:
features = vectorizer.fit_transform(
    data
)
features_nd = features.toarray()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test  = train_test_split(
        features_nd, 
        data_labels,
        train_size=0.80, test_size=0.20, 
        random_state=1234)    

## Random Forest

In [103]:
from sklearn.model_selection import cross_val_score 
from sklearn.ensemble import RandomForestClassifier
model=RandomForestClassifier(criterion='gini',n_estimators=500,min_samples_leaf=2)
import timeit
tic = timeit.default_timer()
# X_train is split into training and testing sets for each of the k folds and scores are obtained
scores = cross_val_score(estimator=model, X=X_train, y=y_train, cv=5, n_jobs=-1)
toc = timeit.default_timer()
print("Training time" , toc - tic)
print('CV accuracy scores: %s' % scores) 
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores))) 
from sklearn.metrics import classification_report

Training time 525.2617492372447
CV accuracy scores: [0.8254717  0.80757098 0.81832543 0.82938389 0.8164557 ]
CV accuracy: 0.819 +/- 0.008


In [104]:
from sklearn.grid_search import GridSearchCV
param_grid = {'n_estimators': [100,200,400,500,600]}
grid = GridSearchCV(model, param_grid,n_jobs=-1)
%time grid.fit(X_train, y_train)
print(grid.best_params_)
model = grid.best_estimator_

Wall time: 17min 1s
{'n_estimators': 600}


In [105]:
tic = timeit.default_timer()
y_fit=model.predict(X_test)
print("Classification report for validation dataset\n",classification_report(y_test, y_fit))
toc = timeit.default_timer()
print("Testing time time" , toc - tic)

Classification report for validation dataset
              precision    recall  f1-score   support

      anger       0.94      0.77      0.84       197
       fear       0.69      0.96      0.81       240
        joy       0.97      0.83      0.89       191
    sadness       0.88      0.74      0.80       164

avg / total       0.86      0.83      0.84       792

Testing time time 1.5385596461528621


In [106]:
tic = timeit.default_timer()
# X_train is split into training and testing sets for each of the k folds and scores are obtained
scores = cross_val_score(estimator=model, X=X_train, y=y_train, cv=5, n_jobs=-1)
toc = timeit.default_timer()
print("Training time" , toc - tic)
print('CV accuracy scores: %s' % scores) 
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

Training time 644.9806059257207
CV accuracy scores: [0.8254717  0.81388013 0.80884676 0.82938389 0.81170886]
CV accuracy: 0.818 +/- 0.008


## Logistic Regression

In [107]:
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
tic = timeit.default_timer()
model.fit(X_train, y_train)
toc = timeit.default_timer()
print("Training time" , toc - tic)
tic = timeit.default_timer()
y_fit=model.predict(X_test)
print("Classification report for validation dataset\n",classification_report(y_test, y_fit))
toc = timeit.default_timer()
print("Testing time time" , toc - tic)

Training time 1.1365342397075437
Classification report for validation dataset
              precision    recall  f1-score   support

      anger       0.90      0.77      0.83       197
       fear       0.66      0.96      0.78       240
        joy       0.92      0.73      0.82       191
    sadness       0.85      0.63      0.72       164

avg / total       0.82      0.79      0.79       792

Testing time time 0.4278164677307359


In [108]:
tic = timeit.default_timer()
# X_train is split into training and testing sets for each of the k folds and scores are obtained
scores = cross_val_score(estimator=model, X=X_train, y=y_train, cv=5, n_jobs=-1)
toc = timeit.default_timer()
print("Training time" , toc - tic)
print('CV accuracy scores: %s' % scores) 
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores))) 

Training time 47.97415437903328
CV accuracy scores: [0.74528302 0.72397476 0.73301738 0.75671406 0.74841772]
CV accuracy: 0.741 +/- 0.012


In [109]:
from sklearn.grid_search import GridSearchCV
param_grid = {'penalty': ['l1','l2'],
             'C':[1,5,10]}
grid = GridSearchCV(model, param_grid,n_jobs=-1)
%time grid.fit(X_train, y_train)
print(grid.best_params_)
model = grid.best_estimator_

Wall time: 1min 4s
{'C': 5, 'penalty': 'l1'}


In [110]:
tic = timeit.default_timer()
y_fit=model.predict(X_test)
print("Classification report for validation dataset\n",classification_report(y_test, y_fit))
toc = timeit.default_timer()
print("Testing time time" , toc - tic)

Classification report for validation dataset
              precision    recall  f1-score   support

      anger       0.91      0.79      0.85       197
       fear       0.76      0.94      0.84       240
        joy       0.96      0.86      0.91       191
    sadness       0.88      0.83      0.86       164

avg / total       0.87      0.86      0.86       792

Testing time time 0.7451992886708467


In [111]:
tic = timeit.default_timer()
# X_train is split into training and testing sets for each of the k folds and scores are obtained
scores = cross_val_score(estimator=model, X=X_train, y=y_train, cv=5, n_jobs=-1)
toc = timeit.default_timer()
print("Training time" , toc - tic)
print('CV accuracy scores: %s' % scores) 
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores))) 

Training time 52.084890534923034
CV accuracy scores: [0.86477987 0.84700315 0.83886256 0.84992101 0.83386076]
CV accuracy: 0.847 +/- 0.011


## Support Vector machine

In [123]:
from sklearn.svm import SVC
model=SVC(kernel='linear')
tic = timeit.default_timer()
model.fit(X_train, y_train)
toc = timeit.default_timer()
print("Training time" , toc - tic)
tic = timeit.default_timer()
y_fit=model.predict(X_test)
print("Classification report for validation dataset\n",classification_report(y_test, y_fit))
toc = timeit.default_timer()
print("Testing time time" , toc - tic)

Training time 276.1597939480889
Classification report for validation dataset
              precision    recall  f1-score   support

      anger       0.89      0.83      0.86       197
       fear       0.79      0.95      0.86       240
        joy       0.94      0.84      0.89       191
    sadness       0.85      0.77      0.81       164

avg / total       0.86      0.85      0.85       792

Testing time time 43.27931745304886


In [124]:
tic = timeit.default_timer()
# X_train is split into training and testing sets for each of the k folds and scores are obtained
scores = cross_val_score(estimator=model, X=X_train, y=y_train, cv=5, n_jobs=-1)
toc = timeit.default_timer()
print("Training time" , toc - tic)
print('CV accuracy scores: %s' % scores) 
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

Training time 600.4453322846348
CV accuracy scores: [0.83490566 0.78391167 0.82464455 0.81516588 0.8085443 ]
CV accuracy: 0.813 +/- 0.017


## Intensity calculation

In [113]:
anger=RandomForestRegressor()
X_train, X_test, y_train, y_test = train_test_split(data_anger, target_anger, test_size=0.20, random_state=1234) 
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
# Split the data set into 'k' folds
#kfold = StratifiedKFold(n_splits=5, random_state=1)
kfold = KFold(n_splits=5, random_state=1)
scores = [] 
k = 0

for (train, test) in kfold.split(X_train, y_train): 
    anger.fit(X_train[train], y_train[train])          # Perform functions in pipeline
    score = anger.score(X_train[test], y_train[test])  # Calculate score for each fold 
    y_pred=anger.predict(X_train[test])
    scores.append(mean_squared_error(y_train[test], y_pred)) 
    score=mean_squared_error(y_train[test], y_pred)
    k = k+1
    y_pred=[]
    print('Fold: %s Mean Squared Error: %.3f' % (k, score))  

Fold: 1 Mean Squared Error: 0.027
Fold: 2 Mean Squared Error: 0.026
Fold: 3 Mean Squared Error: 0.022
Fold: 4 Mean Squared Error: 0.020
Fold: 5 Mean Squared Error: 0.022


In [114]:
joy=LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(data_joy, target_joy, test_size=0.20, random_state=1234) 
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
# Split the data set into 'k' folds
#kfold = StratifiedKFold(n_splits=5, random_state=1)
kfold = KFold(n_splits=5, random_state=1)
scores = [] 
k = 0

for (train, test) in kfold.split(X_train, y_train): 
    anger.fit(X_train[train], y_train[train])          # Perform functions in pipeline
    #score = anger.score(X_train[test], y_train[test])  # Calculate score for each fold 
    y_pred=anger.predict(X_train[test])
    scores.append(mean_squared_error(y_train[test], y_pred)) 
    score=mean_squared_error(y_train[test], y_pred)
    k = k+1
    y_pred=[]
    print('Fold: %s Mean Squared Error: %.3f' % (k, score))  

Fold: 1 Mean Squared Error: 0.043
Fold: 2 Mean Squared Error: 0.035
Fold: 3 Mean Squared Error: 0.035
Fold: 4 Mean Squared Error: 0.032
Fold: 5 Mean Squared Error: 0.039


In [115]:
fear=LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(data_fear, target_fear, test_size=0.20, random_state=1234) 
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
# Split the data set into 'k' folds
#kfold = StratifiedKFold(n_splits=5, random_state=1)
kfold = KFold(n_splits=5, random_state=1)
scores = [] 
k = 0

for (train, test) in kfold.split(X_train, y_train): 
    anger.fit(X_train[train], y_train[train])          # Perform functions in pipeline
    #score = anger.score(X_train[test], y_train[test])  # Calculate score for each fold 
    y_pred=anger.predict(X_train[test])
    scores.append(mean_squared_error(y_train[test], y_pred)) 
    score=mean_squared_error(y_train[test], y_pred)
    k = k+1
    y_pred=[]
    print('Fold: %s Mean Squared Error: %.3f' % (k, score))  

Fold: 1 Mean Squared Error: 0.032
Fold: 2 Mean Squared Error: 0.031
Fold: 3 Mean Squared Error: 0.029
Fold: 4 Mean Squared Error: 0.027
Fold: 5 Mean Squared Error: 0.026


In [116]:
sadness=LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(data_sadness, target_sadness, test_size=0.20, random_state=1234) 
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
# Split the data set into 'k' folds
#kfold = StratifiedKFold(n_splits=5, random_state=1)
kfold = KFold(n_splits=5, random_state=1)
scores = [] 
k = 0

for (train, test) in kfold.split(X_train, y_train): 
    anger.fit(X_train[train], y_train[train])          # Perform functions in pipeline
    #score = anger.score(X_train[test], y_train[test])  # Calculate score for each fold 
    y_pred=anger.predict(X_train[test])
    scores.append(mean_squared_error(y_train[test], y_pred)) 
    score=mean_squared_error(y_train[test], y_pred)
    k = k+1
    y_pred=[]
    print('Fold: %s Mean Squared Error: %.3f' % (k, score))  

Fold: 1 Mean Squared Error: 0.028
Fold: 2 Mean Squared Error: 0.031
Fold: 3 Mean Squared Error: 0.030
Fold: 4 Mean Squared Error: 0.026
Fold: 5 Mean Squared Error: 0.027


In [117]:
features = vectorizer.transform(testdata)
features_nd = features.toarray()
y_pred=model.predict(features_nd)
print("Classification report for valid dataset\n",classification_report(y_pred, testdata_labels))

Classification report for valid dataset
              precision    recall  f1-score   support

      anger       0.79      0.89      0.84       671
       fear       0.88      0.78      0.83      1118
        joy       0.88      0.93      0.90       675
    sadness       0.83      0.82      0.83       678

avg / total       0.85      0.85      0.85      3142



### We can clearly see that tfidfvectorizer is performing better than count vectorizer. Tf idf is different from countvectorizer. Countvectorizer gives equal weightage to all the words, i.e. a word is converted to a column (in a dataframe for example) and for each document, it is equal to 1 if it is present in that doc else 0.  Apart from giving this information, tfidf says how important that word is to that document with respect to the corpus. 

### We can observe that logistic regression takes the least time for traing and testing the dataset for both the vectorizer