# IST736 Text Mining
## Homework 5
### Martin Alonso
### 2019-02-13

#### Objectives
For this assignment, I have asked five friends to review 200 tweets on Artificial Intelligence and classify the sentiment of the tweet in one of either three options: positive, neutral, or negative. The objective is to understand how each person subjectively identifies the tweet and then determine how the group scores the sentiment of each tweet. I'll then compare how close the users' tags where using the Cohen-Kappa score, determining how much agreement there was between each tagger. 

#### Analysis

##### Loading the packages and creating the corpus
We'll first load the packages that will be used (pandas, numpy, and sklearn), in order to manipulate the data and load the Cohen-Kappa metric. Then, before passing the sentiment arrays through the metric, we'll look at the distributions of each scorer: how many tweets were classified as positive, negative, and neutral.

In [12]:
# Load required packages
import pandas as pd
import numpy as np 
from sklearn.metrics import cohen_kappa_score

In [13]:
# Load data
df = pd.read_csv('ai_tweets_amz.csv')
df.head()

Unnamed: 0,id,label,user,timestamp,text,reviewer1,reviewer2,reviewer3,reviewer4,reviewer5
0,1,0,_ItsJustChris,12/26/2011 22:44,Artificial Intelligence Is Going To Be The End...,neg,neg,neg,neg,neg
1,47,0,AnonTechOps,1/14/2019 23:20,The Weaponization Of Artificial Intelligence #...,neg,neg,neg,neg,neg
2,53,0,Arnold_Haine,1/14/2019 23:25,The Weaponization Of Artificial Intelligence #...,neg,neg,neg,neg,neg
3,86,0,BitHack3r,5/6/2011 18:53,@TheChameleon84 wow. I hate Artificial Intelli...,neg,neu,neu,neg,neg
4,103,0,brittneyinc,7/19/2014 21:50,this artificial intelligence performance kind ...,neg,neg,neg,neg,neu


The data has been loaded correctly and we can see that there are five columns that show each reviewer's sentiment of the tweet.  
Let's compare how many tweets each reviewer classified as positive, negative, or neutral. 

In [60]:
# Select the five reviewer columns. 
cols = ['reviewer1', 'reviewer2', 'reviewer3', 'reviewer4', 'reviewer5']
reviewers = df[cols]

# Initiate an empty array and count sentiment for each reviewer. 
reviews = []

for i in reviewers[cols]:
    sentiment = reviewers.groupby(i)[i].count()
    reviews.append(sentiment)

# Convert results to data frame; pass cols to index for easier identification; rename columns to neg, neu, and pos. 
results = pd.DataFrame(np.array(reviews))
results.set_index(pd.Index(cols), inplace=True)
results.columns = ['neg', 'neu', 'pos']
print(results)

           neg  neu  pos
reviewer1   44   93   64
reviewer2   43   87   71
reviewer3   42   89   70
reviewer4   49   86   66
reviewer5   45   91   65


Overall, it appears that the taggers classified the tweets very similarly. There is no number that pops out in any of the columns, though we can see that reviewers 2 and 3 are a bit more positive, while reviewer 4 is the most negative.  
Let's check how they score on the Cohen-Kappa test.

In [75]:
# Initiate an empty array to store Cohen-Kappa test results. 
ck_results = []

# Loop over each reviewer and compute the Cohen-Kappa score against each reviewer - including themselves. 
# This will insure we have a 5 by 5 matrix. 

for i in reviewers[cols]:
    reviewer_compared = reviewers
    for j in reviewer_compared: 
        result = cohen_kappa_score(reviewers[i], reviewer_compared[j])
        ck_results.append(result)
        
# Build a pandas DataFrame with the results. Change the names and index to identify the reviewers. 
ck_results = pd.DataFrame(np.array(ck_results).reshape(-1, 5))
ck_results.set_index(pd.Index(cols), inplace=True)
ck_results.columns = cols

Since all reviewers were set to score against each other, and themselves, we'll have a 5 by 5 matrix that will include the agreement score between a reviewer and himself (therefore having a score of 1.) Furthermore, we'll also have two scores for each reviewer pair. To avoid this, we'll create a new (final) data frame that drops the observations above and on the diagonal. 

In [76]:
final_results = ck_results.mask(np.triu(np.ones(ck_results.shape, dtype=np.bool_)))

#### Results
Even though the initial exploration showed that the reviewers had almost the same number of sentiment tags between them, the results of the Cohen-Kappa test were not as high as expected. 

In [77]:
print(final_results)

           reviewer1  reviewer2  reviewer3  reviewer4  reviewer5
reviewer1        NaN        NaN        NaN        NaN        NaN
reviewer2   0.627116        NaN        NaN        NaN        NaN
reviewer3   0.454567   0.440847        NaN        NaN        NaN
reviewer4   0.482496   0.422989   0.406306        NaN        NaN
reviewer5   0.540313   0.581476   0.417098     0.3449        NaN


Reviewer 1 and 2 seemed to agree the most among themselves on the sentiment of the tweets, even though reviewer 1 was more neutral and reviewer 2 was more positive, having substantial agreement among them.  
On the other hand, reviewers 4 and 5, who both seemed to be more inclined towards the same sentiment count, were given the lowest Cohen-Kappa score, fairly agreeing. 

#### Conclusions
Given the overall results, there is moderate agreement among the five taggers on the overall sentiment of the tweets. Given that, personally, most of the tweets read as neutral, I am surprised that there wasn't that much agreement among the taggers.  Perhaps if the tweet pool were larger or I had asked more taggers to help with the experiment, we would see a higher agreement level.  
I would also like to add that, given that Cohen-Kappa can only score two arrays at once, it is very hard to determine agreement among five different taggers. Perhaps using another metric, like a Fleiss or Multi-Cappa metric, would give different results as to the overall sentiment agreement between the taggers. 