# Exercise: 4.2 :       Extra Credit     : Sentiment analysis on Public tweets about airlines.

## For up to 5% extra credit, find another set of comments, e.g., some tweets, and perform the same sentiment analysis.

### Here we are using the data that is available in the  github link https://raw.githubusercontent.com/kolaveridi/kaggle-Twitter-US-Airline-Sentiment-/master/Tweets.csv

### This file contains public tweets about airlines on column "text", which we will be analyzing to find out the sentiment of the tweet.

### This file also comes with a pre-populated "airline_sentiment" column, We will also add a new column "derived_sentiment" populating sentiment using our scheme for sentiment analysis.

### We will then compare the existing "airline_sentiment" values and our new "derived_sentiment" values to see how much they match.



In [133]:
# Load library
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import numpy as np

## 1. Load the tweet data into a data frame.


In [134]:
# for now to disable the SSL
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
#

data_source_url = "https://raw.githubusercontent.com/kolaveridi/kaggle-Twitter-US-Airline-Sentiment-/master/Tweets.csv"
airline_tweets_corpus = pd.read_csv(data_source_url)
#airline_tweets_corpus= airline_tweets_corpus.head()
airline_tweets_corpus.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [135]:
# Checking the number of tweets in the file.
airline_tweets_corpus.tweet_id.count()

14640


### Here we are using VADER (Valence Aware Dictionary and sEntiment Reasoner) which is a lexicon and rule-based sentiment analysis tool.

### A sentiment lexicon is a list of lexical features (e.g., words) which are generally labelled according to their semantic orientation as either positive or negative.

### We will use the polarity_scores() method to obtain the polarity indices for the given sentence.

### The Positive, Negative and Neutral scores represent the proportion of text that falls in these categories.
### The Compound score is a metric that calculates the sum of all the lexicon ratings which have been normalized between -1(most extreme negative) and +1 (most extreme positive).

### Below we are defining a function which takes a sentence as an input and returns a data frame with the positive, negative, neutral, compound values and the overall sentiment of the sentence.

### Using value of compound we are setting the value of sentiment using below rule
#### 1. Positive sentiment -> compound score >= 0.05
#### 2. Neutral sentiment -> (compound score > - 0.05) and (compound score <  0.05)
#### 3. Negative sentiment -> compound score <= - 0.05

In [136]:
analyser = SentimentIntensityAnalyzer()


def sentiment_analyzer_scores(sentence):
    score = analyser.polarity_scores(sentence)
    # using value of compound to set the value of the sentiment using below rule
        # 1. Positive sentiment -> compound score > 0.05
        # 2. Neutral sentiment -> (compound score > - 0.05) and (compound score <  0.05)
        # 3. Negative sentiment -> compound score < - 0.05
    compound = score.get('compound')
    if compound >= 0.05:
        v_sentiment = "positive"
    elif compound <= -0.05:
        v_sentiment = "negative"
    else: v_sentiment = "neutral"

    # appending the sentiment value:key pair to the score dictionary
    score.update(derived_sentiment = v_sentiment)

    # converting the dictionary to data frame
    df= pd.DataFrame([score])

    # returning the data frame
    return df


### Calling the sentiment_analyzer_scores function iteratively for every comment in the corpus
### Concat the scores into one data frame.

In [137]:
df2 = pd.DataFrame()
for comment in airline_tweets_corpus['text']:
    sentiment_analyzer_scores(comment)

    # getting the score for each row/comment in the corpus
    df1 = sentiment_analyzer_scores(comment)

    # combining all the individual row scores one below the other(axis = 0)
    # if we dont give ignore_index=True, the default is false and all the index of df2 will be 0.
    df2 = pd.concat([df2,df1], axis=0,ignore_index=True)

### Concat scores data frame and airline_tweets_corpus data frame.

In [138]:
# concat corpus and score data frame , by axis =1 , to view the results side by side of the comments.
results2 = pd.concat([airline_tweets_corpus,df2],axis =1, sort = False)

results2.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone,neg,neu,pos,compound,derived_sentiment
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada),0.0,1.0,0.0,0.0,neutral
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada),0.0,1.0,0.0,0.0,neutral
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada),0.0,1.0,0.0,0.0,neutral
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada),0.226,0.645,0.129,-0.2716,negative
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada),0.296,0.704,0.0,-0.5829,negative


In [139]:
results2[["tweet_id","airline_sentiment","text","derived_sentiment"]].head()

Unnamed: 0,tweet_id,airline_sentiment,text,derived_sentiment
0,570306133677760513,neutral,@VirginAmerica What @dhepburn said.,neutral
1,570301130888122368,positive,@VirginAmerica plus you've added commercials t...,neutral
2,570301083672813571,neutral,@VirginAmerica I didn't today... Must mean I n...,neutral
3,570301031407624196,negative,@VirginAmerica it's really aggressive to blast...,negative
4,570300817074462722,negative,@VirginAmerica and it's a really big bad thing...,negative


### As we can see the above result has the derived_sentiment column displayed against each of the rows in the corpus. Selected only a few columns for display to avoid scrolling


### Now we will compare the values of the pre-existing airline_sentiment and the new derived_sentiment values and add the results to the data frame as a new "comparison_column" column.

In [144]:
# compare `"airline_sentiment"` and `"derived_sentiment"`
comparison_column = np.where(results2["airline_sentiment"] == results2["derived_sentiment"], True, False)
results2["comparison_column"] = comparison_column

#results2.head()
results2[["tweet_id","airline_sentiment","text","derived_sentiment","comparison_column"]].head()

Unnamed: 0,tweet_id,airline_sentiment,text,derived_sentiment,comparison_column
0,570306133677760513,neutral,@VirginAmerica What @dhepburn said.,neutral,True
1,570301130888122368,positive,@VirginAmerica plus you've added commercials t...,neutral,False
2,570301083672813571,neutral,@VirginAmerica I didn't today... Must mean I n...,neutral,True
3,570301031407624196,negative,@VirginAmerica it's really aggressive to blast...,negative,True
4,570300817074462722,negative,@VirginAmerica and it's a really big bad thing...,negative,True


### Displaying the unique values in "comparison_column"

In [141]:
results2.comparison_column.unique()

array([ True, False])

In [142]:
results2.tweet_id.count()

14640

### Displaying the number of unique values grouped by "comparison_column", so that we can see how many of the values from the pre-existing "airline_sentiment" matches with the new "derived_sentiment" values.

In [143]:
unique_number = results2.groupby('comparison_column')['tweet_id'].nunique()
unique_number

comparison_column
False    7406
True     7093
Name: tweet_id, dtype: int64

### The number against "True" displays the number of rows for which the values matches and the number against "False" for which the values did not match.