## Challenge

In [1]:
# dataframes
import pandas as pd

# 
import numpy as np

# plotting library
import matplotlib.pyplot as plt

# accessing the operating system
import os

# ensure plots occur in the notebook
%matplotlib inline

# set seed for reproducibility
np.random.seed(0)

In [2]:
os.getcwd()

'/Users/johannesscr/Box Sync/Kaggle/SentimentAnalysis'

In [3]:
# read data from file
tweets_data = pd.read_csv('./Datasets/Tweets.csv')

In [13]:
tweets_data.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [5]:
tweets_data[tweets_data.airline != 'Virgin America'][0:5]

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
504,570307876897628160,positive,1.0,,,United,,rdowning76,,0,@united thanks,,2015-02-24 11:42:48 -0800,usa,
505,570307847281614848,positive,1.0,,,United,,CoreyAStewart,,0,@united Thanks for taking care of that MR!! Ha...,,2015-02-24 11:42:41 -0800,"Richmond, VA",Eastern Time (US & Canada)
506,570307109704900608,negative,1.0,Cancelled Flight,0.703,United,,CoralReefer420,,0,@united still no refund or word via DM. Please...,,2015-02-24 11:39:45 -0800,"Bay Area, California",Alaska
507,570307026263384064,negative,1.0,Late Flight,1.0,United,,lsalazarll,,0,@united Delayed due to lack of crew and now de...,,2015-02-24 11:39:25 -0800,,Mountain Time (US & Canada)
508,570306733010264064,positive,0.3441,,0.0,United,,rombaa,,0,@united thanks -- we filled it out. How's our ...,,2015-02-24 11:38:15 -0800,,


In [6]:
tweets_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 15 columns):
tweet_id                        14640 non-null int64
airline_sentiment               14640 non-null object
airline_sentiment_confidence    14640 non-null float64
negativereason                  9178 non-null object
negativereason_confidence       10522 non-null float64
airline                         14640 non-null object
airline_sentiment_gold          40 non-null object
name                            14640 non-null object
negativereason_gold             32 non-null object
retweet_count                   14640 non-null int64
text                            14640 non-null object
tweet_coord                     1019 non-null object
tweet_created                   14640 non-null object
tweet_location                  9907 non-null object
user_timezone                   9820 non-null object
dtypes: float64(2), int64(2), object(11)
memory usage: 1.7+ MB


In [7]:
tweets_data.describe()

Unnamed: 0,tweet_id,airline_sentiment_confidence,negativereason_confidence,retweet_count
count,14640.0,14640.0,10522.0,14640.0
mean,5.692184e+17,0.900169,0.638298,0.08265
std,779111200000000.0,0.16283,0.33044,0.745778
min,5.675883e+17,0.335,0.0,0.0
25%,5.685592e+17,0.6923,0.3606,0.0
50%,5.694779e+17,1.0,0.6706,0.0
75%,5.698905e+17,1.0,1.0,0.0
max,5.703106e+17,1.0,1.0,44.0


### Task

The task is to predict sentiment of tweets to U.S. airlines. We are less interested in the solution to this problem and more in understanding your thought process when approaching the problem. We want to see how you approach preparing and exploring the data, model training, validation, testing, evaluation, which metrics you use (and why) and critical thinking in evaluating the results. Finally we would like you to present your findings.

> **As a note**: What I know about machine learning is limited to linear regression at present

Therefore I will start by analysing from my current base of knowledge. 

Method:
1. Load the data
2. Look at missing data
3. Research interpretation of the missing data
4. Investigate methods of evaluation


Looking at the distribution on the graph

## 1 Method
Take each word of each tweet and build a reference table (this table serves as a lookup table). 

Look at each tweet, based on the tweet's airline_sentiment, increment the sentiment count for each word either `positive`, `negative` or `neutral`.

Normalise the count ratings between -1 and 1, where -1 denotes the negative sentiment, 0 denotes the neutral sentiment and 1 denotes the positive sentiment.

result should be a hash table
```json
{
    "word": {
        "positive": 100,
        "neutral": 0,
        "negative": 10,
        "sentiment": 0.8181
    }
}
```
where the sentiment is the `sentiment = (positive - negative)/total_sentiment` where the total_sentiment is `positive + neutral + negative`

---

### Building a dictionary of words

In [8]:
words = ['one', 'two', 'three', 'one', 'one', 'three', 'two']

total_words = {}
for word in words:
    if word in total_words:
        total_words[word]['count'] += 1
    else:
        total_words[word] = {
            'count': 0
        }

print(total_words)

{'one': {'count': 2}, 'two': {'count': 1}, 'three': {'count': 1}}


---

In [34]:
len(tweets_data)

14640

In [16]:
tweets_data[:][0:3]
word_dict = {}

"""
increment_sentiment
@param dictionary - the word dictionary
@param word - word to be investigated
@param sentiment - sentiment associated with the word
"""
def increment_sentiment(dictionary, word, sentiment):
    dictionary[word][sentiment] += 1
    dictionary[word]["sentiment"] = (dictionary[word]["positive"] - 
                                     dictionary[word]["negative"]) / (
                                     dictionary[word]["positive"] + 
                                     dictionary[word]["neutral"] + 
                                     dictionary[word]["negative"])
    return dictionary


for i in range(10):
    sentiment = tweets_data["airline_sentiment"][:][i]
    tweet = tweets_data["text"][:][i]
    tweet = tweet.replace('.', '')
    tweet = tweet.replace(',', '')
    tweet = tweet.split()
    # others included as sentiment value hi ad hi! is different
    
    for word in tweet:
        if word_dict.get(word):
            word_dict = increment_sentiment(word_dict, word, sentiment)
        else:
            word_dict[word] = {
                "positive": 0,
                "neutral": 0,
                "negative": 0,
                "sentiment": 0,
            }
            word_dict = increment_sentiment(word_dict, word, sentiment)

print(word_dict)

{'@VirginAmerica': {'positive': 3, 'neutral': 3, 'negative': 3, 'sentiment': 0.0}, 'What': {'positive': 0, 'neutral': 1, 'negative': 0, 'sentiment': 0.0}, '@dhepburn': {'positive': 0, 'neutral': 1, 'negative': 0, 'sentiment': 0.0}, 'said': {'positive': 0, 'neutral': 1, 'negative': 0, 'sentiment': 0.0}, 'plus': {'positive': 1, 'neutral': 0, 'negative': 0, 'sentiment': 1.0}, "you've": {'positive': 1, 'neutral': 0, 'negative': 0, 'sentiment': 1.0}, 'added': {'positive': 1, 'neutral': 0, 'negative': 0, 'sentiment': 1.0}, 'commercials': {'positive': 1, 'neutral': 0, 'negative': 0, 'sentiment': 1.0}, 'to': {'positive': 2, 'neutral': 1, 'negative': 1, 'sentiment': 0.25}, 'the': {'positive': 1, 'neutral': 0, 'negative': 1, 'sentiment': 0.0}, 'experience': {'positive': 1, 'neutral': 0, 'negative': 0, 'sentiment': 1.0}, 'tacky': {'positive': 1, 'neutral': 0, 'negative': 0, 'sentiment': 1.0}, 'I': {'positive': 3, 'neutral': 2, 'negative': 0, 'sentiment': 0.6}, "didn't": {'positive': 0, 'neutral':

In [15]:
summed_sentiment = 0.0
summed_pos_sentiment = 0
summed_neu_sentiment = 0
summed_neg_sentiment = 0

tweet_number = 1
sentiment = tweets_data["airline_sentiment"][:][tweet_number]
tweet = tweets_data["text"][:][tweet_number]
tweet = tweet.replace('.', '')
tweet = tweet.replace(',', '')
words = tweet.split()

print('The chosen tweet is:\n{:s}'.format(tweet))
print('The chosen tweet has sentiment: {:s}\n'.format(sentiment))
print('{:15s}|{:10s}|{:10s}|{:10s}|{:10s}'.format('Word', 'Positive', 
                                                  'Neutral', 'Negative', 
                                                  'Sentiment'))

for word in words:
    summed_sentiment += word_dict[word]['sentiment']
    summed_pos_sentiment += word_dict[word]['positive']
    summed_neu_sentiment += word_dict[word]['neutral']
    summed_neg_sentiment += word_dict[word]['negative']
    print('{:15s}|{:10d}|{:10d}|{:10d}|{:9.3f}'.format(word, 
                                                       word_dict[word]['positive'], 
                                                       word_dict[word]['neutral'],
                                                       word_dict[word]['negative'], 
                                                       word_dict[word]['sentiment']))
    
    
#
print('{:15s}|{:10d}|{:10d}|{:10d}|{:9.3f}'.format('Total',
                                                   summed_pos_sentiment, 
                                                   summed_neu_sentiment, 
                                                   summed_neg_sentiment, 
                                                   summed_sentiment/len(words)))
# print(words)
# for tweet

The chosen tweet is:
@VirginAmerica plus you've added commercials to the experience tacky
The chosen tweet has sentiment: positive

Word           |Positive  |Neutral   |Negative  |Sentiment 
@VirginAmerica |       143|       161|       177|   -0.071
plus           |         4|         2|        21|   -0.630
you've         |         4|         1|        13|   -0.500
added          |         2|         3|         3|   -0.125
commercials    |         1|         0|         0|    1.000
to             |       248|       440|      1617|   -0.594
the            |       246|       231|      1148|   -0.555
experience     |        12|         1|        47|   -0.583
tacky          |         1|         0|         0|    1.000
Total          |       661|       839|      3026|   -0.118


---
Findings thus far

Having looked at the dataset before, it can be seen that the dataset is biased towards a negative sentiment.

With an inital sample of 20, this analysis seems promising

With a sample of 100, the negative bias can be seen slightly

With a 