## Training Data
#### We obtained a csv of Tweets about a few airlines. This csv already has twitter senimates marked as positive, negative, and neutral so it is a good place to start for our CTA model.

* Link to oringal data source: https://www.figure-eight.com/data-for-everyone/

In [9]:
# import dependancies 
import pandas as pd
import re
import tweepy
from sklearn.feature_extraction.text import CountVectorizer

In [10]:
# import the airline csv
airlines_csv = '../Resources/Airline-Sentiment-2-w-AA.csv'
airlines_df = pd.read_csv(airlines_csv, encoding = 'ISO-8859-1')
airlines_df.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,airline_sentiment,airline_sentiment:confidence,negativereason,negativereason:confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_id,tweet_location,user_timezone
0,681448150,False,finalized,3,2/25/15 5:24,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2/24/15 11:35,5.70306e+17,,Eastern Time (US & Canada)
1,681448153,False,finalized,3,2/25/15 1:53,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2/24/15 11:15,5.70301e+17,,Pacific Time (US & Canada)
2,681448156,False,finalized,3,2/25/15 10:01,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2/24/15 11:15,5.70301e+17,Lets Play,Central Time (US & Canada)
3,681448158,False,finalized,3,2/25/15 3:05,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2/24/15 11:15,5.70301e+17,,Pacific Time (US & Canada)
4,681448159,False,finalized,3,2/25/15 5:50,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2/24/15 11:14,5.70301e+17,,Pacific Time (US & Canada)


## Cleaning the data
 * Remove username/speacial characters, spaces, numbers, and NAN rows
 * Found great code to use here, https://github.com/mertkahyaoglu/twitter-sentiment-analysis/blob/master/utils.py

In [11]:
# pull the columns that contain the tweets (features) and sentiment labels (labels).
features = airlines_df.iloc[:, 14].values
# print(features)
labels = airlines_df.iloc[:, 5].values
# print(labels)

In [12]:
# make sure all tweets are in lowercase and remove hashtags, mentions, and links.
clean_data = []                 
for feature in features:
        item = ' '.join(word.lower() for word in feature.split() \
            if not word.startswith('#') and \
            not word.startswith('@') and \
            not word.startswith('http') and \
            not word.startswith('RT'))
        
        if item == "" or item == "RT":
                continue
        clean_data.append(item)
        


In [13]:
# https://www.earthdatascience.org/courses/earth-analytics-python/using-apis-natural-language-processing-twitter/calculate-tweet-word-frequencies-in-python/
# used code from link above to remove all special characters 
def remove_url(txt):
    return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", txt).split())

cleaned_data_no_urls = [remove_url(tweet) for tweet in clean_data]
cleaned_data_no_urls[:10]

['what said',
 'plus youve added commercials to the experience tacky',
 'i didnt today must mean i need to take another trip',
 'its really aggressive to blast obnoxious entertainment in your guests faces amp they have little recourse',
 'and its a really big bad thing about it',
 'seriously would pay 30 a flight for seats that didnt have this playing its really the only bad thing about flying va',
 'yes nearly every time i fly vx this ear worm wont go away',
 'really missed a prime opportunity for men without hats parody there',
 'well i didntbut now i do d',
 'it was amazing and arrived an hour early youre too good to me']

## Train and test the model 

In [14]:
# Split the data into training data and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(cleaned_data_no_urls, labels, random_state=1)

In [21]:
# Encode the labels to numbers
sentiments = ['positive', 'negative', 'neutral']
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(sentiments)
# list(le.classes_)
train_labels = le.transform(y_train) 
test_labels = le.transform(y_test)

In [24]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import svm
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', svm.SVC(gamma='scale')),
])

predictor = text_clf.fit(X_train, y_train)

In [25]:
# train and test the model
import numpy as np
train_predicted = text_clf.predict(X_train)
train = np.mean(train_predicted == y_train) 

test_predicted = text_clf.predict(X_test)
test = np.mean(test_predicted == y_test)            

print(f'Training prediction: {train}')
print(f'Test prediction: {test}')

Training prediction: 0.9647540983606557
Test prediction: 0.7898907103825137


In [26]:
# print a report of our model performance 
from sklearn import metrics
print(metrics.classification_report(y_test, test_predicted,
    target_names=sentiments))

              precision    recall  f1-score   support

    positive       0.79      0.95      0.86      2291
    negative       0.72      0.46      0.56       774
     neutral       0.85      0.61      0.71       595

   micro avg       0.79      0.79      0.79      3660
   macro avg       0.79      0.67      0.71      3660
weighted avg       0.79      0.79      0.77      3660



 ### This model does best when locating positive statments in the sample
 * This model did a great job finding most of the positive statments in the document. However, a significat portion of the positives found were false positives
 * This model did not do well identifiing all of the negative statements in the document. Of the negative statements identified, the model was correct approx 70% of the time
 * This model correcty identified neutral statements 85% of the time, but did a poor job of finding all of the neutral statements in the sample
 
 
