## Training Data
#### We obtained a csv of Tweets about a few airlines. This csv already has twitter senimates marked as positive, negative, and neutral so it is a good place to start for our CTA model.

* Link to oringal data source: https://www.figure-eight.com/data-for-everyone/

In [1]:
# import dependancies 
import pandas as pd
import re
import tweepy
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# import the airline csv
airlines_csv = '../Resources/Airline-Sentiment-2-w-AA.csv'
airlines_df = pd.read_csv(airlines_csv, encoding = 'ISO-8859-1')
# airlines_df.head()

## Cleaning the data
 * Remove username/speacial characters, spaces, numbers, and NAN rows
 * Found great code to use here, https://github.com/mertkahyaoglu/twitter-sentiment-analysis/blob/master/utils.py

In [3]:
# pull the columns that contain the tweets (features) and sentiment labels (labels).
features = airlines_df.iloc[:, 14].values
# print(features)
labels = airlines_df.iloc[:, 5].values
# print(labels)

In [4]:
# make sure all tweets are in lowercase and remove hashtags, mentions, and links.
clean_data = []                 
for feature in features:
        item = ' '.join(word.lower() for word in feature.split() \
            if not word.startswith('#') and \
            not word.startswith('@') and \
            not word.startswith('http') and \
            not word.startswith('RT'))
        
        if item == "" or item == "RT":
                continue
        clean_data.append(item)

In [5]:
# https://www.earthdatascience.org/courses/earth-analytics-python/using-apis-natural-language-processing-twitter/calculate-tweet-word-frequencies-in-python/
# used code from link above to remove all special characters 
def remove_url(txt):
    return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", txt).split())

cleaned_data_no_urls = [remove_url(tweet) for tweet in clean_data]
cleaned_data_no_urls[:10]

['what said',
 'plus youve added commercials to the experience tacky',
 'i didnt today must mean i need to take another trip',
 'its really aggressive to blast obnoxious entertainment in your guests faces amp they have little recourse',
 'and its a really big bad thing about it',
 'seriously would pay 30 a flight for seats that didnt have this playing its really the only bad thing about flying va',
 'yes nearly every time i fly vx this ear worm wont go away',
 'really missed a prime opportunity for men without hats parody there',
 'well i didntbut now i do d',
 'it was amazing and arrived an hour early youre too good to me']

## Train and test the model 

In [7]:
# # Split the data into training data and testing data
# from sklearn.model_selection import train_test_split
# X_train, X_test, y_train, y_test = train_test_split(cleaned_data_no_urls, labels, random_state=1)

In [6]:
# Encode the labels to numbers
sentiments = ['positive', 'negative', 'neutral']
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(sentiments)
# list(le.classes_)
train_labels = le.transform(labels) 
# test_labels = le.transform(y_test)
train_labels[:10]

array([1, 2, 1, 0, 0, 0, 2, 1, 2, 2])

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import svm
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', svm.SVC(gamma='scale'))
])

In [10]:
#set parameters to test
from sklearn.model_selection import GridSearchCV
parameters = {
    #number of combined words for tokenization
    'vect__ngram_range': [(1, 1), (1, 2), (1,3)],
    #remove words above a specified threshold (used in place of stop words)
    'vect__max_df': (0.25, 0.5, 0.75, 1.0),
    #include idf
    'tfidf__use_idf': (True, False)
}


In [11]:
predictor = gs_clf.fit(cleaned_data_no_urls, train_labels)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 36.1min
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed: 88.6min finished


In [12]:
# train and test the model
print(f'Best Score: {gs_clf.best_score_}')                              

for param_name in sorted(parameters.keys()):
    print((param_name, gs_clf.best_params_[param_name]))

Best Score: 0.7838122511454888
('tfidf__use_idf', False)
('vect__max_df', 0.25)
('vect__ngram_range', (1, 1))


In [13]:
# Save the models 
from joblib import dump, load
dump(predictor, 'models/svm.joblib') 

['models/svm.joblib']

In [None]:
print(f'Best Score: {gs_clf.best_score_}')                              


### Load the saved model 

In [12]:
from joblib import dump, load
predictor = load('models/svm.joblib') 

In [39]:
predicitons = predictor.predict(cleaned_data_no_urls[:10])
predicitons

array([1, 2, 1, 0, 0, 0, 2, 1, 2, 2])

In [None]:
# from sklearn import metrics
# print(metrics.classification_report(train_labels, predicitons,
#     target_names=sentiments))

In [26]:
predictor.decision_function(cleaned_data_no_urls[:10])

array([[ 1.07872946,  2.41389228, -0.49262174],
       [-0.09837445,  0.86488566,  2.23348879],
       [ 1.07196639,  2.32183401, -0.39380041],
       [ 2.42888912,  0.86536879, -0.29425791],
       [ 2.42886332,  0.80647378, -0.2353371 ],
       [ 2.5       ,  0.89588038, -0.39588038],
       [ 1.04588664, -0.19954154,  2.15365491],
       [ 1.0777164 ,  2.26834767, -0.34606407],
       [-0.13876501,  0.97804442,  2.16072059],
       [ 1.01998935, -0.4683252 ,  2.44833585]])

In [10]:
# # print a report of our model performance 
# from sklearn import metrics
# print(metrics.classification_report(y_test, test_predicted,
#     target_names=sentiments))

              precision    recall  f1-score   support

    positive       0.79      0.95      0.86      2291
    negative       0.72      0.46      0.56       774
     neutral       0.85      0.61      0.71       595

   micro avg       0.79      0.79      0.79      3660
   macro avg       0.79      0.67      0.71      3660
weighted avg       0.79      0.79      0.77      3660



 ### Oringial SVM MODEL (no gridsearch) 
 
               precision    recall  f1-score   support

    positive       0.79      0.95      0.86      2291
    negative       0.72      0.46      0.56       774
     neutral       0.85      0.61      0.71       595

   micro avg       0.79      0.79      0.79      3660
   macro avg       0.79      0.67      0.71      3660
weighted avg       0.79      0.79      0.77      3660

 
#### This model does best when locating positive statments in the sample
 * This model did a great job finding most of the positive statments in the document. However, a significat portion of the positives found were false positives
 * This model did not do well identifiing all of the negative statements in the document. Of the negative statements identified, the model was correct approx 70% of the time
 * This model correcty identified neutral statements 85% of the time, but did a poor job of finding all of the neutral statements in the sample
 
 
