## Training Data
#### We obtained a csv of Tweets about a few airlines. This csv already has twitter senimates marked as positive, negative, and neutral so it is a good place to start for our CTA model.

* Link to oringal data source: https://www.figure-eight.com/data-for-everyone/

In [1]:
# import dependancies 
import pandas as pd
import re
import tweepy
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# import the airline csv
airlines_csv = '../Resources/Airline-Sentiment-2-w-AA.csv'
airlines_df = pd.read_csv(airlines_csv, encoding = 'ISO-8859-1')
# airlines_df.head()

## Cleaning the data
 * Remove username/speacial characters, spaces, numbers, and NAN rows
 * Found great code to use here, https://github.com/mertkahyaoglu/twitter-sentiment-analysis/blob/master/utils.py

In [5]:
# pull the columns that contain the tweets (features) and sentiment labels (labels).
features = airlines_df.iloc[:, 14].values
# print(features)
labels = airlines_df.iloc[:, 5].values
# print(labels)

In [7]:
# make sure all tweets are in lowercase and remove hashtags, mentions, and links.
clean_data = []                 
for feature in features:
        item = ' '.join(word.lower() for word in str(feature).split() \
            if not word.startswith('#') and \
            not word.startswith('@') and \
            not word.startswith('http') and \
            not word.startswith('RT'))
        
        if item == "" or item == "RT":
                continue
        clean_data.append(item)
        
        

In [8]:
# https://www.earthdatascience.org/courses/earth-analytics-python/using-apis-natural-language-processing-twitter/calculate-tweet-word-frequencies-in-python/
# used code from link above to remove all special characters 
def remove_url(txt):
    return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", txt).split())

cleaned_data_no_urls = [remove_url(tweet) for tweet in clean_data]
cleaned_data_no_urls[:10]

['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan']

## Train and test the model 

In [9]:
# # Split the data into training data and testing data
# from sklearn.model_selection import train_test_split
# X_train, X_test, y_train, y_test = train_test_split(cleaned_data_no_urls, labels, random_state=1)

In [10]:
# Encode the labels to numbers
sentiments = ['positive', 'negative', 'neutral']
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(sentiments)
# list(le.classes_)
train_labels = le.transform(labels) 
# test_labels = le.transform(y_test)

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', RandomForestClassifier())
])

In [12]:
#set parameters to test
from sklearn.model_selection import GridSearchCV
parameters = {
    #number of combined words for tokenization
    'vect__ngram_range': [(1, 1), (1, 2), (1,3)],
    #remove words above a specified threshold (used in place of stop words)
    'vect__max_df': (0.25, 0.5, 0.75, 1.0),
    #include idf
    'tfidf__use_idf': (True, False),
    'clf__n_estimators':(50, 100, 150, 200), 
    'clf__criterion': ('gini', 'entropy')
}

gs_clf = GridSearchCV(text_clf, parameters, cv=5, iid=False, n_jobs = -1, verbose = 1)

In [13]:
predictor = gs_clf.fit(cleaned_data_no_urls, train_labels)

Fitting 5 folds for each of 192 candidates, totalling 960 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  5.2min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 13.8min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed: 21.7min
[Parallel(n_jobs=-1)]: Done 960 out of 960 | elapsed: 28.3min finished


In [15]:
from joblib import dump, load
dump(predictor, 'Models/Complement.joblib')

['Models/Complement.joblib']

In [16]:
# train and test the model
print(f'Best Score: {gs_clf.best_score_}')                              

for param_name in sorted(parameters.keys()):
    print((param_name, gs_clf.best_params_[param_name]))

Best Score: 0.626912615335706
('clf__criterion', 'gini')
('clf__n_estimators', 50)
('tfidf__use_idf', True)
('vect__max_df', 0.25)
('vect__ngram_range', (1, 1))


In [None]:
# # print a report of our model performance 
# from sklearn import metrics
# print(metrics.classification_report(y_test, test_predicted,
#     target_names=sentiments))

 ### This model does best when locating positive statments in the sample
 * This model did a great job finding most of the positive statments in the document. However, a significat portion of the positives found were false positives
 * This model did not do well identifiing all of the negative statements in the document. Of the negative statements identified, the model was correct approx 70% of the time
 * This model correcty identified neutral statements 85% of the time, but did a poor job of finding all of the neutral statements in the sample
 
 
