# Final Project Proposal - Draft 2 - Vision Statement

### by Joseph Jinn

# Final Project Proposal - Draft 2 - Rough Code-Base

In [1]:
"""
Course: CS 344 - Artificial Intelligence
Instructor: Professor VanderLinden
Name: Joseph Jinn
Date: 4-23-19

Final Project - SLO Topic Classification

###########################################################
Notes:

Proceeding with provided labeled SLO TBL dataset.  Will attempt to preprocess and train this.

Using the "NLTK" Natural Language Toolkit as replacement for CMU Tweet Tagger preprocessor.

Using a combination of Sci-kit Learn, Numpy/Pandas, Tensorflow/Keras, and matplotlib.

###########################################################
Resources Used:

https://stackoverflow.com/questions/13413590/how-to-drop-rows-of-pandas-dataframe-whose-value-in-certain-columns-is-nan
(drop rows with NaN values in columns)

https://www.geeksforgeeks.org/different-ways-to-iterate-over-rows-in-pandas-dataframe/
(pandas row iteration methods)

https://stackoverflow.com/questions/40408471/select-data-when-specific-columns-have-null-value-in-pandas
(create boolean indexing mask)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
(drop pandas columns)

https://stackoverflow.com/questions/12850345/how-to-combine-two-data-frames-in-python-pandas
(combine dataframes)

https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-in-python-pandas
(drop duplicate examples)

https://www.nltk.org/
http://www.nltk.org/book/
(text pre-processing)

https://stackoverflow.com/questions/34784004/python-text-processing-nltk-and-pandas
https://stackoverflow.com/questions/48049087/nltk-based-text-processing-with-pandas
https://stackoverflow.com/questions/44173624/how-to-apply-nltk-word-tokenize-library-on-a-pandas-dataframe-for-twitter-data
(tokenize tweets using pands and nltk)

https://www.dataquest.io/blog/settingwithcopywarning/
(SettingWithCopyWarning explanation)

https://stackoverflow.com/questions/42750551/converting-a-string-to-a-lower-case-pandas
(down-case all text)

https://stackoverflow.com/questions/20490274/how-to-reset-index-in-a-pandas-data-frame
(reindex the dataframe)

###########################################################
Regular expressions section:

https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python/40823105#40823105
(remove URL's)

https://stackoverflow.com/questions/8376691/how-to-remove-hashtag-user-link-of-a-tweet-using-regular-expression
(remove mentions)

https://www.machinelearningplus.com/python/python-regex-tutorial-examples/
(remove stuff from tweets via regex)

https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python
(remove punctuation from strings)

https://www.w3schools.com/python/python_regex.asp
(basic tutorial on regular expressions)

###########################################################
Sci-kit Learn section:

https://www.dataquest.io/blog/sci-kit-learn-tutorial/
(sci-kit learn tutorial)

https://stackoverflow.com/questions/49806790/iterable-over-raw-text-documents-expected-string-object-received
(saved my life)

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
(CounteVectorizer)

https://realpython.com/python-keras-text-classification/
(text classification tutorial using python, sci-kit learn, and keras)

https://nlpforhackers.io/keras-intro/
(text classification using keras and NN's)

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
(encode labels from categorical to numerical)

https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
(followed this one initially for text classification)

https://stackoverflow.com/questions/45804133/dimension-mismatch-error-in-countvectorizer-multinomialnb
(only call fit_transform() once to fit to the dataset; afterwards, use transform() only otherwise issues)

https://pypi.org/project/tweet-preprocessor/
(a simple Tweet pre-processor)

https://towardsdatascience.com/extracting-twitter-data-pre-processing-and-sentiment-analysis-using-python-3-0-7192bd8b47cf
(Twitter tweet retrieval)

https://machinelearningmastery.com/gentle-introduction-bag-words-model/
(bag of words)

"""

################################################################################################################
import string
import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import nltk as nltk
from nltk.tokenize import TweetTokenizer
import re

from sklearn.pipeline import Pipeline
from sklearn import metrics

tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

debug = True
################################################################################################################

# Import the dataset.
slo_dataset = \
    pd.read_csv("datasets/tbl_training_set.csv", sep=",")

# Shuffle the data randomly.
slo_dataset = slo_dataset.reindex(
    np.random.permutation(slo_dataset.index))

# Rename columns to something that makes sense.
column_names = ['Tweet', 'SLO1', 'SLO2', 'SLO3']

# Generate a Pandas dataframe.
slo_dataframe = pd.DataFrame(slo_dataset)

# Print shape and column names.
print("The shape of our SLO dataframe:")
print(slo_dataframe.shape)
print()
print("The columns of our SLO dataframe:")
print(slo_dataframe.head)
print()

# Assign column names.
slo_dataframe.columns = column_names

################################################################################################################

# Data pre-processing
# TODO - use pre-processing methods indicated in SLO article
# TODO - https://github.com/Calvin-CS/slo-classifiers/tree/feature/keras-nn/stance/data
# TODO - https://github.com/Calvin-CS/slo-classifiers/blob/feature/keras-nn/stance/data/tweet_preprocessor.py
# FIXME - preprocessor will only work on Linux/Mac

# Drop all rows with only NaN in all columns.
slo_dataframe = slo_dataframe.dropna(how='all')
# Drop all rows without at least 2 non NaN values - indicating no SLO TBL classification labels.
slo_dataframe = slo_dataframe.dropna(thresh=2)

print(slo_dataframe.shape)
print()

if debug:
    # Iterate through each row and check we dropped properly.
    print()
    print("Dataframe only with examples that have SLO TBL classification labels:")
    for index in slo_dataframe.index:
        print(slo_dataframe['Tweet'][index] + '\tSLO1: ' + str(slo_dataframe['SLO1'][index])
              + '\tSLO2: ' + str(slo_dataframe['SLO2'][index]) + '\tSLO3: ' + str(slo_dataframe['SLO3'][index]))
    print("Shape of dataframe with SLO TBL classifications: " + str(slo_dataframe.shape))

#######################################################

# Boolean indexing to select examples with only a single SLO TBL classification.
mask = slo_dataframe['SLO1'].notna() & (slo_dataframe['SLO2'].isna() & slo_dataframe['SLO3'].isna())

# Check that boolean indexing is working.
print()
print("Check that our boolean indexing mask gives only examples with a single SLO TBL classifications:")
print(mask.tail)
print("The shape of our boolean indexing mask:")
print(mask.shape)

# Create new dataframe with only those examples with a single SLO TBL classification.
slo_dataframe_single_classification = slo_dataframe[mask]

# Check that we have created the new dataframe properly.
if debug:
    # Iterate through each row and check that only examples with multiple SLO TBL classifications are left.
    print("Dataframe only with examples that have a single SLO TBL classification label:")
    for index in slo_dataframe_single_classification.index:
        print(slo_dataframe_single_classification['Tweet'][index]
              + '\tSLO1: ' + str(slo_dataframe_single_classification['SLO1'][index])
              + '\tSLO2: ' + str(slo_dataframe_single_classification['SLO2'][index])
              + '\tSLO3: ' + str(slo_dataframe_single_classification['SLO3'][index]))
    print("Shape of dataframe with a single SLO TBL classification: "
          + str(slo_dataframe_single_classification.shape))

#######################################################

# Drop SLO2 and SLO3 columns as they are just NaN values.
slo_dataframe_single_classification = slo_dataframe_single_classification.drop(columns=['SLO2', 'SLO3'])

if debug:
    print('\n')
    print("Dataframe with SLOW2 and SLO3 columns dropped as they are just NaN values:")
    # Iterate through each row and check that each example only has one SLO TBL Classification left.
    for index in slo_dataframe_single_classification.index:
        print(slo_dataframe_single_classification['Tweet'][index] + '\tSLO1: '
              + str(slo_dataframe_single_classification['SLO1'][index]))
    print("Shape of slo_dataframe_single_classification: " + str(slo_dataframe_single_classification.shape))

# Re-name columns.
column_names_single = ['Tweet', 'SLO']

slo_dataframe_single_classification.columns = column_names_single

#######################################################

# Boolean indexing to select examples with multiple SLO TBL classifications.
mask = slo_dataframe['SLO1'].notna() & (slo_dataframe['SLO2'].notna() | slo_dataframe['SLO3'].notna())

# Check that boolean indexing is working.
print()
print("Check that our boolean indexing mask gives only examples with multiple SLO TBL classifications:")
print(mask.tail)
print("The shape of our boolean indexing mask:")
print(mask.shape)

# Create new dataframe with only those examples with multiple SLO TBL classifications.
slo_dataframe_multiple_classifications = slo_dataframe[mask]

# Check that we have created the new dataframe properly.
if debug:
    # Iterate through each row and check that only examples with multiple SLO TBL classifications are left.
    print("Dataframe only with examples that have multiple SLO TBL classification labels:")
    for index in slo_dataframe_multiple_classifications.index:
        print(slo_dataframe_multiple_classifications['Tweet'][index]
              + '\tSLO1: ' + str(slo_dataframe_multiple_classifications['SLO1'][index])
              + '\tSLO2: ' + str(slo_dataframe_multiple_classifications['SLO2'][index])
              + '\tSLO3: ' + str(slo_dataframe_multiple_classifications['SLO3'][index]))
    print("Shape of dataframe with multiple SLO TBL classifications: "
          + str(slo_dataframe_multiple_classifications.shape))

#######################################################

# Duplicate examples with multiple SLO TBL classifications into examples with only 1 SLO TBL classification each.
slo1_dataframe = slo_dataframe_multiple_classifications.drop(columns=['SLO2', 'SLO3'])
slo2_dataframe = slo_dataframe_multiple_classifications.drop(columns=['SLO1', 'SLO3'])
slo3_dataframe = slo_dataframe_multiple_classifications.drop(columns=['SLO1', 'SLO2'])

if debug:
    print('\n')
    print("Separated dataframes single label for examples with multiple SLO TBL classification labels:")
    # Iterate through each row and check that each example only has one SLO TBL Classification left.
    for index in slo1_dataframe.index:
        print(slo1_dataframe['Tweet'][index] + '\tSLO1: ' + str(slo1_dataframe['SLO1'][index]))
    for index in slo2_dataframe.index:
        print(slo2_dataframe['Tweet'][index] + '\tSLO2: ' + str(slo2_dataframe['SLO2'][index]))
    for index in slo3_dataframe.index:
        print(slo3_dataframe['Tweet'][index] + '\tSLO3: ' + str(slo3_dataframe['SLO3'][index]))
    print("Shape of slo1_dataframe: " + str(slo1_dataframe.shape))
    print("Shape of slo2_dataframe: " + str(slo2_dataframe.shape))
    print("Shape of slo3_dataframe: " + str(slo3_dataframe.shape))

# Re-name columns.
column_names_single = ['Tweet', 'SLO']

slo1_dataframe.columns = column_names_single
slo2_dataframe.columns = column_names_single
slo3_dataframe.columns = column_names_single

#######################################################

# Concatenate the individual dataframes back together.
frames = [slo1_dataframe, slo2_dataframe, slo3_dataframe, slo_dataframe_single_classification]
slo_dataframe_combined = pd.concat(frames, ignore_index=True)

# Note: Doing this as context-sensitive menu stopped displaying all useable function calls after concat.
slo_dataframe_combined = pd.DataFrame(slo_dataframe_combined)

if debug:
    print('\n')
    print("Recombined individual dataframes for the dataframe representing Tweets with only a single SLO TBL "
          "classification example\n and for the dataframes representing Tweets with multiple SLO TBL classification "
          "labels:")
    # Iterate through each row and check that each example only has one SLO TBL Classification left.
    for index in slo_dataframe_combined.index:
        print(slo_dataframe_combined['Tweet'][index] + '\tSLO: ' + str(slo_dataframe_combined['SLO'][index]))
    print('Shape of recombined dataframes: ' + str(slo_dataframe_combined.shape))

#######################################################

# Drop all rows with only NaN in all columns.
slo_dataframe_combined = slo_dataframe_combined.dropna()

if debug:
    print('\n')
    print("Recombined dataframes - NaN examples removed:")
    # Iterate through each row and check that we no longer have examples with NaN values.
    for index in slo_dataframe_combined.index:
        print(slo_dataframe_combined['Tweet'][index] + '\tSLO: ' + str(slo_dataframe_combined['SLO'][index]))
    print('Shape of recombined dataframes without NaN examples: ' + str(slo_dataframe_combined.shape))

#######################################################

# Drop duplicate examples with the same SLO TBL classification values.
slo_dataframe_TBL_duplicates_dropped = slo_dataframe_combined.drop_duplicates(subset=['Tweet', 'SLO'], keep=False)

if debug:
    print('\n')
    print("Same examples with duplicate SLO TBL classifications removed:")
    # Iterate through each row and check that we no longer have examples with NaN values.
    for index in slo_dataframe_TBL_duplicates_dropped.index:
        print(slo_dataframe_TBL_duplicates_dropped['Tweet'][index] + '\tSLO: '
              + str(slo_dataframe_TBL_duplicates_dropped['SLO'][index]))
    print('Shape of dataframes without duplicate TBL values: ' + str(slo_dataframe_TBL_duplicates_dropped.shape))


#######################################################

def preprocess_tweets(tweet_text):
    """
    Function performs NLTK text pre-processing.

    Notes:

    Stop words are retained.

    TODO - shrink character elongations
    TODO - remove non-english tweets
    TODO - remove non-company associated tweets
    TODO - remove year and time.
    TODO - remove cash items?

    :return:
    """

    # Remove "RT" tags.
    preprocessed_tweet_text = re.sub("rt", "", tweet_text)

    # Remove URL's.
    preprocessed_tweet_text = re.sub("http[s]?://\S+", "slo_url", preprocessed_tweet_text)

    # Remove Tweet mentions.
    preprocessed_tweet_text = re.sub("@\S+", "slo_mention", preprocessed_tweet_text)

    # Remove Tweet hashtags.
    preprocessed_tweet_text = re.sub("#\S+", "slo_hashtag", preprocessed_tweet_text)

    # Remove all punctuation.
    preprocessed_tweet_text = preprocessed_tweet_text.translate(str.maketrans('', '', string.punctuation))

    return preprocessed_tweet_text


# Assign new dataframe to contents of old.
slo_df_tokenized = slo_dataframe_TBL_duplicates_dropped

# Down-case all text.
slo_df_tokenized['Tweet'] = slo_df_tokenized['Tweet'].str.lower()

# Pre-process each tweet individually.
for index in slo_df_tokenized.index:
    slo_df_tokenized['Tweet'][index] = preprocess_tweets(slo_df_tokenized['Tweet'][index])

################################################################################################################

# # Use NLTK to tokenize each Tweet.
# tweet_tokenizer = TweetTokenizer()
# slo_df_tokenized['Tweet'] = slo_dataframe_TBL_duplicates_dropped['Tweet'].apply(tweet_tokenizer.tokenize)

# Use for NLTK debugging.
# if debug:
#     print('\n')
#     print("SLO TBL dataframe tokenized:")
#     # Iterate through each row and check that we no longer have examples with NaN values.
#     for index in slo_df_tokenized.index:
#         print(slo_df_tokenized['Tweet'][index])
#     print('Shape of tokenized dataframe: ' + str(slo_df_tokenized.shape))

# for index in slo_df_tokenized.index:
#     slo_df_tokenized['Tweet'][index] = vectorizer.transform(slo_df_tokenized['Tweet'][index]).toarray()

################################################################################################################

# Reindex everything.
slo_df_tokenized.index = pd.RangeIndex(len(slo_df_tokenized.index))
# slo_df_tokenized.index = range(len(slo_df_tokenized.index))

################################################################################################################

# Create input features.
selected_features = slo_df_tokenized[column_names_single]
processed_features = selected_features.copy()

# Check what we are using for input features.
if debug:
    print()
    print("The tweets as a string:")
    print(processed_features['Tweet'])
    print()
    print("SLO TBL classification:")
    print(processed_features['SLO'])

# Create feature and target sets.
slo_feature_input = processed_features['Tweet']
slo_targets = processed_features['SLO']

# Create training and test sets.
from sklearn.model_selection import train_test_split

# Note: these are no longer Pandas format, they're Sci-kit Learn format.
tweet_train, tweet_test, target_train, target_test = train_test_split(slo_feature_input, slo_targets, test_size=0.33,
                                                                      random_state=42)

if debug:
    print("Shape of tweet training set:")
    print(tweet_train.data.shape)
    print("Shape of tweet test set:")
    print(tweet_test.data.shape)
    print("Shape of target training set:")
    print(target_train.data.shape)
    print("Shape of target test set:")
    print(target_test.data.shape)

#######################################################

# Use Sci-kit learn to encode labels into integer values - one assigned integer value per class.
from sklearn import preprocessing

target_label_encoder = preprocessing.LabelEncoder()

target_train_encoded = target_label_encoder.fit_transform(target_train)
target_test_encoded = target_label_encoder.fit_transform(target_test)
target_train_DEcoded = target_label_encoder.inverse_transform(target_train_encoded)
target_test_DEcoded = target_label_encoder.inverse_transform(target_test_encoded)

if debug:
    print("Encoded target training labels:")
    print(target_train_encoded)
    print("Decoded target training labels:")
    print(target_train_DEcoded)

    print("Encoded target test labels:")
    print(target_test_encoded)
    print("Decoded target test labels:")
    print(target_test_DEcoded)

#######################################################

# Use Sci-kit learn to tokenize each Tweet.
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=0, lowercase=False)
tweet_train_encoded = vectorizer.fit_transform(tweet_train)
tweet_test_encoded = vectorizer.transform(tweet_test)

if debug:
    print("Vectorized tweet training set:")
    print(tweet_train_encoded)
    print("Vectorized tweet testing set:")
    print(tweet_test_encoded)
    print("Shape of the tweet training set:")
    print(tweet_train_encoded.shape)
    print("Shape of the tweet testing set:")
    print(tweet_test_encoded.shape)

from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()

tweet_train_encoded_tfidf = tfidf_transformer.fit_transform(tweet_train_encoded)
tweet_test_encoded_tfidf = tfidf_transformer.transform(tweet_test_encoded)

if debug:
    print("vectorized tweet training set term frequencies down-sampled:")
    print(tweet_train_encoded_tfidf)
    print("Shape of the tweet training set term frequencies: ")
    print(tweet_train_encoded_tfidf.shape)
    print("vectorized tweet test set term frequencies down-sampled:")
    print(tweet_test_encoded_tfidf)
    print("Shape of the tweet test set term frequencies: ")
    print(tweet_test_encoded_tfidf.shape)

################################################################################################################
"""
Train the model using a variety of different classifiers.
"""

from sklearn.naive_bayes import MultinomialNB

clf_multinomialNB = MultinomialNB().fit(tweet_train_encoded_tfidf, target_train_encoded)

# from sklearn.svm import LinearSVC
# from sklearn.metrics import accuracy_score
#
# # create an object of type LinearSVC
# svc_model = LinearSVC(random_state=0)
#
# # train the algorithm on training data and predict using the testing data
# pred = svc_model.fit(tweet_train, target_train).predict(tweet_test)
#
# # print the accuracy score of the model
# print("LinearSVC accuracy : ", accuracy_score(target_test, pred, normalize=True))

################################################################################################################
"""
Make predictions using pre-processed and tokenized Tweets from CMU Tweet Tagger.
Note: This required .csv import and vectorization.

Probably won't be the best generalization to new data as the vocabulary between these two different datasets could
be drastically different.

"""
# Import the dataset.
slo_dataset_cmu = \
    pd.read_csv("borg-SLO classifiers/dataset_20100101-20180510_tok.csv", sep=",")

# Shuffle the data randomly.
slo_dataset_cmu = slo_dataset_cmu.reindex(
    np.random.permutation(slo_dataset_cmu.index))

# Generate a Pandas dataframe.
slo_dataframe_cmu = pd.DataFrame(slo_dataset_cmu)

# Print shape and column names.
print()
print("The shape of our SLO CMU dataframe:")
print(slo_dataframe_cmu.shape)
print()
print("The columns of our SLO CMU dataframe:")
print(slo_dataframe_cmu.head)
print()

# Create input features.
selected_features_cmu = slo_dataframe_cmu['tweet_t']
processed_features_cmu = selected_features_cmu.copy()

# Check what we are using for predictions.
if debug:
    print("The shape of our SLO CMU feature dataframe:")
    print(slo_dataframe_cmu.shape)
    print()
    print("The columns of our SLO CMU feature dataframe:")
    print(processed_features_cmu.head)
    print()

#######################################################

# Vectorize the categorical data for use in predictions.
tweet_predict_encoded = vectorizer.transform(processed_features_cmu)

if debug:
    print("Vectorized tweet predictions set:")
    print(tweet_predict_encoded)
    print("Shape of the tweet predictions set:")
    print(tweet_predict_encoded.shape)
    print()

tweet_predict_encoded_tfidf = tfidf_transformer.transform(tweet_predict_encoded)

if debug:
    print("vectorized tweet predictions set term frequencies down-sampled:")
    print(tweet_predict_encoded_tfidf)
    print("Shape of the tweet predictions set term frequencies: ")
    print(tweet_predict_encoded_tfidf.shape)
    print()

# Generalize to new data and predict.
tweet_generalize_new_data_predictions = clf_multinomialNB.predict(tweet_predict_encoded_tfidf)

# View the results.
# Note: There are 500k+ Tweets in this dataset, don't print out unless you want a very long output list.
# for doc, category in zip(processed_features_cmu, tweet_generalize_new_data_predictions):
#     print('%r => %s' % (doc, category))

################################################################################################################

# Predict using test dataset.
tweet_test_predictions = clf_multinomialNB.predict(tweet_test_encoded_tfidf)

# View the results.
# for doc, category in zip(tweet_test, tweet_test_predictions):
#     print('%r => %s' % (doc, category))

# Measure accuracy.
print()
print("Accuracy for test set predictions using multinomialNB:")
print(str(np.mean(tweet_test_predictions == target_test_encoded)))

################################################################################################################
"""
multinomialNB Pipeline.
"""
multinomialNB_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('multinomialNB', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)),
])

multinomialNB_clf.fit(tweet_train, target_train)
multinomialNB_predictions = multinomialNB_clf.predict(tweet_test)

# Measure accuracy.
print()
print("Accuracy for test set predictions using multinomialNB:")
print(str(np.mean(multinomialNB_predictions == target_test)))
print()

print("multinomialNB Metrics")
print(metrics.classification_report(target_test, multinomialNB_predictions,
                                    target_names=['economic', 'environmental', 'social']))

print("multinomialNB confusion matrix:")
print(metrics.confusion_matrix(target_test, multinomialNB_predictions))

################################################################################################################
"""
SGD Classifier Pipeline.
"""
from sklearn.linear_model import SGDClassifier

SGDClassifier_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(alpha=0.0001, average=False, class_weight=None,
                          early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
                          l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=5,
                          n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
                          power_t=0.5, random_state=None, shuffle=True, tol=None,
                          validation_fraction=0.1, verbose=0, warm_start=False)),
])

SGDClassifier_clf.fit(tweet_train, target_train)
SGDClassifier_predictions = SGDClassifier_clf.predict(tweet_test)

# Measure accuracy.
print()
print("Accuracy for test set predictions using SGDClassifier:")
print(str(np.mean(SGDClassifier_predictions == target_test)))
print()

print("SGD Classifier Metrics")
print(metrics.classification_report(target_test, SGDClassifier_predictions,
                                    target_names=['economic', 'environmental', 'social']))

print("SGD Classifier confusion matrix:")
print(metrics.confusion_matrix(target_test, SGDClassifier_predictions))

################################################################################################################
"""
SVM SVC Classifiers Pipeline.
"""
from sklearn import svm

SVC_classifier_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', svm.SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
                    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
                    max_iter=-1, probability=False, random_state=None, shrinking=True,
                    tol=0.001, verbose=False)),
])

SVC_classifier_clf.fit(tweet_train, target_train)
SVC_classifier_predictions = SVC_classifier_clf.predict(tweet_test)

# Measure accuracy.
print()
print("Accuracy for test set predictions using SVC_classifier:")
print(str(np.mean(SVC_classifier_predictions == target_test)))
print()

print("SVC_classifier Metrics")
print(metrics.classification_report(target_test, SVC_classifier_predictions,
                                    target_names=['economic', 'environmental', 'social']))

print("SVC_classifier confusion matrix:")
print(metrics.confusion_matrix(target_test, SVC_classifier_predictions))

################################################################################################################
"""
SVM LinearSVC Pipeline.
"""

LinearSVC_classifier_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', svm.LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
                          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
                          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
                          verbose=0)),
])

LinearSVC_classifier_clf.fit(tweet_train, target_train)
LinearSVC_classifier_predictions = LinearSVC_classifier_clf.predict(tweet_test)

# Measure accuracy.
print()
print("Accuracy for test set predictions using LinearSVC_classifier:")
print(str(np.mean(LinearSVC_classifier_predictions == target_test)))
print()

print("LinearSVC_classifier Metrics")
print(metrics.classification_report(target_test, LinearSVC_classifier_predictions,
                                    target_names=['economic', 'environmental', 'social']))

print("LinearSVC_classifier confusion matrix:")
print(metrics.confusion_matrix(target_test, LinearSVC_classifier_predictions))

################################################################################################################
"""
K Neighbors Classifier Pipeline.
"""
from sklearn.neighbors import KNeighborsClassifier

KNeighbor_classifier_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', KNeighborsClassifier(n_neighbors=3)),
])

KNeighbor_classifier_clf.fit(tweet_train, target_train)
KNeighbor_classifier_predictions = KNeighbor_classifier_clf.predict(tweet_test)

# Measure accuracy.
print()
print("Accuracy for test set predictions using RadiusNeighbor_classifier:")
print(str(np.mean(KNeighbor_classifier_predictions == target_test)))
print()

print("RadiusNeighbor_classifier Metrics")
print(metrics.classification_report(target_test, KNeighbor_classifier_predictions,
                                    target_names=['economic', 'environmental', 'social']))

print("RadiusNeighbor_classifier confusion matrix:")
print(metrics.confusion_matrix(target_test, KNeighbor_classifier_predictions))

################################################################################################################
"""
Decision Tree Classifier.
"""
from sklearn import tree

DecisionTree_classifier_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', tree.DecisionTreeClassifier(random_state=0)),
])

DecisionTree_classifier_clf.fit(tweet_train, target_train)
DecisionTree_classifier_predictions = DecisionTree_classifier_clf.predict(tweet_test)

# Measure accuracy.
print()
print("Accuracy for test set predictions using DecisionTree_classifier:")
print(str(np.mean(DecisionTree_classifier_predictions == target_test)))
print()

print("DecisionTree_classifier Metrics")
print(metrics.classification_report(target_test, DecisionTree_classifier_predictions,
                                    target_names=['economic', 'environmental', 'social']))

print("DecisionTree_classifier confusion matrix:")
print(metrics.confusion_matrix(target_test, DecisionTree_classifier_predictions))

################################################################################################################
"""
Multi-layer Perceptron Classifier Pipeline.
"""
from sklearn.neural_network import MLPClassifier

MLP_classifier_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MLPClassifier(activation='relu', alpha=1e-5, batch_size='auto',
                          beta_1=0.9, beta_2=0.999, early_stopping=True,
                          epsilon=1e-08, hidden_layer_sizes=(15,),
                          learning_rate='constant', learning_rate_init=0.001,
                          max_iter=1000, momentum=0.9, n_iter_no_change=10,
                          nesterovs_momentum=True, power_t=0.5, random_state=1,
                          shuffle=True, solver='lbfgs', tol=0.0001,
                          validation_fraction=0.1, verbose=False, warm_start=False)),
])

# from sklearn.preprocessing import StandardScaler
# scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
# scaler.fit(tweet_train)
# tweet_train_scaled = scaler.transform(tweet_train)
# tweet_test_scaled = scaler.transform(tweet_test)

MLP_classifier_clf.fit(tweet_train, target_train)
MLP_classifier_predictions = MLP_classifier_clf.predict(tweet_test)

# Measure accuracy.
print()
print("Accuracy for test set predictions using MLP_classifier:")
print(str(np.mean(MLP_classifier_predictions == target_test)))
print()

print("MLP_classifier Metrics")
print(metrics.classification_report(target_test, MLP_classifier_predictions,
                                    target_names=['economic', 'environmental', 'social']))

print("MLP_classifier confusion matrix:")
print(metrics.confusion_matrix(target_test, MLP_classifier_predictions))

################################################################################################################
"""
Logistic Regression Classifier Pipeline.
"""
from sklearn.linear_model import LogisticRegression

LogisticRegressionCV_classifier_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression(random_state=0, solver='lbfgs',
                               multi_class='multinomial')),
])

LogisticRegressionCV_classifier_clf.fit(tweet_train, target_train)
LogisticRegressionCV_classifier_predictions = LogisticRegressionCV_classifier_clf.predict(tweet_test)

# Measure accuracy.
print()
print("Accuracy for test set predictions using LogisticRegressionCV_classifier:")
print(str(np.mean(LogisticRegressionCV_classifier_predictions == target_test)))
print()

print("LogisticRegressionCV_classifier Metrics")
print(metrics.classification_report(target_test, LogisticRegressionCV_classifier_predictions,
                                    target_names=['economic', 'environmental', 'social']))

print("LogisticRegressionCV_classifier confusion matrix:")
print(metrics.confusion_matrix(target_test, LogisticRegressionCV_classifier_predictions))

################################################################################################################
"""
Keras Neural Network.
"""
from keras.models import Sequential
from keras import layers

################################################################################################################
"""
Parameter tuning using Grid Search.
"""
# from sklearn.model_selection import GridSearchCV
#
# # What parameters do we search for?
# parameters = {
#     'vect__ngram_range': [(1, 1), (1, 2)],
#     'tfidf__use_idf': (True, False),
#     'clf__alpha': (1e-2, 1e-3),
# }
#
# # Perform the grid search using all cores.
# gs_clf = GridSearchCV(SGDClassifier_clf, parameters, cv=5, iid=False, n_jobs=-1)
#
# gs_clf_fit = gs_clf.fit(tweet_train, target_train)
# gs_clf_predict = gs_clf_fit.predict(tweet_test)

############################################################################################

"""
Main function.  Execute the program.
"""
# Debug variable.
debug_main = 0

if __name__ == '__main__':
    print()

############################################################################################


The shape of our SLO dataframe:
(299, 4)

The columns of our SLO dataframe:
<bound method NDFrame.head of     RT @Qldaah: Ciobo, questioner just told you mining declining in Central Qld &amp; first thing you do is talk about Adani Carmichael coal mine.…  \
7    RT @CatchNews: Adani accused former environmen...                                                                                                
23   Brisbane trends now: Pauline Hanson, Muslims, ...                                                                                                
164  Most Active Exchange Traded Options: CALLS WOW...                                                                                                
182  RT @BTS_tumblr: In case you missed it, check o...                                                                                                
123  Tues 14 Jun Phantom Dancer w Greg Poppleton pl...                                                                                     

RT @KrankyKerry: @abcnews @TurnbullMalcolm  #Adanis , Ginas dirty massive #Coal Mines	SLO: social
@Hamigakiwanwan If Australia was run like a business we wouldn't do Adani as too financially risky. But we're run for business. #auspol16	SLO: social
RT @arianewilkinson: Corporate background check ignores foreign environmental offences.  Fails to assess real risks inc re #Adani https://t…	SLO: environmental
RT @StopShenhua: "#Shenhua is well aware its project doesn't stack up financially" #ausvotes #newengland #coal #liverpoolplains https://t.c…	SLO: economic
RT @ConversationEDU: From the archive:  Shenhua mine: the federal government could have chosen farming over coal. https://t.co/jSwQHiiObG	SLO: social
Steve Ciobo, wha good could the Carmichael mine possibly do to Qld'ers? Coal is not innovative. #auspol #ausvotes https://t.co/aV6fPeOHn6	SLO: environmental
Government rules out public funds for Adani coal project, activists claim | Business | The Guardian https://t.co/e9kzSxMqV8	SLO: e

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  exec(code_obj, self.user_global_ns, self.user_ns)



The tweets as a string:
0       slomention adani accused former environment m...
1       slomention latest worrying news from the grea...
2      the slohashtag crops in both winter amp summer...
3       slomention war on slohashtag santos to drill ...
4      oh  ffs greg hunt no definite link between coa...
                             ...                        
240    we live in hope slohashtag may abandon austral...
241     slomention slomention adve in todays adelaide...
242    bhp consoium amci vie for anglo american coalm...
243    let 100 flowers bloom said slomention but only...
244     slomention no place for coal mines or a gas f...
Name: Tweet, Length: 245, dtype: object

SLO TBL classification:
0             social
1      environmental
2           economic
3           economic
4      environmental
           ...      
240           social
241    environmental
242         economic
243    environmental
244    environmental
Name: SLO, Length: 245, dtype: object
Shape of tweet


The shape of our SLO CMU dataframe:
(658982, 11)

The columns of our SLO CMU dataframe:
<bound method NDFrame.head of                         id lang language_textblob  retweeted  \
430510  370323489032376320   en                en      False   
400504  972286005553250304   en                en       True   
295401  925088895070453761   en                en       True   
42190   805619444558901248   en                en       True   
384726  968006580481376256   en                en      False   
...                    ...  ...               ...        ...   
141930  865520514239913985   en                en       True   
438442  501622937116352513   en                en      False   
585940  805712775443947521   en                en       True   
163935  869722889087361024   en                en      False   
355113  949934483859320832   en                en       True   

                hashtags company  \
430510            ausbiz     bhp   
400504               NaN   adani   
2954


Accuracy for test set predictions using MLP_classifier:
0.3333333333333333

MLP_classifier Metrics
               precision    recall  f1-score   support

     economic       0.35      0.39      0.37        18
environmental       0.12      0.09      0.10        22
       social       0.41      0.44      0.42        41

    micro avg       0.33      0.33      0.33        81
    macro avg       0.29      0.31      0.30        81
 weighted avg       0.32      0.33      0.32        81

MLP_classifier confusion matrix:
[[ 7  2  9]
 [ 3  2 17]
 [10 13 18]]

Accuracy for test set predictions using LogisticRegressionCV_classifier:
0.4444444444444444

LogisticRegressionCV_classifier Metrics
               precision    recall  f1-score   support

     economic       0.47      0.39      0.42        18
environmental       0.00      0.00      0.00        22
       social       0.49      0.71      0.58        41

    micro avg       0.44      0.44      0.44        81
    macro avg       0.32      0

Using TensorFlow backend.


# Final Project Proposal - Draft 2 - Preliminary Draft of Final Paper:

Social License to Operator Triple-Bottom-Line Tweet Classification</p>


The application domain is the Triple-Bottom-Line (TBL) classification of Tweet in the context of the Social License to Operate (SLO) of mining companies.  The objective of this project is to continue and extend the earlier work on Tweet TBL classification done at CSIRO – the Commonwealth Scientific and Industrial Research Organization (Australia’s National Science Agency).  The goal is to set up a prototype machine learning system that is capable of identifying the topic classification of a Tweet as either Environmental, Social, or Economic.  The initial milestone is to achieve at an absolute minimum a 50% accuracy metric or higher, indicating the ability to at least guess on par with a flip of a coin.</p>
	
    
The Social License to Operate is defined as when an existing project has the ongoing approval of the local community and other stakeholders within the domain the project operates in.  It is the ongoing social acceptance of that project in regards to a favorable or dis-favorable disposition by those who are concerned.  The SLO must not only be earned but also maintained as the beliefs, opinions, and perceptions of people tend to be dynamic over the course of time.  It is beneficial to the project owners and managers to maintain an agreeable relationship with the local population and their stakeholders.</p>
	
    
The Triple Bottom Line is defined as a framework where organizations and companies dedicate themselves not only to profit but also the social and environmental impact of their operation.  The phrase was coined by the British management consultant John Elkington as a metric to measure the performance of corporate America.  According to Investopedia, the corporate business should be done according to:</p>


Profit – the traditional measure of corporate profit – the profit and loss (P & L) account.</p>

People – the measure of how socially responsible an organization has been throughout its operations.</p>

Planet – the measure of how environmentally responsible a firm has been.</p>

These are the three elements of TBL which are then sourced into the terms Economy (profit), Environmental (planet), and Social (people).</p>
	
    
Twitter data (Tweets) can be obtained in 4 distinct ways – retrieval from the Twitter public API, use of an existing Twitter dataset, purchase from Twitter directly, or purchase access from a 3rd party Twitter service provider.  For the purposes of this project, we will be using existing Twitter datasets provided by Professor VanderLinden via access to Calvin College’s Borg supercomputer.  Specifically, we will be using a training set consisting of crowdsourced Triple Bottom Line labeled Tweets used by CSIRO in their preliminary topic classification research.  For the test set, we will be using a small dataset consisting of TBL labeled Tweets hand-labeled by Professor VanderLinden.  With the machine learning model trained on these two sets, we will then generalize the model to make predictions on the dataset used for stance classification of Tweets in earlier research by Professor VanderLinden and Roy Adams.</p>
	
    
As our research is a continuation of prior research from CSIRO and based on the foundation laid by Professor VanderLinden’s “Machine Learning for Social Media” project, we see no reason to not use machine learning.  While we might consider symbolic artificial intelligence (GOFAI – Good, Old-Fashioned AI), we learned in CS-344 that symbolic reasoning implementations resulted in rules engines, also known as expert systems or knowledge graphs.  These proved to be too brittle and became unmanageable as the knowledge base grew beyond a few thousand rules.  Considering the nature of Tweets, GOFAI seems not to be a viable solution.  The language of Tweets is often informal, prone to slang, misuse of established grammatical rules, and in general a chimeric bastardization of known human languages (insofar in my experience).  It is doubtful a purely symbolic AI would be computationally feasible.  Perhaps as Professor VanderLinden mentioned, a hybrid A.I. combining symbolic reasoning and deep neural networks is the future of A.I. and would prove to be a feasible approach.</p>


Preliminary analysis of the two provided datasets indicates that they will require significant pre-processing before becoming useable as input features for machine learning.  The Tweets are stored as comma delimited CSV files.  The training dataset consists of 299 total Tweets, of which 198 are unlabeled due to not being associated with any TBL classification.  The test dataset consists of 31 hand-labeled Tweets.  Based on the size of the datasets we are working with neural networks may not be the best choice to start with.  Neural networks typically require larger datasets in order to train and as we barely have 330 total examples to work with, the results may be less than optimal.  Therefore, we will start with Bayesian models and SVM’s – Support Vector Machines.  Later, we will expand to using supervised neural networks just to see if we can tune hyperparameters to obtain results closely comparable to our non-NN models.</p>
	 
     
For fast prototyping, we will be using Scikit-Learn in Python rather than Keras or straight Tensorflow, at least until we have established which baseline supervised learning algorithm will provide us with the potential for the best results.  We will also use Pandas, built on NumPy, for data-frame manipulation and matplotlib for visualizations.  To encode our categorical Tweet data into useable numerical Tweet data, we will be using the tools provided by Scikit-Learn.</p>
	
    
Our Bayesian model will be the MultinomialNB classifier that implements the naïve Bayes algorithm for multinomially distributed data.  Scikit-Learn.org indicates that it is one of the two classic Naïve Bayes variants used in textual classification problems.  This indicates it will be an excellent starting point as we have decided our two datasets are too small to initially warrant the use of a supervised neural network training algorithm.  “Naïve” in this case indicates the application of Bayes’ theorem with the “naïve” assumption of conditional independence between every pair of features given the value of the class variable (4).  Further information indicates the classifier performs fast and works in many real-world applications, including document classification and spam filtering.  We built a spam filter based on Paul Graham’s “A Plan for Spam” and indeed it worked well.</p> 
	
    
Our SVM classification model will be the LinearSVC Classifier– Linear Support Vector Classification.  Sci-Kit Learn indicates it is effective in high dimensional spaces and when the number of dimensions is greater than the number of samples.  This will be the case for us as we have a limited 330 samples and after multi-hot encoding to form a feature vector to create a bag-of-words vocabulary, our dimensionality is bound to be pretty high in comparison to the samples.  The memory efficiency of this algorithm should also help as we will no doubt have sparse vectors in comparison to the total vocabulary present across all of the Tweets.  Of note, is that SVM algorithms are not scaling invariant, so data scaling is required, which will matter in our case as encoding our categorical word data will result in word occurrence values for the input feature vector (unless we choose to simply represent as binary: 0 – word not present and 1- word is present). API documentation indicates that the classifier supports sparse input (good for us) and supports multi-class using the one-vs-the-rest scheme.</p>
	
    
Our deep neural network will be the MLP Classifier – multi-layer perceptron.  Scikit-Learn indicates it uses a Softmax layer as the output function to perform multi-class classification and uses the cross-entropy loss function.  MLP also supports multi-label classification through use of the logistic activation function where values > 0.5  1 and values < 0.5  0.  Given this, it would be possible for us to perform multi-class multi-label TBL classification on our training dataset.  Our training dataset does possess Tweets that have been given multiple topic classifications, although some are redundant duplicates of either economic, social, or environmental.  We will leave this possibility for the future, time permitting.  Effective use of the MLP classifier would most likely require us to hand-label additional training example from the larger Twitter datasets present on Calvin’s Borg supercomputer.  Crowdsourcing does not seem a viable option so this task would be tediously time-consuming.</p>
	
    
The application of machine learning to Social License to Operate on Triple-Bottom-Line topic classification can potentially assist any organization or company in evaluating their current level of acceptability by the local population and relevant stakeholders.  Specifically, it could help evaluate whether people are more concerned about the economic, social, or environmental aspects of the project.  In conjunction with stance and sentiment SLO machine learning models, it should be plausible that the level of acceptability of a project can be accurately judged.</p>
	
    
With social media so prevalent in this day and age, it is a simple matter to obtain fresh new datasets on a daily basis to gauge the SLO.  As such, the synchronicity between the dynamism of maintaining the SLO and new Tweets pertaining to the associated project works well.  Rather than conduct old fashioned mail surveys, which is time-consuming and potentially expensive, the entire procedure can be automated.  Extract Twitter data using the Twitter API, pre-process the dataset, post-process the dataset, insert into the machine learning model(s) as input feature vectors, and predict the level of approval.  Given a good model, any organization, corporation, or other entity, can perform a pseudo-real-time estimate on how accepted their current operations and activities are.</p>
	
    
The initial investment would be in adjusting hyperparameters with the validation set to achieve the optimal results while avoiding overfitting and ensuring the model generalizes well to new data.  Once this is achieved, the model should be relevant and usable as an SLO predictor for a given period of time for a particular project and organization.  Of course, even with a good model perhaps the best way to judge SLO would still be to do a face-to-face interview with the individuals in the community and stakeholders and simply ask how they feel about the project.  Then again, the anonymity of the Internet does provide an outlet for people to vent and voice their opinions with less fear of reprisal than in reality.  So perhaps anonymous Tweeters are more honest.  But, anonymity could also cause people to simply say whatever they desire with little regard to how their words actually correlate to their own personal beliefs and opinions on the matter.  Either way, an SLO TBL machine learned prediction model won’t be the be all and end all in estimating Social License to Operate.  But, it can be a useful cog in the whole machine in order to generate the necessary analysis required to measure the components of SLO.</p>
 
 
Works Referenced:


1)	Anonymous ACL submission. “Classifying Stance Using Profile Texts”.</p>

2)	“1. Supervised Learning¶.” Scikit, scikit-learn.org/stable/supervised_learning.html#supervised-learning.</p>

3)	“A Gentle Introduction to the Bag-of-Words Model.” Machine Learning Mastery, 12 Mar. 2019, machinelearningmastery.com/gentle-introduction-bag-words-model/.</p>

4)	“Introduction to Machine Learning  |  Machine Learning Crash Course  |  Google Developers.” Google, Google, developers.google.com/machine-learning/crash-course/ml-intro.</p>

5)	Kenton, Will. “How Can There Be Three Bottom Lines?” Investopedia, Investopedia, 9 Apr. 2019, www.investopedia.com/terms/t/triple-bottom-line.asp.</p>

6)	Littman, Justin. “Where to Get Twitter Data for Academic Research.” Social Feed Manager, 14 Sept. 2017, gwu-libraries.github.io/sfm-ui/posts/2017-09-14-twitter-data.</p>

7)	Mohammad, Saif, et al. “SemEval-2016 Task 6: Detecting Stance in Tweets.” Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 2016, doi:10.18653/v1/s16-1003.</p>

8)	“Multiclass Classification.” Wikipedia, Wikimedia Foundation, 18 Apr. 2019, en.wikipedia.org/wiki/Multiclass_classification.</p>

9)	“Symbolic Reasoning (Symbolic AI) and Machine Learning.” Skymind, skymind.ai/wiki/symbolic-reasoning.
10)	Walker, Leslie. “Learn Tweeting Slang: A Twitter Dictionary.” Lifewire, Lifewire, 8 Nov. 2017, www.lifewire.com/twitter-slang-and-key-terms-explained-2655399.</p>

11)	“What Is the Social License?” The Social License To Operate, socialicense.com/definition.html.</p>

12)	“Working With Text Data¶.” Scikit, scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html.</p>