# CS-344 Artificial Intelligence - Final Project
## Project: SLO TBL Topic Classification
## Author: Joseph Jinn
## Date: 5-3-19

Note: I am using html code to do some formatting.  I'm not sure how compatible that is with PyCharm's Jupyter plugin.  I am using Jupyter notebook in the Chrome browser by starting a local Jupyter server.

### Social License to Operate: Triple-Bottom-Line Topic Classification Introduction

<br>
<span style="font-family:Arial; font-size:1.2em;">
    
<br><br>

The application domain is the Triple-Bottom-Line (TBL) classification of Tweets in the context of the Social License to Operate (SLO) of mining companies.  The objective of this project is to continue and extend the earlier internal work on Tweet TBL topic classification done at CSIRO – the Commonwealth Scientific and Industrial Research Organization (Australia’s National Science Agency).  The goal is to train a machine learning model that is capable of identifying the topic of a Tweet as either Environmental, Social, or Economic.  The initial milestone is to achieve at an absolute minimum a 50% accuracy metric or higher, indicating the ability to perform decently in a 3-way multi-class single-label identification task. </p>

<br><br>
	
The Social License to Operate is defined as when an existing project has the ongoing approval of the local community and other stakeholders within the domain the project operates in.  It is the ongoing social acceptance of that project in regards to a favorable or dis-favorable disposition by those who are concerned with it.  The SLO must not only be earned but also maintained as the beliefs, opinions, and perceptions of people tend to be dynamic over the course of time.  It is beneficial to the project owners and managers to maintain an agreeable relationship with the local population and their stakeholders. </p>

<br><br>

The Triple Bottom Line is defined as a framework where organizations and companies dedicate themselves not only to profit but also to the social and environmental impact of their operation.  The phrase was coined by the British management consultant John Elkington as a metric to measure the performance of corporate America.  According to Investopedia, business should be done according to </p>

<br><br>

Profit – the traditional measure of corporate profit – the profit and loss (P & L) account.
<br>
People – the measure of how socially responsible an organization has been throughout its operations.
<br>
Planet – the measure of how environmentally responsible a firm has been.
<br>
These are the three elements of TBL which are then sourced into the terms Economy (profit), Environmental (planet), and Social (people).

<br><br>

Twitter data (Tweets) can be obtained in 4 distinct ways – retrieval from the Twitter public API, use of an existing Twitter dataset, purchase from Twitter directly, or access purchased from a 3rd party Twitter service provider.  For the purposes of this project, we will be using existing Twitter datasets provided by Professor VanderLinden via access to Calvin College’s Borg supercomputer.  Specifically, we will be using a training set consisting of crowdsourced Triple Bottom Line labeled Tweets used by CSIRO in their preliminary topic classification research.  We will also be using a small dataset consisting of TBL labeled Tweets hand-labeled by Professor VanderLinden.  With the machine learning models trained on these two sets, we will then make predictions on the dataset used for stance classification of Tweets in earlier research by Professor VanderLinden and Roy Adams. </p>

<br><br>

As our research is a continuation of prior research from CSIRO and based on the foundation laid by Professor VanderLinden’s “Machine Learning for Social Media” project, we see no reason to not use machine learning.  While we might consider symbolic artificial intelligence (GOFAI – Good, Old-Fashioned AI), we learned in CS-344 that symbolic reasoning implementations resulted in rules engines, also known as expert systems or knowledge graphs.  These proved to be too brittle and became unmanageable as the knowledge base grew beyond a few thousand rules.  Considering the nature of Tweets, the knowledge base would incorporate far too many rules to be manageable.  The language of Tweets has its own nuances, acronyms, and other peculiarities.  It is doubtful a purely symbolic AI would be computationally feasible.  Perhaps as Professor VanderLinden mentioned, a hybrid A.I. combining symbolic reasoning and deep neural networks is the future of A.I. and would prove to be a feasible approach. </p>

<br><br>

Preliminary analysis of the two provided datasets indicates that they will require significant pre-processing before becoming useable as input features for machine learning.  The Tweets are stored as comma delimited CSV files.  The first dataset consists of 299 total Tweets, of which 198 are unlabeled due to not being associated with any TBL classification.  The second smaller dataset consists of 31 hand-labeled Tweets.  Based on the size of the datasets we are working with neural networks may not be the best choice to start with.  Neural networks typically require larger datasets in order to train and as we barely have 330 total examples to work with, the results may be less than optimal.  Therefore, we will start with a variety of non-neural network models.  Later, we will expand to using supervised neural networks to see if we can tune hyperparameters to obtain results closely comparable to our non-NN models. </p>

<br><br>

For fast prototyping, we will be using Scikit-Learn rather than Keras or straight Tensorflow, at least until we have established which baseline supervised learning algorithm will provide us with the potential for the best results.  That and Keras/Tensorflow are more for deep learning than not.  We will also use Pandas, built on NumPy, for data-frame manipulation and matplotlib for visualizations.  To encode our categorical Tweet data into useable numerical Tweet data, we will be using the tools provided by Scikit-Learn. </p>

<br><br>

Our first ML algorithm will be the MultinomialNB classifier that implements the naïve Bayes algorithm for multinomially distributed data.  Scikit-Learn.org indicates that it is one of the two classic Naïve Bayes variants used in text-based classification problems.  This indicates it will be an excellent starting point as we have decided our two datasets are too small to initially warrant the use of a supervised neural network training algorithm.  “Naïve” in this case indicates the application of Bayes’ theorem with the “naïve” assumption of conditional independence between every pair of features given the value of the class variable (4).  Further information indicates the classifier performs fast and works in many real-world applications, including document classification and spam filtering.  We built a spam filter based on Paul Graham’s “A Plan for Spam” and indeed it worked well. </p>

<br><br>

Our second ML algorithm will be the LinearSVC (Linear Support Vector Classification) Classifier.  Sci-Kit Learn indicates it is effective in high dimensional spaces and when the number of dimensions is greater than the number of samples.  This will be the case for us as we have a limited 330 samples and after multi-hot encoding to form a feature vector to create a bag-of-words vocabulary, our dimensionality is bound to be pretty high in comparison to the samples.  The memory efficiency of this algorithm should also help as we will no doubt have sparse vectors in comparison to the total vocabulary present across all of the Tweets.  Of note, is that SVM algorithms are not scaling invariant, so data scaling is required, which will matter in our case as encoding our categorical word data will result in word occurrence values for the input feature vector (unless we choose to simply represent as binary: 0 – word not present and 1- word is present). API documentation indicates that the classifier supports sparse input (good for us) and supports multi-class using the one-vs-the-rest scheme. </p>

<br><br>

We also plan to utilize the MLP (Multi-Layer Perceptron) Classifier.  Scikit-Learn indicates it uses a Softmax layer as the output function to perform multi-class classification and uses the cross-entropy loss function.  MLP also supports multi-label classification through use of the logistic activation function where values > 0.5  1 and values < 0.5  0.  Given this, it would be possible for us to perform multi-class multi-label TBL classification on our training dataset.  Our training dataset does possess Tweets that have been given multiple topic classifications, although some are redundant duplicates of either economic, social, or environmental.  We will leave this possibility for the future, time permitting.  Effective use of the MLP classifier would most likely require us to hand-label additional training example from the larger Twitter datasets present on Calvin’s Borg supercomputer.  Crowdsourcing does not seem a viable option so this task would be tediously time-consuming. </p>

<br><br>

We may also add additional algorithms capable of multi-class single-label classification as our work progresses to widen the range of models we are considering for further research. </p>

<br><br>

The application of machine learning to Social License to Operate on Triple-Bottom-Line topic classification can potentially assist any organization or company in evaluating their current level of acceptability by the local population and relevant stakeholders.  Specifically, it could help evaluate whether people are more concerned about the economic, social, or environmental aspects of the project.  In conjunction with stance and sentiment SLO machine learning models, it should be plausible that the level of acceptability of a project can be accurately judged. </p>

<br><br>

With social media so prevalent in this day and age, it is a simple matter to obtain fresh new datasets on a daily basis to gauge the SLO.  As such, the synchronicity between the dynamism of maintaining the SLO and obtaining new Tweets pertaining to the associated project works well.  Rather than conduct old fashioned mail surveys, which is time-consuming and potentially expensive, the entire procedure can be automated.  Extract Twitter data using the Twitter API, pre-process the dataset, post-process the dataset, insert into the machine learning model(s) as input feature vectors, and predict the level of approval.  Given a good model, any organization, corporation, or other entity, can perform a pseudo-real-time estimate on how accepted their current operations and activities are. </p>

<br><br>

There would be an initial time investment in adjusting hyperparameters with the validation set to achieve the optimal results while avoiding overfitting and ensuring the model generalizes well to new data.  Once this is achieved, the model should be relevant and usable as an SLO predictor for a given period of time for a particular project and organization.  Of course, even with a good model perhaps the best way to judge SLO would still be to do a face-to-face interview with the individuals in the community and stakeholders and simply ask how they feel about the project.  Then again, the anonymity of the Internet does provide an outlet for people to vent and voice their opinions with less fear of reprisal than in reality.  So perhaps anonymous Tweeters are more honest.  But, anonymity could also cause people to simply say whatever they desire with little regard to how their words actually correlate to their own personal beliefs and opinions on the matter.  Either way, an SLO TBL machine learned prediction model won’t be the be all and end all in estimating Social License to Operate.  But, it can be a useful cog in the whole machine in order to generate the necessary analysis required to measure the components of SLO. </p>

<br><br>

 
</span>

### Works Referenced:

<br><br>

1)	Anonymous ACL submission. “Classifying Stance Using Profile Texts”.
<br><br>

2)	“1. Supervised Learning¶.” Scikit, scikit-learn.org/stable/supervised_learning.html#supervised-learning.
<br><br>

3)	“A Gentle Introduction to the Bag-of-Words Model.” Machine Learning Mastery, 12 Mar. 2019, machinelearningmastery.com/gentle-introduction-bag-words-model/.
<br><br>

4)	“Introduction to Machine Learning  |  Machine Learning Crash Course  |  Google Developers.” Google, Google, developers.google.com/machine-learning/crash-course/ml-intro.
<br><br>

5)	Kenton, Will. “How Can There Be Three Bottom Lines?” Investopedia, Investopedia, 9 Apr. 2019, www.investopedia.com/terms/t/triple-bottom-line.asp.
<br><br>

6)	Littman, Justin. “Where to Get Twitter Data for Academic Research.” Social Feed Manager, 14 Sept. 2017, gwu-libraries.github.io/sfm-ui/posts/2017-09-14-twitter-data.
<br><br>

7)	Mohammad, Saif, et al. “SemEval-2016 Task 6: Detecting Stance in Tweets.” Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 2016, doi:10.18653/v1/s16-1003.
<br><br>

8)	“Multiclass Classification.” Wikipedia, Wikimedia Foundation, 18 Apr. 2019, en.wikipedia.org/wiki/Multiclass_classification.
<br><br>

9)	“Symbolic Reasoning (Symbolic AI) and Machine Learning.” Skymind, skymind.ai/wiki/symbolic-reasoning.
<br><br>

10)	Walker, Leslie. “Learn Tweeting Slang: A Twitter Dictionary.” Lifewire, Lifewire, 8 Nov. 2017, www.lifewire.com/twitter-slang-and-key-terms-explained-2655399.
<br><br>

11)	“What Is the Social License?” The Social License To Operate, socialicense.com/definition.html.
<br><br>

12)	“Working With Text Data¶.” Scikit, scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html.
<br><br>


# Social License to Operate: Triple-Bottom-Line Topic Classification Report

## Vision Section:

<span style="font-family:Arial; font-size:1.2em;">
    
The general purpose of the project is to perform Social License to Operate Triple-Bottom-Line topic classification on Twitter data associated with various mining companies.  Social License to Operate indicates the ongoing acceptance of a company or industry’s standard business practices and operating procedures by its employees, stakeholders, and the general public (Investopedia).  Triple Bottom Line is a framework or theory that recommends that companies commit to focus on social and environmental concerns just as they do on profits (Investopedia).  We will use supervised machine learning algorithms to perform multi-class single-label classification Tweets to predict whether their topic of discussion corresponds to social, environmental, or economic concerns.</p>

</span>

## Background Section:

<span style="font-family:Arial; font-size:1.2em;">
    
Our work is a revival and continuation of the work initially done at the Commonwealth Scientific and Industrial Research Organization (CSIRO) by (insert name here) on TBL topic classification.  We are not directly referencing that research but instead basing our initial data pre-processing on the anonymous ACL submission titled “Classifying Stance Using Profile Text”.  We are however using the exact same labeled training dataset that was used in the prior research for TBL topic classification on SLO for mining companies.  Our work will also involve use of the datasets available on Calvin College’s Borg Supercomputer and will be uploaded to the Calvin-CS / slo-classifiers GitHub Repository.  This project will be a prelude to continued research on topic, stance, and sentiment analysis utilizing machine learning for Social License to Operate of mining companies in connection with Professor VanderLinden’s “Machine Learning for Social Media” research project.</p>

As of the current status of this report, we are currently rapid prototyping using Scikit-Learn machine learning classifiers.  These classifiers require minimal effort to initially setup with default hyperparameters.  They train speedily and provide results in a timely manner, allowing us to adjust our hyper-parameters on-the-fly to see if there are any noticeable differences.  It is also quite simple to add additional Classifiers as the Pipeline class allows literal copy/paste of a code template.  All that is required is the addition of a new import statement for that Classifier and to replace the name of the old Classifier and its corresponding parameters with the new one.  This design feature is one of the reasons we chose to utilize Scikit-Learn; that and it was recommended by Professor VanderLinden as the starting point.</p>  

Of note is that Scikit-Learn provides automated parameter tuning via the Grid Search and Random Search classes.  Grid search methodically builds and evaluates a model for each combination of algorithm parameters specified in a grid.  Random search methodically builds and evaluates a model for each combination of algorithm parameters sampled from a random distribution for a fixed number of iterations.  We plan to utilize one or both of these hyperparameter tuning methods in order to expedite the search for optimal hyperparameters for all of the Scikit-Learn Classifiers we are prototyping with.  As we add additional Classifiers to our codebase, it becomes time-saving to automate parameter tuning as much as possible.</p>

Once we have established which classifiers have the most potential to provide favorable metrics, we may migrate towards Keras and Tensorflow for GPU support and more versatility.  Scikit-Learn does not provide GPU support for its machine learning algorithms.  This does not matter at the moment as we are working with two very small datasets which in total only provide us with 330 samples.  That and GPU support will primarily benefit deep neural networks while we are also using non-NN algorithms.  However, if we wish to crowdsource TBL classification on significantly larger Twitter datasets and work with those, then GPU support will become necessary.  We have heard it requires approximately 24 hours utilizing one Nvidia Geforce Titan on the Borg supercomputer to perform stance analysis training on the larger Twitter datasets consisting of 500k+ examples.  It would be expedient to parallelize this process utilizing all 4 Nvidia Geforce Titans on Borg to cut the training time down to a quarter.</p>

We plan to implement metric visualizations via the use of the matplotlib library and SciView in Pycharm.  The Scikit-learn online documentation has a section on “Classification of text documents using sparse features” that can hopefully be modified to suit our purposes.  Their codebase constructs a bar plot comparing a variety of Classifiers side-by-side visualizing the accuracy score, training time, and test time.  As we are also training multiple Classifiers in the hopes of finding a suitable one(s) to further explore in the Keras and Tensorflow API, this type of visualization would be very useful.  Individual charts detailing a metric summarization of the micro/macro average, weighted average and associated precision, recall, f1-score, and support values are also planned.</p>

</span>

## Implementation Section:

<span style="font-family:Arial; font-size:1.2em;">
    
These sections will describe in detail (perhaps too much detail) our current implementation for SLO TBL topic classification in Python in association with the current state of the codebase.  We have decided to keep all debug output statements as “log.debug()” statements that can be shown or hidden by setting “log.basicConfig(level=log.DEBUG)” to the appropriate level.
Our Tweet preprocessor file is separated into 3 individual functions that perform preprocessing specific to the datasets we are utilizing.  The first is a Tweet dataset consisting of 229 labeled examples, the second is another Tweet dataset consisting of 31 labeled examples, and the third is a dataset consisting of 658983 unlabeled examples.
The first Tweet dataset we are performing text pre-processing on is the training dataset that consists of 229 Tweet examples.  Not all of them are labeled with a TBL topic classification and those are dropped from consideration.  The data is shuffled randomly upon importation to ensure there is no biased structure to the import order.  We do so by utilizing Numpy’s “random. permutation” function.  Then, a Pandas dataframe is constructed to store the dataset.  Custom column names are added for clarity of purpose as none originally exist.  The “Tweet” column stores the Tweet, “SLO1” stores the first assigned topic label, “SLO2” and “SLO3” do the same.</p>

Pandas provide a “dropna()” method by which we drop all rows without at least 2 non-NaN values.  This indicates that the example lacks any TBL classification labels and can be safely discarded.  We use Boolean indexing via bitwise operations, the “.notna()” method, to construct a mask by which we isolate those examples with only a single TBL classification.  These examples are placed in a new dataframe and afterward, we drop the SLO2 and SLO3 columns as they are obviously just NaN values.  This procedure is effective as a preliminary analysis of the CSV file indicates that all labeled examples definitely have a label in the “SLO1” column.  Our objective is to construct a dataframe consisting of a column storing the raw Tweet and another column storing a single topic classification.  We rename this new dataframe to columns “Tweet” and “SLO”.

Next, we construct another mask to isolate all examples with multiple SLO TBL classifications and apply the mask to construct a new dataframe containing only those examples.  We then perform a “drop()” operation on the new dataframe to construct 3 separate dataframes.  The first from dropping SLO2 and SLO3, second dropping SLO1 and SLO3, and third dropping SLO1 and SLO2.  This inefficient but workable solution effectively create duplicates of all examples with multiple SLO TBL classifications with just a single label per example.  We then name the columns “Tweet” and “SLO”.  This is done so that our machine learning model can take into consideration those examples that can be classified as multiple topics.</p>

The multiple separate dataframes constructed from the above operations are then concatenated back together as a single whole Pandas dataframe.  Any rows with a NaN value in any column are then dropped via “dropna()” to effectively remove all examples with multiple topic classifications that might have had a topic in SLO2 but not SLO3 or vice versa.  Last, we drop all duplicated examples possessing the same TBL classification values in the “SLO” column.  We do this as the initial imported dataset sometimes contained duplicate labels for the same example.  We surmise this is because multiple people were manually hand-tagging the Tweets and sometimes they were in agreement.</p>

Using the “shape()” method call, our final training dataframe contains a total of 245 Tweets with a single TBL topic classification label.  We are also using a large Twitter dataset that has already been pre-processed and tokenized as the set we will make predictions on in order to test the generalization of our model(s) to new data.  This set does not contain any target labels and thus we cannot use part of it to supplement our small training and test sets.  There are a total of 658983 Tweets included.  The CMU Tweet Tagger was used to pre-process the text but unfortunately, this is not a feasible option for us as we are working solely on Windows OS workstation(s).</p>

As we are incapable of using the Linux/Mac only CMU Tweet Tagger for pre-processing, our decision was to manually clean the raw Tweet using Python regular expressions and other libraries.  The Natural Language Toolkit was considered as an alternative but ultimately we chose to just use built-in Python libraries and functions.  A for loop is used to send each Tweet to a preprocessing function that does the following:</p>

1)	Removes “RT” tags indicating retweets.</p>

2)	Removes URL. (e.ge. https//…) and replace with slo_url.</p>

3)	Removes Tweet mentions (e.g. @mention) and replaces with slo_mention.</p>

4)	Removes Tweet hashtags (e.g. #hashtag) and replaces with slo_hashtag.</p>

5)	Removes all punctuation from the Tweet.</p>

We also down-case all text from upper to lower case letters.  On our TODO list is to implement regular expressions or other methods in order to:</p>

1)	Shrink character elongations (e.g. “yeees”  “yes”)</p>

2)	Remove non-English tweets</p>

3)	Remove non-company associated Tweets.</p>

4)	Remove year and time.</p>

The yet-to-be-implemented preprocessing features do not seem to be an issue as the preliminary analysis indicates those elements are not present or have already been considered.  We save the processed dataframe to a comma-delimited CSV file to be used in training our Scikit-Learn Classifiers.</p>

The second Tweet dataset we are performing text preprocessing on consists of 31 hand-labeled examples provided by Professor VanderLinden.  We follow a similar path as above with our first dataset of 229 labeled examples.  We noticed that there was a spelling error present in one of the examples where “environmental” was misspelled to “environmental”, resulting in the erroneous creation of a 4th target label later on when we were training our Classifiers.  This was corrected manually by editing the original CSV file before re-preprocessing and saving out to a comma-delimited CSV file.  Of note, is that each example only possesses up to two different TBL classifications as opposed to up to three with the first dataset.  The Tweet itself was in the same format as the other and thus we could trust that preprocessing, in the same manner, would yield similar processed data.</p>

The third Tweet dataset we are performing text preprocessing on consists of 658983 unlabeled examples with 11 columns of different data including Tweet ID#, language of the Tweet, whether it is a re-Tweet, associated hashtags, associated mining company, Tweet text with mentions, user screen name, user description, Tweet text without mentions (replaced with slo_mention), and Tweet author profile description.  While this dataset has technically been previously preprocessed by earlier research on SLO stance classification, we noticed some discrepancies between these processed Tweets and ours.</p>
    
1)	The Tweets still had “#” hashtags, whereas we replaced with slo_hashtag in ours.</p>

2)	The Tweets still had punctuation, whereas we removed them in ours.</p>

Consequently, we decided to run the entire set through our custom preprocessor in order to normalize the Tweets to be consistent with ours.  Python’s timer class records that it took approximately 11412.2 seconds to process the entire dataset of 658,983 Tweets.  This was done overnight and the results were again saved to a comma-delimited CSV file.</p>

Please refer to “SLO_TBL_Tweet_Preprocessor_Specialized.py” for the codebase.  It has also been included in our “proposal.ipynb” Jupyter Notebook file.</p>

Our “slo_topic_classification_clean.py” program implements Scikit-Learn Classifier training, prediction, and parameter tuning via the Pipeline and GridSearchCV classes.  We import our processed datasets, re-index and shuffle the data, and generate a Pandas dataframe for each.  We then concatenate the individual datasets together into one cohesive dataframe and again re-index to ensure our range starts from 0.  The total number of useable labeled examples is at 277, each with a single TBL topic classification of economic, environmental, or social.</p>

The next step was the input feature set created using the “Tweet” column and a target label set created using the “SLO” column.  We chose to refactor our code for this into a separate function so that we can run multiple iterations for training our Classifiers on randomized Tweet and target label test and training sets each iteration.  Scikit-Learn included a handy function “train_test_split()” which allowed us to easily split our input feature and target labels into a training and test set for each.  </p>

With the training, test, and generalization set properly prepared, we utilized Scikit-Learn’s Pipeline class in order to set up various Classifiers.  Each Classifier is contained in its own module and provided log output is set to “debug” or lower, will display accuracy metrics and a classification report summary.  The summary includes statistics on precision, recall, f1-score, as well as the micro, macro, and weighted averages for each.  A for loop is used to generate metrics over N iterations and a mean accuracy metric is provided.  The trained Classifier is also passed to our “make_predictions” method afterward which attempts topic classification using the large unlabeled 658,983 Tweet processed dataset.   The currently implemented Classifiers include:</p>

1)	Multinomial Naïve Bayes’</p>

2)	Stochastic Gradient Descent (SGD)</p>

3)	Support Vector Machine – Support Vector Classifier.</p>

4)	Support Vector Machine – Linear Support Vector Classifier.</p>

5)	Nearest Neighbor KNeighbors Classifier.</p>

6)	Decision Tree Classifier.</p>

7)	Multi-layer Perceptron Neural Network Classifier.</p>

8)	Logistic Regression Classifier.</p>

These are many of the Classifiers capable of multi-class single-label topic classification.  As such, we have decided to implement as many as we can to see which one will be the most performant and worthy of further consideration in the Keras and Tensorflow API, provided those API’s support or can be made to support that Classifier.</p>

For each Scikit-Learn Classifier Pipeline, we implement a CountVectorizer(), TfidfTransformer(), and the relevant Classifier Class().  The following 2 sections describe in some detail the reason we utilize these three classes:</p>

The target label train and test sets were encoded using the Scikit-Learn LabelEncoder class.  This converted our categorical labels of “economic”, “environmental”, and “social”, into associated integer values of 0, 1, and 2, respectively.  A necessary step as most machine learning algorithms we are interested in prototyping with require and support only numerical data. (Note: this is deprecated – may or may not use in the future)</p>

The Scikit-Learn CountVectorizer class was used to convert the processed Tweet training and test set into feature vectors with binary values of 0 and 1.  Documentation indicates that the class converts a collection of text documents to a matrix of token counts and produces a sparse representation of the counts.  As we did not provide an a-priori dictionary and analyzer for feature selection, the total number of features is equal to the vocabulary size of the analyzed data.  Hence, we have a very high dimensionality in our feature vectors compared to our small number of samples.  This effectively creates the bag-of-words that we used to represent our categorical Tweet data.  The occurrences of each word are stored in the feature vector.  Console output shows that we are dealing with a vocabulary size of 809 in comparison to 164 examples for the training set and 81 examples for the test set (Note: deprecated numbers, TODO - update for new training set).</p>

The Scikit-Learn TfidfTransformer class was used to convert the vectorized categorical Tweet data into term-frequency * inverse document-frequency.  The purpose of this is to scale down the impact of tokens that occur very frequently and are therefore empirically less informative than features that occur in a small fraction of the training set.  Term frequencies, in general, are better than raw occurrences as larger corpuses will have higher average word occurrence values than smaller corpuses.  So, normalization of this kind provides better input feature vectors for training our model.</p>

Each Classifier is also paired with a Grid Search Function utilizing Scikit-Learn’s GridSearchCV() class that provides automated parameter tuning.  The grid search requires the setup of a classifier (which we did via Pipeline) and the specification of a dictionary storing all the keys (parameters) and values (parameter values) to tune with.  The dictionary is passed as an argument to the GridSearchCV() class along with the Classifier Pipeline.  We also passed along optional arguments specifying it should run in parallel using all available cores and perform 5-fold cross-validation splitting.  This class provides an exhaustive search of all possibilities, meaning it tries all possible combinations of the parameters and associated values you provide it with.  Hence, the time to find optimal parameters using our grid searches varied drastically from a few minutes to a few hours.</p>

As mentioned above, we also utilize a large 658,983 Tweet dataset upon which we make predictions using each of our trained Classifiers.  The prediction set, so to speak, is prepared in its own function.  We drop all columns except the “tweet_t” column containing the processed Tweet to create an input feature dataframe.  Our prediction function is then called by each Classifier’s module, passing in the Classifier itself.  The Classifier makes predictions on all Tweets and we use counter variables to calculate what percentage of Tweets were classified as economic, social, or environmental among the entire dataset.</p>

</span>

## Results Section:

<span style="font-family:Arial; font-size:1.2em;">
    
Grid Search was the essential component to obtaining the best possible results with our limited training and test sets consisting of a total of 277 labeled examples.  With default and manual parameter tuning, our accuracy metrics were abysmally low and inconsistent.  The inconsistency was due in part to initially not running 1000 iterations and then taking a mean of the accuracy metric to find a consistent percentile.  Utilizing the suggested optimal parameters from exhaustive grid search, we were able to raise our accuracy metrics for each Classifier to around 50%.  The lowest was the Multi-Layer Perceptron Classifier at 0.490, Stochastic Gradient Descent Classifier at 0.492, and Multinomial Bayes Classifier at 0.493, approximately.  The highest was the Support Vector Classification Classifier at 0.535 and the Decision Tree Classifier at 0.532.  The remainder fell somewhere in between.</p>

Our prediction results for each Classifier indicates that they will not generalize well to new Tweets.  At least, not the processed Tweets we are utilizing.  “Social” was the favored classification for our trained models, with the Stochastic Gradient Descent Classifier predicting all the Tweets as social in nature (100%).  On the flip side, the Decision Tree Classifier was the most balanced in identifying 40% as social, 54% as environmental, and 6% as economic.  The rest of the trained models overwhelmingly predicted almost all Tweets as “social”.  This obviously means that we are improperly utilizing machine learning methodologies, almost all the Tweets are actually “social” in nature in that dataset, or we simply do not have enough relevant Twitter data in order to train decent models for TBL topic classification, let alone any deep neural networks.  Refer to the code output in the notebook for further details.</p>

Of particular concern to us is performing the proper and necessary pre-processing and post-processing of the Twitter data into useable sparse feature vectors.  Regretfully, we will need to obtain the assistance of other researchers with a Linux/Mac workstation and the proper set up in order to use the CMU Tweet Tagger on the labeled TBL datasets.</p>

It is also within our planned schedule to implement matplotlib visualizations of our metric summaries to display the results of training our models and their predictive abilities in generalizing to new data.  As of the current writing of this report, this is where are at in our research efforts.  Please refer to the code modules included in this Jupyter Notebook for further details.</p>

Placeholder – discuss comparison with similar works. (not really possible since the similar work was internal at CSIRO and Professor VanderLinden is unsure he can retrieve the relevant materials from years ago; otherwise we are using the prior summer’s stance classification research material as a reference for our own work)</p>

</span>

## Implications Section:

<span style="font-family:Arial; font-size:1.2em;">
    
Social and ethical implications would be that a machine learning algorithm would be the substitute for the voice of the local population and stakeholders concerning the project.  Perhaps the future holds a system where the Social License to Operate could be maintained simply by plugging in a Tweet dataset and if above a certain metric threshold, the company or organization would keep that SLO.  There is the danger of the company or organization using a trained model to predict SLO levels and assuming that the results are reliable when reality could be different.  These are hypothetical situations that may or may not (probably not) ever occur as we are currently just performing stance, sentiment, and topic classification on Twitter data purely for the sake of research.</p>

</span>

## Works Referenced:

1)	“1. Supervised Learning¶.” Scikit, scikit-learn.org/stable/supervised_learning.html#supervised-learning.
<br><br>
2)	“A Gentle Introduction to the Bag-of-Words Model.” Machine Learning Mastery, 12 Mar. 2019, machinelearningmastery.com/gentle-introduction-bag-words-model/.
<br><br>
3)	“A Gentle Introduction to k-Fold Cross-Validation.” Machine Learning Mastery, 21 May 2018, machinelearningmastery.com/k-fold-cross-validation/.
<br><br>
4)	“Classification of Text Documents Using Sparse Features¶.” Scikit, scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py.
<br><br>
5)	“Introduction to Machine Learning  |  Machine Learning Crash Course  |  Google Developers.” Google, Google, developers.google.com/machine-learning/crash-course/ml-intro.
<br><br>
6)	“How to Tune Algorithm Parameters with Scikit-Learn.” Machine Learning Mastery, 1 Nov. 2018, machinelearningmastery.com/how-to-tune-algorithm-parameters-with-scikit-learn/.
<br><br>
7)	Kenton, Will. “How Can There Be Three Bottom Lines?” Investopedia, Investopedia, 9 Apr. 2019, www.investopedia.com/terms/t/triple-bottom-line.asp.
<br><br>
8)	Littman, Justin. “Where to Get Twitter Data for Academic Research.” Social Feed Manager, 14 Sept. 2017, gwu-libraries.github.io/sfm-ui/posts/2017-09-14-twitter-data.
<br><br>
9)	Mohammad, Saif, et al. “SemEval-2016 Task 6: Detecting Stance in Tweets.” Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 2016, doi:10.18653/v1/s16-1003.
<br><br>
10)	“Multiclass Classification.” Wikipedia, Wikimedia Foundation, 18 Apr. 2019, en.wikipedia.org/wiki/Multiclass_classification.
<br><br>
11)	“Symbolic Reasoning (Symbolic AI) and Machine Learning.” Skymind, skymind.ai/wiki/symbolic-reasoning.
<br><br>
12)	Walker, Leslie. “Learn Tweeting Slang: A Twitter Dictionary.” Lifewire, Lifewire, 8 Nov. 2017, www.lifewire.com/twitter-slang-and-key-terms-explained-2655399.
<br><br>
13)	“What Is the Social License?” The Social License To Operate, socialicense.com/definition.html.
<br><br>
14)	“Working With Text Data¶.” Scikit, scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html.
<br><br>