# Part 3: Applying Neural Nets (ANN, CNN, LSTM) to real labeled text data

For this part of the Exam, I have gathered articles on three topics: *football*, *science*, and *politics*. The data has already been cleaned, tokenized, and vectorized. Each row (vector) in the dataset is an article, each row is labeled as *football*, *science*, and *politics*. Each column is a word in the vocabulary. The data itself represents the number of times each word appear in that given article. (The data was gathered from [newsapi.org](https://newsapi.org/)).

**Here is a link to the cleaned, prepared, labeled data.**

<https://drive.google.com/file/d/1-ZAbxWN29iCo44kaLSfYmV2E8YDKcgGE/view?usp=sharing>

(If you want to know how this was done (**not required**) - here is code and a tutorial)

<https://gatesboltonanalytics.com/?page_id=254>

---

## The overall goals here include:
1. Coding, comparing, and using an ANN, CNN, and LSTM RNN in TF/Keras (Python) to Train models and to Test their accuracy.
2. You want to see if you can predict the topic of an article (in this case - *football*, *science*, and *politics*). 
3. You also want to compare and illustrate the accuracy of your models and determine/discuss which model (ANN, CNN, or LSTM) is best and why this might be.
4. **It is up to you how to do this and how best to illustrate and explain your steps, results, and conclusions. Assume the reader is non-technical.**
5. You will include a **link** to your code, but do not paste or otherwise include code on the Exam document. (Again, you can place your code wherever you want as long as there is a link to it).

## Specific requirements:

There are many ways to do this. The following offeres a few core requirements. Beyond this, **YOU must decide what to do and how best to do it.** Part of your grade will be based on your flow, discussion, illustrations, report, and communication of methods and results. Again, you will post a link to the code, but you will not include or paste code in the word doc. 

1. Use Python and TF/Keras to Train and then Test the accuracy for an ANN, CNN, and LSTM RNN. In other words, you will use three different Neural Networks to create models that should predict whether a test vector (which represents an article on a topic) is on the topic of *science*, *football*, or *politics*. You will need to write code to do this. You already have code for ANNs, CNNs, and LSTM RNNs, so you may choose to repurpose/update your code as needed. 
2. To show your work and to **illustrate and explain** your work, results, and conclusions you must include at least the following:
    
    (a) A link to your code. If you wish, you can put your code on your website, Google Colab, GitHub or wherever, an the include the URL on the word doc.
    
    (b) Show and explain how you **prepared the data** so that you can use it properly to Train and Test your models. (You are not required to validate - but you certainly can). Specifically, if you split the data, discuss and illustrate this. If you encode the labels, discuss and illustrate this, etc. Use images (like screenshots) as needed. YOU decide and explain/show what you are doing.
    
    (c) **DO NOT** include or paste any code (to Word doc). You do not need nor should you use "code" to explain or illustrate what you are doing. Use illustrations, images, explanations. Pretend that the person grading this paper does not know Python but does wan to see and understand what you did, what you found, how your models compare, which model worked best, etc.
    
    (d) TO be clear - You will be coding, training, and then testing three types of models: ANN, CNN, LSTM. Therefore, you should include screen images (small portions) of the training for each (a few of the last epochs), as well as **confusion matrices** for each that illustrate the test data accuracy for each model.
    
    (e) Discuss and describe what you are doing and showing.
    
    (f) Discuss and illustrate the results. Which model worked best (have confusion matrices that support this discussion). Comment on which model you expected to work the best, which model actually worked the best and why. 

---


In [1]:
# %% libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os,sys

In [2]:
# %% working directory
src_file_dir = os.path.abspath("")          # directory holding this script file
src_dir = os.path.dirname(src_file_dir)     # parent directory of above directory
os.chdir(src_dir)                           # working directory should now be ".../CSCI5922/Exam3"
print("current working directory:", os.getcwd())

current working directory: /home/jasminekobayashi/gh_repos/CSCI5922/Exam3


# Data

In [3]:
# %% load data
data = pd.read_csv("data/Final_News_DF_Labeled_ExamDataset.csv")
data.head()

Unnamed: 0,LABEL,according,agency,ahead,alabama,amazon,america,american,announced,appeared,...,wolverines,women,work,working,world,wrote,year,years,york,young
0,politics,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,politics,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,politics,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,politics,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,politics,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# %% features vs. targets (aka: estimators vs. predictors, input vs. output, etc.)
X = data.drop(columns=['LABEL']).to_numpy()     # features: everything except column "LABEL"
y = data[["LABEL"]]                             # targets: column "LABEL"

In [None]:
# %% One Hot encode label
OHE = OneHotEncoder()
y = OHE.fit_transform(y).toarray()

In [None]:
# %% Training & Testing set
test_size = 0.2 # what percent of the data = testing set 
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=test_size,random_state=123)

# ANN

# CNN

# LSTM