## News Classification

Use Jupyter Notebooks to create a News Classifier with Multimodal Naive Bayes Classification.

### Read in the Data:

Use the Kaggle News Category Classification data set for this project. Note this is a JSON dataset and will be read in differently than a CSV.

Markdown help:
https://www.markdownguide.org/cheat-sheet/

In [None]:
#This block will read in the data, converting from JSON to a dataframe

import pandas, json

df = pandas.read_json("News_Category_Dataset_v3.json", lines=True)

# let's see what categories/keys we're working with
# here's an example:
# {
    # "link": "https://www.huffpost.com/entry/covid-boosters-uptake-us_n_632d719ee4b087fae6feaac9",
    # "headline": "Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters",
    # "category": "U.S. NEWS",
    # "short_description": "Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall.",
    # "authors": "Carla K. Johnson, AP",
    # "date": "2022-09-23"
# }

# so we have
# keys = [link, headline, category, short_description, authors, date]

# let's check to make sure all our data is at least somewhat valid
# print(df.info())


### Test Data: 

Before you start building your classifier, remove 20% of the data and place it in a separate table; this will be your testing dataset. 

Once you have created your model (Steps 0 and 1) with the other 80% of the data you will test to see how accurate your model is by having the program categorize the records in your testing data set.


In [None]:
# this block will partition our data into the 80-20 split we need

# let's find out the shape of our dataframe

# print(df.shape) returns (209527, 6), which means we have 209527 rows and 6 columns

# we'll calculate the last 20% and reserve those entries for testing

test_index_start = int(df.shape[0]*4/5)

model_bounds:tuple = (0, test_index_start-1)

test_bounds:tuple = (test_index_start, df.shape[0])

print(model_bounds)

print(test_bounds)



### Stop Words:

Just like in your first Project you should remove the Stop Words in all of the articles before building your model or testing.

### Classifier:

You may use whatever python libraries you wish but you should write the code for each of the four steps in the Multimodal Naive Bayes Classification yourself.

- Step 0: Laplace Smoothing (remember this is for word counts not the categories’ probabilities)

- Step 1: Find probabilities for each word for each category

- Step 2: Calculate the probability that a record in the testing data set is part of each category.

- Step 3: Compare probabilities calculated in Step 2. Choose the largest probability to assign the category tag for the data.


In [14]:
# this block will remove stop words from the headlines and short descriptions, and then run our classifier
# for each row in the dataframe, we'll remove stop words from the headline and short_description

# we'll use the nltk library to help us with this
import time
import nltk
import string
from nltk.corpus import stopwords
nltk.download("punkt")
nltk.download('stopwords')

punc = str.maketrans('', '', string.punctuation)

stop_words = stopwords.words("english")

# here's a list of all words, coupled with their frequency in each category
# each word is a key, and the value is a dictionary of categories and their counts
# {word : {category : count, category : count, ...}, word : {category : count, category : count, ...}, ...}
words_list = {}

# this is a list of all possible categories in category-count pairs
category_list = {}

# we'll also keep track of the total number of words in each category through the incrementWord method
category_words = {}

# this is our method from the last project, modified to work here
# this method takes in a word or string and returns the cleaned word or string
def clean_string(current_word:str, article_category:str, record = True):
    # remove all numbers: if a number is found, return without adding word
    for num in range(0,9):
        if (current_word.find(str(num)) != -1):
            return ""

    # we're going to turn dashes into spaces so that we can split on spaces later
    current_word = current_word.replace("-", " ")

    # make it all lowercase and remove most punctuation
    current_word = current_word.casefold().translate(punc)
    cleanedString:str = ""
    # completely ditch the word if it's just an empty string
    if (current_word != ''):
        # if the above operations turned the word into multiple words, recurse this method for each new word
        if (current_word.find(' ') != -1):
            allWords = current_word.split()
            for i in range(len(allWords)):
                cleaned_word = clean_string(allWords[i], article_category)
                if cleaned_word is not None:
                    cleanedString = cleanedString + " " + cleaned_word
            while (cleanedString.find("  ") != -1):
                cleanedString = cleanedString.replace("  ", " ")
            return cleanedString.strip()

        # make sure the word is not a stopword before taking it into account
        if current_word not in stop_words:
            # print("\""+thisWord+"\" is KEPT")
            # add 1 to the count of this word for this category
            if record:
                increment(current_word, article_category)
            return current_word
        else:
            # print("\""+thisWord+"\" is TOSSED")
            return ""
    else:
        return ""

# this is a method to send valid words into wordsList or increment their count
def increment(current_word: str, article_category:str):
    # if the article_category is not in category_list, add it
    if(article_category not in category_list):
        category_list[article_category] = 1
    else:
        category_list[article_category] += 1
    # print("incrementing "+current_word+" in "+article_category)
    # for the current_word, we need to check if it's already in wordsList
    if(current_word in words_list):
        # if it is, we need to check if it's in the current category
        if(article_category in words_list[current_word]):
            # if it is, we need to increment the count
            words_list[current_word][article_category] += 1
        else:
            # if it's not, we need to add it to the current category
            words_list[current_word][article_category] = 1
    else:
        # if it's not, we need to add it to words_list and add the category
        words_list[current_word] = {article_category:1}

    # now we need to increment the total number of words in this category
    # we'll have a dictionary where the keys are category names and the values are the total number of words in that category
    if(article_category in category_words):
        category_words[article_category] += 1
    else:
        category_words[article_category] = 1

# now we call this method to replace each headline and short_description in df with the cleaned version
# this will take a while, so we'll print out the progress as we go

for i in range(model_bounds[0], model_bounds[1]):
# this one is for testing purposes and only runs the first 1000 rows
# for i in range(1000):

    df.loc[i, 'headline'] = clean_string(df['headline'].loc[i], df['category'].loc[i])
    df.loc[i, 'short_description'] = clean_string(df['short_description'].loc[i], df['category'].loc[i])
    # todo: timer here, but maybe later. it could help us estimate how long it will take to clean the whole dataset.
    if (i % 100 == 0):
        print("Cleaned row "+str(i)+" of "+str(df.shape[0]))

# CLASSIFIER

# step 0: Laplace smoothing
def laplace_smoothing(alpha=1):
    # we'll make sure every word in words_list has a count for every category
    for word in words_list:
        for category in category_list:
            if category not in words_list[word]:
                words_list[word][category] = 0
    # then, we'll add 1 to every count
    for word in words_list:
        for category in words_list[word]:
            words_list[word][category] += alpha

# print(words_list)
laplace_smoothing()
# these print statements can be used to verify that laplace smoothing works
# print(words_list)
# print(category_words)
# print(category_list)
# step 1: Find probabilities for each word for each category
# step 2: Calculate the probability that a record in the testing data set is part of each category.
# step 3: Compare probabilities calculated in Step 2. Choose the largest probability to assign the category tag for the data.

# to reduce risk of underflow, we could use log probabilities
# import math

def prob_category(category:str):
    return category_list[category]/model_bounds[1]

def prob_category_given_word(category:str, word:str):
    # print("prob_category_given_word("+category+", "+word+")")
    # print("words_list["+word+"]["+category+"]: "+str(words_list[word][category]))
    # print("category_words["+category+"]: "+str(len(category_words[category])))
    try:
        return words_list[word][category]/category_words[category]
    except KeyError:
        print("the key "+word+" was not found in the dictionary for category "+category)
        return -1

def prob_category_given_words(category:str, word_list:list[str]):
    # first guess begins with the probability of the category
    product:int = 1
    # then we multiply by the probability of each category given each word in the headline and short_description
    for word in word_list:
        # print("prob_category_given_words("+category+", "+word+")")
        product *= prob_category_given_word(category, word)
    return product

all_predictions = {}

def classify(headline:str, short_description:str):
    # takes in a headline and short_description and returns the highest probability category
    predictions = []
    for category in category_list:
        probability = prob_category(category) * prob_category_given_words(category, clean_string(headline, category, False).split() + clean_string(short_description, category, False).split())
        predictions.append((category, probability))
    # print(predictions)

    max = 0
    max_category = ""

    for prediction in predictions:
        if prediction[1] > max:
            max = prediction[1]
            max_category = prediction[0]

    # store the predictions made for each article
    # all_predictions[headline] = predictions

    return max_category

# for each row in the test data, we'll run the classify method and see if it matches the actual category
correct = 0
total = test_bounds[1] - test_bounds[0]

for i in range(test_bounds[0], test_bounds[1]):
    actual = df['category'].loc[i]
    predicted = classify(df['headline'].loc[i], df['short_description'].loc[i])
    print("Predicted: "+predicted)
    print("Actual: "+actual)
    if (actual == predicted):
        correct += 1
    print("Current accuracy: "+str(correct/(i-test_bounds[0] + 1)))

print("Final accuracy: "+str(correct/total))


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\liamz\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\liamz\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Cleaned row 0 of 209527
Cleaned row 100 of 209527
Cleaned row 200 of 209527
Cleaned row 300 of 209527
Cleaned row 400 of 209527
Cleaned row 500 of 209527
Cleaned row 600 of 209527
Cleaned row 700 of 209527
Cleaned row 800 of 209527
Cleaned row 900 of 209527
Cleaned row 1000 of 209527
Cleaned row 1100 of 209527
Cleaned row 1200 of 209527
Cleaned row 1300 of 209527
Cleaned row 1400 of 209527
Cleaned row 1500 of 209527
Cleaned row 1600 of 209527
Cleaned row 1700 of 209527
Cleaned row 1800 of 209527
Cleaned row 1900 of 209527
Cleaned row 2000 of 209527
Cleaned row 2100 of 209527
Cleaned row 2200 of 209527
Cleaned row 2300 of 209527
Cleaned row 2400 of 209527
Cleaned row 2500 of 209527
Cleaned row 2600 of 209527
Cleaned row 2700 of 209527
Cleaned row 2800 of 209527
Cleaned row 2900 of 209527
Cleaned row 3000 of 209527
Cleaned row 3100 of 209527
Cleaned row 3200 of 209527
Cleaned row 3300 of 209527
Cleaned row 3400 of 209527
Cleaned row 3500 of 209527
Cleaned row 3600 of 209527
Cleaned row 3

### Results:

Since you have the categories that the test data was originally sorted into you can compare the predicted probabilities with the original classifications of the news articles. Report the overall effectiveness of your classifier as a percentage of news items categorized correctly. This should be done as a percentage across all the data in the test data set as well as the percent in each category that were categorized correctly.


In [None]:
# here we can print the effectiveness of our classifier
# Final accuracy: 0.47697227127380326
# for each category, we'll print the number of correct predictions and the total number of predictions
for category in category_list:
    correct = 0
    total = 0
    for i in range(test_bounds[0], test_bounds[1]):
        actual = df['category'].loc[i]
        predicted = classify(df['headline'].loc[i], df['short_description'].loc[i])
        if (actual == category):
            total += 1
            if (actual == predicted):
                correct += 1
    print("Category: "+category)
    print("Correct: "+str(correct))
    print("Total: "+str(total))
    print("Accuracy: "+str(correct/total))
    print("")


### Documentation:

Make sure to comment your code and give credit to any sources you used in creating your code.

### Reflection:

Create a document with the answers to the following questions. This should be formatted with each numbered question as the start of its own section, with the answer below it. Make sure your document has a title and your name on it.

#### Technical:

1. How is JSON formatted? Give an example with an explanation.
- JSON is formatted similarly to python dictionaries, with key-value pairs such as "headline" as a key and "World Cup Captains Want To Wear Rainbow Armbands In Qatar" as a value.
2. Why were the stop words removed from the text? If you don’t do this step in your code what changes? What other steps could you take along the same lines to improve your program?
- Removing stop words allows the program to focus on the words that are more important to the meaning of the text. If stop words were not removed, the program would have to sift through a lot of words that are not important to the meaning of the text. Another step would be to implement lemmatization, which would reduce words to their root form and would hopefully help reduce the run time of the program.
3. Why was Laplace Smoothing done?
- Laplace smoothing was done to prevent the probability of a word being 0, which would cause the probability of the entire category to be 0. This would cause the program to be unable to classify the text. What is learning if not adapting to new information?
4. This is a machine learning algorithm. What is the purpose of training and testing data sets? Why might 20% of the data have been reserved for testing, not more, not less?
- The purpose of training and testing data sets is to train the program to classify the text, and then to test the program to see how well it can classify the text. 20% of the data was reserved for testing because it is a good balance between having enough data to train the program and having enough data to test the program. If there was less data, the program would not be able to train as well, and if there was more data, the program would not be able to test as well. What use is precision without accuracy?
5. How did the size of data affected the time complexity? How did you manage this challenge?
- The size of the data affected the time complexity by increasing the run time of the program. I managed this challenge by using a smaller data set to test the program, and then using the full data set to test the program once I was confident that it worked. I had a similar approach to testing as well.

#### Process:

1. How did you approach the assignment? Did you give yourself enough time?
- My main drivers for figuring things out were breaking down the problem into smaller parts and using print statements to see what was happening at each step. I definitely did not give myself enough time to complete the assignment, but I was able to get a working program.
2. What challenges did you have with the code?
- I had a small stumble with laplace smoothing where not all categories were being initalized before adding one, meaning that all zero categories remained zero while all others gained a small boost. I also had a small issue with the testing data set where I was not removing the stop words from the testing data set, which caused the program to be unable to classify the text.
3. What did you learn while working on this assignment?
- I learned how to use JSON files, work with dataframes, and how to use the Multimodal Naive Bayes Classification algorithm with large data sets. 