## News Classification

Use Jupyter Notebooks to create a News Classifier with Multimodal Naive Bayes Classification.

### Read in the Data:

Use the Kaggle News Category Classification data set for this project. Note this is a JSON dataset and will be read in differently than a CSV.

Markdown help:
https://www.markdownguide.org/cheat-sheet/

In [None]:
#This block will read in the data, converting from JSON to a dataframe

import pandas, json

df = pandas.read_json("News_Category_Dataset_v3.json", lines=True)

# let's see what categories/keys we're working with
# here's an example:
# {
    # "link": "https://www.huffpost.com/entry/covid-boosters-uptake-us_n_632d719ee4b087fae6feaac9",
    # "headline": "Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters",
    # "category": "U.S. NEWS",
    # "short_description": "Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall.",
    # "authors": "Carla K. Johnson, AP",
    # "date": "2022-09-23"
# }

# so we have
# keys = [link, headline, category, short_description, authors, date]

# let's check to make sure all our data is at least somewhat valid
# print(df.info())


### Test Data: 

Before you start building your classifier, remove 20% of the data and place it in a separate table; this will be your testing dataset. 

Once you have created your model (Steps 0 and 1) with the other 80% of the data you will test to see how accurate your model is by having the program categorize the records in your testing data set.


In [None]:
# this block will partition our data into the 80-20 split we need

# let's find out the shape of our dataframe

# print(df.shape) returns (209527, 6), which means we have 209527 rows and 6 columns

# we'll calculate the last 20% and reserve those entries for testing

test_index_start = int(df.shape[0]*4/5)

model_bounds:tuple = (0, test_index_start-1)

test_bounds:tuple = (test_index_start, df.shape[0])

print(model_bounds)

print(test_bounds)



### Stop Words:

Just like in your first Project you should remove the Stop Words in all of the articles before building your model or testing.


In [83]:
# this block will help remove stop words from our 80% before testing
# for each row in the dataframe, we'll remove stop words from the headline and short_description

# we'll use the nltk library to help us with this
import time
import nltk
import string
from nltk.corpus import stopwords
nltk.download("punkt")
nltk.download('stopwords')

punc = str.maketrans('', '', string.punctuation)

stop_words = stopwords.words("english")

# here's a list of all words, coupled with their frequency in each category
# each word is a key, and the value is a dictionary of categories and their counts
# {word : {category : count, category : count, ...}, word : {category : count, category : count, ...}, ...}
words_list = {}

# we'll also keep track of the total number of words in each category through the incrementWord method
category_word_counts = {}

# this is our method from the last project, modified to work here
# this method takes in a word or string and returns the cleaned word or string
def cleanString(current_word:str, article_category:str):
    # remove all numbers: if a number is found, return without adding word
    for num in range(0,9):
        if (current_word.find(str(num)) != -1):
            return ""

    # make it all lowercase and remove most punctuation
    current_word = current_word.casefold().translate(punc)
    cleanedString:str = ""
    # completely ditch the word if it's just an empty string
    if (current_word != ''):
        # if the above operations turned the word into multiple words, recurse this method for each new word
        if (current_word.find(' ') != -1):
            allWords = current_word.split()
            for i in range(len(allWords)):
                cleaned_word = cleanString(allWords[i], article_category)
                if cleaned_word is not None:
                    cleanedString = cleanedString + " " + cleaned_word
            while (cleanedString.find("  ") != -1):
                cleanedString = cleanedString.replace("  ", " ")
            return cleanedString.strip()

        # make sure the word is not a stopword before taking it into account
        if current_word not in stop_words:
            # print("\""+thisWord+"\" is KEPT")
            # add 1 to the count of this word for this category
            incrementWord(current_word, article_category)
            return current_word
        else:
            # print("\""+thisWord+"\" is TOSSED")
            return ""
    else:
        return ""

# this is a method to send valid words into wordsList or increment their count
def incrementWord(current_word: str, article_category:str):
    print("incrementing "+current_word+" in "+article_category)
    # for the current_word, we need to check if it's already in wordsList
    if(current_word in words_list):
        # if it is, we need to check if it's in the current category
        if(article_category in words_list[current_word]):
            # if it is, we need to increment the count
            words_list[current_word][article_category] += 1
        else:
            # if it's not, we need to add it to the current category
            words_list[current_word][article_category] = 1
    else:
        # if it's not, we need to add it to words_list and add the category
        words_list[current_word] = {article_category:1}

# now we call this method to replace each headline and short_description in df with the cleaned version
# this will take a while, so we'll print out the progress as we go

for i in range(100+0*df.shape[0]):
    df.loc[i, 'headline'] = cleanString(df['headline'].loc[i], df['category'].loc[i])
    df.loc[i, 'short_description'] = cleanString(df['short_description'].loc[i], df['category'].loc[i])
    # todo: timer here later
    # print("Cleaned row "+str(i)+" of "+str(df.shape[0]))

print(words_list)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\liamz\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\liamz\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


incrementing american in U.S. NEWS
incrementing airlines in U.S. NEWS
incrementing flyer in U.S. NEWS
incrementing charged in U.S. NEWS
incrementing banned in U.S. NEWS
incrementing life in U.S. NEWS
incrementing punching in U.S. NEWS
incrementing flight in U.S. NEWS
incrementing attendant in U.S. NEWS
incrementing video in U.S. NEWS
incrementing subdued in U.S. NEWS
incrementing passengers in U.S. NEWS
incrementing crew in U.S. NEWS
incrementing fled in U.S. NEWS
incrementing back in U.S. NEWS
incrementing aircraft in U.S. NEWS
incrementing confrontation in U.S. NEWS
incrementing according in U.S. NEWS
incrementing us in U.S. NEWS
incrementing attorneys in U.S. NEWS
incrementing office in U.S. NEWS
incrementing los in U.S. NEWS
incrementing angeles in U.S. NEWS
incrementing dog in COMEDY
incrementing dont in COMEDY
incrementing understand in COMEDY
incrementing could in COMEDY
incrementing eaten in COMEDY
incrementing accidentally in PARENTING
incrementing put in PARENTING
incrementin


### Classifier:

You may use whatever python libraries you wish but you should write the code for each of the four steps in the Multimodal Naive Bayes Classification yourself.

- Step 0: Laplace Smoothing (remember this is for word counts not the categories’ probabilities)

- Step 1: Find probabilities for each word for each category

- Step 2: Calculate the probability that a record in the testing data set is part of each category.

- Step 3: Compare probabilities calculated in Step 2. Choose the largest probability to assign the category tag for the data.


In [None]:
# this block will be our classifier function

# step 0: Laplace smoothing
def laplaceSmoothing():
    # for every single word and category, add 1 to the count
    for word in words_list:
        for category in words_list[word]:
            words_list[word][category] += 1

# laplaceSmoothing()

# step 1: Find probabilities for each word in each category, from both headline and short_description

# step 2: Calculate the probability that a record in the testing data set is part of each category

# step 3: Compare probabilities calculated in Step 2. Choose the largest probability to assign the category tag for the data.

### Results:

Since you have the categories that the test data was originally sorted into you can compare the predicted probabilities with the original classifications of the news articles. Report the overall effectiveness of your classifier as a percentage of news items categorized correctly. This should be done as a percentage across all the data in the test data set as well as the percent in each category that were categorized correctly.


In [None]:
# here we can print the effectiveness of our classifier


### Documentation:

Make sure to comment your code and give credit to any sources you used in creating your code.

### Reflection:

Create a document with the answers to the following questions. This should be formatted with each numbered question as the start of its own section, with the answer below it. Make sure your document has a title and your name on it.

#### Technical:

1. How is JSON formatted? Give an example with an explanation.
2. Why were the stop words removed from the text? If you don’t do this step in your code what changes? What other steps could you take along the same lines to improve your program?
3. Why was Laplace Smoothing done?
- What is learning if not adapting to new information?
4. This is a machine learning algorithm. What is the purpose of training and testing data sets? Why might 20% of the data have been reserved for testing, not more, not less?
- What use is precision without accuracy?
5. How did the size of data affected the time complexity? How did you manage this challenge?

#### Process:

1. How did you approach the assignment? Did you give yourself enough time?
2. What challenges did you have with the code?
3. What did you learn while working on this assignment?