# NLP & Classification Group Project

### Project Title: News Article Category Predictions
#### Done By: Amogelang Mogwane, Chris Phillip, Moosa Molibeli & Tiisetso Gabaza

© ExploreAI 2024

----

![NewsPaper](newspaper1.jpg)

<a id="cont"></a>
## Table of Contents

<a href=#one>1. Background Context</a>

<a href=#two>2. Importing Packages</a>

<a href=#three>3. Data Collection and Description</a>

<a href=#four>4. Loading Data </a>

<a href=#five>5. Data Cleaning and PreProcessing</a>

<a href=#six>6. Exploratory Data Analysis (EDA)</a>

<a href=#seven>7. Modeling</a>

<a href=#eight>8. Evaluation and Validation</a>

<a href=#nine>9. Final Model</a>

<a href=#ten>10. Conclusion and Future Work</a>

<a href=#eleven>11. References</a>

---
<a id="one"></a>
## Background Context
<a href=#cont>Back to Table of Contents</a>

From the inception of printed newspapers, every article has been assigned to a specific section. While many aspects of the newspaper industry have evolved—ranging from ink and paper types to distribution methods—the practice of categorizing news has persisted across generations, extending into digital formats. Newspaper articles cover a broad spectrum of topics, including politics, sports, and entertainment. Historically, categorization was primarily a manual process, but advancements in technology now allow for automated classification with minimal effort.

This project aims to design and develop an application that predicts the categories of news articles intended for publication. By utilizing classification algorithms, we will analyze the content of articles to determine their respective genres. The proposed algorithm will not only classify existing topics but also adapt to new topics as they emerge in the content. While the algorithm is extendable to multiple languages, this paper will primarily focus on English.

In the realm of news article classification, multi-label text classification poses a significant challenge. Our objective is to assign one or more category labels to each article. For each category, classifiers will provide binary responses—either "yes" or "no"—indicating whether a specific category applies to the given test data. This approach utilizes binary classifiers, and we will implement several standard algorithms, including K-Nearest Neighbours, Support Vector Machines, and Logistic Regression, which are commonly used for binary classification. We will evaluate these three approaches and select the best model based on predetermined parameters for predicting news categories.

A robust category classification algorithm for news articles must achieve high precision while remaining easily updatable. Given the continuous evolution of news topics and events, the ability to seamlessly add new categories to the classifier is essential.

![News Articles](newspaper2.jpg)

---
<a id="two"></a>
## Importing Packages
<a href=#cont>Back to Table of Contents</a>

**Please Note:**
*The below are all the libraries I believe I will need for this project. This list will be adjusted as needed throughout the project*

In [23]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt 
import seaborn as sns
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

import warnings
warnings.filterwarnings('ignore')

In [1]:
# !pip install wordcloud
# pip install imbalanced-learn

---
<a id="three"></a>
## Data Collection and Description
<a href=#cont>Back to Table of Contents</a>

This project utilizes a news dataset containing articles categorized into five distinct groups: Sports, Business, Entertainment, Education, and Technology. The dataset is randomly partitioned into training and testing sets, with the challenge of developing a classification model to predict the category of each news article.

The training dataset comprises 5,520 records across five columns, while the testing dataset consists of 2,000 observations with the same column structure. The columns included in the dataset are: 'index', 'headlines', 'description', 'content', 'url', and 'category'. The target variable for prediction is the 'category' column. Other columns will either be discarded (e.g., the 'url' column) or combined to create a single content column for the articles.

---
<a id="four"></a>
## Loading Data
<a href=#cont>Back to Table of Contents</a>

In [24]:
# loading our training and our testing data sets
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

In [25]:
# getting a sense of the training data from the first 5 observations
train_df.head()

Unnamed: 0,headlines,description,content,url,category
0,RBI revises definition of politically-exposed ...,The central bank has also asked chairpersons a...,The Reserve Bank of India (RBI) has changed th...,https://indianexpress.com/article/business/ban...,business
1,NDTV Q2 net profit falls 57.4% to Rs 5.55 cror...,NDTV's consolidated revenue from operations wa...,Broadcaster New Delhi Television Ltd on Monday...,https://indianexpress.com/article/business/com...,business
2,"Akasa Air ‘well capitalised’, can grow much fa...",The initial share sale will be open for public...,Homegrown server maker Netweb Technologies Ind...,https://indianexpress.com/article/business/mar...,business
3,India’s current account deficit declines sharp...,The current account deficit (CAD) was 3.8 per ...,India’s current account deficit declined sharp...,https://indianexpress.com/article/business/eco...,business
4,"States borrowing cost soars to 7.68%, highest ...",The prices shot up reflecting the overall high...,States have been forced to pay through their n...,https://indianexpress.com/article/business/eco...,business


In [26]:
# understanding the structure of the training dataset
train_df.shape

(5520, 5)

In [27]:
# understanding the structure of the testing dataset
test_df.shape

(2000, 5)

---
<a id="five"></a>
## Data Cleaning and PreProcessing
<a href=#cont>Back to Table of Contents</a>

Data cleaning is an essential first step in any data-driven project, ensuring the dataset is accurate, consistent, and ready for analysis. For our news article classification project, this involves several processes to improve data quality.

We will start by creating a copy of the training dataset. This way, any major changes can be made while preserving the original for reference.

In [28]:
# creating a copy of the training dataframe
train_df_copy = train_df.copy()

Next, we'll address any missing or incomplete data. Missing values can distort results and lead to inaccurate conclusions, so it's crucial to identify and handle these gaps through imputation, removal, or replacement.

In [29]:
# checking for missing values
train_df_copy.isnull().sum()

headlines      0
description    0
content        0
url            0
category       0
dtype: int64

Since there are no missing values, we can confidently move on to the next phase of the project.

Next, we will address any duplicate records to avoid bias in model training. Ensuring each article is unique is essential for accurately training the classification model.

In [30]:
# dropping any duplicates that might exist
train_df_copy.drop_duplicates(inplace=True)

train_df_copy.shape

(5520, 5)

The structure of the training dataset remains intact, confirming there are no duplicates. We can proceed with confidence, knowing our dataset is unbiased for modeling.

Next, we will remove the URL column, as it does not add value to our news category predictions and is considered redundant.

In [31]:
# removing redundant url column 
train_df_copy.drop(columns="url",inplace=True)

train_df_copy.head()

Unnamed: 0,headlines,description,content,category
0,RBI revises definition of politically-exposed ...,The central bank has also asked chairpersons a...,The Reserve Bank of India (RBI) has changed th...,business
1,NDTV Q2 net profit falls 57.4% to Rs 5.55 cror...,NDTV's consolidated revenue from operations wa...,Broadcaster New Delhi Television Ltd on Monday...,business
2,"Akasa Air ‘well capitalised’, can grow much fa...",The initial share sale will be open for public...,Homegrown server maker Netweb Technologies Ind...,business
3,India’s current account deficit declines sharp...,The current account deficit (CAD) was 3.8 per ...,India’s current account deficit declined sharp...,business
4,"States borrowing cost soars to 7.68%, highest ...",The prices shot up reflecting the overall high...,States have been forced to pay through their n...,business


Now that we've removed the URL column, we can map the category column for classification. Currently, the categories are in text format, so converting them to integers will allow us to use classification algorithms. We will define a function to facilitate this conversion.

In [32]:
# mapping category column for classification
def map_category(category):
    category_map = {
        'sports': 0,
        'business': 1,
        'entertainment': 2,
        'education': 3,
        'technology': 4
    }
    return category_map.get(category, -1)
train_df_copy['Category'] = train_df_copy['category'].apply(map_category)

In [33]:
train_df_copy.head()

Unnamed: 0,headlines,description,content,category,Category
0,RBI revises definition of politically-exposed ...,The central bank has also asked chairpersons a...,The Reserve Bank of India (RBI) has changed th...,business,1
1,NDTV Q2 net profit falls 57.4% to Rs 5.55 cror...,NDTV's consolidated revenue from operations wa...,Broadcaster New Delhi Television Ltd on Monday...,business,1
2,"Akasa Air ‘well capitalised’, can grow much fa...",The initial share sale will be open for public...,Homegrown server maker Netweb Technologies Ind...,business,1
3,India’s current account deficit declines sharp...,The current account deficit (CAD) was 3.8 per ...,India’s current account deficit declined sharp...,business,1
4,"States borrowing cost soars to 7.68%, highest ...",The prices shot up reflecting the overall high...,States have been forced to pay through their n...,business,1


Now that we have created a new category column, the old one is redundant and can be removed in order to keep the dataframe clean and of good quality.

In [34]:
train_df_copy.drop('category', axis=1, inplace=True)

In [35]:
train_df_copy.head()

Unnamed: 0,headlines,description,content,Category
0,RBI revises definition of politically-exposed ...,The central bank has also asked chairpersons a...,The Reserve Bank of India (RBI) has changed th...,1
1,NDTV Q2 net profit falls 57.4% to Rs 5.55 cror...,NDTV's consolidated revenue from operations wa...,Broadcaster New Delhi Television Ltd on Monday...,1
2,"Akasa Air ‘well capitalised’, can grow much fa...",The initial share sale will be open for public...,Homegrown server maker Netweb Technologies Ind...,1
3,India’s current account deficit declines sharp...,The current account deficit (CAD) was 3.8 per ...,India’s current account deficit declined sharp...,1
4,"States borrowing cost soars to 7.68%, highest ...",The prices shot up reflecting the overall high...,States have been forced to pay through their n...,1


Next, we will merge relevant columns, such as 'headlines,' 'description,' and 'content,' to create a comprehensive representation of each article. This consolidation will enhance our analysis and improve the model's understanding of the articles' context.

In [36]:
# joining the columns to create one and then dropping the redundant columns
train_df_copy['Content'] = train_df_copy['headlines'] + ' ' + train_df_copy['description'] + ' ' + train_df_copy['content']

train_df_copy.drop(['headlines', 'description', 'content'], axis=1, inplace=True)

In [37]:
train_df_copy.head()

Unnamed: 0,Category,Content
0,1,RBI revises definition of politically-exposed ...
1,1,NDTV Q2 net profit falls 57.4% to Rs 5.55 cror...
2,1,"Akasa Air ‘well capitalised’, can grow much fa..."
3,1,India’s current account deficit declines sharp...
4,1,"States borrowing cost soars to 7.68%, highest ..."


Lastly, we will standardize the text data by converting everything to lowercase for consistency, removing punctuation, eliminating stop words that add little meaning, and lemmatizing the dataset. These steps will streamline the data, making it easier for classification algorithms to process effectively. Lemmatization will reduce words to their base forms, enhancing classification across all observations.

We will define a function to perform these final cleanup steps on our content column.

In [38]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# text cleaning 
def clean_text(text):
    
    text = re.sub(r'\W', ' ', text)    # removing special characters
    text = text.lower()    # lowercasing the text
    tokens = word_tokenize(text)   # tokenizing the text
    
    # removing stop words and lemmatizing to reduce the words to base words
    lemmatizer = WordNetLemmatizer() #intiation
    stop_words = set(stopwords.words('english'))
    cleaned_tokens = [
        lemmatizer.lemmatize(token) for token in tokens if token not in stop_words
    ]

    # joining tokens back to string
    return ' '.join(cleaned_tokens)

train_df_copy['Cleaned_Content'] = train_df_copy['Content'].apply(clean_text)

In [39]:
train_df_copy.head()

Unnamed: 0,Category,Content,Cleaned_Content
0,1,RBI revises definition of politically-exposed ...,rbi revise definition politically exposed pers...
1,1,NDTV Q2 net profit falls 57.4% to Rs 5.55 cror...,ndtv q2 net profit fall 57 4 r 5 55 crore impa...
2,1,"Akasa Air ‘well capitalised’, can grow much fa...",akasa air well capitalised grow much faster ce...
3,1,India’s current account deficit declines sharp...,india current account deficit decline sharply ...
4,1,"States borrowing cost soars to 7.68%, highest ...",state borrowing cost soar 7 68 highest far fis...


We will close off this section by removing the 'Content' column which is now redundant. 

In [40]:
train_df_copy.drop('Content', axis=1, inplace=True)
train_df_copy.head()

Unnamed: 0,Category,Cleaned_Content
0,1,rbi revise definition politically exposed pers...
1,1,ndtv q2 net profit fall 57 4 r 5 55 crore impa...
2,1,akasa air well capitalised grow much faster ce...
3,1,india current account deficit decline sharply ...
4,1,state borrowing cost soar 7 68 highest far fis...


By applying these data cleaning techniques, we aim to build a high-quality dataset that improves the reliability of our classification model, ultimately leading to more accurate predictions of news categories.

---
<a id="six"></a>
## Exploratory Data Analysis (EDA)
<a href=#cont>Back to Table of Contents</a>

----
<a id="seven"></a>
## Modeling
<a href=#cont>Back to Table of Contents</a>

---
<a id="eight"></a>
## Evaluation and Validation
<a href=#cont>Back to Table of Contents</a>

---
<a id="nine"></a>
## Final Model
<a href=#cont>Back to Table of Contents</a>

---
<a id="ten"></a>
## Conclusion and Future Work
<a href=#cont>Back to Table of Contents</a>

---
<a id="eleven"></a>
## References
<a href=#cont>Back to Table of Contents</a>

[1] Rao, S., Sudarshan, K. and Abhishek (2020) 'News Article Category Predictor', Department of Computer Science and Engineering, Srinivas Institute of Technology, Valachil, India.

[2] Tong, S. and Koller, D. (2000) 'Support vector machine active learning with applications to text classification', in Langley, P. (ed.) Proceedings ICML-00, 17th International Conference on Machine Learning, pp. 999–1006.

[3] McCallum, A. and Nigam, K. (1998) 'A comparison of event models for naive Bayes text classification', in AAAI/ICML-98 Workshop on Learning for Text Categorization.