# MACHINE LEARNING MODEL FOR NEWS ARTICLE CLASSIFICATION

## Data pre-processing

In [40]:
# importing libraries
import numpy as np
import pandas as pd
import sklearn
import nltk

In [2]:
print("numpy",np.__version__)
print("pandas",pd.__version__)
print("sklearn",sklearn.__version__)

numpy 1.26.4
pandas 2.2.2
sklearn 1.4.2


In [5]:
# creating pandas DataFrame
df = pd.read_csv('NEWS_CATEGORY_DATASET.csv')
df.head()

Unnamed: 0,CATEGORY,TITLE
0,SCIENCE,A closer look at water-splitting's solar fuel ...
1,SCIENCE,An irresistible scent makes locusts swarm
2,SCIENCE,Artificial intelligence warning: AI will know ...
3,SCIENCE,Glaciers Could Have Sculpted Mars Valleys: Study
4,SCIENCE,Perseid meteor shower 2020: What time and how ...


In [7]:
# News Categories
pd.unique(df['CATEGORY'])

array(['SCIENCE', 'TECHNOLOGY', 'HEALTH', 'WORLD', 'ENTERTAINMENT',
       'SPORTS', 'BUSINESS', 'NATION'], dtype=object)

In [9]:
df.info()
print(df.columns)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 108691 entries, 0 to 108690
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   CATEGORY  108691 non-null  object
 1   TITLE     108627 non-null  object
dtypes: object(2)
memory usage: 1.7+ MB
Index(['CATEGORY', 'TITLE'], dtype='object')


### Tokenizing Titles

It helps in breaking text into individual components (tokens), simplifying processing and allowing models to map words to numerical representations.

In [12]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/mdshahbazshamim/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [13]:
# importing required libreary
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

TOKENIZED_TITLES = []

for headline in df['TITLE']:
    if isinstance(headline, str):  # Check if the value is a string
        TOKENIZED_TITLES.append(word_tokenize(headline.lower()))
    else:
        TOKENIZED_TITLES.append([])  # Handle non-string values, you can modify this as needed

### Pickling Tokenized Titles

It conserves time by retaining the processed data for future use, which removes the necessity of re-tokenizing with each model run. This approach guarantees consistency, accelerates workflows, and facilitates the sharing of preprocessed data between teams, resulting in more effective and organized project execution.

In [15]:
# Required library
import pickle
import os

# Ensureing the directory exists
os.makedirs('pklFiles', exist_ok=True)

# Now open the file and pickle data
file = "pklFiles/TOKENIZED_TITLES.pkl"
with open(file, 'wb') as fileobj:
    pickle.dump(TOKENIZED_TITLES, fileobj)

### Removal of stop words , punctuation & Possessives ('s)

Eliminating stop words, punctuation, and possessives aids in refining the data by eliminating unnecessary details. enables the model to concentrate on the key terms, thereby enhancing its precision and effectiveness. Additionally, it simplifies the text, making it easier for the model to comprehend and process.

In [72]:
# Required Libraries
from nltk.corpus import stopwords
import string

# stop words for English language
stop_words = set(stopwords.words("english"))
print("stop words : ")
print(stop_words)

# punctuations
punctuations = set(string.punctuation)
print("\npunctuations : ")
print(punctuations)

# FILTERED TITLE = Title without stop words & punctuations
FILTERED_TITLES = []

for title in TOKENIZED_TITLES:
    temp_title = []
    for word in title:
        if((word not in stop_words) and (word not in punctuations) and (word != "'s")):
            temp_title.append(word)

    FILTERED_TITLES.append(temp_title)

print("\nFiltered Titles : ")
print(FILTERED_TITLES[0:5])

stop words : 
{'which', 'few', 'than', 'hadn', "doesn't", 'do', 't', 'under', 'an', "don't", "should've", 'himself', 'until', "he'd", "she'd", "we'd", 'both', "they'd", "it'd", "mustn't", "they'll", 'don', 's', 'below', 'myself', "she'll", "won't", 'herself', 'can', 'all', "isn't", 'ma', 'because', 'our', 'being', 'll', 'where', 'just', 'when', 'your', "hadn't", 'am', 'their', 'once', 'she', 'doing', 'couldn', 'i', 'through', 'aren', 'in', 'won', 'further', "hasn't", 'of', 'up', 'those', 'haven', 'needn', 'me', 'as', 'above', 'how', 'more', 'before', "mightn't", 'against', 'he', 'between', 'shouldn', 'very', 'any', 'been', "you'd", 'over', 'so', 'each', 'itself', 'who', 'y', 'a', 'her', 'should', 'nor', 'most', 'doesn', 'about', 'some', "aren't", 'didn', 'for', 'm', 'had', "it's", 'will', 'mustn', 'own', "they're", 'isn', "i've", 'o', 'same', 'why', "they've", "weren't", 'them', 'down', 'be', 'from', 'that', 'there', 'themselves', "i'm", "shan't", 'after', 'into', "needn't", 'no', 'out

In [74]:
# Pickling FILTERED_TITLES

file = "pklFiles/FILTERED_TITLES.pkl"
with open(file, 'wb') as fileobj:
    pickle.dump(FILTERED_TITLES, fileobj)

### Stemmed Titles Headlines

Reducing headlines to their basic forms, known as stemming, simplifies the text by converting to their fundamental roots (for instance, changing "running" to "run"). This process minimizes word variations, making it easier for the model to identify patterns and enhancing classification accuracy. Additionally, it accelerates processing by decreasing the number of unique words that the model needs to understand.

In [77]:
# stemming using porter stemmer

# Required Library
from nltk.stem import PorterStemmer

porter = PorterStemmer()

STEMMED_TITLES_HEADLINES = []

for title in FILTERED_TITLES:
    temp_title = []
    for word in title:
        temp_title.append(porter.stem(word))

    STEMMED_TITLES_HEADLINES.append(" ".join(temp_title))

print("\nStemmed Titles Headlines : ")
print(STEMMED_TITLES_HEADLINES[0:5])


Stemmed Titles Headlines : 
['closer look water-split solar fuel potenti', 'irresist scent make locust swarm', 'artifici intellig warn ai know us better know', 'glacier could sculpt mar valley studi', 'perseid meteor shower 2020 time see huge bright firebal uk tonight']


In [143]:
# pickling STEMMED_TITLES_HEADLINES

file = "pklFiles/STEMMED_TITLES_HEADLINES.pkl"
with open(file, 'wb') as fileobj:
    pickle.dump(STEMMED_TITLES_HEADLINES, fileobj)

In [145]:
# Data Frame
df.head()

Unnamed: 0,CATEGORY,TITLE
0,SCIENCE,closer look water-split solar fuel potenti
1,SCIENCE,irresist scent make locust swarm
2,SCIENCE,artifici intellig warn ai know us better know
3,SCIENCE,glacier could sculpt mar valley studi
4,SCIENCE,perseid meteor shower 2020 time see huge brigh...


In [147]:
# Replacing HEADLINES with STEMMED TITLES HEADLINES
df = df.drop(['TITLE'], axis=1)
df.insert(1,"TITLE",STEMMED_TITLES_HEADLINES, True)

df.head()

Unnamed: 0,CATEGORY,TITLE
0,SCIENCE,closer look water-split solar fuel potenti
1,SCIENCE,irresist scent make locust swarm
2,SCIENCE,artifici intellig warn ai know us better know
3,SCIENCE,glacier could sculpt mar valley studi
4,SCIENCE,perseid meteor shower 2020 time see huge brigh...


### Encoding News Categories

In [150]:
#Required Library
from sklearn.preprocessing import LabelEncoder

labelencoder = LabelEncoder()

# Adding Column of ENCODED CATEGORY
df.insert(2, "ENCODED CATEGORY",labelencoder.fit_transform(df['CATEGORY']),True)

# BUSINESS -> 0
# ENTERTAINMENT -> 1
# HEALTH -> 2
# NATION -> 3
# SCIENCE -> 4
# SPORTS -> 5
# TECHNOLOGY -> 6
# WORLD -> 7

df.head()

Unnamed: 0,CATEGORY,TITLE,ENCODED CATEGORY
0,SCIENCE,closer look water-split solar fuel potenti,4
1,SCIENCE,irresist scent make locust swarm,4
2,SCIENCE,artifici intellig warn ai know us better know,4
3,SCIENCE,glacier could sculpt mar valley studi,4
4,SCIENCE,perseid meteor shower 2020 time see huge brigh...,4


In [152]:
# pickling DataFrame

file = "pklFiles/DATAFRAME.pkl"
with open(file, 'wb') as fileobj:
    pickle.dump(df, fileobj)

print(type(df))
df.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,CATEGORY,TITLE,ENCODED CATEGORY
0,SCIENCE,closer look water-split solar fuel potenti,4
1,SCIENCE,irresist scent make locust swarm,4
2,SCIENCE,artifici intellig warn ai know us better know,4
3,SCIENCE,glacier could sculpt mar valley studi,4
4,SCIENCE,perseid meteor shower 2020 time see huge brigh...,4


## Using Naive Bayes Classification

Naive Bayes classification is an easy-to-use and efficient algorithm that applies probability for data classification. It is based on the assumption that features are independent, which contributes to its speed and often reliable results in applications such as spam detection and text classification.

### Data Understanding

By analyzing the data, We can identify trends, address problems, and make well decisions, which results in outcomes that are both accurate and dependable.

In [156]:
# Loading Data Frame
import pickle

file = "pklFiles/DATAFRAME.pkl"
fileobj = open(file,'rb')
df = pickle.load(fileobj)
fileobj.close()

print(type(df))
df.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,CATEGORY,TITLE,ENCODED CATEGORY
0,SCIENCE,closer look water-split solar fuel potenti,4
1,SCIENCE,irresist scent make locust swarm,4
2,SCIENCE,artifici intellig warn ai know us better know,4
3,SCIENCE,glacier could sculpt mar valley studi,4
4,SCIENCE,perseid meteor shower 2020 time see huge brigh...,4


### Selecting features and labels

Choosing the right features and labels increases a model's accuracy and efficiency. This focus significant data not only simplifies the model's complexity but also enhances its ability to make correct predictions. Labels present the outcomes that the model must learn from.

In [159]:
# News Headlines
X = df['TITLE']

# Encoded News Category
Y = df['ENCODED CATEGORY']

### data division (test and train)

Dividing the dataset into training and testing subsets allows us to evaluate our model's performance on unseen data, helps prevent overfitting, and enables you to adjust it for improved outcomes.

In [162]:
# Required Library
from sklearn.model_selection import train_test_split

# Testing_set -> 25% and Training_set -> 75%
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.25,random_state=50)

In [164]:
# understanding how much data is used for training and testing from the original dataset

print("shape of X : " + str(X.shape))
print("shape of Y : " + str(Y.shape))

print("\n")
print("shape of X_train : " + str(X_train.shape))
print("shape of Y_train : " + str(Y_train.shape))
print("shape of X_test : " + str(X_test.shape))
print("shape of Y_test : " + str(Y_test.shape))

shape of X : (108691,)
shape of Y : (108691,)


shape of X_train : (81518,)
shape of Y_train : (81518,)
shape of X_test : (27173,)
shape of Y_test : (27173,)


### Feature Selection 

Selecting relevant features enhances the accuracy of your model, minimizes the risk offitting, accelerates the process, and simplifies interpretation by concentrating the most crucial data.

### Bag of Words (BOW) Approach

The Bag of Words (BoW) method converts text into a tally of word occurrences while disregarding grammar and the sequence of words. It formulates a vocabulary containing all distinct words from the dataset, subsequently representing each document as a vector that indicates the frequency of each word. Although it offers a straightforward way to analyze text, it fails to convey the meaning or order of the.

In [167]:
# Converting text documents into matrix of token counts

# Required Library
from sklearn.feature_extraction.text import CountVectorizer

# Intantaition CountVectorizer
count_vectorizer = CountVectorizer()

# Fitting & Transforming Training Data (X_train)
count_X_train = count_vectorizer.fit_transform(X_train.values)

# Transforming Testing Data (x_test)
count_X_test = count_vectorizer.transform(X_test.values)

# Saving count_vectorizer
pickle.dump(count_vectorizer, open("pklFiles/count_vectorizer.pkl","wb"))

### Model Training

Model training allows a machine learning model to patterns within data, enabling it to make precise predictions on new information. This process enhances the model's effectiveness and allows it to perform tasks like detecting spam or making product recommendations.

In [170]:
# Multinomial Naive Classifier -> Predicts category based on the word frequency in text

# Required Library
from sklearn.naive_bayes import MultinomialNB

# Instantiating Naive Bayes Classifier with alpha = 1.0
nb_classifier = MultinomialNB()

# Fitting nb_classifier to training Data 
nb_classifier.fit(count_X_train, Y_train)

# Saving nb_classifier for count_vectorizer
pickle.dump(nb_classifier, open("pklFiles/nb_classifier_for_count_vectorizer.pkl","wb"))

### Model Prediction

Making predictions with a model is crucial for applying its learned insights to new data. This process aids in automating certain tasks, facilitating data-informed decisions, and evaluating the model's performance in real-life scenarios.

In [173]:
# Prediction
pred = nb_classifier.predict(count_X_test)

### Evaluation of Prediction

#### Accuracy Score & Confusion Matrix

* *Accuracy Score*: Reflects the frequency with which the model produces correct predictions. It is calculated as the proportion of accurate predictions against the overall number of predictions made.
* *Confusion Matrix*: A framework that delineates the model's effectiveness by displaying a matrix of true positives, true negatives, false positives, and false negatives. It facilitates the assessment of where the model excels or falters in its predictions, offering deeper insights beyond just the accuracy metric.

In [176]:
# Required Library
from sklearn import metrics

# Accuracy
accuracy=metrics.accuracy_score(Y_test,pred)
print("accuracy: "+str("{:.2f}".format(accuracy*100)),"%")

print("\n")

# Confusion Matrix
# Labels : 0(BUSINESS), 1(ENTERTAINMENT), 2(HEALTH), 3(NATION), 4(SCIENCE), 5(SPORTS), 6(TECHNOLOGY), 7(WORLD)
# By default , Horizontally Labels are from 0 to 3
# By default , Vertically Labels are from 0 to 3
conf_matrix=metrics.confusion_matrix(Y_test,pred)
print("confusion_matrix: ")
print(conf_matrix)

accuracy: 77.00 %


confusion_matrix: 
[[2744   57  267  174    5   40  215  230]
 [  42 3234   72  122    8   71  141  110]
 [ 104   42 3270  142   21   23   32  159]
 [ 185  143  449 2303   12   84   51  578]
 [  31   25  104   10  689    3   31   28]
 [  37  124   56   64    3 3304   59   64]
 [ 146  148   88   34   26   46 3192   54]
 [ 185  143  548  488   25   42   60 2186]]


##### Laplace Smoothing

Laplace smoothing prevents the model from assigning a probability of zero to words or categories that have not been encountered before, ensuring that every potential outcome is considered. This enhancement increases the accuracy and reliability of the model, particularly when it comes to new or infrequent data.

In [179]:
# Laplace Smoothing (Tunning paramer - alpha)

# List of alphas
alphas =np.arange(0,1,0.1)

# Function for training nb_classifier with different alpha values
def train_and_predict(alpha):
    # Instantiating Naive Bayes Classifier
    nb_classifier = MultinomialNB(alpha=alpha)

    # Fitting nb_classifier to training Data 
    nb_classifier.fit(count_X_train, Y_train)

    # Prediction
    pred = nb_classifier.predict(count_X_test)
    
    # Accuracy Score
    accuracy=metrics.accuracy_score(Y_test,pred)

    return accuracy

# Iterating over alphas & printing the corresponding Accuracy Score
for alpha in alphas:
    print("alpha : ",alpha)
    print("Accuracy Score : ",train_and_predict(alpha))
    print()

alpha :  0.0
Accuracy Score :  0.7130607588414971

alpha :  0.1
Accuracy Score :  0.7793029845802819

alpha :  0.2
Accuracy Score :  0.7786037610863725

alpha :  0.30000000000000004
Accuracy Score :  0.7768005004968167

alpha :  0.4
Accuracy Score :  0.775549258455084

alpha :  0.5
Accuracy Score :  0.7750340411437824

alpha :  0.6000000000000001
Accuracy Score :  0.7742244139403084

alpha :  0.7000000000000001
Accuracy Score :  0.7725315570603172

alpha :  0.8
Accuracy Score :  0.7717955323298863

alpha :  0.9
Accuracy Score :  0.7706914952342399



  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


 with alpha=1.0, we got accuracy of 77%.

 then, trying different values of alpha, still we are getting approximate accuracy of 77%

 so, we don't need to change the value of alpha.

### Loading Model

* *Count Vectorizer*: This tool converts text into numerical values by tallying the frequency of each word found in a document. It transforms text into a format that algorithms can utilize.
* *Naive Bayes Classifier*: This algorithm applies these numerical values to make predictions about categories. For instance, it can determine whether an email is spam on its word composition.

In [183]:
import pickle

try:
    # Load the CountVectorizer and Naive Bayes Classifier
    with open("pklFiles/count_vectorizer.pkl", "rb") as f:
        count_vectorizer = pickle.load(f)

    with open("pklFiles/nb_classifier_for_count_vectorizer.pkl", "rb") as f:
        nb_classifier = pickle.load(f)

    print("Models loaded successfully.")

except FileNotFoundError as e:
    print(f"File not found: {e}")

except pickle.UnpicklingError as e:
    print(f"Error unpickling object: {e}")

except Exception as e:
    print(f"An error occurred: {e}")

Models loaded successfully.


### Taking User Input

User input helps machine learning models give more accurate and personalized results. It allows the model to adapt to what each individual needs or prefers, making its predictions or recommendations more relevant.

In [200]:
# Value encoded by Label Encoder
encoded = {0:'BUSINESS',1:'ENTERTAINMENT',2:'HEALTH',3:'NATION',4:'SCIENCE',5:'SPORTS',6:'TECHNOLOGY',7:'WORLD'}

# Input
User_Headline = [input("News Headline : ")]

# Transformation and Prediction of User Headline
Headline_counts = count_vectorizer.transform(User_Headline)
prediction = nb_classifier.predict(Headline_counts)

print("News Category : ",encoded[prediction[0]])

News Headline :  https://timesofindia.indiatimes.com/world/south-asia/bangladesh-currency-drops-bangabandhus-image-new-notes-without-sheikh-mujibur-rahmans-picture-are-out/articleshow/121559121.cms


News Category :  WORLD
