# Analysing News Articles Dataset

<div id="main image" align="center">
  <img src="https://github.com/Ganizo/Classification-ProjectTeam-11/blob/main/Images/newspaper-with-the-headline-news-and-glasses-and-coffee-cup-on-wooden-table-daily-newspaper-mock-up-concept-photo.jpg" width="550" height="300" alt=""/>
</div>


## Structure of the Notebook
* 1. Importing Pacakages
* 2. Dataset
* 3. Pre-Processing
* 4. DataPrep
* 5. Model train and test
* 6. Model performance

## 1. Importing Pacakages <a class="anchor" id="packages"></a>

In [2]:
#Importing Packages

# Importing Pandas for data manipulation and analysis
import pandas as pd

# Importing NumPy for numerical operations
import numpy as np

# Importing Matplotlib for data visualization
import matplotlib.pyplot as plt

# Importing Seaborn for advanced data visualization
import seaborn as sns

# Importing IPython.display for displaying rich media in Jupyter Notebooks
from IPython.display import display, Image

# Importing IPython.display for displaying rich media in Jupyter Notebooks
import nltk
from nltk.corpus import stopwords

# Importing models used for training
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

# Importing model metrics to evaluate performance
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import confusion_matrix, log_loss

#Importing Pickle to save and use the model
import pickle

#Importing re
import re


__________________________________________________________________________________________

############################################################################
__________________________________________________________________________________________

## 2. Data Loading <a class="anchor" id="dataset"></a>
    
    Firstly, import the necessary packages. When the required packages are imported, for loading a CSV file, this Project will utilise Pandas to load the data.

In [3]:
# Load the CSV file
train_df = pd.read_csv('Data/processed/train.csv')
test_df = pd.read_csv('Data/processed/test.csv')

In [4]:
test_df.head()

Unnamed: 0,headlines,description,content,url,category
0,NLC India wins contract for power supply to Ra...,State-owned firm NLC India Ltd (NLCIL) on Mond...,State-owned firm NLC India Ltd (NLCIL) on Mond...,https://indianexpress.com/article/business/com...,business
1,SBI Clerk prelims exams dates announced; admit...,SBI Clerk Prelims Exam: The SBI Clerk prelims ...,SBI Clerk Prelims Exam: The State Bank of Indi...,https://indianexpress.com/article/education/sb...,education
2,"Golden Globes: Michelle Yeoh, Will Ferrell, An...","Barbie is the top nominee this year, followed ...","Michelle Yeoh, Will Ferrell, Angela Bassett an...",https://indianexpress.com/article/entertainmen...,entertainment
3,"OnePlus Nord 3 at Rs 27,999 as part of new pri...",New deal makes the OnePlus Nord 3 an easy purc...,"In our review of the OnePlus Nord 3 5G, we pra...",https://indianexpress.com/article/technology/t...,technology
4,Adani family’s partners used ‘opaque’ funds to...,Citing review of files from multiple tax haven...,Millions of dollars were invested in some publ...,https://indianexpress.com/article/business/ada...,business


In [5]:
train_df.head()

Unnamed: 0,headlines,description,content,url,category
0,RBI revises definition of politically-exposed ...,The central bank has also asked chairpersons a...,The Reserve Bank of India (RBI) has changed th...,https://indianexpress.com/article/business/ban...,business
1,NDTV Q2 net profit falls 57.4% to Rs 5.55 cror...,NDTV's consolidated revenue from operations wa...,Broadcaster New Delhi Television Ltd on Monday...,https://indianexpress.com/article/business/com...,business
2,"Akasa Air ‘well capitalised’, can grow much fa...",The initial share sale will be open for public...,Homegrown server maker Netweb Technologies Ind...,https://indianexpress.com/article/business/mar...,business
3,India’s current account deficit declines sharp...,The current account deficit (CAD) was 3.8 per ...,India’s current account deficit declined sharp...,https://indianexpress.com/article/business/eco...,business
4,"States borrowing cost soars to 7.68%, highest ...",The prices shot up reflecting the overall high...,States have been forced to pay through their n...,https://indianexpress.com/article/business/eco...,business


__________________________________________________________________________________________

############################################################################
__________________________________________________________________________________________

## 3. Pre-Processing <a class="anchor" id="preprocess"></a>

- Remove unnecessary columns (url) since it may not contribute to categorization.
- *Text Cleaning*: Convert text to lowercase, remove punctuation, stopwords, and special characters.
- *Tokenization*: Split text into individual words or phrases.
- *Vectorization*: Convert text into numerical form using TF-IDF or CountVectorizer.

After analysising my data We have decided to reduce my dataset to the headlines and Category tabs, Reason being Describtion and content usually contain alot more verbose explaination of the headlines, meaning we need to use alot more resources to process,train and test our model.

Headlines are similar to article titles meaning chances are can get a pretty good model at a fraction of the resource. This is our initial hypothesis.

As a result we will drop the **url**, **Describtion** and **content** columns.

In [6]:
#Remove unnecessary columns
train_df = train_df.drop('url', axis=1)
train_df = train_df.drop('description', axis=1)
train_df = train_df.drop('content', axis=1)
train_df.head()

Unnamed: 0,headlines,category
0,RBI revises definition of politically-exposed ...,business
1,NDTV Q2 net profit falls 57.4% to Rs 5.55 cror...,business
2,"Akasa Air ‘well capitalised’, can grow much fa...",business
3,India’s current account deficit declines sharp...,business
4,"States borrowing cost soars to 7.68%, highest ...",business


Next up we need to remove the noise within the headlines column, so we will make use of regex to remove punctuation, special characters and numbers. Afterwards we will use nltk to download and set stop words array then we use a for loop to cycle through our text to remove any stop words.

Lastly we make sure to make all our text lower case to make it uniform.

In [7]:
#Text Cleaning

# Download stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Function to remove stop words
def clean_text(text):
    # Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', ' ', text)
    # Remove numbers from the text
    text = re.sub(r'\d+', ' ', text)
    # Remove stopwords
    text = ' '.join([word for word in text.split() if word.lower() not in stop_words])
    return text

# Apply the above function to the headings column
train_df['cleaned_headlines'] = train_df['headlines'].apply(clean_text)


# Make the entire column lowercase
train_df['cleaned_headlines'] = train_df['cleaned_headlines'].str.lower()


train_df = train_df.drop('headlines', axis=1)


train_df.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,category,cleaned_headlines
0,business,rbi revises definition politically exposed per...
1,business,ndtv q net profit falls rs crore impacted lowe...
2,business,akasa air well capitalised grow much faster ce...
3,business,india current account deficit declines sharply...
4,business,states borrowing cost soars highest far fiscal


Down here we tried using varies sort of tokenizers, vectorizers and we ended up using TD-IDF Vectorizer

In [8]:

# Tokenise the text using the TreebankWordTokenizer
#tokeniser = TreebankWordTokenizer()
#train_df['tokens_content'] = train_df['cleaned_content'].apply(tokeniser.tokenize)
#train_df['tokens_description'] = train_df['cleaned_description'].apply(tokeniser.tokenize)
#train_df['tokens_headlines'] = train_df['cleaned_headlines'].apply(tokeniser.tokenize)

#train_df = train_df.drop('cleaned_content', axis=1)
#train_df = train_df.drop('cleaned_description', axis=1)
#train_df = train_df.drop('cleaned_headlines', axis=1)

#train_df.head()

In [9]:
#from sklearn.feature_extraction.text import CountVectorizer

# Initialise CountVectorizer
#vect = CountVectorizer()
# Fit the CountVectorizer on the preprocessed 'post' column
#vect.fit(train_df['cleaned_content'] )

__________________________________________________________________________________________

############################################################################
__________________________________________________________________________________________

## 4. DataPrep <a class="anchor" id="dataprep"></a>

So below we are just seperating our features and labels to be sent into our models. We were not sure if we needed to use the train test split since our dataset came separated already so ultimately we decide to just go with using as we are comfortable training and testing our models this way. 

Due to time contraints we would have expolre this further but for the train dataset and the test dataset we did the train test split

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X_vectorized = vectorizer.fit_transform(train_df["cleaned_headlines"]) 

# Setting up our labels
y = train_df["category"]

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2, random_state=42)

__________________________________________________________________________________________

############################################################################
__________________________________________________________________________________________

## 5. Model Train and Test <a class="anchor" id="modeltandt"></a>

In [12]:


# Save the trained model
#train LogisticRegression 
lr = LogisticRegression(multi_class='multinomial', solver='lbfgs')
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)
with open("Streamlit/LogisticRegression.pkl", "wb") as f:
    pickle.dump(lr, f)

# Train Naive Bayes and save for streamlit usage
nb = MultinomialNB()
nb.fit(X_train, y_train)
nb_pred = nb.predict(X_test)
with open("Streamlit/Naive_Bayes.pkl", "wb") as f:
    pickle.dump(nb, f)

# Train SVM and save for streamlit usage
svm = SVC(kernel="linear")
svm.fit(X_train, y_train)
svm_pred = svm.predict(X_test)
with open("Streamlit/SVM.pkl", "wb") as f:
    pickle.dump(svm, f)

# Save Vectorizer for streamlit usage
with open("Streamlit/vectorizer.pkl", "wb") as f:
    pickle.dump(vectorizer, f)

# Evaluate Models
print("Naive Bayes Accuracy:", accuracy_score(y_test, nb_pred))
print("SVM Accuracy:", accuracy_score(y_test, svm_pred))
print("LogisticRegression Accuracy:", accuracy_score(y_test, lr_pred))



Naive Bayes Accuracy: 0.9039855072463768
SVM Accuracy: 0.9320652173913043
LogisticRegression Accuracy: 0.9193840579710145


Test trained models

In [13]:

test_df['cleaned_headlines'] = test_df['headlines'].apply(clean_text)


# Make the entire column lowercase
test_df['cleaned_headlines'] = test_df['cleaned_headlines'].str.lower()


test_df = test_df.drop('content', axis=1)
test_df = test_df.drop('description', axis=1)
test_df = test_df.drop('headlines', axis=1)

#vectorizer = TfidfVectorizer()
X_test_vectorized = vectorizer.transform(test_df["cleaned_headlines"]) 
y_test = test_df["category"]



__________________________________________________________________________________________

############################################################################
__________________________________________________________________________________________

## 6. Model Performance <a class="anchor" id="modelperform"></a>

Checking Performance of models on unseen data

In [14]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_test_vectorized, y_test, test_size=0.2, random_state=42)

#performing model predictions
lr_pred2 = lr.predict(X_test2)
nb_pred2 = nb.predict(X_test2)
svm_pred2= svm.predict(X_test2)

#LogisticRegression metrics
lr_cm = confusion_matrix(y_test2, lr_pred2)
#lr_tn, lr_fp, lr_fn, lr_tp = lr_cm.ravel()
lr_accuracy = accuracy_score(y_test2, lr_pred2)
lr_precision = precision_score(y_test2, lr_pred2, average='weighted')
lr_recall = recall_score(y_test2, lr_pred2, average='weighted')
lr_f1 = f1_score(y_test2, lr_pred2, average='weighted')
lr_auc_roc = roc_auc_score(y_test2, lr.predict_proba(X_test2), multi_class='ovr')
lr_log_loss_val = log_loss(y_test2, lr.predict_proba(X_test2))

#Naive bayes Metrics
nb_cm = confusion_matrix(y_test2, nb_pred2)
#nb_tn, nb_fp, nb_fn, nb_tp = nb_cm.ravel()
nb_accuracy = accuracy_score(y_test2, nb_pred2)
nb_precision = precision_score(y_test2, nb_pred2, average='weighted')
nb_recall = recall_score(y_test2, nb_pred2, average='weighted')
nb_f1 = f1_score(y_test2, nb_pred2, average='weighted')
nb_auc_roc = roc_auc_score(y_test2, nb.predict_proba(X_test2), multi_class='ovr')
nb_log_loss_val = log_loss(y_test2, nb.predict_proba(X_test2))


svm_accuracy = accuracy_score(y_test2, svm_pred2)
svm_precision = precision_score(y_test2, svm_pred2, average='weighted')
svm_recall = recall_score(y_test2, svm_pred2, average='weighted')
svm_f1 = f1_score(y_test2, svm_pred2, average='weighted')
svm_cm = confusion_matrix(y_test2, svm_pred2)
#svm_tn, svm_fp, svm_fn, svm_tp = svm_cm.ravel()
#svm_auc_roc = roc_auc_score(y_test2, svm.predict_proba(X_test2), multi_class='ovr')
#svm_log_loss_val = log_loss(y_test2, svm.predict_proba(X_test2))

print("LogisticRegression Accuracy:")
print(f"Accuracy: {lr_accuracy:.4f}")
print(f"Precision: {lr_precision:.4f}")
print(f"Recall: {lr_recall:.4f}")
print(f"F1 Score: {lr_f1:.4f}")
print(f"AUC-ROC: {lr_auc_roc:.4f}")
print(f"Log Loss: {lr_log_loss_val:.4f}")
print(lr_cm)
print("------------------------------------------")
print("******************************************")
print("------------------------------------------")
print("Naive Bayes Accuracy:")
print(f"Accuracy: {nb_accuracy:.4f}")
print(f"Precision: {nb_precision:.4f}")
print(f"Recall: {nb_recall:.4f}")
print(f"F1 Score: {nb_f1:.4f}")
print(f"AUC-ROC: {nb_auc_roc:.4f}")
print(f"Log Loss: {nb_log_loss_val:.4f}")
print(nb_cm)
#print(f"confusion_matrix: {nb_cm:.4f}")
print("------------------------------------------")
print("******************************************")
print("------------------------------------------")
print("SVM Accuracy:")
print(f"Accuracy: {svm_accuracy:.4f}")
print(f"Precision: {svm_precision:.4f}")
print(f"Recall: {svm_recall:.4f}")
print(f"F1 Score: {svm_f1:.4f}")
print(svm_cm)
#print(f"AUC-ROC: {svm_auc_roc:.4f}")
#print(f"Log Loss: {svm_log_loss_val:.4f}")

LogisticRegression Accuracy:
Accuracy: 0.9075
Precision: 0.9133
Recall: 0.9075
F1 Score: 0.9072
AUC-ROC: 0.9920
Log Loss: 0.5550
[[67  1  1  0  4]
 [ 0 78  0  0  2]
 [ 1  0 75  1  4]
 [ 5  2  5 68  6]
 [ 4  1  0  0 75]]
------------------------------------------
******************************************
------------------------------------------
Naive Bayes Accuracy:
Accuracy: 0.8850
Precision: 0.8950
Recall: 0.8850
F1 Score: 0.8821
AUC-ROC: 0.9935
Log Loss: 0.5652
[[66  4  0  0  3]
 [ 0 78  0  0  2]
 [ 0  2 75  0  4]
 [ 7  5  8 59  7]
 [ 3  1  0  0 76]]
------------------------------------------
******************************************
------------------------------------------
SVM Accuracy:
Accuracy: 0.9300
Precision: 0.9320
Recall: 0.9300
F1 Score: 0.9301
[[68  1  1  0  3]
 [ 0 78  0  0  2]
 [ 1  0 76  1  3]
 [ 3  0  5 75  3]
 [ 3  1  0  1 75]]


__________________________________________________________________________________________

############################################################################
__________________________________________________________________________________________