# News categorization

In [1]:
import numpy as np
import pandas as pd

import sklearn.datasets as skd

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split


from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

import json


## Load and Analyze the data

Dataset is based on https://www.kaggle.com/rmisra/news-category-dataset/data#

In [2]:
# our json data is not a list of objects, so lets make it:
with open('../../datasets/TextClassification/News_Category_Dataset_v2.json') as f:
    lines = f.readlines()
    joined_lines = '[' + ','.join(lines) + ']'
    json_data = json.loads(joined_lines)

full_df = pd.DataFrame(json_data)
full_df.head()

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26


In [3]:
# let's check how many documents we have per each category:
full_df.groupby('category').count().sort_values(by='headline', ascending=False)

Unnamed: 0_level_0,headline,authors,link,short_description,date
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
POLITICS,32739,32739,32739,32739,32739
WELLNESS,17827,17827,17827,17827,17827
ENTERTAINMENT,16058,16058,16058,16058,16058
TRAVEL,9887,9887,9887,9887,9887
STYLE & BEAUTY,9649,9649,9649,9649,9649
PARENTING,8677,8677,8677,8677,8677
HEALTHY LIVING,6694,6694,6694,6694,6694
QUEER VOICES,6314,6314,6314,6314,6314
FOOD & DRINK,6226,6226,6226,6226,6226
BUSINESS,5937,5937,5937,5937,5937


## Preprocess the data

For the sake of example, we will use only 2 categories: 'ARTS' and 'BUSINESS'. So let's create a new DataFrame containing only the rows for these categories

In [4]:
df = full_df[
    (full_df.category=='ARTS') |
    (full_df.category=='BUSINESS')
]

In [5]:
df.shape

(7446, 6)

In [6]:
# check the categories:
df.groupby('category').count()

Unnamed: 0_level_0,headline,authors,link,short_description,date
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ARTS,1509,1509,1509,1509,1509
BUSINESS,5937,5937,5937,5937,5937


In [7]:
# Check for NaN values:
df.isna().sum()

category             0
headline             0
authors              0
link                 0
short_description    0
date                 0
dtype: int64

In [8]:
# Check for empty string values:
(df == '').sum()

category                0
headline                1
authors              1008
link                    0
short_description    1506
date                    0
dtype: int64

Deal with empty 'short_description'

Our main feature is the 'short_description', so we will replace empty values with the value in 'headline' column.

In [9]:
# remove lines with empty headline and short_descripion
df = df.loc[ (df.short_description!='') & (df.headline!='') ]

# now fill:
df.short_description = np.where(df.short_description=='', df.headline, df.short_description)

# check again:
df[df.short_description==''].count()

category             0
headline             0
authors              0
link                 0
short_description    0
date                 0
dtype: int64

### Select Features and separate the train/test sets

For feature column we will use short_description. 
Target will be the 'category' column

In [10]:
df[ ['short_description','category' ] ]

X_train, X_test, y_train, y_test = train_test_split(
    df['short_description'],
    df['category'],
    random_state=42)

### Preprocessing (stop-words removal, stemming)

### TF-IDF

For converting raw documents into TF-IDF matrix representation we will use [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

## Train the classifier

For features transformed with TF-IDF vectorization, we use the Multinomial Naive Bayes classifier, as it is well-suited for frequency-based data, such as term frequencies from text.

To streamline the process, we will use a pipeline to combine the two steps: 1. TF-IDF Vectorization and 2. Classifier Initialization, allowing them to run together efficiently.

In [11]:
# instantiate the classifier:
clf = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', MultinomialNB())
])

clf.fit(X_train, y_train)

### Test the classifier

In [12]:
# Predict, using the test cases
y_pred = clf.predict(X_test)

res = y_pred == y_test

print( f'Correct: {np.count_nonzero( (res) )} out of {np.size(res)}' )
print( f'False: {np.size(res) - np.count_nonzero((res))} ' )


Correct: 1279 out of 1485
False: 206 


## Results estimation

In [13]:
accuracy = str(np.mean(y_pred == y_test))
print(f'Accuracy = {accuracy}\n')

report = metrics.classification_report(
    y_test, y_pred, target_names=['ARTS','BUSINESS']
)

print(f'Classification report: \n{report}')

Accuracy = 0.8612794612794613

Classification report: 
              precision    recall  f1-score   support

        ARTS       1.00      0.00      0.01       207
    BUSINESS       0.86      1.00      0.93      1278

    accuracy                           0.86      1485
   macro avg       0.93      0.50      0.47      1485
weighted avg       0.88      0.86      0.80      1485



In [14]:
confusion_matrix(y_test, y_pred, labels=['ARTS','BUSINESS'])

array([[   1,  206],
       [   0, 1278]])

### Interpretation of the Report:

- **ARTS**:
  - **Precision**: 1.00 (All predicted ARTS instances were correct)
  - **Recall**: 0.00 (None of the actual ARTS instances were correctly classified)
  - **F1-Score**: 0.01 (Poor performance due to the very low recall)
  - **Support**: 207 (Total actual instances of ARTS)

- **BUSINESS**:
  - **Precision**: 0.86 (86% of the predicted BUSINESS instances were correct)
  - **Recall**: 1.00 (All actual BUSINESS instances were correctly classified)
  - **F1-Score**: 0.93 (Good balance of precision and recall)
  - **Support**: 1278 (Total actual instances of BUSINESS)

#### Overall Metrics:
- **Accuracy**: 0.86 (86.13% of the total predictions were correct)
- **Macro Average**: 
  - **Precision**: 0.93 (Average precision across both classes)
  - **Recall**: 0.50 (Average recall, heavily affected by the ARTS class)
  - **F1-Score**: 0.47 (Reflects poor classification of ARTS)

- **Weighted Average**:
  - **Precision**: 0.88 (Takes into account class imbalance)
  - **Recall**: 0.86 (Reflects overall recall across both classes)
  - **F1-Score**: 0.80 (Overall model performance)

#### Conclusion:
- The model performs well in classifying **BUSINESS** but struggles significantly with **ARTS** due to very low recall for the ARTS class, which affects the macro average and overall F1-score.
