### **spacy_text_classification : Exercise**


- In this exercise, you are going to classify whether a given text belongs to one of possible classes ['BUSINESS', 'SPORTS', 'CRIME'].

- you are going to use spacy for pre-processing the text, convert text to numbers and apply different classification algorithms.

### **About Data: News Category Classifier**

Credits: https://www.kaggle.com/code/hengzheng/news-category-classifier-val-acc-0-65


- This data consists of two columns.
        - Text
        - Category
- Text are the description about a particular topic.
- Category determine which class the text belongs to.
- we have classes mainly of 'BUSINESS', 'SPORTS', 'CRIME' and comes under **Multi-class** classification Problem.

In [12]:
#import pandas library
import pandas


#read the dataset "news_dataset.json" provided and load it into dataframe "df"
df = pandas.read_json('News_Category_Dataset_v3.json', lines = True)
df = df[df.category.isin(["BUSINESS", "SPORTS", "CRIME"])]

df = df[['category', 'short_description']]
#print the shape of data
print(df.shape)


#print the top5 rows
df.head()


(14631, 2)


Unnamed: 0,category,short_description
17,SPORTS,"Maury Wills, who helped the Los Angeles Dodger..."
26,SPORTS,Las Vegas never had a professional sports cham...
61,SPORTS,The race's organizers say nonbinary athletes w...
62,SPORTS,Varvaro pitched mostly with the Atlanta Braves...
67,SPORTS,Carlos Alcaraz defeated Casper Ruud in the U.S...


In [9]:
#check the distribution of labels 
df.category.value_counts()


category
BUSINESS    5992
SPORTS      5077
CRIME       3562
Name: count, dtype: int64

In [13]:
#Add the new column "label_num" which gives a unique number to each of these labels 
df['label_num'] = pandas.factorize(df.category)[0]
df.rename(columns= {'short_description': 'text'}, inplace=True)

#check the results with top 5 
df.head()


Unnamed: 0,category,text,label_num
17,SPORTS,"Maury Wills, who helped the Los Angeles Dodger...",0
26,SPORTS,Las Vegas never had a professional sports cham...,0
61,SPORTS,The race's organizers say nonbinary athletes w...,0
62,SPORTS,Varvaro pitched mostly with the Atlanta Braves...,0
67,SPORTS,Carlos Alcaraz defeated Casper Ruud in the U.S...,0


### **Preprocess the text**

In [14]:
#create a new column "vector" that store the vector representation of each pre-processed text
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_lg") 


#use this utility function to get the preprocessed text data
def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    
    return " ".join(filtered_tokens) 

from sklearn.base import BaseEstimator, TransformerMixin

class PreProcessText(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.apply(preprocess)
        return X

class SpacyVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        vectors = [nlp(text).vector for text in X]
        return vectors

#1. create a pipeline object


model = Pipeline(steps= [
    ('PreprocessText', PreProcessText()),
    ('SpacyVectorization', SpacyVectorizer()),
    ('XgBoost', XGBClassifier()),
])


#2. fit with X_train and y_train
X_train, X_test, y_train, y_test = train_test_split(
    df.text,
    df.label_num,
    test_size= 0.2,
    random_state=2022
)


model.fit(X_train, y_train)

#3. get the predictions for X_test and store it in y_pred

y_pred = model.predict(X_test)

#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.72      0.77       998
           1       0.82      0.58      0.68       700
           2       0.71      0.90      0.79      1229

    accuracy                           0.76      2927
   macro avg       0.78      0.73      0.74      2927
weighted avg       0.77      0.76      0.75      2927

