### **spacy_text_classification : Exercise**


- In this exercise, you are going to classify whether a given text belongs to one of possible classes ['BUSINESS', 'SPORTS', 'CRIME'].

- you are going to use spacy for pre-processing the text, convert text to numbers and apply different classification algorithms.

In [27]:
#uncomment the below line and run this cell to install the large english model which is trained on wikipedia data

# !python -m spacy download en_core_web_lg

In [28]:
#import spacy and load the language model downloaded

import spacy 
nlp=spacy.load("en_core_web_lg")


### **About Data: News Category Classifier**

Credits: https://www.kaggle.com/code/hengzheng/news-category-classifier-val-acc-0-65


- This data consists of two columns.
        - Text
        - Category
- Text are the description about a particular topic.
- Category determine which class the text belongs to.
- we have classes mainly of 'BUSINESS', 'SPORTS', 'CRIME' and comes under **Multi-class** classification Problem.

In [29]:
#import pandas library
import pandas as pd


#read the dataset "news_dataset.json" provided and load it into dataframe "df"
df=pd.read_json("news_dataset.json")
#print the shape of data

df.shape
#print the top5 rows

df.head()

Unnamed: 0,text,category
0,"Larry Nassar Blames His Victims, Says He 'Was ...",CRIME
1,"Woman Beats Cancer, Dies Falling From Horse",CRIME
2,Vegas Taxpayers Could Spend A Record $750 Mill...,SPORTS
3,This Richard Sherman Interception Literally Sh...,SPORTS
4,7 Things That Could Totally Kill Weed Legaliza...,BUSINESS


In [30]:
#check the distribution of labels 
df['category'].value_counts()


category
CRIME       2500
SPORTS      2500
BUSINESS    2500
Name: count, dtype: int64

In [31]:
#Add the new column "label_num" which gives a unique number to each of these labels 
df["label_num"]=df['category'].map({
    "CRIME":0,
    "SPORTS":1,
    "BUSINESS":2
})


#check the results with top 5 rows
df.head()

Unnamed: 0,text,category,label_num
0,"Larry Nassar Blames His Victims, Says He 'Was ...",CRIME,0
1,"Woman Beats Cancer, Dies Falling From Horse",CRIME,0
2,Vegas Taxpayers Could Spend A Record $750 Mill...,SPORTS,1
3,This Richard Sherman Interception Literally Sh...,SPORTS,1
4,7 Things That Could Totally Kill Weed Legaliza...,BUSINESS,2


In [32]:
# def preprocess(text):
#     doc = nlp(text)
#     filtered_tok=[]
#     for token in doc:
#         if(token.is_stop or token.is_punct):
#             continue
#         filtered_tok.append(token.lemma_)
#     return " ".join(filtered_tok)

### **Preprocess the text**

In [33]:
#use this utility function to preprocess the text
#1. Remove the stop words
#2. Convert to base form using lemmatisation

def preprocess(text):
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    return ' '.join(filtered_tokens)

In [34]:
df.head()

Unnamed: 0,text,category,label_num
0,"Larry Nassar Blames His Victims, Says He 'Was ...",CRIME,0
1,"Woman Beats Cancer, Dies Falling From Horse",CRIME,0
2,Vegas Taxpayers Could Spend A Record $750 Mill...,SPORTS,1
3,This Richard Sherman Interception Literally Sh...,SPORTS,1
4,7 Things That Could Totally Kill Weed Legaliza...,BUSINESS,2


In [35]:
#create a new column "preprocessed_text" which store the clean form of given text [use apply and lambda function]

df["newText"]=df["text"].apply(preprocess)

In [36]:
#print the top 5 rows
df

Unnamed: 0,text,category,label_num,newText
0,"Larry Nassar Blames His Victims, Says He 'Was ...",CRIME,0,Larry Nassar blame victim say victimize newly ...
1,"Woman Beats Cancer, Dies Falling From Horse",CRIME,0,woman Beats Cancer die fall horse
2,Vegas Taxpayers Could Spend A Record $750 Mill...,SPORTS,1,vegas taxpayer spend Record $ 750 million New ...
3,This Richard Sherman Interception Literally Sh...,SPORTS,1,Richard Sherman Interception literally shake W...
4,7 Things That Could Totally Kill Weed Legaliza...,BUSINESS,2,7 thing totally kill Weed Legalization Buzz
...,...,...,...,...
7495,Sex Offender Registries Are Not Really Keeping...,CRIME,0,sex offender registry keep child safe problem ...
7496,'Stockbroker's Bible' Just Told Oil Industry T...,BUSINESS,2,Stockbroker Bible tell Oil Industry accept dem...
7497,Want to Change It? Scale It!,BUSINESS,2,want change scale
7498,"How To Make A Billion Dollar Drug In 1961, new...",BUSINESS,2,billion Dollar Drug 1961 newspaper world run s...


### **Get the spacy embeddings for each preprocessed text**

In [43]:
#create a new column "vector" that store the vector representation of each pre-processed text
df['vector'] = df['newText'].apply(lambda text: nlp(text).vector)
df['vector']

0       [-0.5585511, -0.29323253, -0.9253956, 0.189389...
1       [-0.73039824, -0.43196002, -1.2930516, -1.0628...
2       [-1.9413117, 0.121578515, -3.2996283, 1.511650...
3       [-1.4702771, -0.685319, 0.57398, -0.31135806, ...
4       [-1.037173, -1.9495698, -1.7179357, 1.2975286,...
                              ...                        
7495    [-0.80910146, 1.0078055, -2.4174294, 0.2242247...
7496    [0.9950101, -0.58799165, 0.01528129, 0.7908599...
7497    [1.4338999, 2.9818058, -5.5303, 0.044243336, 0...
7498    [-0.23529872, -0.12220071, -1.9055535, -1.0336...
7499    [-0.7867514, 0.022580221, -0.9533115, -0.46140...
Name: vector, Length: 7500, dtype: object

In [41]:
#print the top 5 rows
df.head()

Unnamed: 0,text,category,label_num,newText,vector
0,"Larry Nassar Blames His Victims, Says He 'Was ...",CRIME,0,Larry Nassar blame victim say victimize newly ...,"[-0.5585511, -0.29323253, -0.9253956, 0.189389..."
1,"Woman Beats Cancer, Dies Falling From Horse",CRIME,0,woman Beats Cancer die fall horse,"[-0.73039824, -0.43196002, -1.2930516, -1.0628..."
2,Vegas Taxpayers Could Spend A Record $750 Mill...,SPORTS,1,vegas taxpayer spend Record $ 750 million New ...,"[-1.9413117, 0.121578515, -3.2996283, 1.511650..."
3,This Richard Sherman Interception Literally Sh...,SPORTS,1,Richard Sherman Interception literally shake W...,"[-1.4702771, -0.685319, 0.57398, -0.31135806, ..."
4,7 Things That Could Totally Kill Weed Legaliza...,BUSINESS,2,7 thing totally kill Weed Legalization Buzz,"[-1.037173, -1.9495698, -1.7179357, 1.2975286,..."


In [46]:
df.vector.values

array([array([-0.5585511 , -0.29323253, -0.9253956 ,  0.18938938,  1.0181136 ,
               1.7050675 ,  0.700774  ,  2.2029855 , -1.7906338 , -0.5034125 ,
               1.6184038 ,  0.61051875, -1.3079705 ,  0.584547  ,  0.63700706,
              -0.8729482 ,  1.00014   , -1.4759021 ,  0.17712572,  0.61367625,
              -0.29666373,  0.7998125 , -0.03366186, -1.3914751 ,  0.02639747,
              -0.29605618,  0.82793623,  0.04722439, -0.18659225,  0.61112875,
               0.5447923 ,  0.70491487, -0.23620602,  1.504215  , -0.65176713,
              -0.487085  , -0.6281269 , -0.86626047,  1.6398525 ,  1.2386812 ,
              -2.010722  , -0.7159607 ,  0.4391331 ,  1.1089885 , -0.70660996,
              -0.7859553 ,  0.13826066, -2.4608564 , -0.25043505,  1.6539725 ,
              -0.23573613,  0.36592993,  1.5811756 , -2.9950502 , -1.1908113 ,
               0.8907906 ,  0.65194976,  0.7145287 ,  0.68849564,  1.0460285 ,
               1.8116534 ,  1.0965936 , -1.250417  ,

**Train-Test splitting**

In [54]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(df.vector.values, 
    df.label_num, 
    test_size=0.2, # 20% samples will go to test dataset
    random_state=2022,
    stratify=df.label_num)


In [55]:
x_train

array([array([-0.75421077,  0.6757007 , -0.8103016 , -0.36451474,  1.943962  ,
               0.89710534,  1.6667575 ,  2.2389388 , -1.4035333 , -0.04789238,
               2.1404338 ,  1.0904554 , -1.4379984 , -0.96389085, -0.09045536,
               0.59112   , -0.31627148, -1.4955684 , -0.7992073 ,  0.4178073 ,
               0.8412622 ,  1.2190624 ,  1.0316141 , -1.2051774 ,  1.0414815 ,
              -0.7621094 , -0.644522  ,  0.07541333,  0.48121196, -0.5205901 ,
               1.4094892 ,  0.8018907 ,  0.36769167,  0.65687066, -0.8550199 ,
              -1.3493154 , -1.7030555 ,  0.34671888, -0.00570598,  0.48409203,
              -0.27079943,  0.05977068,  1.2400585 , -0.28377095, -0.58447367,
               0.6155707 ,  0.08372927, -1.3828926 , -0.47318897,  1.3583547 ,
               0.11905539, -0.22138388,  1.8554968 , -0.50412124, -0.0278773 ,
              -1.1545568 ,  0.60336   ,  0.03980036,  0.11591005,  0.16612536,
               1.8213847 ,  0.8264089 , -0.8314514 ,

**Reshape the X_train and X_test so as to fit for models**

In [58]:
# import numpy as np

import numpy as np

#reshapes the X_train and X_test using 'stack' function of numpy. Store the result in new variables "X_train_2d" and "X_test_2d"
x_train2=np.stack(x_train)
x_test2=np.stack(x_test)
x_test2
x_train2


array([[-0.75421077,  0.6757007 , -0.8103016 , ...,  0.11203392,
        -1.2481873 ,  0.8206914 ],
       [-0.30570197,  0.17626004,  2.2580261 , ..., -0.717418  ,
        -2.4548218 , -0.8858727 ],
       [-0.3888616 ,  0.7500783 , -0.27698502, ..., -1.1810415 ,
        -0.8416365 ,  0.11569308],
       ...,
       [-2.94532   ,  0.236612  , -0.165432  , ..., -1.304252  ,
         0.31972402,  0.944558  ],
       [-2.260163  , -0.9833932 , -1.0096097 , ..., -0.32583067,
        -0.3160187 ,  1.9718864 ],
       [ 0.9070337 ,  2.0025    ,  0.29584482, ...,  0.13635537,
         0.19664921,  0.6167868 ]], dtype=float32)

**Attempt 1:**


- use spacy glove embeddings for text vectorization.

- use Decision Tree as the classifier.

- print the classification report.

In [60]:
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import classification_report
#1. creating a Decision Tree model object
clf = DecisionTreeClassifier()

#2. fit with all_train_embeddings and y_train
clf.fit(x_train2, y_train)


#3. get the predictions for all_test_embeddings and store it in y_pred
y_pred = clf.predict(x_test2)


#4. print the classfication report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.73      0.75      0.74       500
           1       0.69      0.70      0.70       500
           2       0.73      0.71      0.72       500

    accuracy                           0.72      1500
   macro avg       0.72      0.72      0.72      1500
weighted avg       0.72      0.72      0.72      1500



**Attempt 2:**


- use spacy glove embeddings for text vectorization.
- use MultinomialNB as the classifier after applying the MinMaxscaler.
- print the classification report.

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import classification_report



#doing scaling because Negative values will not pass into Naive Bayes models



#1. creating a MultinomialNB model object 



#2. fit with all_train_embeddings(scaled) and y_train



#3. get the predictions for all_test_embeddings and store it in y_pred



#4. print the classfication report


**Attempt 3:**


- use spacy glove embeddings for text vectorization.
- use KNeighborsClassifier as the classifier after applying the MinMaxscaler.
- print the classification report.

In [None]:
from  sklearn.neighbors import KNeighborsClassifier


#1. creating a KNN model object



#2. fit with all_train_embeddings and y_train



#3. get the predictions for all_test_embeddings and store it in y_pred



#4. print the classfication report


**Attempt 4:**


- use spacy glove embeddings for text vectorization.
- use RandomForestClassifier as the classifier after applying the MinMaxscaler.
- print the classification report.

In [None]:
from sklearn.ensemble import RandomForestClassifier


#1. creating a Random Forest model object



#2. fit with all_train_embeddings and y_train



#3. get the predictions for all_test_embeddings and store it in y_pred



#4. print the classfication report


**Attempt 5:**


- use spacy glove embeddings for text vectorization.
- use GradientBoostingClassifier as the classifier after applying the MinMaxscaler.
- print the classification report.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier


#1. creating a GradientBoosting model object



#2. fit with all_train_embeddings and y_train



#3. get the predictions for all_test_embeddings and store it in y_pred



#4. print the classfication report


**Print the confusion Matrix with the best model got**

In [None]:
#finally print the confusion matrix for the best model: GradientBoostingClassifier

# from sklearn.metrics import confusion_matrix





## [**Solution**](./spacy_word_embeddings_solution.ipynb)