# Spacy encoding


performing Spacy encoding on the given dataset and train with several Machine learning models.


Spacy library lets to use Universal Sentence Encoder embeddings of Docs, Spans and Tokens directly from TensorFlow Hub.
spaCy lets you share a single transformer or other token-to-vector (“tok2vec”) embedding layer between multiple components.it can even update the shared layer, performing multi-task learning. Reusing the tok2vec layer between components can make the pipeline run a lot faster and result in much smaller models. However, it can make the pipeline less modular and make it more difficult to swap components or retrain parts of the pipeline. Multi-task learning can affect accuracy (either positively or negatively), and may require some retuning of your hyper-parameters.



In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv("/content/drive/MyDrive/nlp/spacy_preprocessed_labeledtext.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,File Name,Caption,LABEL
0,0,1.txt,feel today legday jelly ache gym,negative
1,1,10.txt,absolute disgrace carriage Bangor half way sta...,negative
2,2,100.txt,Valentine 1 nephew elated little thing big goo...,positive
3,3,1000.txt,betterfeelingfilm RT Instagram day film powerl...,neutral
4,4,1001.txt,Zoe love rattle,positive


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df['LABEL'].value_counts()

neutral     1771
positive    1646
negative    1452
Name: LABEL, dtype: int64

In [None]:
df['label_num'] = df['LABEL'].map({'neutral' : 0, 'positive': 1,'negative':2})

#check the results with top 5 rows
df.head(5)

Unnamed: 0.1,Unnamed: 0,File Name,Caption,LABEL,label_num
0,0,1.txt,feel today legday jelly ache gym,negative,2
1,1,10.txt,absolute disgrace carriage Bangor half way sta...,negative,2
2,2,100.txt,Valentine 1 nephew elated little thing big goo...,positive,1
3,3,1000.txt,betterfeelingfilm RT Instagram day film powerl...,neutral,0
4,4,1001.txt,Zoe love rattle,positive,1


In [None]:
df.shape

(4869, 5)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4869 entries, 0 to 4868
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  4869 non-null   int64 
 1   File Name   4869 non-null   object
 2   Caption     4869 non-null   object
 3   LABEL       4869 non-null   object
 4   label_num   4869 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 190.3+ KB


In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,File Name,Caption,LABEL,label_num
0,0,1.txt,feel today legday jelly ache gym,negative,2
1,1,10.txt,absolute disgrace carriage Bangor half way sta...,negative,2
2,2,100.txt,Valentine 1 nephew elated little thing big goo...,positive,1
3,3,1000.txt,betterfeelingfilm RT Instagram day film powerl...,neutral,0
4,4,1001.txt,Zoe love rattle,positive,1


In [None]:
!python -m spacy download en_core_web_lg
import spacy
nlp = spacy.load("en_core_web_lg")

2023-03-25 12:54:15.390656: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-25 12:54:17.071529: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-03-25 12:54:17.071662: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-03-25 12:54:19.389156: E tensorfl



In [None]:
nlp("feel today legday jelly ache gym").vector

array([ 1.68045175e+00,  8.15226734e-01, -1.20932662e+00, -5.10283351e-01,
       -8.26773345e-01, -1.25090659e+00, -3.69484335e-01,  1.86937511e+00,
       -9.36916649e-01,  1.23137343e+00,  2.54665017e+00,  7.98018277e-01,
       -1.26332445e-02,  2.32047486e+00,  2.38205504e+00, -2.12688828e+00,
        6.24416649e-01, -8.75853360e-01, -1.60570002e+00, -1.84330022e+00,
        9.68449891e-01,  1.58537996e+00,  2.72878408e-01, -2.08236146e+00,
       -2.49944970e-01, -4.17616695e-01, -4.86057848e-01, -1.17727339e+00,
        1.03483546e+00,  1.25202835e+00,  1.51832831e+00, -6.13253295e-01,
        6.02168823e-03, -2.07076693e+00,  1.55910015e+00,  1.34525168e+00,
        1.07888329e+00,  2.28481007e+00,  4.24191743e-01,  1.02048302e+00,
       -9.88665044e-01,  1.40706670e+00,  7.63133347e-01, -3.20828319e-01,
        2.57764280e-01,  1.49299657e+00, -4.64185067e-02, -2.08144522e+00,
       -6.32928252e-01, -5.26746690e-01,  2.25669503e+00,  3.14990687e-03,
        1.09429669e+00, -

In [None]:
df['Caption']= df['Caption'].astype(str)

In [None]:
df['vector'] = df['Caption'].apply(lambda text: nlp(text).vector)

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,File Name,Caption,LABEL,label_num,vector
0,0,1.txt,feel today legday jelly ache gym,negative,2,"[1.6804518, 0.81522673, -1.2093266, -0.5102833..."
1,1,10.txt,absolute disgrace carriage Bangor half way sta...,negative,2,"[-1.0308651, 0.71679777, -2.4452846, 0.2788891..."
2,2,100.txt,Valentine 1 nephew elated little thing big goo...,positive,1,"[0.48003995, -1.0579224, -1.9142222, -0.367658..."
3,3,1000.txt,betterfeelingfilm RT Instagram day film powerl...,neutral,0,"[0.82350415, -1.7657572, 0.62956333, -1.842519..."
4,4,1001.txt,Zoe love rattle,positive,1,"[1.3921299, -0.79847, -1.9492999, -2.2701333, ..."


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.vector.values, df.label_num, test_size=0.2)
import numpy as np
X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import MinMaxScaler


scaler = MinMaxScaler()
scaled_train_embed = scaler.fit_transform(X_train_2d)
scaled_test_embed = scaler.transform(X_test_2d)


clf = GaussianNB()
clf.fit(scaled_train_embed, y_train)
from sklearn.metrics import classification_report

y_pred = clf.predict(scaled_test_embed)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.52      0.34      0.41       355
           1       0.58      0.69      0.63       320
           2       0.54      0.65      0.59       299

    accuracy                           0.55       974
   macro avg       0.54      0.56      0.54       974
weighted avg       0.54      0.55      0.54       974



In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.vector.values, df.label_num, test_size=0.2)
import numpy as np
X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

from sklearn.preprocessing import MinMaxScaler


scaler = MinMaxScaler()
scaled_train_embed = scaler.fit_transform(X_train_2d)
scaled_test_embed = scaler.transform(X_test_2d)


clf = RandomForestClassifier()
clf.fit(scaled_train_embed, y_train)
from sklearn.metrics import classification_report

y_pred = clf.predict(scaled_test_embed)

print(classification_report(y_test, y_pred))




              precision    recall  f1-score   support

           0       0.63      0.57      0.60       367
           1       0.62      0.73      0.67       300
           2       0.64      0.60      0.62       307

    accuracy                           0.63       974
   macro avg       0.63      0.63      0.63       974
weighted avg       0.63      0.63      0.63       974



In [None]:
from sklearn.svm import SVC # "Support vector classifier"
classifier = SVC(kernel='linear', random_state=0)
classifier.fit(scaled_train_embed, y_train)

In [None]:
from sklearn.metrics import classification_report
y_pred = classifier.predict(scaled_test_embed)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.56      0.63      0.59       367
           1       0.69      0.74      0.71       300
           2       0.70      0.54      0.61       307

    accuracy                           0.64       974
   macro avg       0.65      0.64      0.64       974
weighted avg       0.64      0.64      0.64       974



In [None]:
classifier = SVC(kernel='poly', random_state=0)
classifier.fit(scaled_train_embed, y_train)
y_pred = classifier.predict(scaled_test_embed)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.62      0.65      0.64       367
           1       0.70      0.73      0.71       300
           2       0.69      0.63      0.66       307

    accuracy                           0.67       974
   macro avg       0.67      0.67      0.67       974
weighted avg       0.67      0.67      0.67       974



In [None]:
classifier = SVC(kernel='rbf', random_state=0)
classifier.fit(scaled_train_embed, y_train)
y_pred = classifier.predict(scaled_test_embed)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.56      0.63      0.59       367
           1       0.69      0.73      0.71       300
           2       0.67      0.54      0.59       307

    accuracy                           0.63       974
   macro avg       0.64      0.63      0.63       974
weighted avg       0.64      0.63      0.63       974



In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(scaled_train_embed, y_train)
# Predict on dataset which model has not seen before
print(knn.predict(scaled_test_embed))

[0 1 2 1 0 0 1 2 1 0 1 1 1 0 2 0 1 1 1 2 0 1 2 1 1 1 1 1 0 1 1 2 0 1 0 1 1
 1 0 0 1 2 0 0 0 0 0 0 0 1 2 0 1 1 1 1 1 1 1 1 1 2 2 2 0 1 1 0 0 1 0 0 0 0
 2 0 1 0 0 1 1 1 1 0 2 1 1 2 0 1 1 0 1 0 0 1 0 2 2 0 1 1 1 0 2 0 0 2 2 2 1
 0 0 0 1 1 1 0 1 2 0 0 1 1 1 0 1 2 0 0 0 0 1 0 1 2 1 0 0 1 1 1 2 1 2 1 2 0
 0 1 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 1 1 1 1 0 2 2 1 2 0 1 1 1 0 0 1 0 0 0 0
 1 0 1 2 1 1 1 0 1 1 0 2 0 2 0 1 1 1 1 1 0 0 1 0 1 1 2 1 1 0 1 0 2 0 2 0 2
 1 0 2 1 0 2 2 1 2 0 1 1 0 0 0 2 1 2 0 0 1 2 2 2 0 1 2 1 1 0 2 1 2 0 2 0 1
 1 0 0 0 1 0 2 0 2 1 1 2 0 0 1 0 1 1 2 0 2 2 0 1 0 1 1 0 1 0 0 0 1 0 1 2 0
 0 0 1 1 0 2 0 1 0 2 0 1 0 0 0 2 1 1 0 1 1 1 1 1 0 2 2 1 2 2 1 1 0 1 0 0 0
 2 1 1 0 0 0 0 1 0 0 0 1 1 1 1 1 1 1 0 1 0 1 1 1 0 0 0 1 0 1 2 2 1 0 2 2 2
 1 0 1 2 1 0 1 0 0 1 1 0 0 2 2 2 1 1 0 1 1 2 1 0 0 1 2 1 1 1 2 1 0 0 1 0 0
 1 1 0 1 0 1 1 1 0 2 1 2 2 2 2 1 2 1 1 0 0 2 2 1 0 2 2 0 1 2 0 2 1 1 1 0 1
 1 1 0 0 1 1 1 1 0 1 1 0 1 2 0 1 2 1 1 0 2 1 0 1 2 0 1 2 0 1 2 2 2 0 0 0 1
 1 0 0 2 1 0 2 1 1 0 1 0 

In [None]:
y_pred = knn.predict(scaled_test_embed)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.53      0.52      0.53       367
           1       0.47      0.66      0.55       300
           2       0.54      0.34      0.42       307

    accuracy                           0.51       974
   macro avg       0.51      0.51      0.50       974
weighted avg       0.51      0.51      0.50       974



**Observation**:

*  Support vector classifier with Poly kernel gave maximum accuracy of 0.67 .

*   Support vector classifier with linear kernel gave 0.64 accuracy.


*   
Accuracy with Random forest is 0.63,Naive bayes is 0.55 , with SVM rbf kernel is 0.63, with knn is 0.51