<h3>Fasttext encoding</h3>

To perform fasttext encoding and applying different Machine learning models for the dataset to perform entimental analysis.


FastText is an extension to Word2Vec proposed by Facebook in 2016. Instead of feeding individual words into the Neural Network, FastText breaks words into several n-grams (sub-words). For instance, the tri-grams for the word apple is app, ppl, and ple (ignoring the starting and ending of boundaries of words). The word embedding vector for apple will be the sum of all these n-grams. After training the Neural Network, we will have word embeddings for all the n-grams given the training dataset. Rare words can now be properly represented since it is highly likely that some of their n-grams also appears in other words.

In [None]:
!pip install fasttext

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.8/68.8 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pybind11>=2.2
  Using cached pybind11-2.10.4-py3-none-any.whl (222 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp39-cp39-linux_x86_64.whl size=4395888 sha256=c61246a67515014476d9480da3d7e45880269baf0f576f5b9a5dae66fc13369e
  Stored in directory: /root/.cache/pip/wheels/64/57/bc/1741406019061d5664914b070bd3e71f6244648732bc96109e
Successfully built fasttext
Installing collected packages: pybind11, fasttext
Successfully installed fasttext-0.9.2 pybind11-2.10.4


In [None]:
import fasttext
import fasttext.util

In [None]:
fasttext.util.download_model('en', if_exists='ignore')  # English
ft = fasttext.load_model('cc.en.300.bin')

Downloading https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz





In [None]:
ft.get_nearest_neighbors('good')

[(0.7517593502998352, 'bad'),
 (0.7426098585128784, 'great'),
 (0.7299689054489136, 'decent'),
 (0.7123614549636841, 'nice'),
 (0.6796907186508179, 'Good'),
 (0.6737031936645508, 'excellent'),
 (0.669592022895813, 'goood'),
 (0.6602178812026978, 'ggod'),
 (0.6479219794273376, 'semi-good'),
 (0.6417751908302307, 'good.Good')]

In [None]:
vect =ft.get_sentence_vector("some string")
vect

In [None]:
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/nlp/spacy_preprocessed_labeledtext.csv",nrows=5000)
df

Unnamed: 0.1,Unnamed: 0,File Name,Caption,LABEL
0,0,1.txt,feel today legday jelly ache gym,negative
1,1,10.txt,absolute disgrace carriage Bangor half way sta...,negative
2,2,100.txt,Valentine 1 nephew elated little thing big goo...,positive
3,3,1000.txt,betterfeelingfilm RT Instagram day film powerl...,neutral
4,4,1001.txt,Zoe love rattle,positive
...,...,...,...,...
4864,4864,995.txt,OMG Eskom Man die LoadShedding powerless,positive
4865,4865,996.txt,Feelin love ValentinesDay care,positive
4866,4866,997.txt,blue eye beat,neutral
4867,4867,998.txt,LA CHUCHA LOUUU TE CHUPO LOS OJOS,neutral


In [None]:
df['Caption']=df['Caption'].astype(str)

In [None]:
df['vector']=df['Caption'].apply(ft.get_sentence_vector)

In [None]:
df

Unnamed: 0.1,Unnamed: 0,File Name,Caption,LABEL,vector
0,0,1.txt,feel today legday jelly ache gym,negative,"[0.023793615, 0.0010435967, 0.02133198, 0.0540..."
1,1,10.txt,absolute disgrace carriage Bangor half way sta...,negative,"[0.00859744, 0.00888663, -0.003554554, 0.08510..."
2,2,100.txt,Valentine 1 nephew elated little thing big goo...,positive,"[-0.016047364, -0.01680923, -0.011857084, 0.08..."
3,3,1000.txt,betterfeelingfilm RT Instagram day film powerl...,neutral,"[-0.0207995, 0.04026957, 0.013544136, 0.035259..."
4,4,1001.txt,Zoe love rattle,positive,"[0.02244046, 0.035908155, 0.06669589, 0.073361..."
...,...,...,...,...,...
4864,4864,995.txt,OMG Eskom Man die LoadShedding powerless,positive,"[0.0017393455, 0.0013330536, 0.0436339, 0.0512..."
4865,4865,996.txt,Feelin love ValentinesDay care,positive,"[0.023763092, 0.004180819, 0.056516614, 0.0438..."
4866,4866,997.txt,blue eye beat,neutral,"[0.014739843, -0.04131049, 0.03872008, 0.08647..."
4867,4867,998.txt,LA CHUCHA LOUUU TE CHUPO LOS OJOS,neutral,"[-0.0022629295, -0.03879671, -0.033079512, -0...."


In [None]:
df['label_num'] = df['LABEL'].map({'neutral' : 0, 'positive': 1,'negative':2})

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.vector.values, df.label_num, test_size=0.2)
import numpy as np
X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import MinMaxScaler


scaler = MinMaxScaler()
scaled_train_embed = scaler.fit_transform(X_train_2d)
scaled_test_embed = scaler.transform(X_test_2d)


clf = GaussianNB()
clf.fit(scaled_train_embed, y_train)
from sklearn.metrics import classification_report

y_pred = clf.predict(scaled_test_embed)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.61      0.41      0.49       350
           1       0.64      0.72      0.68       343
           2       0.54      0.67      0.59       281

    accuracy                           0.59       974
   macro avg       0.59      0.60      0.59       974
weighted avg       0.60      0.59      0.59       974



In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.vector.values, df.label_num, test_size=0.2)
import numpy as np
X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

from sklearn.preprocessing import MinMaxScaler


scaler = MinMaxScaler()
scaled_train_embed = scaler.fit_transform(X_train_2d)
scaled_test_embed = scaler.transform(X_test_2d)


clf = RandomForestClassifier()
clf.fit(scaled_train_embed, y_train)
from sklearn.metrics import classification_report

y_pred = clf.predict(scaled_test_embed)

print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.60      0.66      0.63       351
           1       0.76      0.71      0.74       340
           2       0.70      0.66      0.68       283

    accuracy                           0.68       974
   macro avg       0.69      0.68      0.68       974
weighted avg       0.69      0.68      0.68       974



In [None]:
from sklearn.svm import SVC # "Support vector classifier"
classifier = SVC(kernel='linear', random_state=0)
classifier.fit(scaled_train_embed, y_train)

In [None]:
from sklearn.metrics import classification_report
y_pred = classifier.predict(scaled_test_embed)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.58      0.63      0.61       351
           1       0.74      0.69      0.72       340
           2       0.68      0.66      0.67       283

    accuracy                           0.66       974
   macro avg       0.67      0.66      0.66       974
weighted avg       0.67      0.66      0.66       974



In [None]:
classifier = SVC(kernel='poly', random_state=0)
classifier.fit(scaled_train_embed, y_train)
y_pred = classifier.predict(scaled_test_embed)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.60      0.58      0.59       351
           1       0.73      0.72      0.73       340
           2       0.65      0.68      0.67       283

    accuracy                           0.66       974
   macro avg       0.66      0.66      0.66       974
weighted avg       0.66      0.66      0.66       974



In [None]:
classifier = SVC(kernel='rbf', random_state=0)
classifier.fit(scaled_train_embed, y_train)
y_pred = classifier.predict(scaled_test_embed)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.63      0.68      0.65       351
           1       0.80      0.71      0.75       340
           2       0.70      0.72      0.71       283

    accuracy                           0.70       974
   macro avg       0.71      0.70      0.70       974
weighted avg       0.71      0.70      0.70       974



In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(scaled_train_embed, y_train)

In [None]:
y_pred = knn.predict(scaled_test_embed)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.59      0.36      0.45       351
           1       0.56      0.76      0.65       340
           2       0.56      0.58      0.57       283

    accuracy                           0.57       974
   macro avg       0.57      0.57      0.55       974
weighted avg       0.57      0.57      0.55       974



**Observations**:


*  FastText encoding gave relatively high accuracy than CBOW, Skipgram, Spacy and Gensim encoding.
*   As FastText is an extension to Word2Vec it performed better than Word2Vec models.
*   SVM (rbf kernel) gave 70 percent accuracy and RandomForest classifier gave 68 percent accuracy.
*   SVM (linear kernel) and SVM (poly kernel) gave accuracy of 66 percent.



