# Using fasttext for text classification
In this notebook we shall go over the use of fastText in pretrained word embeddings for converting text data into vector model. We have also loaded word embeddings into machine learning MLP algorithm. 

In [1]:
from gensim.models import KeyedVectors


In [2]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
!unzip wiki-news-300d-1M.vec.zip

--2021-10-08 01:27:06--  https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 104.22.75.142, 172.67.9.4, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 681808098 (650M) [application/zip]
Saving to: ‘wiki-news-300d-1M.vec.zip’


2021-10-08 01:27:38 (20.9 MB/s) - ‘wiki-news-300d-1M.vec.zip’ saved [681808098/681808098]

Archive:  wiki-news-300d-1M.vec.zip
  inflating: wiki-news-300d-1M.vec   


In [3]:
model = KeyedVectors.load_word2vec_format("/content/wiki-news-300d-1M.vec")
print(model.most_similar('desk'))

[('desks', 0.7923153638839722), ('Desk', 0.6869951486587524), ('desk.', 0.6602819561958313), ('desk-', 0.6187258958816528), ('credenza', 0.5955315828323364), ('roll-top', 0.5875717401504517), ('rolltop', 0.5837830305099487), ('bookshelf', 0.5758029222488403), ('Desks', 0.5755287408828735), ('sofa', 0.5617446899414062)]


In [4]:
words = []
for word in model.vocab:
    words.append(word)

In [5]:
print("Vector components of a word: {}". format(model[words[0]]))

Vector components of a word: [ 1.0730e-01  8.9000e-03  6.0000e-04  5.5000e-03 -6.4600e-02 -6.0000e-02
  4.5000e-02 -1.3300e-02 -3.5700e-02  4.3000e-02 -3.5600e-02 -3.2000e-03
  7.3000e-03 -1.0000e-04  2.5800e-02 -1.6600e-02  7.5000e-03  6.8600e-02
  3.9200e-02  7.5300e-02  1.1500e-02 -8.7000e-03  4.2100e-02  2.6500e-02
 -6.0100e-02  2.4200e-01  1.9900e-02 -7.3900e-02 -3.1000e-03 -2.6300e-02
 -6.2000e-03  1.6800e-02 -3.5700e-02 -2.4900e-02  1.9000e-02 -1.8400e-02
 -5.3700e-02  1.4200e-01  6.0000e-02  2.2600e-02 -3.8000e-03 -6.7500e-02
 -3.6000e-03 -8.0000e-03  5.7000e-02  2.0800e-02  2.2300e-02 -2.5600e-02
 -1.5300e-02  2.2000e-03 -4.8200e-02  1.3100e-02 -6.0160e-01 -8.8000e-03
  1.0600e-02  2.2900e-02  3.3600e-02  7.1000e-03  8.8700e-02  2.3700e-02
 -2.9000e-02 -4.0500e-02 -1.2500e-02  1.4700e-02  4.7500e-02  6.4700e-02
  4.7400e-02  1.9900e-02  4.0800e-02  3.2200e-02  3.6000e-03  3.5000e-02
 -7.2300e-02 -3.0500e-02  1.8400e-02 -2.6000e-03  2.4000e-02 -1.6000e-02
 -3.0800e-02  4.3400e-

# The Problem
We will use fastText word embeddings for text classifciation of sentences, we will use the sklearn MLP. The sentences are prepared and inserted.

In [6]:
sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
            ['this', 'is',  'another', 'machine', 'learning', 'book'],
            ['one', 'more', 'new', 'book'],
         
          ['this', 'is', 'about', 'machine', 'learning', 'post'],
          ['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'fruit'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
          ['this', 'is', 'the', 'last', 'machine', 'learning', 'book'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'packages'],
          ['orange', 'juice', 'is', 'liquid', 'extract', 'from', 'fruit', 'on', 'orange', 'tree']]

The problem of classifying the words can be visualized by the following flowchart <img src="https://bit.ly/3oCtx5U)">

# Data preparation
we have a task of preparing the data into digital output. The aim will be to get word embeddings and avergae all words in the sentences. The resulting vector sentence representations are saved to array V. 

In [9]:
import numpy as np 
def sent_vectorizer(sent, model):
    sent_vec = []
    numw = 0
    for w in sent:
        try:
            if numw  == 0:
                sent_vec = model[w]
            else:
                sent_vec = np.add(sent_vec, model[w])
            numw += 1
        except:
            pass
    return np.asarray(sent_vec) / numw 

V = []
for sentence in sentences:
    V.append(sent_vectorizer(sentence, model))

After converting the text to vectors we can then divide the data into training and testing datasets and attach class albels.

In [13]:
X_train = V[0:6]
X_test = V[6:9]

Y_train = [0, 0, 0, 0,1,1]
Y_test = [0, 1,1]

# Text classification 
It's time to load the MLP classifer for the text classifcation.

In [16]:
from sklearn.neural_network import MLPClassifier
import pandas as pd


In [14]:
classifier = MLPClassifier(alpha = 0.7, max_iter=400)
classifier.fit(X_train, Y_train)

MLPClassifier(activation='relu', alpha=0.7, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=400,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

In [18]:
df_results = pd.DataFrame(data=np.zeros(shape=(1,3)), columns=['classifier', 'train_score', 'test_score'])
train_score = classifier.score(X_train, Y_train)
test_score = classifier.score(X_test, Y_test)

print(classifier.predict_proba(X_test))
print(classifier.predict(X_test))

df_results.loc[1,'classifier'] = 'MLP'
df_results.loc[1,'train_score'] = train_score
df_results.loc[1,'test_score'] = test_score

print(df_results)


[[0.76104308 0.23895692]
 [0.49024427 0.50975573]
 [0.44699475 0.55300525]]
[0 1 1]
  classifier  train_score  test_score
0          0          0.0         0.0
1        MLP          1.0         1.0
