In [1]:
import numpy

In [6]:
# !pip install fasttext

In [7]:
import fasttext

In [8]:
from google.colab import files
path_to_file = list(files.upload().keys())[0]

Saving ecc.csv to ecc.csv


In [9]:
path_to_file

'ecc.csv'

In [10]:
with open(path_to_file, 'r') as file:
    content = file.read()
    print(content)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [12]:
import pandas as pd

In [15]:

df = pd.read_csv(path_to_file , names = ['category' , 'description'] , header = None)
print(df.head())

    category                                        description
0  Household  Paper Plane Design Framed Wall Hanging Motivat...
1  Household  SAF 'Floral' Framed Painting (Wood, 30 inch x ...
2  Household  SAF 'UV Textured Modern Art Print Framed' Pain...
3  Household  SAF Flower Print Framed Painting (Synthetic, 1...
4  Household  Incredible Gifts India Wooden Happy Birthday U...


In [16]:
df.dropna(inplace=True)
df.shape

(50424, 2)

In [17]:
df.category.unique()

array(['Household', 'Books', 'Clothing & Accessories', 'Electronics'],
      dtype=object)

In [18]:
df.category.replace("Clothing & Accessories", "Clothing_Accessories", inplace=True)

In [19]:
df.category.unique()

array(['Household', 'Books', 'Clothing_Accessories', 'Electronics'],
      dtype=object)

When you train a fasttext model, it expects labels to be specified with label prefix. We will just create a third column in the dataframe that has label as well as the product description

In [20]:
df['category'] = '__label__' + df['category'].astype(str)
df.head(5)

Unnamed: 0,category,description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...


In [22]:
df['category_description'] = df['category'] + ' ' + df['description']
df.head(3)

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__Household Paper Plane Design Framed W...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__Household SAF 'Floral' Framed Paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__Household SAF 'UV Textured Modern Art...


Pre-procesing

(1). Remove punctuation
(2). Remove extra space
(3). Make the entire sentence lower case

In [23]:
import re

text = "  VIKI's | Bookcase/Bookshelf (3-Shelf/Shelve, White) | ? . hi"
text = re.sub(r'[^\w\s\']',' ', text)
text = re.sub(' +', ' ', text)
text.strip().lower()

"viki's bookcase bookshelf 3 shelf shelve white hi"

In [24]:
def preprocess(text):
    text = re.sub(r'[^\w\s\']',' ', text)
    text = re.sub(' +', ' ', text)
    return text.strip().lower()

In [25]:
df['category_description'] = df['category_description'].map(preprocess)
df.head()

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__household paper plane design framed w...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__household saf 'floral' framed paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__household saf 'uv textured modern art...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__household saf flower print framed pai...
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...,__label__household incredible gifts india wood...


In [26]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

In [27]:
train.shape, test.shape

((40339, 3), (10085, 3))

In [28]:
train.to_csv("ecommerce.train", columns=["category_description"], index=False, header=False)
test.to_csv("ecommerce.test", columns=["category_description"], index=False, header=False)

Train the model and evaluate

In [29]:
import fasttext

model = fasttext.train_supervised(input="ecommerce.train")
model.test("ecommerce.test")

(10085, 0.9680713931581557, 0.9680713931581557)

First parameter (10084) is test size. Second and third parameters are precision and recall respectively. You can see we are getting around 96% precision which is pretty good

Now let's do prediction for few product descriptions

In [30]:
model.predict("wintech assemble desktop pc cpu 500 gb sata hdd 4 gb ram intel c2d processor 3")

(('__label__electronics',), array([0.99900275]))

In [31]:
model.predict("ockey men's cotton t shirt fabric details 80 cotton 20 polyester super combed cotton rich fabric")

(('__label__clothing_accessories',), array([1.00001001]))

In [32]:
model.predict("think and grow rich deluxe edition")

(('__label__books',), array([1.00000989]))

In [33]:
model.get_nearest_neighbors("painting")

[(0.9987263679504395, 'cauliflower'),
 (0.9987250566482544, "clients'"),
 (0.9987231492996216, 'pad1'),
 (0.9987231492996216, '5ldiameter'),
 (0.9987231492996216, 'personspackage'),
 (0.9987231492996216, 'inchsuitable'),
 (0.9987231492996216, 'timelightweight'),
 (0.9987231492996216, 'personscompatible'),
 (0.9987231492996216, 'activitiesspecification'),
 (0.9987231492996216, 'graycapacity')]

In [34]:
model.get_nearest_neighbors("sony")

[(0.9995042085647583, '950'),
 (0.999500572681427, 'us9'),
 (0.9994949102401733, 'synced'),
 (0.9994925856590271, 'synthesizers'),
 (0.9994832873344421, '560mb'),
 (0.999480664730072, '7805'),
 (0.9994799494743347, 'eyecups'),
 (0.9994739890098572, '420ft'),
 (0.9994739890098572, 'ipd'),
 (0.9994739890098572, 'pupillary')]

In [35]:
model.get_nearest_neighbors("banglore")

[(0.0, 'to'),
 (0.0, 'and'),
 (0.0, 'a'),
 (0.0, 'with'),
 (0.0, 'for'),
 (0.0, 'is'),
 (0.0, '</s>'),
 (0.0, 'subs'),
 (0.0, '190rms'),
 (0.0, 'revolutionise')]

What is FastText?

FastText is an open-source, free library from Facebook AI Research(FAIR) for learning word embeddings and word classifications. This model allows creating unsupervised learning or supervised learning algorithm for obtaining vector representations for words. It also evaluates these models. FastText supports both CBOW and Skip-gram models.

Uses of FastText:

(1). It is used for finding semantic similarities

(2). It can also be used for text classification(ex: spam filtering).

(3). It can train large datasets in minutes.

Working on FastText ?



FastText is very fast in training word vector models. You can train about 1 billion words in less than 10 minutes. The models built through deep neural networks can be slow to train and test. These methods use a linear classifier to train the model.

Linear classifier: In this text and labels are represented as vectors. We find vector representations such that text and it’s associated labels have similar vectors. In simple words, the vector corresponding to the text is closer to its corresponding label.




To maximize this probability of the correct label we can use the Gradient Descent algorithm.

This is quite computationally expensive because for every piece of text not only we have to get the score associated with its correct label but we need to get the score for every other label in the training set. This limits the use of these models on very large datasets.

FastText solves this problem by using a hierarchical classifier to train the model.

Hierarchical Classifier used by FastText:




In this method, it represents the labels in a binary tree. Every node in the binary tree represents a probability. A label is represented by the probability along the path to that given label. This means that the leaf nodes of the binary tree represent the labels.



FastText uses the Huffman algorithm to build these trees to make full use of the fact that classes can be imbalanced. Depth of the frequently occurring labels is smaller than the infrequent ones.

Using a binary tree speed up the time of search as instead of having to go through all the different elements you just search for the nodes. So now we won’t have to compute the score for every single possible label, and we will only be calculating just the probability on each node in the path to the one correct label. Hence this method vastly reduces the time complexity of training the model.

Increasing the speed does not sacrifice the accuracy of the model.

When we have unlabeled dataset FastText uses the N-Gram Technique to train the model. Let us understand more in detail how this technique works-
Let us consider a word from our dataset, for example: “kingdom”. Now it will take a look at the word “kingdom” and will break it into its n-gram components as-

kingdom = ['k','in','kin','king','kingd','kingdo','kingdom',...]




These are some n-gram components for the given words. There will be many more components for this word but only a few are stated here just to get an idea. The size of the n-gram components can be chosen as per your choice. The length of n-grams can be between the minimum and the maximum number of characters selected. You can do so by using the -minn and -maxn flags respectively.

Note: When your text is not words from a particular language then using n-grams won’t make sense. for example: when the corpus contains ids it will not be storing words but numbers and special characters. In this case, you can turn off the -gram embeddings by selecting the -minn and -maxn parameters as 0.

When the model updates, fastText learns the weights for every n-gram along with the entire word token.




In this manner, each token/word will be expressed as the sum and an average of its n-gram components.

(1). Word vectors generated through fastText hold extra information about their sub-words. As in the above example, we can see that one of the components for the word “kingdom” is the word “king”. This information helps the model build semantic similarity between the two words.


(2). It also allows for capturing the meaning of suffixes/prefixes for the given words in the corpus.


(3). It allows for generating better word embeddings for different or rare words as well.


(4). It can also generate word embeddings for out of vocabulary(OOV) words.
While using fastText even if you don’t remove the stopwords still the accuracy is not compromised. You can perform simple pre-processing steps on your corpus if you fell like.


(5). As fastText has the feature of providing sub-word information, it can also be used on morphologically rich languages like Spanish, French, German, etc.
We do get better word embeddings through fastText but it uses more memory as compared to word2vec or GloVe as it generates a lot of sub-words for each word.




