<a href="https://colab.research.google.com/github/SDAravind/FastText/blob/main/TextClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Learn FastText 
https://fasttext.cc/docs/en/supervised-tutorial.html



## Downloading and installing fasttext

In [None]:
!wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip
!unzip v0.9.2.zip

# !cd fastText-0.9.2
!make fastText-0.9.2/
!pip install fastText-0.9.2/

In [2]:
import fasttext

In [3]:
help(fasttext.FastText)

Help on module fasttext.FastText in fasttext:

NAME
    fasttext.FastText

DESCRIPTION
    # Copyright (c) 2017-present, Facebook, Inc.
    # All rights reserved.
    #
    # This source code is licensed under the MIT license found in the
    # LICENSE file in the root directory of this source tree.

FUNCTIONS
    cbow(*kargs, **kwargs)
    
    eprint(*args, **kwargs)
    
    load_model(path)
        Load a model given a filepath and return a model object.
    
    read_args(arg_list, arg_dict, arg_names, default_values)
    
    skipgram(*kargs, **kwargs)
    
    supervised(*kargs, **kwargs)
    
    tokenize(text)
        Given a string of text, tokenize it and return a list of tokens
    
    train_supervised(*kargs, **kwargs)
        Train a supervised model and return a model object.
        
        input must be a filepath. The input text does not need to be tokenized
        as per the tokenize function, but it must be preprocessed and encoded
        as UTF-8. You might wan

### Download/Get and Prepare datasets

In [None]:
!wget https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz
!head cooking.stackexchange.txt

In [5]:
!wc cooking.stackexchange.txt

  15404  169582 1401900 cooking.stackexchange.txt


## Split dataset into train and validation

In [6]:
!head -n 12404 cooking.stackexchange.txt > cooking.train
!tail -n 3000 cooking.stackexchange.txt > cooking.valid

## Train the model

In [7]:
model = fasttext.train_supervised(input="cooking.train")

Save the model

In [8]:
model.save_model("model_cooking.bin")

Predict using the model

In [9]:
model.predict("Which baking dish is best to bake a banana bread ?")

(('__label__baking',), array([0.07257967]))

Test the model

In [10]:
model.test("cooking.valid")

(3000, 0.135, 0.05838258613233386)

In [11]:
model.predict("Why not put knives in the dishwasher?", k=5)

(('__label__food-safety',
  '__label__baking',
  '__label__bread',
  '__label__substitutions',
  '__label__equipment'),
 array([0.07451777, 0.07366108, 0.04390582, 0.0373    , 0.03408055]))

## Making the model better

### Prepocessing the data

In [12]:
!cat cooking.stackexchange.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > cooking.preprocessed.txt
!head -n 12404 cooking.preprocessed.txt > cooking.train
!tail -n 3000 cooking.preprocessed.txt > cooking.valid

#### Train on more epochs

In [13]:
model = fasttext.train_supervised(input="cooking.train",  epoch=25)

In [14]:
model.test("cooking.valid")

(3000, 0.52, 0.22488107250973044)

#### Train model on different learning rate

In [15]:
model = fasttext.train_supervised(input="cooking.train", lr=1.0)

In [None]:
model.test("cooking.valid")

In [16]:
model = fasttext.train_supervised(input="cooking.train", lr=1.0, epoch=25)

In [17]:
model.test("cooking.valid")

(3000, 0.5843333333333334, 0.25270289750612657)

In [18]:
model = fasttext.train_supervised(input="cooking.train", lr=1.0, epoch=25, wordNgrams=2)

In [19]:
model.test("cooking.valid")

(3000, 0.5996666666666667, 0.2593340060544904)

In [20]:
model = fasttext.train_supervised(input="cooking.train", lr=0.5, epoch=25, wordNgrams=2, bucket=200000, dim=50, loss='ova')

In [21]:
model.predict("Which baking dish is best to bake a banana bread ?", k=-1, threshold=0.5)

(('__label__baking',
  '__label__equipment',
  '__label__bread',
  '__label__bananas'),
 array([1.00001001, 0.97967768, 0.97632056, 0.8872146 ]))