# **FastText** - Training
This notebook trains a fastText supervised model and generates predictions on the test set. 


## Sources
Uses Facebook AI research's Python implementation of FastText: https://github.com/facebookresearch/fastText/tree/master/python
## Reproducibility
After running this notebook, you will obtain the model used for Submission **#109984** on AIcrowd

| Accuracy | F1 |
|:---:|:---:|
| 85.9% | 86.0% |

## Notes
FastText is very efficient, and this notebook can be run in under 5 minutes. FastText is a great option for simple text classification problems where state-of-the-art performance is not necessary. 

### Creating a fastText labeled dataset

In [1]:
import numpy as np 
import pandas as pd 
import fasttext_models as mod
import os 
import wget
import fasttext

root = 'data/'
os.makedirs(root, exist_ok=True)

seed = 0

CREATE_NEW_DATASET = True # If need to create a labeled text file for FastText 
if CREATE_NEW_DATASET:
    
    # Download negative full
    neg_url = 'https://api.onedrive.com/v1.0/shares/u!aHR0cHM6Ly8xZHJ2Lm1zL3QvcyFBclREZ3U5ejdJT1ZqcDQ0eDZMdDI5WXBlVXYyZGc_ZT1ZZDJn/root/content'
    neg_filename = root + 'train_neg_full_u.txt'
    wget.download(neg_url, neg_filename)
    neg_tweets = mod.txt_to_list(neg_filename)

    # Download positive full
    pos_url = 'https://api.onedrive.com/v1.0/shares/u!aHR0cHM6Ly8xZHJ2Lm1zL3QvcyFBclREZ3U5ejdJT1ZqcDQzcTc3QmNPbUdIWHQ3TXc_ZT01ejdG/root/content'
    pos_filename = root + 'train_pos_full_u.txt'
    wget.download(pos_url, pos_filename)
    pos_tweets = mod.txt_to_list(pos_filename)
    
    # Create a labeled dataset 
    all_tweets, y = mod.merge_shuffle_label(pos_tweets, neg_tweets, seed = seed)
    
    # Create a labeled text files for supervised FastText
    labeled_filename_full = root + 'full_u_labeled.txt'

    mod.write_labeled(labeled_filename_full, all_tweets, y)

100% [........................................................................] 78157401 / 78157401

### Training the model

In [2]:
# File to use for training
labeled_filename_full = root + 'full_u_labeled.txt'

# Train full model
model = fasttext.train_supervised(labeled_filename_full, epoch = 3, dim = 100, wordNgrams = 2, lr = 1)

# Save it 
model.save_model(root + "fasttext_trained_model.bin")

### Make predictions on the test set

In [4]:
# Prepare test set
test_url = 'https://api.onedrive.com/v1.0/shares/u!aHR0cHM6Ly8xZHJ2Lm1zL3QvcyFBclREZ3U5ejdJT1ZqcDR5Q3hoWXM4T2FJd1JLenc_ZT1hSXh0/root/content'
test_filename = root + 'test.txt'
wget.download(test_url, test_filename)

test_tweets = []
with open(test_filename, encoding = 'utf-8') as f:
    for line in f:
        sp = line.split(',')

        test_tweets.append(','.join(sp[1:])[:-1]) # Remove index and \n
        
# Generate predictions
res = {'__label__0': 0, '__label__1': 1}
predictions = np.array([res[el[0]] for el in model.predict(test_tweets, k=1)[0]])

# Save predictions
save_filename = 'submission_fasttext_training.csv'
mod.save_pred(save_filename, predictions)

  0% [                                                                            ]      0 / 817297  1% [                                                                            ]   8192 / 817297  2% [.                                                                           ]  16384 / 817297  3% [..                                                                          ]  24576 / 817297  4% [...                                                                         ]  32768 / 817297  5% [...                                                                         ]  40960 / 817297  6% [....                                                                        ]  49152 / 817297  7% [.....                                                                       ]  57344 / 817297  8% [......                                                                      ]  65536 / 817297  9% [......                                                                      ]  73728 / 817297