<h1>FastText Implementation</h1>

In this notebook we apply FastText to our dataset.

In [None]:
# Needed general imports
import csv, fasttext, time, os
import numpy as np

# Helper code
from fastText.helpers import clean_str

# Libraries for FastText
import fasttext
from scipy.sparse import *
from sklearn.model_selection import KFold

<h3>Loading Data</h3>

This function loads data from the processed tweet files, splits the data into words and generates labels. Returns split sentences and labels for the training sets and split sentences for the testing set. The function after that cleans the files we need.

In [None]:
def load_data_and_labels(positive_data_file, negative_data_file, test_data_file):
    """
    Loads data from files, splits the data into words and generates labels.
    Returns split sentences and labels for the training sets and split sentences for the testing set
    """
    # Load data from files
    positive_examples = list(open(positive_data_file, "r", encoding="utf-8").readlines())
    positive_examples = [s.strip() for s in positive_examples]
    
    negative_examples = list(open(negative_data_file, "r", encoding="utf-8").readlines())
    negative_examples = [s.strip() for s in negative_examples]
    
    test = list(open(test_data_file, "r", encoding="utf-8").readlines())
    test = [s.strip() for s in test]
    
    # Split by words
    train = positive_examples + negative_examples
    
    # Generate labels
    positive_labels = [1 for _ in positive_examples]
    negative_labels = [-1 for _ in negative_examples]
    labels = np.concatenate([positive_labels, negative_labels], 0)
    return [train, labels, test]

In [None]:
def clean_files():
    positive_examples = list(open('twitter-datasets/train_pos_full.txt', "r", encoding="utf-8").readlines())
    positive_examples = [s.strip() for s in positive_examples]
    
    negative_examples = list(open('twitter-datasets/train_neg_full.txt', "r", encoding="utf-8").readlines())
    negative_examples = [s.strip() for s in negative_examples]
    
    test_examples = list(open('twitter-datasets/test_data.txt', "r", encoding="utf-8").readlines())
    test_examples = [s.strip() for s in test_examples]
    
    # process every word
    positive_string = [clean_str(sent) for sent in positive_examples]
    negative_string = [clean_str(sent) for sent in negative_examples]
    test_string = [clean_str(sent) for sent in test_examples]

    with open('processed/train_pos_fastText_full.txt', 'w', encoding="utf-8") as f:
        for sent in positive_string:
            f.write(sent + '\n')

    with open('processed/train_neg_fastText_full.txt', 'w', encoding="utf-8") as f:
        for sent in negative_string:
            f.write(sent + '\n')

    with open('processed/test_data_fastText.txt', 'w', encoding="utf-8") as f:
        for sent in test_string:
            f.write(sent + '\n')

<h3>Classification</h3>

Here we run our classification. We first load the datasets, then save the training set with the labels appended. We feed the resulting file to FastText and get a classifier, which we use to predict labels for the testing set.

In [None]:
# start computing time 
start = time.time()

# Clean file if it does not exist
if not os.path.exists('processed/train_pos_fastText_full.txt') \
    or not os.path.exists('processed/train_neg_fastText_full.txt'):
        print('Clean fastText files did not exist')
        clean_files()

# Load data from processed files
train, labels, test = load_data_and_labels('processed/train_pos_fastText_full.txt', 'processed/train_neg_fastText_full.txt', 
                                           'processed/test_data_fastText.txt')

# Create the correct label in front of every tweets as : '__label__<X>'
with open('outputs/fastText_labels.txt', 'w', encoding="utf-8") as f:
    for sent, label in zip(train ,labels):
        f.write('__label__' + str(label) + ' ' + sent + '\n')

# define the parameters for the fastText classifier
window, epochs = 10, 10

# Build the fastText classifier 
classifier = fasttext.supervised('outputs/fastText_labels.txt', 'model', label_prefix='__label__', ws=window, epoch=epochs)

# Create the prediction
prediction = classifier.predict_proba(test)

# Compute the computing time
end = time.time()
print(end - start)

In [None]:
prediction

<h3>Cross-Validation</h3>

We can now test the accuracy of our classifier by running a 10-fold cross validation on it. With the same parameters as before, we split out data in 10 scrambled subsets, with one of them acting as a testing set in each iteration.

In [None]:
# Load data from processed files
train, labels, test = load_data_and_labels('train_pos_proc.txt', 'train_neg_proc.txt', 'test_data.txt')

# define the parameters for the fastText classifier
window, epochs = 10, 10

# create random indices of the rows size
num_row = len(labels)
indices = np.random.permutation(num_row)

# Define the number of folds for the cross-validation
fold = 10;
k_fold = KFold(n_splits=fold)
accuracy = np.zeros((fold))

i = 0

for train_indices, test_indices in k_fold.split(labels):
    # Randomize the cross-val indices with the indices created above
    train_indices = indices[train_indices]
    test_indices = indices[test_indices]
    
    # Create the correct label in front of every tweets as : '__label__<X>'
    # For the training set
    with open('outputs/fastText_train_labels.txt', 'w', encoding="utf-8") as f:
        for indice in train_indices:
             f.write('__label__' + str(labels[indice]) + ' ' + train[indice] + '\n')
                
    # For the testing set 
    with open('outputs/fastText_test_labels.txt', 'w', encoding="utf-8") as f:
        for indice in test_indices:
             f.write('__label__'  + str(labels[indice]) + ' ' + train[indice] + '\n')
    
    # Build the fastText classifier 
    classifier = fasttext.supervised('outputs/fastText_train_labels.txt', 'model_cros_val', label_prefix='__label__', ws=window, epoch=epochs)
    
    # Evaluate how the classifier performs on the testing set
    result = classifier.test('outputs/fastText_test_labels.txt')
    
    # Saving the accuracy for every iteration
    accuracy[i] = result.precision
    i += 1

We can now view the accuracy of each of the folds of the cross-validation.

In [None]:
accuracy

<h3>Submission for FastText</h3>

This code generates a submission file for the FastText implementation.

In [None]:
with open('outputs/sub_fasttext.csv', 'w') as csvfile:
    fieldnames = ['Id', 'Prediction']
    sub_writer = csv.DictWriter(csvfile, fieldnames)
    index = 0
    sub_writer.writeheader()
    
    for pred in prediction:
        index += 1
        sub_writer.writerow({'Id': str(index), 'Prediction': str(pred[0][0])})