# Introduction to Natural Language Processing with fastText
In this notebook we will discuss what is Natural Language Processing (NLP) and how to easily implement several projects using the library [fastText](https://github.com/facebookresearch/fastText).

In [1]:
#Load all libraries
import os,sys  
import pandas as pd
import numpy as np
import fasttext

print(sys.version)

3.5.2 |Anaconda custom (64-bit)| (default, Jul  2 2016, 17:53:06) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]


## Text classification
The first task will be to perform text classification dataset DBPedia. 

In [2]:
#Load train set
train_file = 'dbpedia_train.csv'
df = pd.read_csv(train_file, header=None, names=['class','name','description'])

#Load test set
test_file = 'dbpedia_test.csv'
df_test = pd.read_csv(test_file, header=None, names=['class','name','description'])

#Mapping from class number to class name
class_dict={
1:'Company',
2:'EducationalInstitution',
3:'Artist',
4:'Athlete',
5:'OfficeHolder',
6:'MeanOfTransportation',
7:'Building',
8:'NaturalPlace',
9:'Village',
10:'Animal',
11:'Plant',
12:'Album',
13:'Film',
14:'WrittenWork'
}
df['class_name'] = df['class'].map(class_dict)
df.head()

Unnamed: 0,class,name,description,class_name
0,1,E. D. Abbott Ltd,Abbott of Farnham E D Abbott Limited was a Br...,Company
1,1,Schwan-Stabilo,Schwan-STABILO is a German maker of pens for ...,Company
2,1,Q-workshop,Q-workshop is a Polish company located in Poz...,Company
3,1,Marvell Software Solutions Israel,Marvell Software Solutions Israel known as RA...,Company
4,1,Bergan Mercy Medical Center,Bergan Mercy Medical Center is a hospital loc...,Company


In [3]:
#df.describe().transpose()
desc = df.groupby('class')
desc.describe().transpose()

class,1,1,1,1,2,2,2,2,3,3,...,12,12,13,13,13,13,14,14,14,14
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq,count,unique,...,top,freq,count,unique,top,freq,count,unique,top,freq
class_name,40000,1,Company,40000,40000,1,EducationalInstitution,40000,40000,1,...,Album,40000,40000,1,Film,40000,40000,1,WrittenWork,40000
description,40000,39996,DTOX is a mobile recovery smartphone app that...,2,40000,39992,St. Croix Country Day School is an independen...,2,40000,40000,...,Before Smile Empty Soul became Smile Empty So...,2,40000,40000,Azhagan is a 1991 Indian Tamil language film ...,1,40000,39984,Tom Clancy's Net Force Explorers or Net Force...,15
name,40000,40000,ArmorGroup,1,40000,40000,Highland Park High School (University Park Texas),1,40000,40000,...,Made to Be Broken,1,40000,40000,Uruvangal Maralam,1,40000,40000,Night Fall (novel),1


The next step is to treat the data. As of today, the python wrapper of fastText doesn't allow dataframes or iterators as inputs to their functions (however, they are [working on it](https://github.com/salestock/fastText.py/issues/78). We have to create an intermediate file. This intermediate file doesn't have commas, non-ascii characters and everything is lowercase. The changes are based on [this script](https://github.com/facebookresearch/fastText/blob/a88344f6de234bdefd003e9e55512eceedde3ec0/classification-example.sh#L17).

In [9]:
%%time
def clean_dataset(dataframe, shuffle=False, encode=False):
    # Transform train file
    df = dataframe[['name','description']].apply(lambda x: x.str.replace(',',' '))
    df[['name','description']] = df[['name','description']].apply(lambda x: x.str.replace('"',''))
    df[['name','description']] = df[['name','description']].apply(lambda x: x.str.replace('\'',' \' '))
    df[['name','description']] = df[['name','description']].apply(lambda x: x.str.replace('.',' . '))
    df[['name','description']] = df[['name','description']].apply(lambda x: x.str.replace('(',' ( '))
    df[['name','description']] = df[['name','description']].apply(lambda x: x.str.replace(')',' ) '))
    df[['name','description']] = df[['name','description']].apply(lambda x: x.str.replace('!',' ! '))
    df[['name','description']] = df[['name','description']].apply(lambda x: x.str.replace('?',' ? '))
    df[['name','description']] = df[['name','description']].apply(lambda x: x.str.replace(':',' '))
    df[['name','description']] = df[['name','description']].apply(lambda x: x.str.replace(';',' '))
    df[['name','description']] = df[['name','description']].apply(lambda x: x.str.lower())
    df['class'] = '__label__' + dataframe['class'].astype(str) + ' '
    if(shuffle):
        from sklearn.utils import shuffle
        df = shuffle(df).reset_index(drop=True)
        #df.sample(frac=1).reset_index(drop=True)
    if(encode):
        df[['name','description']] = df[['name','description']].apply(lambda x: x.str.normalize('NFKD').str.encode('ascii','ignore').str.decode('utf-8'))
    df['name'] = ' ' + df['name'] + ' '
    df['description'] = ' ' + df['description'] + ' '
    return df

# Transform datasets
df_train_clean = clean_dataset(df, True, False)
df_test_clean = clean_dataset(df_test, False, False)

# Write files to disk
train_file_clean = 'dbpedia.train'
df_train_clean.to_csv(train_file_clean, header=None, index=False, columns=['class','name','description'] )

test_file_clean = 'dbpedia.test'
df_test_clean.to_csv(test_file_clean, header=None, index=False, columns=['class','name','description'] )


CPU times: user 23.2 s, sys: 596 ms, total: 23.8 s
Wall time: 25.2 s


Once the dataset is cleaned, the next step is to train the classifier. 

In [10]:
%%time
# Train a classifier
output_file = 'dp_model'
classifier = fasttext.supervised(train_file_clean, output_file, label_prefix='__label__')

CPU times: user 1min 13s, sys: 1.26 s, total: 1min 14s
Wall time: 20.6 s


Once the model is trained, we can test its accuracy. We can obtain the [percision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) of the model. High precision means that an algorithm returned substantially more relevant results than irrelevant ones, while high recall means that an algorithm returned most of the relevant results.

In [11]:
%%time
# Evaluate classifier
result = classifier.test(test_file_clean)
print('P@1:', result.precision)
print('R@1:', result.recall)
print ('Number of examples:', result.nexamples)

P@1: 0.9832428571428572
R@1: 0.9832428571428572
Number of examples: 70000
CPU times: user 580 ms, sys: 12 ms, total: 592 ms
Wall time: 591 ms


The next step is to check how the model works with real sentences.

In [12]:
sentence1 = ['Picasso was a famous painter born in Malaga, Spain. He revolutionized the art in the 20th century.']
labels1 = classifier.predict(sentence1)
class1 = int(labels1[0][0])
print("Sentence: ", sentence1[0])
print("Label: %d; label name: %s" %(class1, class_dict[class1]))

sentence2 = ['One of my favourite tennis players in the world is Rafa Nadal.']
labels2 = classifier.predict_proba(sentence2)
class2, prob2 = labels2[0][0] # it returns class2 as string
print("Sentence: ", sentence2[0])
print("Label: %s; label name: %s; certainty: %f" %(class2, class_dict[int(class2)], prob2))

sentence3 = ['Say what one more time, I dare you, I double-dare you motherfucker!']
number_responses = 3
labels3 = classifier.predict_proba(sentence3, k=number_responses)
print("Sentence: ", sentence3[0])
for l in range(number_responses):
    class3, prob3 = labels3[0][l]
    print("Label: %s; label name: %s; certainty: %f" %(class3, class_dict[int(class3)], prob3))


Sentence:  Picasso was a famous painter born in Malaga, Spain. He revolutionized the art in the 20th century.
Label: 3; label name: Artist
Sentence:  One of my favourite tennis players in the world is Rafa Nadal.
Label: 4; label name: Athlete; certainty: 0.492188
Sentence:  Say what one more time, I dare you, I double-dare you motherfucker!
Label: 3; label name: Artist; certainty: 0.562500
Label: 14; label name: WrittenWork; certainty: 0.201172
Label: 7; label name: Building; certainty: 0.150391


The model predicts sentence 1 as `Artist`, which is correct. Sentence 2 is also predicted correctly. This time we used the function `predict_proba` that retruns the certainty of the prediction as a probability. Finally, sentence 3 was not correctly classified. The correct label would be `Film`, since the sentence is from famous scene of a very good film. If by any chance, you don't know [what I'm talking about](https://www.youtube.com/watch?v=xwT60UbOZnI), well, please put your priorities in order. Stop reading this notebook, go to see Pulp Fiction, and then come back to keep learning NLP :-)