# Input Data and Classification Model

### The Milky Way was so meh! Been there, done that!


Every group of document are separate by language and label. I am going to unify the languages to make three different models, one per language. So, the input must be in the same language and with the specific label.

In [1]:
# Imports
import pandas as pd
import numpy as np
import re
import glob, os

import spacy

import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

In [2]:
from spacy.lang.en.stop_words import STOP_WORDS
stopwords_en = list(STOP_WORDS)
from spacy.lang.fr.stop_words import STOP_WORDS
stopwords_fr = list(STOP_WORDS)
from spacy.lang.es.stop_words import STOP_WORDS
stopwords_es = list(STOP_WORDS)

from spacy.lang.en import English
from spacy.lang.fr import French
from spacy.lang.es import Spanish


# 0. Preparing Data
In this part we are going to read the documents in the folders and join then in the same dataframe with the different labels for the training model. Because we are going to create 3 different models -one for each language- we are going to separate the dataframes because of the language.

In [3]:
def union_df(path, label):
    df_all = pd.DataFrame()
    for f in glob.glob(path + '/*.txt'):
        doc = [open(f, encoding='utf-8').read()]
        df = pd.DataFrame(doc, columns=['text'])
        df['label'] = label
        #df['file_name'] = re.findall('.+\/(.+\.txt)', f)
        #df = df[['label', 'text']]
        df_all = pd.concat([df_all,df], ignore_index = True)
    return df_all

In [4]:
label1 = 'APR'
label2 = 'Conference_papers'
label3 = 'PAN11'
label4 = 'Wikipedia'

# 1. English

### 1.1. Joing Dataframes

In [5]:
path1 = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/APR/en'

path2 = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/Conference_papers/en'

path3 = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/PAN11/en'

path4 = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/Wikipedia/en'


In [6]:
df_en1 = union_df(path1,label1)


In [7]:
df_en2 = union_df(path2,label2)


In [8]:
df_en3 = union_df(path3,label3)


In [9]:
df_en4 = union_df(path4,label4)

In [10]:
df_en = pd.concat([df_en1,df_en2, df_en3, df_en4], ignore_index = True)
df_en

Unnamed: 0,text,label
0,"i read this book because in my town, everyone ...",APR
1,recipes appreciated by the family (small and l...,APR
2,i say no to ease ..... and not to the author w...,APR
3,milady has found a good vein: anita blake. bas...,APR
4,"460 bc, somewhere in greece: ""gentlemen, i dec...",APR
...,...,...
9644,"Bupyeong-gu, Incheon. | location_country =...",Wikipedia
9645,Freedom Call is a German power metal band form...,Wikipedia
9646,majesty|consortname = Paola Ruffo di Calabriat...,Wikipedia
9647,Sertã (pron. ) is a municipality in Portugal ...,Wikipedia


I will be check few things about the new English dataframe:
- Lenght
- Columns
- Nulls

In [11]:
def checkingdf(df):
    length = len(df)
    columns = df.columns
    nulls = df.isnull().sum()
    return length, columns, nulls

In [12]:
checkingdf(df_en)

(9649,
 Index(['text', 'label'], dtype='object'),
 text     0
 label    0
 dtype: int64)

In [57]:
df_en.to_csv('/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/training_data/english.csv', sep='\t', index=False)

Everything looks fine, or I should say... *Roger, Roger!*

### 1.2. Removing StopWords, Lemmanization, Tokenization and remove puntuation
Removing **stopwords** is an important step in NLP. It involves filtering out hihg-frecuency words that add little or no semantic value to a sentence.

**Lemmanization** is the act that convert the several infected forms of a word into a single form to make the analysis process easier. For example, if we have *I'am*, *you're* or *she wasn't* we will get the verb *to be* instead of the variations.

**Tokenization** is a technique where we divide the words into morphemes (basic unit of meaning).

Also, we are going to **remove puntuation** to reduce the content of the text files into words.

I am using SpaCy

In [15]:
lang = 'en'
nlp = spacy.load('en_core_web_md')

In [16]:
punctuations = string.punctuation
parser_en = English()
parser = parser_en

In [17]:
def spacy_tokenizer(sentence):
    #if lang == 'en':
     #   nlp = spacy.load(lang)
     #   tokens = English(sentence)
    #elif lang == 'fr':
     #   tokens == French(sentence)
    #elif lang == 'es':
     #   tokens == Spanish(sentence)
    #else:
     #   raise error
    tokens = parser(sentence)
    tokens = [w.lemma_.lower().strip() if w.lemma_ != '-PRON-!' else w.lower_ for w in tokens]
    tokens = [w for w in tokens if w not in stopwords_en and w not in punctuations]
    return tokens

### 1.3. ML with SKLearn

Custom transformer


In [18]:
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        return[clean_text(text) for text in X]
    def fit(self, X, y=None, **fit_params):
        return self
    def get_params(self, deep=True):
        return[]

def clean_text(text):
    return text.strip().lower()

In [19]:
# Vectorization
vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))
classifier = LinearSVC()

# Using Tfidf
tfvectorizer = TfidfVectorizer(tokenizer = spacy_tokenizer)

I am going to split the dataframe. We areg going to need to select the right columns of the dataframe, but also, we want to keep some data out from training model to test it. I am going to do a 80 - 20.

In [20]:
# Features
X = df_en['text']

# Classification Labels
y = df_en['label']

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=11)

In [22]:
# Apply the pipeline to clean, tokenize, vectorize and classify
pipe = Pipeline([('cleaner', predictors()),('vectorizer', vectorizer), ('classifier', classifier)])

In [23]:
pipe.fit(X_train,y_train)

Pipeline(steps=[('cleaner', <__main__.predictors object at 0x1a241d1290>),
                ('vectorizer',
                 CountVectorizer(tokenizer=<function spacy_tokenizer at 0x1a2edcbcb0>)),
                ('classifier', LinearSVC())])

In [24]:
sample_prediction = pipe.predict(X_test)

In [25]:

for(sample, pred) in zip(X_test,sample_prediction):
    print(f'Predict: {pred}')

Predict: APR
Predict: Wikipedia
Predict: APR
Predict: APR
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Conference_papers
Predict: Wikipedia
Predict: APR
Predict: Wikipedia
Predict: APR
Predict: APR
Predict: Wikipedia
Predict: Wikipedia
Predict: APR
Predict: Wikipedia
Predict: Conference_papers
Predict: Wikipedia
Predict: PAN11
Predict: Wikipedia
Predict: Wikipedia
Predict: APR
Predict: Wikipedia
Predict: APR
Predict: APR
Predict: APR
Predict: Wikipedia
Predict: Wikipedia
Predict: APR
Predict: APR
Predict: APR
Predict: APR
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: APR
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: PAN11
Predict: PAN11
Predict: APR
Predict: APR
Predict: APR
Predict: PAN11
Predict: Wikipedia
Predict: PAN11
Predict: Wikipedia
Predict: Conference_papers
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: APR
Predict: PAN11
Predict: Wikipedia

In [32]:
print(f'Accuracy test:{pipe.score(X_test, y_test)}')
print(f'Accuracy sample:{pipe.score(X_test, sample_prediction)}')

Accuracy test:0.9569948186528497
Accuracy sample:1.0


In [33]:
print(f'Accuracy score:{pipe.score(X_train, y_train)}')

Accuracy score:1.0


In [38]:
pipe.predict(['With a scientific method we can show the data in this graph. We want to show our investigation results of five years with machine learning models'])

array(['APR'], dtype=object)

In [39]:
pipe.predict(['I do not like this item at all'])

array(['APR'], dtype=object)

In [40]:
pipe.predict(['and the computer that ran the first public Bulletin Board Systems, CBBS]]A Bulletin Board System, or BBS, is a computer system running software that allows users to connect and login to the system using a terminal program. Originally BBSes were accessed only over a phone line using a modem, but by the early 1990s some BBSes allowed access via a Telnet or packet radio connection.Once a user logged in, they could perform functions such as downloading or uploading software and data, reading news, and exchanging messages with other users. Many BBSes also offered on-line games, in which users could compete with each other, and BBSes with multiple phone lines often offered IRC-like chat rooms, allowing users to meet each other.In recent years, the term BBS is sometimes incorrectly used to refer to any online forum or message board.During their heyday from the late 1970s to the mid 1990s, most BBSes were run as a hobby free of charge by the system operator (or "sysop"), while other BBSes charged their users a subscription fee for access, or were operated by a business as a means of supporting their customers. Bulletin Board Systems were in many ways a precursor to the modern form of the World Wide Web and other aspects of the Internet.Early BBSes were often a local phenomenon, as one had to dial into a BBS'])

array(['Wikipedia'], dtype=object)

In [41]:
pipe.predict(['A Long Island iced tea is a type of alcoholic mixed drink typically made with vodka, tequila, light rum, triple sec, gin, and a splash of cola, which gives the drink the same amber hue as its namesake.[1] A popular version mixes equal parts vodka, tequila, gin, rum, triple sec, with ​1 1⁄2 parts sour mix and a splash of cola. Lastly, it is decorated with the lemon and straw, after stirring with bar spoon smoothly.[2]Most variants use equal parts of the main liquors, but include a smaller amount of triple sec (or other orange-flavored liqueur). Close variants often replace the sour mix with lemon juice, replace the cola with diet cola or actual iced tea, or add white crème de menthe. Most variants do not include any tea.The drink has a much higher alcohol concentration (approximately 22 percent) than most highball drinks due to the relatively small amount of mixer.'])

array(['APR'], dtype=object)

# 2. French

In [42]:
path1 = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/APR/fr'

path2 = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/Conference_papers/fr'

path4 = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/Wikipedia/fr'

In [43]:
df_fr1 = union_df(path1,label1)
df_fr2 = union_df(path2,label2)
df_fr4 = union_df(path4,label4)
df_fr = pd.concat([df_fr1, df_fr2, df_fr4], ignore_index = True)
df_fr

Unnamed: 0,text,label
0,"J'avais beaucoup aimé les premiers albums du ""...",APR
1,Je me joins aux commentaires peu satisfaits......,APR
2,"À sa parution en 1979, ce livre n'a pas rencon...",APR
3,Je découvre Douglas Kennedy et j'aimerais que ...,APR
4,J'ai acheté ce livre à la lecture des commenta...,APR
...,...,...
7958,", Nuremberg |années actives = depuis 1998 |gen...",Wikipedia
7959,Une cellule polyploïde (du grec : πολλαπλόν - ...,Wikipedia
7960,", George W. Bush et Albert II le .]]La reine ...",Wikipedia
7961,Sertã est une petite ville portugaise de 5 50...,Wikipedia


In [44]:
checkingdf(df_fr)

(7963,
 Index(['text', 'label'], dtype='object'),
 text     0
 label    0
 dtype: int64)

In [45]:
df_fr.to_csv('/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/training_data/french.csv', header= False, index=False)

In [46]:
lang = 'fr'
nlp = spacy.load('fr_core_news_lg')
parser_fr = French()
parser = parser_fr
# Vectorization
vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))
classifier = LinearSVC()

# Using Tfidf
tfvectorizer = TfidfVectorizer(tokenizer = spacy_tokenizer)

# Features
X = df_fr['text']

# Classification Labels
y = df_fr['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=11)

# Apply the pipeline to clean, tokenize, vectorize and classify
pipe = Pipeline([('cleaner', predictors()),('vectorizer', vectorizer), ('classifier', classifier)])

pipe.fit(X_train,y_train)

sample_prediction = pipe.predict(X_test)



In [47]:

for(sample, pred) in zip(X_test,sample_prediction):
    print(f'Predict: {pred}')
print(f'Accuracy test:{pipe.score(X_test, y_test)}')
print(f'Accuracy sample:{pipe.score(X_test, sample_prediction)}')

Predict: APR
Predict: APR
Predict: Wikipedia
Predict: APR
Predict: Wikipedia
Predict: APR
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: APR
Predict: APR
Predict: APR
Predict: APR
Predict: APR
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: APR
Predict: Wikipedia
Predict: Wikipedia
Predict: APR
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: APR
Predict: Wikipedia
Predict: Wikipedia
Predict: APR
Predict: APR
Predict: APR
Predict: APR
Predict: Wikipedia
Predict: APR
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Conference_papers
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: APR
Predict: APR
Predict: APR
Predict: W

Accuracy test:0.9943502824858758
Accuracy sample:1.0


In [50]:
pipe.predict(["Le Long Island iced tea ou LIIT ou Long Island littéralement thé glacé Long Island en anglais est un cocktail à base de tequila gin vodka rhum liqueur doranges et cola Ce cocktail officiel de lIBA tient son nom de sa ressemblance à du thé glacé bien quil nen contienne pas)"])

array(['APR'], dtype=object)

# Spanish

In [51]:
path3 = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/PAN11/es'

path4 = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/Wikipedia/es'

In [52]:
df_es3 = union_df(path3,label3)
df_es4 = union_df(path4,label4)
df_es = pd.concat([df_es3, df_es4], ignore_index = True)
df_es

Unnamed: 0,text,label
0,"El primero dia pafamos aquellas Lagunas, i pa...",PAN11
1,"A la puesta del Sol, por vnos llanos, i entre...",PAN11
2,"\n\nLa Tierra, por la maior parte, defde donde...",PAN11
3,\n\nCAP. XXXVI. De como hecimos hacer Iglesias...,PAN11
4,¡Asombra el imaginar lo que hubiera dado este...,PAN11
...,...,...
4992,| fecha_de_fallecimiento = 8 de enero de 1981|...,Wikipedia
4993,Red Hat es la compañía responsable de la creac...,Wikipedia
4994,Bashkortostán (en ruso: Республика Башкортоста...,Wikipedia
4995,|zona=Polinesia |hablantes=165.000 (censo 200...,Wikipedia


In [53]:
checkingdf(df_es)

(4997,
 Index(['text', 'label'], dtype='object'),
 text     0
 label    0
 dtype: int64)

In [54]:
df_es.to_csv('/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/training_data/spanish.csv', header= False, index=False)

In [55]:
lang = 'es'
nlp = spacy.load('es_core_news_lg')
parser_fr = French()
parser = parser_fr
# Vectorization
vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))
classifier = LinearSVC()

# Using Tfidf
tfvectorizer = TfidfVectorizer(tokenizer = spacy_tokenizer)

# Features
X = df_fr['text']

# Classification Labels
y = df_fr['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=11)

# Apply the pipeline to clean, tokenize, vectorize and classify
pipe = Pipeline([('cleaner', predictors()),('vectorizer', vectorizer), ('classifier', classifier)])

pipe.fit(X_train,y_train)

sample_prediction = pipe.predict(X_test)



In [56]:
for(sample, pred) in zip(X_test,sample_prediction):
    print(f'Predict: {pred}')
print(f'Accuracy test:{pipe.score(X_test, y_test)}')
print(f'Accuracy sample:{pipe.score(X_test, sample_prediction)}')

Predict: APR
Predict: APR
Predict: Wikipedia
Predict: APR
Predict: Wikipedia
Predict: APR
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: APR
Predict: APR
Predict: APR
Predict: APR
Predict: APR
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: APR
Predict: Wikipedia
Predict: Wikipedia
Predict: APR
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: APR
Predict: Wikipedia
Predict: Wikipedia
Predict: APR
Predict: APR
Predict: APR
Predict: APR
Predict: Wikipedia
Predict: APR
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Conference_papers
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: Wikipedia
Predict: APR
Predict: APR
Predict: APR
Predict: W

Accuracy test:0.9943502824858758
Accuracy sample:1.0


In [None]:
pipe.predict(["Le Long Island iced tea ou LIIT ou Long Island littéralement thé glacé Long Island en anglais est un cocktail à base de tequila gin vodka rhum liqueur doranges et cola Ce cocktail officiel de lIBA tient son nom de sa ressemblance à du thé glacé bien quil nen contienne pas)"])

In [None]:
Para más adelante:
    - Balancear el dataset para que la categoría APR no sea tan dominante
    - Unir el modelo a la función de selección de lenguaje para entrar en uno u otro
