<p><img alt="Colaboratory logo" height="50px" src="https://github.com/INGEOTEC/text_models/raw/master/docs/source/ingeotec.png" align="left" hspace="10px" vspace="0px" /></p>

<h1>EvoMSA 2.0</h1>
<h2>Quickstart Guide</h2>

### http://ingeotec.mx

### http://github.com/ingeotec

# Installing EvoMSA

The first step is to install the library.

In [1]:
try:
    import EvoMSA
except ImportError:
    !pip install evomsa

# Libraries and Text Classification Problem

Once EvoMSA is installed, one must load a few libraries. The first line loads EvoMSA core classes. Line 2 contains the pathname where the text classification problem is. Finally, line 3 is a method to read a file containing a JSON per line.

In [2]:
from EvoMSA import BoW, DenseBoW, StackGeneralization
from EvoMSA.tests.test_base import TWEETS
from microtc.utils import tweet_iterator

The text classification problem can be read using the following instruction. It is stored in a variable `D` which is a list of dictionaries. The second line shows the content of the first element in `D`.

In [3]:
D = list(tweet_iterator(TWEETS))
D[0]

{'text': '| R #OnPoint | #Summer16 ...... 🚣🏽🌴🌴🌴 @ Ocho Rios, Jamacia https://t.co/8xkfjhk52L',
 'klass': 'NONE',
 'q_voc_ratio': 0.4585635359116022}

The field text is self-described, and the field klass contains the label associated with that text.

# BoW Classifier

The first text classifier presented is the pre-trained BoW. The following line initializes the classifier, the first part initializes the class, and the second corresponds to the estimate of the parameters of the linear SVM.

In [4]:
bow = BoW(lang='es').fit(D)

After training the text classifier, it can make predictions. For instance, the first line predicts the training set, while the second line predicts the phrase _good morning_ in Spanish, _buenos días_.

In [5]:
hy = bow.predict(D)
bow.predict(['buenos días'])

array(['P'], dtype='<U4')

# DenseBoW Classifier

Next, the second method is trained using the dataset following the same steps. The subsequent instruction shows the code to train the text classifier.

In [6]:
dense = DenseBoW(lang='es').fit(D)

The code to predict is equivalent; therefore, the prediction for the phrase _good morning_ is only shown.

In [7]:
dense.predict(['buenos días'])

array(['P'], dtype='<U4')

# Stack Generalization

The final text classifier uses a stack generalization approach. The first step is to create the base text classifiers corresponding to the two previous text classifiers, BoW, and DenseBoW.

In [8]:
bow = BoW(lang='es')
dense = DenseBoW(lang='es')

It is worth noting that the base classifiers were not trained; as can be seen, the method fit was not called. These base classifiers will be trained inside the stack generalization algorithm.

The second step is to initialize and train the stack generalization class, shown in the following instruction.

In [9]:
stack = StackGeneralization([bow, dense]).fit(D)

One does not need to specify the language to stack generalization because the base text classifiers give the language.

The code to predict is kept constant in all the classes; therefore, the following code predicts the class for the phrase _good morning_.

In [10]:
stack.predict(['buenos días'])

array(['P'], dtype='<U4')