# Introduction

This project involves creating a classifier that categorizes news from the hypothetical newspaper "One News". 

Two techniques are covered: one hot encode, which represents words as sparse vectors with only one position being 1, and word2vec, which uses dense vectors of fixed size to represent words, enabling generalization and contextual understanding.

Word2vec is trained in two ways: continuous bag of words (CBOW), which tries to predict the word based on the context, and skip gram, which tries to predict the context given the word. The model is developed using the Gensim library in Python. 

In [1]:
import pandas as pd
article_train = pd.read_csv('https://caelum-online-public.s3.amazonaws.com/1638-word-embedding/treino.csv')
article_test = pd.read_csv('https://caelum-online-public.s3.amazonaws.com/1638-word-embedding/teste.csv')

In [2]:
article_train.head()

Unnamed: 0,title,text,date,category,subcategory,link
0,"Após polêmica, Marine Le Pen diz que abomina n...",A candidata da direita nacionalista à Presidên...,2017-04-28,mundo,,http://www1.folha.uol.com.br/mundo/2017/04/187...
1,"Macron e Le Pen vão ao 2º turno na França, em ...",O centrista independente Emmanuel Macron e a d...,2017-04-23,mundo,,http://www1.folha.uol.com.br/mundo/2017/04/187...
2,"Apesar de larga vitória nas legislativas, Macr...",As eleições legislativas deste domingo (19) na...,2017-06-19,mundo,,http://www1.folha.uol.com.br/mundo/2017/06/189...
3,"Governo antecipa balanço, e Alckmin anuncia qu...",O número de ocorrências de homicídios dolosos ...,2015-07-24,cotidiano,,http://www1.folha.uol.com.br/cotidiano/2015/07...
4,"Após queda em maio, a atividade econômica sobe...","A economia cresceu 0,25% no segundo trimestre,...",2017-08-17,mercado,,http://www1.folha.uol.com.br/mercado/2017/08/1...


In [3]:
article_test.head()

Unnamed: 0,title,text,date,category,subcategory,link
0,Grandes irmãos,"RIO DE JANEIRO - O Brasil, cada vez menos famí...",2017-03-06,colunas,ruycastro,http://www1.folha.uol.com.br/colunas/ruycastro...
1,Haddad congela orçamento e suspende emendas de...,"O prefeito de São Paulo, Fernando Haddad (PT),...",2016-08-10,colunas,monicabergamo,http://www1.folha.uol.com.br/colunas/monicaber...
2,Proposta de reforma da Fifa tem a divulgação d...,"A Fifa divulgou, nesta quinta (10), um relatór...",2015-10-09,esporte,,http://www1.folha.uol.com.br/esporte/2015/09/1...
3,"Mercado incipiente, internet das coisas conect...","Bueiros, coleiras, aparelhos hospitalares, ele...",2016-11-09,mercado,,http://www1.folha.uol.com.br/mercado/2016/09/1...
4,"Mortes: Psicanalista, estudou o autismo em cri...",Toda vez que o grupo de amigos de Silvana Rabe...,2017-02-07,cotidiano,,http://www1.folha.uol.com.br/cotidiano/2017/07...


# One-hot encoding

The One-hot encoding will split all the strings based on spaces and will make new columns based on the presence or absence of each unique word on all strings of the column text. This will make an unique binary vector for each datapoint in the column text.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
#example

example = ['Hello world, how are you?', 
           'Hello you, how is the world?', 
           'World says hello to you.']

vectorizer = CountVectorizer()
vectorizer.fit(example)
vocabulary = vectorizer.vocabulary_ #unique words
print('Unique words =', vocabulary)

you_vector = vectorizer.transform(['you'])
print('Vector of the word "you"',you_vector.toarray())
hello_vector = vectorizer.transform(['hello'])
print('Vector of the word "hello"',hello_vector.toarray())

Unique words = {'hello': 1, 'world': 7, 'how': 2, 'are': 0, 'you': 8, 'is': 3, 'the': 5, 'says': 4, 'to': 6}
Vector of the word "you" [[0 0 0 0 0 0 0 0 1]]
Vector of the word "hello" [[0 1 0 0 0 0 0 0 0]]


Uning One-hot encoding one can find out that the vector size is equal to the number of unique words in the data. This will also increase if new words appear in the data. Is there a way to create a vector with the same size to all the data?

# Using pre-trained CBOW Word2Vec model

CBOW (Continuous Bag of Words) Word2Vec is a word embedding technique used in natural language processing to represent words as dense, numerical vectors. In simple terms, CBOW tries to predict a target word from its surrounding context words. It takes a group of context words as input and tries to generate the missing target word. The model learns to associate similar words based on their context and creates meaningful vector representations for each word. These word embeddings capture semantic relationships, allowing machines to understand the meaning and context of words. CBOW Word2Vec is popular for tasks like sentiment analysis, language modeling, and information retrieval in NLP.


In [5]:
#CBOW model = http://143.107.183.175:22980/download.php?file=embeddings/wang2vec/cbow_s300.zip
%pip install gensim
from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('cbow_s300.txt')



[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: C:\Users\Gustavo Fortunato\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip





In [6]:
print(model.get_vector('china'))
print("The vector len is:", len(model.get_vector('china')))
print('Words most related to "china"', model.most_similar('china'))

[-1.49033e-01  1.26020e-01  2.17628e-01  1.82684e-01  1.65151e-01
 -1.59660e-01 -2.34411e-01  6.00570e-02  8.03680e-02  2.87578e-01
 -4.81100e-03 -5.68800e-02  2.15676e-01  8.65540e-02  1.25983e-01
  3.36157e-01 -1.83254e-01 -1.18499e-01  1.13010e-02  1.03814e-01
  9.37640e-02  2.90178e-01 -1.64395e-01 -1.13300e-02 -1.80676e-01
 -1.15820e-02  1.08728e-01  1.65898e-01  9.37900e-02  2.66767e-01
 -1.29890e-02  9.16030e-02  2.21292e-01 -1.36497e-01 -4.26350e-02
 -1.30038e-01  2.17067e-01 -1.01963e-01 -3.70960e-02  1.42155e-01
  3.41109e-01  2.46560e-01  1.27458e-01  5.72360e-02 -1.47962e-01
 -1.60290e-02  1.86533e-01  7.71550e-02 -3.50024e-01 -4.06085e-01
  1.67131e-01 -4.75230e-02  5.13780e-02 -1.28224e-01  1.06580e-02
 -2.92652e-01  1.40540e-01 -4.57049e-01  1.31094e-01  2.03234e-01
  2.94019e-01  7.38370e-02  1.11554e-01 -1.64204e-01 -3.62020e-02
  1.29522e-01 -1.28321e-01  1.37502e-01 -7.99200e-03 -5.07100e-03
 -2.86010e-02 -8.99040e-02  8.82800e-03 -8.27730e-02  6.91940e-02
 -2.70182e

Note that the word china returns a relation with the closest countries nearby it, based on the model's vectors. Does that mean one could operate through those vectors?

In [7]:
#this will result in the closest words to a list of two words
model.most_similar(positive=['brasil','argentina'])

[('chile', 0.6781662702560425),
 ('peru', 0.634803295135498),
 ('venezuela', 0.6273865699768066),
 ('equador', 0.6037014126777649),
 ('bolívia', 0.6017141342163086),
 ('haiti', 0.5993806719779968),
 ('méxico', 0.596230685710907),
 ('paraguai', 0.5957703590393066),
 ('uruguai', 0.590367317199707),
 ('japão', 0.5893509387969971)]

In [8]:
#here one tries to find the plural form of the word estrela or star 
#in english by using nuven (cloud) and nuvens (clouds) as references
model.most_similar(positive=['nuvens', 'estrela'], negative=['nuven'])

[('estrelas', 0.46129682660102844),
 ('névoas', 0.42456939816474915),
 ('sombras', 0.41976645588874817),
 ('galáxias', 0.3930572271347046),
 ('brumas', 0.3914782404899597),
 ('ondas', 0.3874913454055786),
 ('poeira', 0.3841949999332428),
 ('crateras', 0.38199013471603394),
 ('grimpas', 0.38016703724861145),
 ('rochas', 0.3793330788612366)]

# Text vetorization

In [9]:
article_train.title.loc[12]

"Daniel Craig será stormtrooper em novo 'Star Wars', diz ator"

This string contains various elements that are not words and needs to be tokenized.

In [10]:
import nltk
import string

def tokenizer(text):
    alphanumerical_list =[]
    for valid_token in nltk.word_tokenize(text):
        if valid_token in string.punctuation: continue
        alphanumerical_list.append(valid_token)
    return alphanumerical_list #numbers and letters only

tokenizer(article_train.title.loc[12])

['Daniel',
 'Craig',
 'será',
 'stormtrooper',
 'em',
 'novo',
 "'Star",
 'Wars',
 'diz',
 'ator']

## Simple vectorization method

In [11]:
import numpy as np

def sum_of_vectors(alphanumerical_list):
    resulting_vector=np.zeros(300) #since the len is 300 and the type is array
    for token in alphanumerical_list:
        resulting_vector += model.get_vector(token)
    return resulting_vector

example = tokenizer("text example") 
resulting_vector = sum_of_vectors(example)

print(resulting_vector)
print("The vector len is:", len(resulting_vector))

[-0.017046    0.029228    0.22306101  0.152564   -0.200753   -0.127543
 -0.146758   -0.04999799 -0.25263101  0.27229601 -0.35945301  0.547111
 -0.176569    0.1486     -0.252841    0.079226   -0.52047399 -0.007093
 -0.44340901 -0.121615   -0.23984801 -0.039316   -0.268535    0.34372199
 -0.06969301  0.23496301 -0.18431701  0.140893    0.099615    0.26310501
  0.163359   -0.403374   -0.186685   -0.32900199 -0.56019901  0.247881
 -0.18952099 -0.19303299 -0.087729   -0.294744    0.08101501  0.22759
  0.232398    0.33104201 -0.104798   -0.34920001  0.041333    0.081159
  0.189505    0.45870599  0.272421   -0.056138    0.279457   -0.02023801
  0.326961   -0.133247   -0.45143999  0.31204    -0.214312    0.257893
 -0.036405    0.22963601  0.10984901  0.39851099 -0.20284601  0.29613101
 -0.069412    0.570878   -0.168102    0.02522     0.048971   -0.653547
 -0.004913   -0.11250401 -0.49022299 -0.18543     0.44946999 -0.092301
  0.26825399  0.05362599 -0.225497    0.148662    0.58682597 -0.084838

This kind of method works, but it has limitations. One of its main limitations is the vocabulary, if words used in the sum_of_vectors() are not in the vocabulary, the method crashes with the error of word not found. 

This could be related to the frequency of the word too, since the pre-processing method of Word2Vec applies an UNKNOWN when the word has less than 5 repetitions along the text dataset. All numbers in the strings were also transformed into 0 for units, 00 for dozens, 000 for cents, and further on... (Reference link: https://arxiv.org/abs/1301.3781).

Now one needs to fix this, since that information is necesary.

In [12]:
def sum_of_vectors(alphanumerical_list):
    resulting_vector=np.zeros(300) #since the len is 300 and the type is array
    for token in alphanumerical_list:
        try:
            resulting_vector += model.get_vector(token)
        except KeyError:
            if token.isnumeric():
               token = '0'*len(token)
               resulting_vector += model.get_vector(token)
            else:
               resulting_vector += model.get_vector('unknown')
    return resulting_vector

example = tokenizer("number 2315 dracula dadouken") #words with low frequency and numbers
resulting_vector = sum_of_vectors(example)

print(resulting_vector)
print("The vector len is:", len(resulting_vector))

[ 4.47156008e-01 -1.35562994e-01 -5.09890113e-02  2.28716999e-01
 -2.11824998e-01 -3.06598997e-01  1.66520005e-01 -4.29335989e-01
 -7.13000353e-03 -5.56901009e-01  1.78813000e-01 -1.48512000e-01
 -3.14641990e-01  6.55426025e-01  1.96528994e-01  2.30792001e-01
 -9.69326997e-01  5.53601235e-03 -8.04177001e-01 -5.03518984e-01
 -3.70199978e-02 -5.61859952e-02 -3.04392993e-01  7.45470982e-01
 -5.40238023e-01  2.78465994e-01  5.20579815e-02 -2.75829003e-01
  1.45709008e-01  4.80225980e-01 -1.36143997e-01 -2.07818995e-01
  7.54090026e-02 -1.23510994e-01 -7.56703012e-01 -4.60499991e-02
 -4.97524001e-01 -3.79018009e-01  3.59106004e-01 -5.06000007e-02
  6.89790128e-02 -2.01459013e-01 -2.92449012e-01 -2.54795002e-01
  7.51613986e-01 -1.56610992e-01  1.04944000e-01 -1.57802977e-01
  5.90936012e-01  7.49940995e-01  6.13562991e-01  1.31321004e-01
  1.20389994e-01  2.15308993e-01  4.76178989e-01 -4.60175000e-01
  2.43807994e-01  6.14789002e-01  4.00899008e-01 -1.38468999e-01
  2.03336000e-01  8.07155

Now the function work with any kind of alphanumerical character.

## Improving the method

In [13]:
def vector_matrix(texts):
    x =len(texts)
    y= 300
    matrix=np.zeros((x,y))
    for i in range(x):
        alphanumerical = tokenizer(texts.iloc[i])
        matrix[i] = sum_of_vectors(alphanumerical)
    return matrix

train_matrix = vector_matrix(article_train.title)
test_matrix = vector_matrix(article_test.title)

print('train shape', train_matrix.shape)
print('test shape', test_matrix.shape)

train shape (90000, 300)
test shape (20513, 300)


Now all the texts were vectorized and one is ready to feed the machine model.

# Building the classifier

## Stabilishing a baseline

In [14]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import classification_report
DC = DummyClassifier()
DC.fit(train_matrix, article_train.category)
dummy_pred = DC.predict(test_matrix)

print(f'General score= {DC.score(test_matrix, article_test.category)}')

dummy_CR = classification_report(article_test.category, dummy_pred)

print(dummy_CR)

General score= 0.29751864671184125


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

     colunas       0.30      1.00      0.46      6103
   cotidiano       0.00      0.00      0.00      1698
     esporte       0.00      0.00      0.00      4663
   ilustrada       0.00      0.00      0.00       131
     mercado       0.00      0.00      0.00      5867
       mundo       0.00      0.00      0.00      2051

    accuracy                           0.30     20513
   macro avg       0.05      0.17      0.08     20513
weighted avg       0.09      0.30      0.14     20513



  _warn_prf(average, modifier, msg_start, len(result))


In [15]:
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(max_iter=299) #tested before to avoid warnings
LR.fit(train_matrix, article_train.category)
print(f'General score={LR.score(test_matrix, article_test.category)}')

General score=0.6996538780285673


In [16]:
from sklearn.metrics import classification_report
predicted_labels = LR.predict(test_matrix)
LR_CR = classification_report(article_test.category, predicted_labels)

print(LR_CR)


              precision    recall  f1-score   support

     colunas       0.84      0.68      0.75      6103
   cotidiano       0.49      0.63      0.55      1698
     esporte       0.83      0.75      0.78      4663
   ilustrada       0.08      0.82      0.15       131
     mercado       0.79      0.72      0.75      5867
       mundo       0.53      0.66      0.59      2051

    accuracy                           0.70     20513
   macro avg       0.59      0.71      0.60     20513
weighted avg       0.76      0.70      0.72     20513



Considering the general score and the classification report the model performed better than the dummy classifier. Based on the weighted average of the metrics the model can still be improved. The weighted average is a standard mean weighted by the amount of data in each category and their metrics. The precision values were good for some categories, but bad for others. The recall values showed that the less accurate results (ilustrada) were the most true ones. This might have happened because of the small amount of data representation by its class(support=131). Similar behavior can be observed in the 'cotidiano' and 'mundo' categories, which had lower data support and also lower metrics. 

One thing that could be done to increase the metrics could be data augmentation, fine-tuning, or adjusting class weights to address the class imbalance.

In summary, the model seems to have performed reasonably well, with an overall accuracy of 0.70. However, it's important to pay attention to the discrepancies in performance across different categories. The model performs well in categories with high precision and recall values, while it struggles in categories with low support and precision values.

# Training using SkipGram



In [20]:
#SkipGram model = http://143.107.183.175:22980/download.php?file=embeddings/fasttext/skip_s300.zip
model2 = KeyedVectors.load_word2vec_format('skip_s300.txt')
LR2= LogisticRegression(max_iter=299)
def sum_of_vectors_model2(alphanumerical_list):
    resulting_vector=np.zeros(300) #since the len is 300 and the type is array
    for token in alphanumerical_list:
        try:
            resulting_vector += model2.get_vector(token)
        except KeyError:
            if token.isnumeric():
               token = '0'*len(token)
               resulting_vector += model2.get_vector(token)
            else:
               resulting_vector += model2.get_vector('unknown')
    return resulting_vector

def vector_matrix_model2(texts):
    x =len(texts)
    y= 300
    matrix=np.zeros((x,y))
    for i in range(x):
        alphanumerical = tokenizer(texts.iloc[i])
        matrix[i] = sum_of_vectors_model2(alphanumerical)
    return matrix

train_matrix2 = vector_matrix_model2(article_train.title)
test_matrix2 = vector_matrix_model2(article_test.title)

LR2.fit(train_matrix2, article_train.category)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [21]:
predicted_labels2 = LR2.predict(test_matrix2)
CR2 = classification_report(article_test.category, predicted_labels2)

print(CR2)

              precision    recall  f1-score   support

     colunas       0.83      0.67      0.75      6103
   cotidiano       0.50      0.64      0.56      1698
     esporte       0.84      0.75      0.79      4663
   ilustrada       0.08      0.80      0.15       131
     mercado       0.79      0.73      0.76      5867
       mundo       0.54      0.67      0.60      2051

    accuracy                           0.71     20513
   macro avg       0.60      0.71      0.60     20513
weighted avg       0.76      0.71      0.73     20513



By using SkipGram with the same dimensions as the CBOW used before, the logistic regression performance was a little better. This happened because the SkipGram is a better semantic classification, but slower in training when compared to the CBOW. It is also recommended to use both and see which one performs best for that specific case.

# Conclusion

The classification created was better than other methods applied to that end. It can be used as a category classifier for the articles, but it still needs improvements.