# Objetivo do código

Esse códido é o início de uma possível implementação do Chatbot. Foi pensando em treinar o modelo para ele classificar um nível de cada vez, usufruindo da resposta de um nível para treinar o próximo. Dessa maneira, o único input semlehante para o treinamento de todos os níveis é a descrição da compra. Até agora foi feito apenas o treinamento do primeiro nível

# Importação das bibliotecas necessárias

In [1]:
# Instalando os pacotes necessários
from google.colab import drive
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk import tokenize
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

import gensim
from scipy.spatial.distance import cosine
from gensim.models import KeyedVectors

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

# Importação do dataset e criação do dataframe

In [4]:
# Conectando ao Google Drive e efetuando a leitura do Dataset
drive.mount('/content/drive')

data = pd.read_excel("/content/drive/MyDrive/M8/ModeloSprint2/LATAM-Data.xlsx")
data.head()

Mounted at /content/drive


Unnamed: 0.1,Unnamed: 0,Supplier Name,Normalized Supplier Name,Parent Supplier Name,Region,Country Name,Strategic Region,Requestor Name,Preparer Name,Level 1,...,GL Desc (Level 6),Invoice ID,Invoice Number,Invoice Source,GL Description,Product,Project,"Month, Day, Year of Payment Date",PO Number,Amount USD
0,,20 TABELIAO DE NOTAS DA CAPITAL,20 TABELIAO DE NOTAS DA CAPITAL,20 TABELIAO DE NOTAS DA CAPITAL,LATAM,Brazil,LATAM,Daniela Fechio,Cindy Eurie,Uncategorized,...,Operating Expenses w/o Allocations,300002608576539,504851,LETTERBOX,Postage and courier,Default Product,31505 - Sao Paulo Birmann 32,2023-02-10,70000600000.0,6
1,,20 TABELIAO DE NOTAS DA CAPITAL,20 TABELIAO DE NOTAS DA CAPITAL,20 TABELIAO DE NOTAS DA CAPITAL,LATAM,Brazil,LATAM,Daniela Fechio,Cindy Eurie,Uncategorized,...,Operating Expenses w/o Allocations,300002647480228,505438,LETTERBOX,Postage and courier,Default Product,31505 - Sao Paulo Birmann 32,2023-03-08,70000600000.0,2
2,,20 TABELIAO DE NOTAS DA CAPITAL,20 TABELIAO DE NOTAS DA CAPITAL,20 TABELIAO DE NOTAS DA CAPITAL,LATAM,Brazil,LATAM,Daniela Fechio,Cindy Eurie,Uncategorized,...,Operating Expenses w/o Allocations,300002705372803,505806,LETTERBOX,Postage and courier,Default Product,31505 - Sao Paulo Birmann 32,2023-04-12,70000600000.0,6
3,,20 TABELIAO DE NOTAS DA CAPITAL,20 TABELIAO DE NOTAS DA CAPITAL,20 TABELIAO DE NOTAS DA CAPITAL,LATAM,Brazil,LATAM,Daniela Fechio,Cindy Eurie,Uncategorized,...,Operating Expenses w/o Allocations,300002712153834,506089,LETTERBOX,Postage and courier,Default Product,31505 - Sao Paulo Birmann 32,2023-04-14,70000600000.0,8
4,,20 TABELIAO DE NOTAS DA CAPITAL,20 TABELIAO DE NOTAS DA CAPITAL,20 TABELIAO DE NOTAS DA CAPITAL,LATAM,Brazil,LATAM,Daniela Fechio,Cindy Eurie,Uncategorized,...,Operating Expenses w/o Allocations,300002746687642,506689,LETTERBOX,Postage and courier,Default Product,31505 - Sao Paulo Birmann 32,2023-05-10,70000600000.0,6


In [5]:
df = pd.DataFrame(data)

# Tratamento dos dados

## Pré-processamento do PLN

A função "preprocess_text()" tem o objetivo de aplicar algumas técnicas de pré-processamento do PLN em uma string que receber como parâmetro

In [6]:
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

# Definindo método de pré-processamento de texto
def preprocess_text(text):

    tokens = tokenize.word_tokenize(text, language='english')  # Tokenização
    # tokens = [word for word in tokens if word.isalnum()]  # Removendo caracteres não alfanuméricos
    # tokens = [word.lower() for word in tokens]  # Convertendo para minúsculas
    # print("Tokenização: ", tokens)

    tokens = [word for word in tokens if word not in stop_words]  # Removendo Stopwords
    # print("Remoção das Stopwords: ", tokens)

    tokens = [stemmer.stem(word) for word in tokens]  # Aplicando stemming
    # print("Aplicação do Stemming: ", stemmed_tokens)

    return tokens

In [7]:
df_description = df[['GL Description']] # cópia de uma coluna do Dataset original para facilitar o tratamento

In [8]:
df_description['GL Description'] = df_description['GL Description'].apply(preprocess_text) # aplicação da função acima em todas as linhas da coluna

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_description['GL Description'] = df_description['GL Description'].apply(preprocess_text)


In [9]:
df_description

Unnamed: 0,GL Description
0,"[postag, courier]"
1,"[postag, courier]"
2,"[postag, courier]"
3,"[postag, courier]"
4,"[postag, courier]"
...,...
11260,"[busi, insur]"
11261,"[busi, insur]"
11262,"[busi, insur]"
11263,"[busi, insur]"


## Vetorização

O processo de vetorização foi feito utilizando o Word2Vec, e sua função está definida abaixo. O Word2Vec é uma técnica de aprendizado de máquina que converte palavras em vetores numéricos contínuos, permitindo que modelos capturem relações semânticas e sintáticas entre palavras com base em seus contextos de ocorrência. Tal processo ofi aplicado apenas na colunas "GL Description", ou seja, a colunas com os dados de entrada.

In [10]:
df_word2vec = df_description['GL Description'].to_frame()

In [11]:
df_word2vec

Unnamed: 0,GL Description
0,"[postag, courier]"
1,"[postag, courier]"
2,"[postag, courier]"
3,"[postag, courier]"
4,"[postag, courier]"
...,...
11260,"[busi, insur]"
11261,"[busi, insur]"
11262,"[busi, insur]"
11263,"[busi, insur]"


In [12]:
model_cbow = KeyedVectors.load_word2vec_format("/content/drive/Shareddrives/CogniVoice/cbow_s50.txt")

In [13]:
def word2vec(column):
  x = []
  for i in range(0,len(column)):
    vector = []
    for h in range(0,len(column[i])):
      # Adiciona o vetor da palavra na posição h (índice h) da frase atual no vetor temporário
      vector.append(model_cbow[h])
    # Após percorrer todas as palavras da frase e construir o vetor temporário, este trecho de código calcula a média dos vetores das palavras na frase e adiciona esse vetor médio à lista x.
    x.append(list(map(sum, zip(*vector))))
    vector = []
  print(x)
  return x


In [14]:
df_word2vec = pd.DataFrame(word2vec(df_word2vec['GL Description']))
df_word2vec

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
0,0.084044,0.194944,-0.155875,-0.441704,-0.165511,0.009908,-0.137812,-0.001082,-0.119002,-0.088,...,-0.035835,-0.002491,0.294376,-0.106502,-0.084032,0.058371,0.043544,0.033019,0.050086,0.185515
1,0.084044,0.194944,-0.155875,-0.441704,-0.165511,0.009908,-0.137812,-0.001082,-0.119002,-0.088,...,-0.035835,-0.002491,0.294376,-0.106502,-0.084032,0.058371,0.043544,0.033019,0.050086,0.185515
2,0.084044,0.194944,-0.155875,-0.441704,-0.165511,0.009908,-0.137812,-0.001082,-0.119002,-0.088,...,-0.035835,-0.002491,0.294376,-0.106502,-0.084032,0.058371,0.043544,0.033019,0.050086,0.185515
3,0.084044,0.194944,-0.155875,-0.441704,-0.165511,0.009908,-0.137812,-0.001082,-0.119002,-0.088,...,-0.035835,-0.002491,0.294376,-0.106502,-0.084032,0.058371,0.043544,0.033019,0.050086,0.185515
4,0.084044,0.194944,-0.155875,-0.441704,-0.165511,0.009908,-0.137812,-0.001082,-0.119002,-0.088,...,-0.035835,-0.002491,0.294376,-0.106502,-0.084032,0.058371,0.043544,0.033019,0.050086,0.185515
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11260,0.084044,0.194944,-0.155875,-0.441704,-0.165511,0.009908,-0.137812,-0.001082,-0.119002,-0.088,...,-0.035835,-0.002491,0.294376,-0.106502,-0.084032,0.058371,0.043544,0.033019,0.050086,0.185515
11261,0.084044,0.194944,-0.155875,-0.441704,-0.165511,0.009908,-0.137812,-0.001082,-0.119002,-0.088,...,-0.035835,-0.002491,0.294376,-0.106502,-0.084032,0.058371,0.043544,0.033019,0.050086,0.185515
11262,0.084044,0.194944,-0.155875,-0.441704,-0.165511,0.009908,-0.137812,-0.001082,-0.119002,-0.088,...,-0.035835,-0.002491,0.294376,-0.106502,-0.084032,0.058371,0.043544,0.033019,0.050086,0.185515
11263,0.084044,0.194944,-0.155875,-0.441704,-0.165511,0.009908,-0.137812,-0.001082,-0.119002,-0.088,...,-0.035835,-0.002491,0.294376,-0.106502,-0.084032,0.058371,0.043544,0.033019,0.050086,0.185515


## Label encoding

O Label encoding é uma técnica de transformar rótulos (labels) categóricos em números inteiros. Cada categoria recebe um número único, facilitando o uso de algoritmos de aprendizado de máquina que requerem entradas numéricas. Essa técnica foi aplicada na coluna de saída, que contém as 10 categorias de compra possíveis do nível 1.

In [15]:
df_labelEncoding = df[['Level 1']]
df_labelEncoding = df_labelEncoding.replace(['Energy & Utilities',
                                         'Human Resources',
                                         'Logistics',
                                         'Manufacturing',
                                         'Professional Services',
                                         'R&D Equipment (incl. Equipment Services and Supplies)',
                                         'Real Estate & Facilities',
                                         'Sales, Marketing & Events',
                                         'Technology/Telecom',
                                         'Travel & Expense',
                                         'Uncategorized'],
                                         [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

In [None]:
# df_final = pd.concat([df_word2vec, df_labelEncoding], axis=1)
# df_final

## PCA

O PCA (Principal Component Analysis) é uma técnica de análise de dados que reduz a dimensionalidade enquanto preserva a variação nos dados. Ele identifica os componentes principais, que são combinações lineares das variáveis originais, ordenados por sua contribuição para a variância total dos dados. Isso facilita a compressão de informações, eliminação de redundâncias e visualização de dados em dimensões reduzidas, muitas vezes sendo utilizado para simplificar conjuntos de dados complexos. Considerando tais fatores, o PCA foi aplicado nos dados de entrada, já que cada um deles, por conta da vetorização, se transformou em um vetor de tamanho 50. Foi decidido diminuir os dados pela metade, ou seja, 25 vetores.

In [17]:
!pip show scikit-learn

Name: scikit-learn
Version: 1.2.2
Summary: A set of python modules for machine learning and data mining
Home-page: http://scikit-learn.org
Author: 
Author-email: 
License: new BSD
Location: /usr/local/lib/python3.10/dist-packages
Requires: joblib, numpy, scipy, threadpoolctl
Required-by: bigframes, fastai, imbalanced-learn, librosa, mlxtend, qudida, sklearn-pandas, yellowbrick


In [18]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [19]:
scaling = StandardScaler()

scaling.fit(df_word2vec)
scaled_data = scaling.transform(df_word2vec)

In [20]:
principal = PCA(n_components = 25)
principal.fit(scaled_data)
pca_calculation = principal.transform(scaled_data)

In [21]:
print(pca_calculation.shape)

(11265, 25)


In [22]:
final_df = pd.DataFrame(data = pca_calculation)

Por fim, é necessário concatenar os dataframes para melhor análise dos dados

In [23]:
df_final = pd.concat([final_df, df_labelEncoding], axis=1)
df_final

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,Level 1
0,-3.812749,2.262559,-2.366671,-1.05216,0.303172,-1.279635,0.025101,-0.015574,-0.006992,9.819629e-16,...,5.062861e-18,6.050539e-16,-2.246704e-15,1.815744e-15,1.413292e-15,-6.210968e-16,-1.484044e-15,-1.947077e-15,8.549342e-16,10
1,-3.812749,2.262559,-2.366671,-1.05216,0.303172,-1.279635,0.025101,-0.015574,-0.006992,9.819629e-16,...,5.062861e-18,6.050539e-16,-2.246704e-15,1.815744e-15,1.413292e-15,-6.210968e-16,-1.484044e-15,-1.947077e-15,8.549342e-16,10
2,-3.812749,2.262559,-2.366671,-1.05216,0.303172,-1.279635,0.025101,-0.015574,-0.006992,9.819629e-16,...,5.062861e-18,6.050539e-16,-2.246704e-15,1.815744e-15,1.413292e-15,-6.210968e-16,-1.484044e-15,-1.947077e-15,8.549342e-16,10
3,-3.812749,2.262559,-2.366671,-1.05216,0.303172,-1.279635,0.025101,-0.015574,-0.006992,9.819629e-16,...,5.062861e-18,6.050539e-16,-2.246704e-15,1.815744e-15,1.413292e-15,-6.210968e-16,-1.484044e-15,-1.947077e-15,8.549342e-16,10
4,-3.812749,2.262559,-2.366671,-1.05216,0.303172,-1.279635,0.025101,-0.015574,-0.006992,9.819629e-16,...,5.062861e-18,6.050539e-16,-2.246704e-15,1.815744e-15,1.413292e-15,-6.210968e-16,-1.484044e-15,-1.947077e-15,8.549342e-16,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11260,-3.812749,2.262559,-2.366671,-1.05216,0.303172,-1.279635,0.025101,-0.015574,-0.006992,9.819629e-16,...,5.062861e-18,6.050539e-16,-2.246704e-15,1.815744e-15,1.413292e-15,-6.210968e-16,-1.484044e-15,-1.947077e-15,8.549342e-16,10
11261,-3.812749,2.262559,-2.366671,-1.05216,0.303172,-1.279635,0.025101,-0.015574,-0.006992,9.819629e-16,...,5.062861e-18,6.050539e-16,-2.246704e-15,1.815744e-15,1.413292e-15,-6.210968e-16,-1.484044e-15,-1.947077e-15,8.549342e-16,10
11262,-3.812749,2.262559,-2.366671,-1.05216,0.303172,-1.279635,0.025101,-0.015574,-0.006992,9.819629e-16,...,5.062861e-18,6.050539e-16,-2.246704e-15,1.815744e-15,1.413292e-15,-6.210968e-16,-1.484044e-15,-1.947077e-15,8.549342e-16,10
11263,-3.812749,2.262559,-2.366671,-1.05216,0.303172,-1.279635,0.025101,-0.015574,-0.006992,9.819629e-16,...,5.062861e-18,6.050539e-16,-2.246704e-15,1.815744e-15,1.413292e-15,-6.210968e-16,-1.484044e-15,-1.947077e-15,8.549342e-16,10


# Treinamento do modelo

Foram utilizados 4 diferentes tipos de modelos de machine learning, sendo que todos tinham a mesma divisão de 80% para treino e 20% para teste. Os modelos foram:
- Naive Bayes: foi utilizado o tipo Gaussiano, uma vez que se trata de uma categorização e a coluna "target" é única e composta por números. O modelo obteve uma acurácia de 59%.
- Árvore de decisão: é um modelo que representa decisões e suas possíveis consequências em forma de uma estrutura de árvore. Esse, obeteve uma acurácia de 72%.
- Random Forest: é um modelo que utiliza da inteligência de várias árvores de decisão independentes que são treinadas em diferentes subconjuntos do conjunto de dados. Esse, também obteve uma acurácia de 72%.
- SVM: é um que busca encontrar um hiperplano de decisão ótimo para separar classes em um espaço dimensional, maximizando a margem entre elas. O modelo, semelhantes aos dois anteriores, também obteve uma acurácia de 72%.

**Observação:** como foi possível perceber, todos os modelos treinados, com excessão do Naive Bayes, obtiveram resultados semelhantes, o que mostra que o modelo ainda não está 100% confiável e pode apresentar viéses.

## Naive Bayes

In [24]:
from sklearn.naive_bayes import GaussianNB

x = df_final.iloc[:, 0:24]
y = df_final['Level 1']

# Dividindo os dados em conjuntos de treinamento e teste
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Treinando modelo Multinomial Naive Bayes
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)

# Prevendo os valores do conjunto de teste
y_pred = nb_classifier.predict(X_test)

# Avaliando o modelo
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Acurácia do modelo: {accuracy}')
print(f'Relatório de classificação:\n{report}')

Acurácia do modelo: 0.5974256546826454
Relatório de classificação:
              precision    recall  f1-score   support

           0       0.01      0.60      0.01         5
           1       0.00      0.00      0.00        82
           2       0.16      0.75      0.27        16
           3       0.03      0.29      0.06        14
           4       0.89      0.87      0.88      1093
           5       0.02      0.11      0.03         9
           6       0.73      0.69      0.71       379
           7       0.00      0.00      0.00        80
           8       0.87      0.25      0.39       435
           9       0.33      0.08      0.13        37
          10       0.00      0.00      0.00       103

    accuracy                           0.60      2253
   macro avg       0.28      0.33      0.23      2253
weighted avg       0.73      0.60      0.63      2253



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
# Testando o modelo com um novo input
new_description = ["Kitchen"]
new_description = [preprocess_text(new_description[0])]
print(new_description)
new_description2 = word2vec(new_description)

predicted_category = nb_classifier.predict(new_description2)
print(f'A categoria prevista para a nova descrição é: {predicted_category[0]}')

## Árvore de decisão

In [25]:
from sklearn.tree import DecisionTreeClassifier

x = df_final.iloc[:, 0:24]
y = df_final['Level 1']

# Dividindo os dados em conjuntos de treinamento e teste
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Iniciando o modelo
dt_classifier = DecisionTreeClassifier()
dt_classifier.fit(X_train, y_train)

# Prevendo os valores do conjunto de teste
y_pred = dt_classifier.predict(X_test)

# Avaliando o modelo
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Acurácia do modelo: {accuracy}')
print(f'Relatório de classificação:\n{report}')

Acurácia do modelo: 0.7288060363959166
Relatório de classificação:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         5
           1       0.00      0.00      0.00        82
           2       0.00      0.00      0.00        16
           3       0.00      0.00      0.00        14
           4       0.89      0.87      0.88      1093
           5       0.00      0.00      0.00         9
           6       0.73      0.69      0.71       379
           7       1.00      0.03      0.05        80
           8       0.52      0.97      0.68       435
           9       0.43      0.08      0.14        37
          10       0.00      0.00      0.00       103

    accuracy                           0.73      2253
   macro avg       0.32      0.24      0.22      2253
weighted avg       0.70      0.73      0.68      2253



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [1]:
# Testando o modelo com um novo input
new_description = ["Temporary Services"]
new_description = [preprocess_text(new_description[0])]
print(new_description)
new_description2 = word2vec(new_description)

predicted_category = dt_classifier.predict(new_description2)
print(f'A categoria prevista para a nova descrição é: {predicted_category[0]}')

NameError: ignored

## Random forest

In [28]:
from sklearn.ensemble import RandomForestClassifier

x = df_final.iloc[:, 0:24]
y = df_final['Level 1']

# Dividindo os dados em conjuntos de treinamento e teste
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Iniciando o modelo
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

# Prevendo os valores do conjunto de teste
y_pred = rf_classifier.predict(X_test)

# Avaliando o modelo
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Acurácia do modelo: {accuracy}')
print(f'Relatório de classificação:\n{report}')

Acurácia do modelo: 0.7288060363959166
Relatório de classificação:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         5
           1       0.00      0.00      0.00        82
           2       0.00      0.00      0.00        16
           3       0.00      0.00      0.00        14
           4       0.89      0.87      0.88      1093
           5       0.00      0.00      0.00         9
           6       0.73      0.69      0.71       379
           7       1.00      0.03      0.05        80
           8       0.52      0.97      0.68       435
           9       0.43      0.08      0.14        37
          10       0.00      0.00      0.00       103

    accuracy                           0.73      2253
   macro avg       0.32      0.24      0.22      2253
weighted avg       0.70      0.73      0.68      2253



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
# Testando o modelo com um novo input
new_description = ["Temporary Services"]
new_description = [preprocess_text(new_description[0])]
print(new_description)
new_description2 = word2vec(new_description)

predicted_category = dt_classifier.predict(new_description2)
print(f'A categoria prevista para a nova descrição é: {predicted_category[0]}')

## SVM

In [29]:
from sklearn.svm import SVC

x = df_final.iloc[:, 0:24]
y = df_final['Level 1']

# Dividindo os dados em conjuntos de treinamento e teste
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Iniciando o modelo
svm_classifier = SVC(kernel='linear', C=1.0, random_state=42)
svm_classifier.fit(X_train, y_train)

# Prevendo os valores do conjunto de teste
y_pred = svm_classifier.predict(X_test)

# Avaliando o modelo
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Acurácia do modelo: {accuracy}')
print(f'Relatório de classificação:\n{report}')

Acurácia do modelo: 0.7288060363959166
Relatório de classificação:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         5
           1       0.00      0.00      0.00        82
           2       0.00      0.00      0.00        16
           3       0.00      0.00      0.00        14
           4       0.89      0.87      0.88      1093
           5       0.00      0.00      0.00         9
           6       0.73      0.69      0.71       379
           7       1.00      0.03      0.05        80
           8       0.52      0.97      0.68       435
           9       0.43      0.08      0.14        37
          10       0.00      0.00      0.00       103

    accuracy                           0.73      2253
   macro avg       0.32      0.24      0.22      2253
weighted avg       0.70      0.73      0.68      2253



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
# Testando o modelo com um novo input
new_description = ["Temporary Services"]
new_description = [preprocess_text(new_description[0])]
print(new_description)
new_description2 = word2vec(new_description)

predicted_category = dt_classifier.predict(new_description2)
print(f'A categoria prevista para a nova descrição é: {predicted_category[0]}')

 # Parte 2 (classificação niível 2)

In [None]:
categories = ['Energy & Utilities', 'Human Resources', 'Logistics', 'Manufacturing',
              'Professional Services', 'R&D Equipment', 'Real Estate & Facilities',
              'Sales, Marketing & Events', 'Technology/Telecom', 'Travel & Expense']

subcategories_per_category = {}

for category in categories:
    subcategories = dataframe[dataframe['Level 1'] == category]['Level 2'].unique()
    subcategories_per_category[category] = subcategories

subcategories_per_category