## Classifying texts with machine learning

In this notebook you can use the SVM (Support Vector Machine) algorithm, through the SGDClassifier estimator, already tested in the original database.

With this notebook it is possible to use the algorithm with the best performance to fit the nature of the data and obtain the ratings of the Dislike and Complain type evaluations for a new data set.

## Loading used libraries

In [37]:
#Bibliotecas para Pré-Processamento
import pandas as pd
pd.set_option('max_colwidth', 300)
import numpy as np
#!pip install texthero
#import texthero as hero
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

#Bibliotecas para Modelagem
# um classificador linear que utiliza o Gradiente Descendente Estocástico como método de treino. #Por padrão, utiliza o estimador SVM.
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
%matplotlib inline

# Database reading

In [188]:
df = pd.read_csv('Base.csv', sep= ';', encoding="latin-1")
#df.shape

(17177, 34)

In [189]:
df.dropna(subset=['Verb Native'], inplace=True)
#df.shape

In [108]:
df1=df.loc[df['SubClassification']!='#N/D']
#df1.shape

(7814, 34)

In [109]:
#Seleção das avaliações acima de 4 caracteres
df1 = df1[df1['Verb Native'].map(len) > 4]
#df1.shape

(7785, 34)

## Separating the columns

In [110]:
#Separando a coluna preditora do target
X = df1['Verb Native']
y = df1['SubClassification']

## Word Treatment and Vectorization

At this stage, it is necessary to normalize the written words, as well as to check the writing, by removing accents and punctuation.

In [111]:
# Limpeza = letras minusculas, exclusao de apostrofos, aspas e coisas do tipo
#X = hero.clean(X)

In [112]:
# Separação entre bases de treinamento e teste
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size = 0.3, random_state=100, shuffle=True)

In [113]:
#Processo de vetorização das palavras
vectorizer = TfidfVectorizer()
X_train_tfidf_vectorize = vectorizer.fit_transform(Xtrain)

## Training the classifier

In [174]:
clf = SGDClassifier()
#Fitando o classificador
clf.fit(X_train_tfidf_vectorize, ytrain) 

SGDClassifier()

In [175]:
#Vetorizando a base teste
X_test_tfidf_vectorize = vectorizer.transform(Xtest)

In [176]:
#Predizendo a base teste
predicted = clf.predict(X_test_tfidf_vectorize)

In [201]:
#Verificando a performance do modelo
#print(metrics.classification_report(ytest.values, predicted))

In [179]:
#confusion_matrix = confusion_matrix(ytest, predicted)
#print(confusion_matrix)

In [180]:
predicted = pd.DataFrame(predicted)
predicted = predicted.reset_index(drop=True)
ytest = pd.DataFrame(ytest)
ytest = ytest.reset_index(drop=True)
Xtest = pd.DataFrame(Xtest)
Xtest = Xtest.reset_index(drop=True)
predicted = predicted.rename({0: 'Predito'}, axis = 1)

In [181]:
Base_Teste_Ajustada = Xtest.merge(ytest, right_index=True, left_index=True, how = "outer")

In [182]:
#Cruzamento da base textual com as estimativas de classificação textual
Base_Teste_Ajustada1 = Base_Teste_Ajustada.merge(predicted, right_index=True, left_index=True, how = "outer")

In [183]:
#Base Teste Classificada Exportada
Base_Teste_Ajustada1.to_excel("Base_Teste_Ajustada.xlsx", index=False)

## Reading from a new database

In [191]:
df = pd.read_csv('Base_Nova.csv', sep= ';', encoding="latin-1")
#df.shape

(8165, 33)

In [192]:
df.dropna(subset=['Verb Native'], inplace=True)
#df.shape

(3749, 33)

In [193]:
#Seleção das avaliações acima de 4 caracteres
df1 = df[df['Verb Native'].map(len) > 4]
df1 = df1.reset_index(drop=True)
#df1.shape

(3740, 33)

In [194]:
#Definição da coluna a ser classificada
X = df1['Verb Native']

## Word Treatment and Vectorization

In [40]:
# Limpeza = letras minusculas, exclusao de apostrofos, aspas e demais pontuações
#X_clean = hero.clean(X)

In [195]:
#Vetorização das palavras
Base_vectorize = vectorizer.transform(X)

## Classifier Forecast

In [196]:
#Estimativas obtidas pelo modelo de classificação textual 
predicted = clf.predict(Base_vectorize)

## Exporting the New Sorted Base in xlsx Format

In [197]:
predicted = pd.DataFrame(predicted)
predicted = predicted.reset_index(drop=True)
#Renomear a coluna predict que por defaut possui o nome de 0
predicted = predicted.rename({0: 'Predito'}, axis = 1)

In [198]:
#Re-indexação da base
X = pd.DataFrame(X)
X = X.reset_index(drop=True)

In [199]:
#Cruzamento da base textual com as estimativas de classificação textual
Base_Final = df1.merge(predicted, right_index=True, left_index=True, how = "outer")

In [200]:
#Base Final Exportada
Base_Final.to_excel("Base_Escorada.xlsx", index=False)