# Analisando os dados dos sites da globo usando bag of words

* No notebook anterior, foram removidos as stop words e palavras que enviesavam o modelo

Agora, o objetivo é aplicar algoritmos para generalizar o modelo.


In [28]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd

df = pd.read_csv('C:\\Users\\samsung\\Desktop\\df_sites_globo2_stopwords.csv')

In [29]:
df.columns

Index(['url', 'titulo', 'conteudo', 'url_origem', 'texto_processado'], dtype='object')

O método to_csv converte strings sem valor ('') em np.nan, verificar se apareceu alguma

In [30]:
df[df['texto_processado'].isnull()].index.tolist()

[587]

In [31]:
df.conteudo[587] # estava lá desde o início :( Melhor remover

'   '

In [32]:
df.drop([587], axis=0,inplace = True)

In [33]:
df.shape

(1499, 5)

## Testando alguns modelos de ML

In [34]:
from sklearn.feature_extraction.text import CountVectorizer
vetorizador = CountVectorizer(max_features = 50)
bag_of_words = vetorizador.fit_transform(df['texto_processado'])

In [35]:
from sklearn.linear_model import LogisticRegression
reg_log = LogisticRegression()

from sklearn import svm
svm_default = svm.SVC()

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
random_forest = RandomForestClassifier()

In [36]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(bag_of_words, df.url_origem, test_size=0.33)

In [37]:
reg_log.fit(x_train,y_train)
print(reg_log.score(x_test,y_test))

svm_default.fit(x_train,y_train)
print(svm_default.score(x_test,y_test))

random_forest.fit(x_train,y_train)
print(random_forest.score(x_test,y_test))

0.7757575757575758
0.7494949494949495
0.7333333333333333


Poderíamos otimizar os modelos nessa abordagem. No entanto, temos poucos dados e nosso modelo não será muito confiável<br>
Portanto, vamos usar uma abordagem usando TFIDF e validação cruzada

In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=50)
modelo_tfidf = tfidf.fit_transform(df.texto_processado)

In [52]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(reg_log, modelo_tfidf, df.url_origem, cv=10)
print("Logistic regression reached %0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

Logistic regression reached 0.75 accuracy with a standard deviation of 0.02


In [53]:
scores = cross_val_score(svm_default, modelo_tfidf, df.url_origem, cv=10)
print("SVM reached %0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

SVM reached 0.72 accuracy with a standard deviation of 0.03


In [54]:
scores = cross_val_score(svm_default, modelo_tfidf, df.url_origem, cv=10)
print("random_forest reached %0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

random_forest reached 0.72 accuracy with a standard deviation of 0.03


Repetindo para 100 palavras

In [61]:
tfidf_100 = TfidfVectorizer(max_features=100)
modelo_tfidf_100 = tfidf_100.fit_transform(df.texto_processado)

scores = cross_val_score(reg_log, modelo_tfidf_100, df.url_origem, cv=10)
print("Logistic regression reached %0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

scores = cross_val_score(svm_default, modelo_tfidf_100, df.url_origem, cv=10)
print("SVM reached %0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

scores = cross_val_score(random_forest, modelo_tfidf_100, df.url_origem, cv=10)
print("Random forest reached %0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

Logistic regression reached 0.82 accuracy with a standard deviation of 0.03
SVM reached 0.78 accuracy with a standard deviation of 0.03
Random forest reached 0.81 accuracy with a standard deviation of 0.03


# Refinamento dos hiperparâmetros

Em todos os testes, a regressão logistica esteve melhor que os demais. Além disso, é bem mais performática para se treinar. <br>Portanto, vamos refinar os hiperparâmetros apenas desse modelo

In [66]:
liblinear = LogisticRegression(solver = 'liblinear')
newtoncg = LogisticRegression(solver = 'newton-cg')
sag = LogisticRegression(solver = 'sag')
saga = LogisticRegression(solver = 'saga')

In [67]:
scores = cross_val_score(liblinear, modelo_tfidf_100, df.url_origem, cv=10)
print(scores.mean(), scores.std())

scores = cross_val_score(newtoncg, modelo_tfidf_100, df.url_origem, cv=10)
print(scores.mean(), scores.std())

scores = cross_val_score(sag, modelo_tfidf_100, df.url_origem, cv=10)
print(scores.mean(), scores.std())

scores = cross_val_score(saga, modelo_tfidf_100, df.url_origem, cv=10)
print(scores.mean(), scores.std())

0.8185413870246085 0.02511594920391652
0.8192080536912751 0.02540953778806545
0.8192080536912751 0.02540953778806545
0.8185413870246085 0.02511594920391652


In [68]:
liblinear = LogisticRegression(solver = 'liblinear',max_iter = 500)
newtoncg = LogisticRegression(solver = 'newton-cg',max_iter = 500)
sag = LogisticRegression(solver = 'sag',max_iter = 500)
saga = LogisticRegression(solver = 'saga',max_iter = 500)

In [69]:
scores = cross_val_score(liblinear, modelo_tfidf_100, df.url_origem, cv=10)
print(scores.mean(), scores.std())

scores = cross_val_score(newtoncg, modelo_tfidf_100, df.url_origem, cv=10)
print(scores.mean(), scores.std())

scores = cross_val_score(sag, modelo_tfidf_100, df.url_origem, cv=10)
print(scores.mean(), scores.std())

scores = cross_val_score(saga, modelo_tfidf_100, df.url_origem, cv=10)
print(scores.mean(), scores.std())

0.8185413870246085 0.02511594920391652
0.8192080536912751 0.02540953778806545
0.8192080536912751 0.02540953778806545
0.8185413870246085 0.02511594920391652


In [65]:
reg_log = LogisticRegression(max_iter = 500)
scores = cross_val_score(reg_log, modelo_tfidf_100, df.url_origem, cv=10)
print("Logistic regression reached %0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

Logistic regression reached 0.82 accuracy with a standard deviation of 0.03


Mudar os parâmetros da regressão não tiveram efeitos significativos, portanto vamos ficar com o modelo padrão

Resultado final: 82% de acc com sd de 0.03 :)