### Projeto Módulo 13 - Análise de sentimento no Twitter

https://www.kaggle.com/code/akshayarajasekaran/cd-sentiment-analysis-of-tweets/notebook

O dataset contém 1_600_000 tweets extraido pelo Twitter API. É classificado por gradação (0: Negativo e 4: Positivo) uqe podem ser usados para detectar sentimentos.
Contém 6 campos:
- sentiment : polarização do tweet (0= negativo a 4=positivo)
- ids
- date : data no formato (Sat May 16 23:58:44 UTC 2009)
- flag : The query
- user : usuário
- text : texto do tweet

É necessário apenas a análise do text

----

### Pré-análise e configuração do dataset

In [1]:
import pandas as pd
import numpy as np

In [2]:
path = "../../twitter_proj/training.1600000.processed.noemoticon.csv"
dataset = pd.read_csv(path, names = ["label", "ids", "date", "flag", "user", "tweet"], encoding = "ISO-8859-1")

In [3]:
# O que tem no dataset
dataset.iloc[:][150:160]

Unnamed: 0,label,ids,date,flag,user,tweet
150,0,1467844157,Mon Apr 06 22:28:32 PDT 2009,NO_QUERY,AKyarnie,@onemoreproject that is lame
151,0,1467844505,Mon Apr 06 22:28:38 PDT 2009,NO_QUERY,luimoral85,I don't understand... I really don't
152,0,1467844540,Mon Apr 06 22:28:38 PDT 2009,NO_QUERY,ceironous,HEROES just isn't doing it for me this season...
153,0,1467844907,Mon Apr 06 22:28:45 PDT 2009,NO_QUERY,brandonmcb,Living not downtown sure isn't much fun.
154,0,1467845095,Mon Apr 06 22:28:48 PDT 2009,NO_QUERY,mannyrique,@jonathanchard Not calorie wise I wish junk ...
155,0,1467845157,Mon Apr 06 22:28:51 PDT 2009,NO_QUERY,styletrain,Man Work is Hard
156,0,1467852031,Mon Apr 06 22:30:34 PDT 2009,NO_QUERY,kscud,"getting sick time for some hot tea, studying,..."
157,0,1467852067,Mon Apr 06 22:30:34 PDT 2009,NO_QUERY,kirstenj0y,Getting eyebrows waxed. More pain
158,0,1467852789,Mon Apr 06 22:30:45 PDT 2009,NO_QUERY,marculus,No phantasy star yesterday going to work...
159,0,1467853135,Mon Apr 06 22:30:50 PDT 2009,NO_QUERY,andrewofthediaz,Oh - Just got all my MacHeist 3.0 apps - sweet...


In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   label   1600000 non-null  int64 
 1   ids     1600000 non-null  int64 
 2   date    1600000 non-null  object
 3   flag    1600000 non-null  object
 4   user    1600000 non-null  object
 5   tweet   1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


In [5]:
# Substituir o 'label' 4 (positivo) para 1
dataset["label"].replace(4, 1, inplace = True)

In [6]:
# Ver balanceamento de labels
dataset["label"].value_counts(normalize = True)

0    0.5
1    0.5
Name: label, dtype: float64

A amostra está balanceada. Não há problema em utilizar a **acurácia** para avaliar a classificação

Podemos dropar o Id, date (não precisamos avaliar em qual o momento foi twitado), flag (só tem um valor) e user (não precisamos saber quem disse). Além disso, não temos valores nulos.

In [7]:
dataset.drop(["ids","date","flag","user"], axis = 1, inplace = True)

In [8]:
dataset.head()

Unnamed: 0,label,tweet
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


----

### Pré-processamento do dataset

In [9]:
# Stopwords
stopwordlist = ['a', 'about', 'above', 'after', 'again', 'ain', 'all', 'am', 'an',
             'and','any','are', 'as', 'at', 'be', 'because', 'been', 'before',
             'being', 'below', 'between','both', 'by', 'can', 'd', 'did', 'do',
             'does', 'doing', 'down', 'during', 'each','few', 'for', 'from', 
             'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here',
             'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in',
             'into','is', 'it', 'its', 'itself', 'just', 'll', 'm', 'ma',
             'me', 'more', 'most','my', 'myself', 'now', 'o', 'of', 'on', 'once',
             'only', 'or', 'other', 'our', 'ours','ourselves', 'out', 'own', 're',
             's', 'same', 'she', "shes", 'should', "shouldve",'so', 'some', 'such',
             't', 'than', 'that', "thatll", 'the', 'their', 'theirs', 'them',
             'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 
             'through', 'to', 'too','under', 'until', 'up', 've', 'very', 'was',
             'we', 'were', 'what', 'when', 'where','which','while', 'who', 'whom',
             'why', 'will', 'with', 'won', 'y', 'you', "youd","youll", "youre",
             "youve", 'your', 'yours', 'yourself', 'yourselves']

In [10]:
# Preprocessamento

# Bibliotecas
import pandas as pd
from nltk.tokenize import word_tokenize  # tokenização
from nltk.stem import *  # importar PorterStemmer() e WordNetLemmatizer()
from nltk.corpus import stopwords  # Stopwords
import nltk  # para baixar os 'stopwords' e 'punkt' (punctuation)
import re  # regex

nltk.download('stopwords')
nltk.download('punkt')
# Define as stopwords em inglês, português e espanhol
#sw_english = set(stopwords.words('english'))

# Instancia o PorterStemmer e WordNetLemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Função de preprocessamento - 
def preprocessing(string, method = 'stem'):
    """
    Preprocessing for english
    """
    # Substitui todas as URls com 'URL'
    string = re.sub(r"((http://)[^ ]*|(https://)[^ ]*|( www\.)[^ ]*)",' URL',string)
    # Substitui todos os @USERNAME para 'USER'.
    string = re.sub('@[^\s]+',' USER', string)
    # Manter somente caracteres e números - sem caracteres especiais
    string = re.sub(r"[^a-zA-Z0-9]+", " ", string)
    # Letras minúsculas
    string = string.lower()
    # tokenização
    words = word_tokenize(string)
    
    # sem stopwords em inglês e com palavras com mais de dois caracteres
    filter_words = [word for word in words if word not in stopwordlist]
    # Sem dropar as stopwords, aumenta-se a acurácia
    lemmatized_words = []
    
    if method == 'stem':
        for word in words:
            sw = stemmer.stem(word)
            lemmatized_words.append(sw)
        return lemmatized_words
    if method == 'lemma':
        for word in words:
            sw = lemmatizer.lemmatize(word)
            lemmatized_words.append(sw)
        return lemmatized_words
    
# use dataframe['nova_col'].apply(lambda x: preprocessing(x, "stem")) para criar uma nova coluna no dataframe

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\gabri\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gabri\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [11]:
# Criando novas colunas no dataset para ML
dataset['filtered_words'] = dataset["tweet"].apply(lambda x: preprocessing(x, "lemma"))
dataset['joined_words'] = dataset["filtered_words"].apply(lambda x: " ".join(x))

In [12]:
dataset.head()

Unnamed: 0,label,tweet,filtered_words,joined_words
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...","[user, url, awww, that, s, a, bummer, you, sho...",user url awww that s a bummer you shoulda got ...
1,0,is upset that he can't update his Facebook by ...,"[is, upset, that, he, can, t, update, his, fac...",is upset that he can t update his facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...,"[user, i, dived, many, time, for, the, ball, m...",user i dived many time for the ball managed to...
3,0,my whole body feels itchy and like its on fire,"[my, whole, body, feel, itchy, and, like, it, ...",my whole body feel itchy and like it on fire
4,0,"@nationwideclass no, it's not behaving at all....","[user, no, it, s, not, behaving, at, all, i, m...",user no it s not behaving at all i m mad why a...


### Salvar dataset pré-processado
path = "../../twitter_proj/dataset_twitter.csv"
dataset.to_csv(path)

----

### Palavras mais frequentes

In [None]:
import matplotlib.pyplot as plt
from nltk.probability import FreqDist

In [None]:
twitter_words = []
for i in range(len(dataset)):
    twitter_words += dataset.iloc[i]["filtered_words"]   

In [None]:
fdist = FreqDist(twitter_words)
fdist.plot(10, title = "Palavras mais comuns no Twitter")
plt.show()

Nota-se que são palavras relativamente neutras.

### Analisando as palavras mais comuns em tweets com sentimentos positivos e negativos

In [None]:
# Tweets com sentimentos positivos (1)
twitter_words_pos = dataset[dataset["label"] == 1]
# Tweets com sentimentos negativos (0)
twitter_words_neg = dataset[dataset["label"] == 0]

In [None]:
words_tweet_1 = []
for i in range(len(twitter_words_pos["filtered_words"])):
    words_tweet_1 += twitter_words_pos.iloc[i]["filtered_words"]    

In [None]:
fdist = FreqDist(words_tweet_1)
fdist.plot(10, title = "Palavras mais comuns no Twitter com sentimentos positivos")
plt.show()

In [None]:
words_tweet_0 = []
for i in range(len(twitter_words_neg["filtered_words"])):
    words_tweet_0 += twitter_words_neg.iloc[i]["filtered_words"]

In [None]:
fdist = FreqDist(words_tweet_0)
fdist.plot(10, title = "Palavras mais comuns no Twitter com sentimentos negativos")
plt.show()

- Palavras relacionadas à sentimentos positivos: "good", "love", "like", "lol", "thanks".
- Palavras relacionadas à sentimentos negativos: "get", "work", "want", "going", "back", "miss".

---

#### Dataset já pré-processado, tirado uma amostra de 30%

In [2]:
# Dataset pré-processado
path = "../../twitter_proj/dataset_twitter.csv"
dataset = pd.read_csv(path)

In [3]:
# Como a amostra é muitp grande e fazer GridSearch é impensável, toma-se uma amostra de 30% - Foi feito com 10%, 20% e 30% e não houve mudanças significantes
twitter_sample = dataset.sample(frac = 0.3, replace=False)
twitter_sample.drop(columns = ["Unnamed: 0"], inplace = True)

In [4]:
twitter_sample.head()

Unnamed: 0,label,tweet,filtered_words,joined_words
138709,0,@mjfh81 still at least you only have today to ...,"['user', 'still', 'at', 'least', 'you', 'only'...",user still at least you only have today to go ...
1000243,1,@HouseofSpain thank you,"['user', 'thank', 'you']",user thank you
165721,0,is very upset to see carys and donna go today ...,"['is', 'very', 'upset', 'to', 'see', 'carys', ...",is very upset to see carys and donna go today ...
1290869,1,aw it's six months today &lt;3aab,"['aw', 'it', 's', 'six', 'month', 'today', 'lt...",aw it s six month today lt 3aab
461578,0,@sgtmongoose samus has bio armor and a plasma ...,"['user', 'samus', 'ha', 'bio', 'armor', 'and',...",user samus ha bio armor and a plasma cannon so...


In [5]:
# Ver balanceamento de labels
twitter_sample["label"].value_counts(normalize = True)

1    0.500958
0    0.499042
Name: label, dtype: float64

In [6]:
twitter_sample.iloc[1,3]

'user thank you'

In [7]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 5 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   Unnamed: 0      1600000 non-null  int64 
 1   label           1600000 non-null  int64 
 2   tweet           1600000 non-null  object
 3   filtered_words  1600000 non-null  object
 4   joined_words    1600000 non-null  object
dtypes: int64(2), object(3)
memory usage: 61.0+ MB


## Modelos de Machine Learning

In [10]:
# Bibliotecas e modelos
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

# Processamentos
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD

# Usar os modelos abaixo
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

### Baseline com LogisticRegression (sem otimização de hiperparâmetros)

In [20]:
# Separar amostras de treino e teste
from sklearn.model_selection import train_test_split

X = dataset["joined_words"]
y = dataset["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

In [21]:
# Vetorização e modelo de ML
text_model_lr = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("model", LogisticRegression())
])

text_model_lr

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('model', LogisticRegression())])

In [22]:
text_model_lr.fit(X_train, y_train)
predictions_lr = text_model_lr.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [23]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_curve, roc_auc_score, f1_score

print(confusion_matrix(y_test, predictions_lr))
print()
print(classification_report(y_test, predictions_lr))
print()
print(f"A acurácia é {accuracy_score(y_test, predictions_lr)}")
print(f"A roc auc score é {roc_auc_score(y_test, predictions_lr)}")
print(f"A f1 é {f1_score(y_test, predictions_lr)}")

[[188637  50724]
 [ 46116 194523]]

              precision    recall  f1-score   support

           0       0.80      0.79      0.80    239361
           1       0.79      0.81      0.80    240639

    accuracy                           0.80    480000
   macro avg       0.80      0.80      0.80    480000
weighted avg       0.80      0.80      0.80    480000


A acurácia é 0.79825
A roc auc score é 0.7982230096218319
A f1 é 0.8006939899482595


Usando os dados como um todo, temos com a baseline uma acurácia de 80%.

### LogisticRegression com tunning de Hiperparâmetros

In [8]:
# Separar amostras de treino e teste
from sklearn.model_selection import train_test_split

X = twitter_sample["joined_words"]
y = twitter_sample["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

In [11]:
# Vetorização e modelo de ML
text_model_lrh = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("model", LogisticRegression())
])

# param_grid
param_grid={'model__penalty': ['l1', 'l2'],
            'model__solver': ['liblinear']}

# GridSearch
fold = StratifiedKFold(n_splits = 3, shuffle = True, random_state = 42)
grid_lr = GridSearchCV(text_model_lrh, param_grid = param_grid, cv = fold, scoring = 'neg_mean_absolute_error', return_train_score = True)

In [12]:
grid_lr.fit(X_train, y_train)
predictions = grid_lr.predict(X_test)

In [13]:
# Melhores parâmetros
print(grid_lr.best_params_)

{'model__penalty': 'l2', 'model__solver': 'liblinear'}


In [14]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_curve, roc_auc_score

print(confusion_matrix(y_test, predictions))
print()
print(classification_report(y_test, predictions))
print()
print(f"A acurácia é {accuracy_score(y_test, predictions)}")

[[56153 15398]
 [14276 58173]]

              precision    recall  f1-score   support

           0       0.80      0.78      0.79     71551
           1       0.79      0.80      0.80     72449

    accuracy                           0.79    144000
   macro avg       0.79      0.79      0.79    144000
weighted avg       0.79      0.79      0.79    144000


A acurácia é 0.7939305555555556


Não houve muita mudança com tunning de hiperparâmetros. Com 30% dos dados a acurácia vai para 79.39%

### RandomForest

In [30]:
# Modelo RandomForest
text_model_rf = Pipeline([
            ('tfidf', TfidfVectorizer()),
            ('model', RandomForestClassifier())
])

text_model_rf

Pipeline(steps=[('tfidf', TfidfVectorizer()),
                ('model', RandomForestClassifier())])

In [None]:
text_model_rf.fit(X_train, y_train)
predictions = text_model_rf.predict(X_test)

In [None]:
# Melhores parâmetros
print(text_model_rf.best_params_)

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_curve, roc_auc_score

print(confusion_matrix(y_test, predictions))
print()
print(classification_report(y_test, predictions))
print()
print(f"A acurácia é {accuracy_score(y_test, predictions)}")

A acurácia ficou menor que em LogisticRegression

### RandomForest com tunning de hiperparâmetros

In [26]:
# Modelo RandomForest
text_model_rfh = Pipeline([
            ('tfidf', TfidfVectorizer()),
            ('model', RandomForestClassifier())
])

# Param_grid
param_grid={'model__max_depth': [10, 15, 20],
            'model__criterion': ['entropy', 'gini']}

# GridSearch
fold = StratifiedKFold(n_splits = 3, shuffle = True, random_state = 42)
grid_rfh = GridSearchCV(text_model_rfh, param_grid = param_grid, cv = fold, scoring = 'neg_mean_absolute_error', return_train_score = True)

In [27]:
grid_rfh.fit(X_train, y_train)
predictions = grid_rfh.predict(X_test)

In [28]:
# Melhores parâmetros
print(grid_rfh.best_params_)

{'model__criterion': 'entropy', 'model__max_depth': 20}


In [29]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_curve, roc_auc_score

print(confusion_matrix(y_test, predictions))
print()
print(classification_report(y_test, predictions))
print()
print(f"A acurácia é {accuracy_score(y_test, predictions)}")

[[50676 21317]
 [16219 55788]]

              precision    recall  f1-score   support

           0       0.76      0.70      0.73     71993
           1       0.72      0.77      0.75     72007

    accuracy                           0.74    144000
   macro avg       0.74      0.74      0.74    144000
weighted avg       0.74      0.74      0.74    144000


A acurácia é 0.7393333333333333


### RandomForest com PCA

In [23]:
# Modelo com SVD
text_model_rfsvd = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ("sc", StandardScaler(with_mean=False)),
    ("svd", TruncatedSVD(n_components = 50, random_state = 42)),
    ('rf', RandomForestClassifier())
])

text_model_rfsvd

Pipeline(steps=[('tfidf', TfidfVectorizer()),
                ('sc', StandardScaler(with_mean=False)),
                ('svd', TruncatedSVD(n_components=50, random_state=42)),
                ('rf', RandomForestClassifier())])

In [24]:
text_model_rfsvd.fit(X_train, y_train)
predictions = text_model_rfsvd.predict(X_test)

In [25]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_curve, roc_auc_score

print(confusion_matrix(y_test, predictions))
print()
print(classification_report(y_test, predictions))
print()
print(f"A acurácia é {accuracy_score(y_test, predictions)}")

[[46482 25511]
 [29712 42295]]

              precision    recall  f1-score   support

           0       0.61      0.65      0.63     71993
           1       0.62      0.59      0.61     72007

    accuracy                           0.62    144000
   macro avg       0.62      0.62      0.62    144000
weighted avg       0.62      0.62      0.62    144000


A acurácia é 0.6165069444444444


: (

é preciso utilizar mais componentes

### Linear SVC

In [10]:
text_model_lsvc = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("model", LinearSVC())
])

text_model_lsvc

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('model', LinearSVC())])

In [11]:
text_model_lsvc.fit(X_train, y_train)
predictions = text_model_lsvc.predict(X_test)

In [12]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_curve, roc_auc_score

print(confusion_matrix(y_test, predictions))
print()
print(classification_report(y_test, predictions))
print()
print(f"A acurácia é {accuracy_score(y_test, predictions)}")

[[55692 16301]
 [14907 57100]]

              precision    recall  f1-score   support

           0       0.79      0.77      0.78     71993
           1       0.78      0.79      0.79     72007

    accuracy                           0.78    144000
   macro avg       0.78      0.78      0.78    144000
weighted avg       0.78      0.78      0.78    144000


A acurácia é 0.7832777777777777


### Linear SVC com tunning

In [14]:
text_model_svc = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("model", LinearSVC())
])

param_grid={
            'model__C': [1.0, 10.0, 100.0]}

# GridSearch
fold = StratifiedKFold(n_splits = 3, shuffle = True, random_state = 42)
grid_svc = GridSearchCV(text_model_svc, param_grid = param_grid, cv = fold, scoring = 'neg_mean_absolute_error', return_train_score = True)

In [15]:
grid_svc.fit(X_train, y_train)
predictions = grid_svc.predict(X_test)



In [16]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_curve, roc_auc_score

print(confusion_matrix(y_test, predictions))
print()
print(classification_report(y_test, predictions))
print()
print(f"A acurácia é {accuracy_score(y_test, predictions)}")

[[55692 16301]
 [14907 57100]]

              precision    recall  f1-score   support

           0       0.79      0.77      0.78     71993
           1       0.78      0.79      0.79     72007

    accuracy                           0.78    144000
   macro avg       0.78      0.78      0.78    144000
weighted avg       0.78      0.78      0.78    144000


A acurácia é 0.7832777777777777


### XGBoost

In [17]:
text_model_xgb = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("model", XGBClassifier(learning_rate = 0.1, eval_metric = "logloss"))
])

text_model_xgb

Pipeline(steps=[('tfidf', TfidfVectorizer()),
                ('model',
                 XGBClassifier(base_score=None, booster=None,
                               colsample_bylevel=None, colsample_bynode=None,
                               colsample_bytree=None, eval_metric='logloss',
                               gamma=None, gpu_id=None, importance_type='gain',
                               interaction_constraints=None, learning_rate=0.1,
                               max_delta_step=None, max_depth=None,
                               min_child_weight=None, missing=nan,
                               monotone_constraints=None, n_estimators=100,
                               n_jobs=None, num_parallel_tree=None,
                               random_state=None, reg_alpha=None,
                               reg_lambda=None, scale_pos_weight=None,
                               subsample=None, tree_method=None,
                               validate_parameters=None, verbosity=Non

In [18]:
text_model_xgb.fit(X_train, y_train)
predictions = text_model_xgb.predict(X_test)



In [19]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_curve, roc_auc_score

print(confusion_matrix(y_test, predictions))
print()
print(classification_report(y_test, predictions))
print()
print(f"A acurácia é {accuracy_score(y_test, predictions)}")

[[53084 18909]
 [20764 51243]]

              precision    recall  f1-score   support

           0       0.72      0.74      0.73     71993
           1       0.73      0.71      0.72     72007

    accuracy                           0.72    144000
   macro avg       0.72      0.72      0.72    144000
weighted avg       0.72      0.72      0.72    144000


A acurácia é 0.7244930555555555


Ficou pior!

----

#### LogisticRegression com GridSearch teve maior acurácia, 79.39% com 30% dos dados

----