# Mineração de Texto e Web - Projeto: Construção de um Sistema de Monitoramento de Reviews

### Alunas:
- **Laianna Lana Virginio da Silva** - *llvs2@cin.ufpe.br*
- **Liviany Reis Rodrigues** - *lrr@cin.ufpe.br*

# Informações do Projeto

### Link do GitHub:
- https://github.com/Laianna/projeto-mineracao-texto-web

### Produto a Ser Monitorado:
- Smartwatch Xiaomi Mi Band 4 Oled Preto

### Fonte de Dados:
- https://www.amazon.com.br/Smartwatch-Xiaomi-Preto-Original-Lacrado/dp/B07SNG23JW/ref=cm_cr_arp_d_product_top?ie=UTF8

# Bibliotecas

In [1]:
#####################################################################

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
from scipy.spatial import distance

#####################################################################

In [2]:
SEED = 42

# Base de Dados

## Carregamento dos Dados

In [3]:
df = pd.read_csv("Dados/avaliações.csv", parse_dates = ["Data"])
df.head(1)

Unnamed: 0,Review,Estrela,Data
0,No anúncio informa que a pulseira é a versão g...,1,2019-07-17


## Formatando os Tipos

In [4]:
df.dtypes

Review             object
Estrela             int64
Data       datetime64[ns]
dtype: object

In [5]:
df["Review"] = df["Review"].astype("string")

In [6]:
df.dtypes

Review             string
Estrela             int64
Data       datetime64[ns]
dtype: object

# Pré-Processamento da Classe

**Recodificando a coluna "Estrela" para *Negativo* e *Positivo*:**

- **Negativo:** 1 ★ 2 ★ 3 ★

- **Positivo:** 4 ★ 5 ★

In [7]:
def recodificar_classe(estrela):

    if estrela == 4 or estrela == 5:
        return 1 # positivo
    else:
        return 0 # negativo

In [8]:
df["classe"] = df["Estrela"].apply(lambda estrela: recodificar_classe(estrela))

# Reviews Mais Representativas

## Formatando os Dados de Entrada

Usamos o CountVectorizer do BoW para encontrar uma representação númerica para as reviews.

In [9]:
def formatar_entrada_rf_bow(dados, mf = 1000):
    
    matriz = CountVectorizer(max_features = mf)
    X = matriz.fit_transform(dados).toarray()
    
    return X

In [10]:
mf = 50

In [11]:
X = formatar_entrada_rf_bow(df["Review"], mf)
y = df['classe']

## Cálculo

2 cluters para representar as duas classes. Lembrando que é necessário transformar as reviews de string para uma representação vetorial númerica antes de passar para o Kmeans.

In [12]:
kmeans = KMeans(n_clusters = 2, random_state = SEED).fit(X, y)

## Checando os Clusters do Kmeans

### Negativo

In [13]:
kmeans.cluster_centers_[0]

array([0.10890465, 0.08652482, 0.08510638, 0.10953507, 0.20063042,
       0.19054374, 0.12608353, 0.03546099, 0.05626478, 0.06635146,
       0.22111899, 0.04775414, 0.05311269, 0.21245075, 0.03372734,
       0.10591017, 0.08037825, 0.05594957, 0.03750985, 0.12655634,
       0.05925926, 0.03798266, 0.03624901, 0.06540583, 0.04617809,
       0.05973207, 0.03829787, 0.36910954, 0.04712372, 0.07817179,
       0.12923562, 0.0750197 , 0.11615445, 0.10811663, 0.41229314,
       0.04491726, 0.06272656, 0.16044129, 0.10543735, 0.05263987,
       0.05263987, 0.02616233, 0.02789598, 0.06887313, 0.03404255,
       0.07659574, 0.06540583, 0.04586288, 0.05153664, 0.1605989 ])

### Positivo

In [14]:
kmeans.cluster_centers_[1]

array([0.11979823, 0.34930643, 0.4148802 , 0.4110971 , 0.18789407,
       0.23581337, 0.81084489, 0.2370744 , 0.07944515, 0.44010088,
       2.09583859, 0.24968474, 0.29129887, 0.75535939, 0.30264817,
       0.5964691 , 0.13871375, 0.22068096, 0.33165195, 0.13871375,
       0.04035309, 0.19672131, 0.21437579, 0.13114754, 0.39470366,
       0.48297604, 0.21815889, 0.73896595, 0.29760404, 0.48423707,
       1.07313997, 0.1147541 , 0.91677175, 0.11601513, 0.63934426,
       0.30769231, 0.110971  , 1.55737705, 0.17654477, 0.25851198,
       0.09457755, 0.30264817, 0.33921816, 0.14754098, 0.24211854,
       0.15006305, 0.54981084, 0.35308953, 0.14501892, 0.16519546])

## Calculando a Distância do Centro Para Cada Review

In [15]:
distancia = []

for i, (linha, classe) in enumerate(zip(X, y)):
    
    distancia.append(distance.euclidean(linha, kmeans.cluster_centers_[y[i]]))

In [16]:
df["distancia"] = distancia

### Pegando as *N* Reviews Mais Representativas

In [17]:
num_representativas = 20
reviews_representativas_negativas = df[df["classe"] == 0].nsmallest(num_representativas, 'distancia')
reviews_representativas_positivas = df[df["classe"] == 1].nsmallest(num_representativas, 'distancia')

## Exportando para um .csv

In [18]:
colunas = ["Review", "Estrela", "Data", "classe", "distancia"]

In [19]:
reviews_representativas_negativas[colunas].to_csv(f'./Dados/{num_representativas}_reviews_representativas_negativas.csv', index = True)
reviews_representativas_positivas[colunas].to_csv(f'./Dados/{num_representativas}_reviews_representativas_positivas.csv', index = True)

In [20]:
# Para ler o .csv
#teste = pd.read_csv("./Dados/20_reviews_representativas_negativas.csv", parse_dates = ["Data"], index_col = 0)
#df.head(1)