# Padronização e Deduplicação de Dados Usando IA

Nesta aula prática, vamos explorar como técnicas de IA podem ser utilizadas para padronização e deduplicação de dados. Vamos usar um conjunto de dados de avaliações de produtos da [Amazon](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews?resource=download) e aplicar técnicas tanto tradicionais quanto baseadas em IA.

## Objetivos
- Entender os conceitos básicos de padronização e deduplicação de dados.
- Aplicar técnicas tradicionais de padronização de dados.
- Utilizar modelos de linguagem para deduplicação de dados.
- Comparar os resultados obtidos com métodos tradicionais e IA.

In [None]:
import pandas as pd

# Carregamento dos Dados

Primeiro vamos carregar os dados dentro de um DataFrame utilizando a biblioteca Pandas

In [None]:
nrows_read = 1000
df = pd.read_csv("Reviews.csv", sep=",", engine="python", encoding="utf8", nrows=nrows_read)
df.head(5)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [None]:
df.shape

(1000, 10)

## Padronização dos dados

Para realizar uma padronização desses dados iremos utilizar uma técnica tradicional para  remover espaços extras e transformar tudo para minúsculas.

In [None]:
df['Text'] = df['Text'].str.strip().str.lower()
df['Summary'] = df['Summary'].str.strip().str.lower()

# Remover pontuação
import string

df['Text'] = df['Text'].str.translate(str.maketrans('', '', string.punctuation))
df['Summary'] = df['Summary'].str.translate(str.maketrans('', '', string.punctuation))

df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,good quality dog food,i have bought several of the vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,not as advertised,product arrived labeled as jumbo salted peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,delight says it all,this is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,cough medicine,if you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,great taffy,great taffy at a great price there was a wide...


## Deduplicação usando BERT

In [None]:
# Instalando a biblioteca sentence-transformers
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence-transform

In [None]:
from sentence_transformers import SentenceTransformer, util
import torch

# Carregar o modelo BERT
model = SentenceTransformer('bert-base-nli-mean-tokens')

# Função para calcular a similaridade entre todas as sentenças
def deduplicate_reviews(df, threshold=0.90):
    embeddings = model.encode(df['Text'], convert_to_tensor=True)
    cosine_scores = util.pytorch_cos_sim(embeddings, embeddings)

    # Marcar duplicatas
    duplicate_indices = set()
    for i in range(len(df)):
        if i in duplicate_indices:
            continue
        for j in range(i + 1, len(df)):
            if cosine_scores[i][j] > threshold:
                duplicate_indices.add(j)
                # mostrar frases ditas como similares
                print("#")
                print(df['Text'][i])
                print(df['Text'][j])
                print("#")


    # Remover duplicatas
    df_deduplicated = df.drop(duplicate_indices).reset_index(drop=True)
    return df_deduplicated, duplicate_indices

# Aplicar a função de deduplicação
df_deduplicated, indices = deduplicate_reviews(df)
df_deduplicated.head()



#
i dont know if its the cactus or the tequila or just the unique combination of ingredients but the flavour of this hot sauce makes it one of a kind  we picked up a bottle once on a trip we were on and brought it back home with us and were totally blown away  when we realized that we simply couldnt find it anywhere in our city we were bummedbr br now because of the magic of the internet we have a case of the sauce and are ecstatic because of itbr br if you love hot saucei mean really love hot sauce but dont want a sauce that tastelessly burns your throat grab a bottle of tequila picante gourmet de inclan  just realize that once you taste it you will never want to use any other saucebr br thank you for the personal incredible service
i dont know if its the cactus or the tequila or just the unique combination of ingredients but the flavour of this hot sauce makes it one of a kind  we picked up a bottle once on a trip we were on and brought it back home with us and were totally blown awa

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,good quality dog food,i have bought several of the vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,not as advertised,product arrived labeled as jumbo salted peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,delight says it all,this is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,cough medicine,if you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,great taffy,great taffy at a great price there was a wide...


In [None]:
df_deduplicated

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
5,6,B006K2ZZ7K,ADT0SRK1MGOEU,Twoapennything,0,0,4,1342051200,nice taffy,i got a wild hair for taffy and ordered this f...
7,8,B006K2ZZ7K,A3JRGQVEQN31IQ,Pamela G. Williams,0,0,5,1336003200,wonderful tasty taffy,this taffy is so good it is very soft and che...
13,14,B001GVISJM,A18ECVX2RJ7HUE,"willie ""roadie""",2,2,4,1288915200,fresh and greasy,good flavor these came securely packed they we...
18,19,B001GVISJM,A2A9X58G2GTBLP,Wolfee1,0,0,5,1324598400,great sweet candy,twizzlers strawberry my childhood favorite can...
24,25,B001GVISJM,A22P2J09NJ9HKE,"S. Cabanaugh ""jilly pepper""",0,0,5,1295481600,please sell these in mexico,i have lived out of the us for over 7 yrs now ...
...,...,...,...,...,...,...,...,...,...,...
995,996,B006F2NYI2,A1D3F6UI1RTXO0,Swopes,1,1,5,1331856000,hot flavorful,black market hot sauce is wonderful my husband...
996,997,B006F2NYI2,AF50D40Y85TV3,Mike A.,1,1,5,1328140800,great hot sauce and people who run it,man what can i say this salsa is the bomb i ha...
997,998,B006F2NYI2,A3G313KLWDG3PW,kefka82,1,1,5,1324252800,this sauce is the shiznit,this sauce is so good with just about anything...
998,999,B006F2NYI2,A3NIDDT7E7JIFW,V. B. Brookshaw,1,2,1,1336089600,not hot,not hot at all like the other low star reviewe...


# Utilizando um dataset em portugues

O dataset pode ser encontrado também no [kaggle](https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce?select=olist_order_reviews_dataset.csv)


In [None]:
df_pt = pd.read_csv("olist_order_reviews_dataset.csv", encoding="utf8", nrows=5000)
df_pt.head()

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53


In [None]:
# padronização e remocao de nulls
df = df_pt.dropna(subset=['review_comment_message'])
df['review_comment_title'] = df['review_comment_title'].str.strip().str.lower()
df['review_comment_message'] = df['review_comment_message'].str.strip().str.lower()

# Remover pontuação
import string

df['review_comment_title'] = df['review_comment_title'].str.translate(str.maketrans('', '', string.punctuation))
df['review_comment_message'] = df['review_comment_message'].str.translate(str.maketrans('', '', string.punctuation))

df.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review_comment_title'] = df['review_comment_title'].str.strip().str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review_comment_message'] = df['review_comment_message'].str.strip().str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review_comment_title'] = df['review_c

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,recebi bem antes do prazo estipulado,2017-04-21 00:00:00,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,parabéns lojas lannister adorei comprar pela i...,2018-03-01 00:00:00,2018-03-02 10:26:53
9,8670d52e15e00043ae7de4c01cc2fe06,b9bf720beb4ab3728760088589c62129,4,recomendo,aparelho eficiente no site a marca do aparelho...,2018-05-22 00:00:00,2018-05-23 16:45:47
12,4b49719c8a200003f700d3d986ea1a19,9d6f15f95d01e79bd1349cc208361f09,4,,mas um pouco travandopelo valor ta boa,2018-02-16 00:00:00,2018-02-20 10:52:22
15,3948b09f7c818e2d86c9a546758b2335,e51478e7e277a83743b6f9991dbfa3fb,5,super recomendo,vendedor confiável produto ok e entrega antes ...,2018-05-23 00:00:00,2018-05-24 03:00:01


In [None]:
df = df.reset_index()

In [None]:
from sentence_transformers import SentenceTransformer, util
import torch

# Carregar o modelo BERT em português
model = SentenceTransformer('neuralmind/bert-base-portuguese-cased')

# Função para calcular a similaridade entre todas as sentenças
def deduplicate_reviews(df, threshold=0.9):
    embeddings = model.encode(df['review_comment_message'], convert_to_tensor=True)
    cosine_scores = util.pytorch_cos_sim(embeddings, embeddings)

    # Marcar duplicatas
    duplicate_indices = set()
    for i in range(len(df)):
        if i in duplicate_indices:
            continue
        for j in range(i + 1, len(df)):
            if cosine_scores[i][j] > threshold:
                duplicate_indices.add(j)
                print("#")
                print(df['review_comment_message'][i])
                print(df['review_comment_message'][j])
                print("#")

    # Remover duplicatas
    df_deduplicated = df.drop(duplicate_indices).reset_index(drop=True)
    return df_deduplicated

# Aplicar a função de deduplicação
df_deduplicated = deduplicate_reviews(df)
df_deduplicated.head()



#
recebi bem antes do prazo estipulado
recebi antes do prazo estimado
#
#
recebi bem antes do prazo estipulado
recebi antes do prazo
#
#
recebi bem antes do prazo estipulado
chegou bem antes do prazo informado
#
#
vendedor confiável produto ok e entrega antes do prazo
entregue no prazo produto muito bom
#
#
vendedor confiável produto ok e entrega antes do prazo
entrega antes do prazo produto muito bem conservado
#
#
vendedor confiável produto ok e entrega antes do prazo
entregue antes do prazo produto perfeito
#
#
vendedor confiável produto ok e entrega antes do prazo
produto bom entrega antes do prazo
#
#
vendedor confiável produto ok e entrega antes do prazo
ótimo atendimento e entrega antes do prazo
#
#
vendedor confiável produto ok e entrega antes do prazo
entrega dentro do prazo e produto em excelente condição
#
#
vendedor confiável produto ok e entrega antes do prazo
bom produto entrega no prazo
#
#
vendedor confiável produto ok e entrega antes do prazo
entrega no prazo produto d

Unnamed: 0,index,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,recebi bem antes do prazo estipulado,2017-04-21 00:00:00,2017-04-21 22:02:06
1,4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,parabéns lojas lannister adorei comprar pela i...,2018-03-01 00:00:00,2018-03-02 10:26:53
2,9,8670d52e15e00043ae7de4c01cc2fe06,b9bf720beb4ab3728760088589c62129,4,recomendo,aparelho eficiente no site a marca do aparelho...,2018-05-22 00:00:00,2018-05-23 16:45:47
3,12,4b49719c8a200003f700d3d986ea1a19,9d6f15f95d01e79bd1349cc208361f09,4,,mas um pouco travandopelo valor ta boa,2018-02-16 00:00:00,2018-02-20 10:52:22
4,15,3948b09f7c818e2d86c9a546758b2335,e51478e7e277a83743b6f9991dbfa3fb,5,super recomendo,vendedor confiável produto ok e entrega antes ...,2018-05-23 00:00:00,2018-05-24 03:00:01


In [None]:
df_deduplicated.shape