# Data Quality Checks – Amazon Reviews (Electronics)

Este notebook executa **verificações de qualidade de dados** usando **dois approaches**:

1. Checks declarativos no estilo **Soda (pandas-based)**
2. Validações com **Great Expectations**

In [1]:
# Instalar dependências (Google Colab)
!pip -q install great_expectations


[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


## 1) Carregar dataset Parquet


In [1]:
import pandas as pd

PARQUET_PATH = r"C:\Users\Rodrigo\Desktop\py\Prjt\DDF_TECH_122025\notebooks\data\electronics_reviews_prepared.parquet"

df = pd.read_parquet(PARQUET_PATH)
df.head(), df.shape

(       reviewerID        asin  \
 0   AO94DHGC771SJ  0528881469   
 1   AMO214LNFCEI4  0528881469   
 2  A3N7T0DY83Y4IG  0528881469   
 3  A1H8PY3QHMQQA0  0528881469   
 4  A24EV6RXELQZ63  0528881469   
 
                                           reviewText  overall  \
 0  We got this GPS for my husband who is an (OTR)...      5.0   
 1  I'm a professional OTR truck driver, and I bou...      1.0   
 2  Well, what can I say.  I've had this unit in m...      3.0   
 3  Not going to write a long review, even thought...      2.0   
 4  I've had mine for a year and here's what we go...      1.0   
 
                                   summary  unixReviewTime   reviewTime  \
 0                         Gotta have GPS!      1370131200   06 2, 2013   
 1                       Very Disappointed      1290643200  11 25, 2010   
 2                          1st impression      1283990400   09 9, 2010   
 3                 Great grafics, POOR GPS      1290556800  11 24, 2010   
 4  Major issues, onl

## 2) Checks de Qualidade – abordagem estilo Soda (pandas)


In [2]:
report = []
def check(name, ok, details=""):
    report.append({"check": name, "status": "PASS" if ok else "FAIL", "details": details})

# Completude
nulls = int(df.isna().sum().sum())
check("No null values in dataset", nulls == 0, f"Total nulls: {nulls}")

# Validade
check("overall between 1 and 5", df["overall"].between(1,5).all())
check("month between 1 and 12", df["month"].between(1,12).all())
check("year between 1990 and 2030", df["year"].between(1990,2030).all())

# Consistência
viol = int((df["helpful_up"] > df["helpful_total"]).sum())
check("helpful_up <= helpful_total", viol == 0, f"Violations: {viol}")

dq_report = pd.DataFrame(report)
dq_report


Unnamed: 0,check,status,details
0,No null values in dataset,PASS,Total nulls: 0
1,overall between 1 and 5,PASS,
2,month between 1 and 12,PASS,
3,year between 1990 and 2030,PASS,
4,helpful_up <= helpful_total,PASS,Violations: 0


## 3) Checks de Qualidade – Great Expectations


In [3]:
!pip install -q "great_expectations<0.18"


[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
violations = (df["helpful_up"] > df["helpful_total"]).sum()

custom_expectation = {
    "expectation": "helpful_up <= helpful_total",
    "success": violations == 0,
    "violations": int(violations)
}

custom_expectation

{'expectation': 'helpful_up <= helpful_total',
 'success': np.True_,
 'violations': 0}

In [7]:
import great_expectations as ge

# Great Expectations (API legacy Pandas)
gdf = ge.dataset.PandasDataset(df)

gdf.expect_table_row_count_to_be_between(min_value=100000, max_value=10000000)
gdf.expect_column_values_to_not_be_null("asin")
gdf.expect_column_values_to_not_be_null("reviewerID")
gdf.expect_column_values_to_be_between("overall", 1, 5)
gdf.expect_column_values_to_be_between("month", 1, 12)
gdf.expect_column_values_to_be_between("year", 1990, 2030)

# Validação GE
results = gdf.validate()
print("Great Expectations success:", results["success"])

# Regra de consistência customizada (fora do GE legacy)
violations = (df["helpful_up"] > df["helpful_total"]).sum()
custom_check = {
    "expectation": "helpful_up <= helpful_total",
    "success": violations == 0,
    "violations": int(violations)
}

custom_check


Great Expectations success: True


{'expectation': 'helpful_up <= helpful_total',
 'success': np.True_,
 'violations': 0}

## 4) Exportar evidências


In [8]:
dq_report.to_csv("data_quality_report.csv", index=False)

import json
with open("ge_validation_results.json", "w") as f:
    json.dump(results, f, indent=2, default=str)

print("Arquivos gerados:")
print("- data_quality_report.csv")
print("- ge_validation_results.json")


Arquivos gerados:
- data_quality_report.csv
- ge_validation_results.json


## 5) Conclusão

- As verificações de qualidade não indicaram falhas críticas.
- O dataset apresenta alta completude, validade e consistência.
- As regras podem ser automatizadas em pipelines do módulo Inteligência da Dadosfera.
