# Dadosfera Case — Item 1 (Colab)
## Baixar e preparar o dataset **Amazon Reviews (Electronics)** localmente

Este notebook faz:
- download do `reviews_Electronics_5.json.gz` (SNAP/Stanford)
- leitura incremental (JSON Lines)
- limpeza mínima / tipagem
- salvamento em **Parquet**

> Data: 2026-01-03


In [1]:
!pip -q install pandas pyarrow requests tqdm


[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import os, json, gzip, math, requests
from pathlib import Path
import pandas as pd
from tqdm.auto import tqdm

# =========================
# CONFIG
# =========================
DATA_DIR = Path("data")
DATA_DIR.mkdir(exist_ok=True)

DATASET_URL = "https://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz"
RAW_GZ_PATH = DATA_DIR / "reviews_Electronics_5.json.gz"

# Controle de volume (para rodar rápido no Colab)
# - O dataset completo é bem grande; para o case, >=100k já atende.
MAX_ROWS = 400_000   # ajuste se quiser (ex: 100_000)


  from .autonotebook import tqdm as notebook_tqdm


## 1) Download do dataset (SNAP)

In [3]:
def download_file(url: str, dest: Path, chunk_size: int = 1024 * 1024):
    if dest.exists() and dest.stat().st_size > 0:
        print(f"Arquivo já existe: {dest} ({dest.stat().st_size/1e6:.1f} MB)")
        return

    print(f"Baixando: {url}")
    with requests.get(url, stream=True, timeout=60) as r:
        r.raise_for_status()
        total = int(r.headers.get("content-length", 0))
        with open(dest, "wb") as f, tqdm(total=total, unit="B", unit_scale=True) as pbar:
            for chunk in r.iter_content(chunk_size=chunk_size):
                if chunk:
                    f.write(chunk)
                    pbar.update(len(chunk))

download_file(DATASET_URL, RAW_GZ_PATH)


Baixando: https://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz


100%|██████████| 496M/496M [00:38<00:00, 12.8MB/s] 


## 2) Leitura incremental do `.json.gz` (JSON Lines)

Vamos ler até `MAX_ROWS` para acelerar o desenvolvimento.


In [4]:
def read_gz_json_lines(path: Path, max_rows: int | None = None):
    rows = []
    with gzip.open(path, "rt", encoding="utf-8") as f:
        for i, line in enumerate(f):
            if max_rows is not None and i >= max_rows:
                break
            line = line.strip()
            if not line:
                continue
            rows.append(json.loads(line))
    return rows

data = read_gz_json_lines(RAW_GZ_PATH, max_rows=MAX_ROWS)
df = pd.DataFrame(data)

print("Linhas:", len(df))
print("Colunas:", list(df.columns))
df.head(3)


Linhas: 400000
Colunas: ['reviewerID', 'asin', 'reviewerName', 'helpful', 'reviewText', 'overall', 'summary', 'unixReviewTime', 'reviewTime']


Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,AO94DHGC771SJ,528881469,amazdnu,"[0, 0]",We got this GPS for my husband who is an (OTR)...,5.0,Gotta have GPS!,1370131200,"06 2, 2013"
1,AMO214LNFCEI4,528881469,Amazon Customer,"[12, 15]","I'm a professional OTR truck driver, and I bou...",1.0,Very Disappointed,1290643200,"11 25, 2010"
2,A3N7T0DY83Y4IG,528881469,C. A. Freeman,"[43, 45]","Well, what can I say. I've had this unit in m...",3.0,1st impression,1283990400,"09 9, 2010"


## 3) Preparação mínima (colunas principais + tipagem)

In [5]:
# Colunas mais úteis do dataset
wanted_cols = [
    "reviewerID", "asin", "reviewText", "overall", "summary",
    "unixReviewTime", "reviewTime", "helpful"
]
cols = [c for c in wanted_cols if c in df.columns]
df = df[cols].copy()

# Tipagem
if "overall" in df.columns:
    df["overall"] = pd.to_numeric(df["overall"], errors="coerce")

if "unixReviewTime" in df.columns:
    df["review_datetime"] = pd.to_datetime(df["unixReviewTime"], unit="s", errors="coerce")
    df["year"] = df["review_datetime"].dt.year
    df["month"] = df["review_datetime"].dt.month

# Helpful geralmente é [upvotes, total]
if "helpful" in df.columns:
    df["helpful_up"] = df["helpful"].apply(lambda x: x[0] if isinstance(x, list) and len(x) > 0 else None)
    df["helpful_total"] = df["helpful"].apply(lambda x: x[1] if isinstance(x, list) and len(x) > 1 else None)

# Higiene simples
if "reviewText" in df.columns:
    df["reviewText_len"] = df["reviewText"].astype(str).str.len()

df.head(3)


Unnamed: 0,reviewerID,asin,reviewText,overall,summary,unixReviewTime,reviewTime,helpful,review_datetime,year,month,helpful_up,helpful_total,reviewText_len
0,AO94DHGC771SJ,528881469,We got this GPS for my husband who is an (OTR)...,5.0,Gotta have GPS!,1370131200,"06 2, 2013","[0, 0]",2013-06-02,2013,6,0,0,805
1,AMO214LNFCEI4,528881469,"I'm a professional OTR truck driver, and I bou...",1.0,Very Disappointed,1290643200,"11 25, 2010","[12, 15]",2010-11-25,2010,11,12,15,2175
2,A3N7T0DY83Y4IG,528881469,"Well, what can I say. I've had this unit in m...",3.0,1st impression,1283990400,"09 9, 2010","[43, 45]",2010-09-09,2010,9,43,45,4607


## 4) Salvar como Parquet (para ingestão e performance)

Vamos salvar o dataset preparado e uma amostra menor para testes rápidos.

In [6]:
OUT_PARQUET = DATA_DIR / "electronics_reviews_prepared.parquet"
OUT_SAMPLE = DATA_DIR / "electronics_reviews_sample_50k.parquet"

df.to_parquet(OUT_PARQUET, index=False)

# Amostra (para dev rápido)
sample = df.sample(min(50_000, len(df)), random_state=42) if len(df) > 0 else df
sample.to_parquet(OUT_SAMPLE, index=False)

print("Salvo:", OUT_PARQUET, "->", OUT_PARQUET.stat().st_size/1e6, "MB")
print("Salvo:", OUT_SAMPLE, "->", OUT_SAMPLE.stat().st_size/1e6, "MB")


Salvo: data\electronics_reviews_prepared.parquet -> 164.936579 MB
Salvo: data\electronics_reviews_sample_50k.parquet -> 21.731382 MB


## 5) Checagens rápidas

- confirmar `>= 100k` linhas
- ver distribuição de rating (1–5)
- nulos principais


In [7]:
assert len(df) >= 100_000, f"Dataset com {len(df)} linhas; precisa >=100k para o case."

checks = {
    "rows": len(df),
    "null_reviewText_pct": float(df["reviewText"].isna().mean()) if "reviewText" in df.columns else None,
    "null_overall_pct": float(df["overall"].isna().mean()) if "overall" in df.columns else None,
}
checks


{'rows': 400000, 'null_reviewText_pct': 0.0, 'null_overall_pct': 0.0}

In [8]:
if "overall" in df.columns:
    df["overall"].value_counts(dropna=False).sort_index()


In [17]:
# carrega o parquet
df_googleSheets = pd.read_parquet(r"C:\Users\Rodrigo\Desktop\py\Prjt\DDF_TECH_122025\notebooks\data\electronics_reviews_prepared.parquet")

# amostra representativa
df_150k = df_googleSheets.sample(150_000, random_state=42)

# exporta para CSV (compatível com Google Sheets)
df_150k.to_csv(
    "amazon_reviews_electronics_150k.csv",
    index=False
)

print("Arquivo gerado com sucesso:", len(df_150k))

Arquivo gerado com sucesso: 150000


In [15]:
import os
print(os.getcwd())

c:\Users\Rodrigo\Desktop\py\Prjt\DDF_TECH_122025\notebooks


In [16]:
base = r"C:\Users\Rodrigo\Desktop\py\Prjt\DDF_TECH_122025"
print("Pastas no projeto:")
print(os.listdir(base))

Pastas no projeto:
['.git', '.gitignore', 'Data', 'docs', 'gen_ai_insight.py', 'Images_and_files', 'notebooks', 'README_v2.md', 'requirements.txt', 'scripts', 'venv']


In [19]:
df_100k = df_googleSheets.sample(110_000, random_state=42)

df_100k.to_csv(
    "amazon_reviews_electronics_110k.csv",
    index=False
)

print(len(df_100k))

110000
