# 01_eda.ipynb – Explorative Datenanalyse (EDA)

**Ziel:** Überblick und Qualitätscheck der Amazon Reviews Daten für die Kategorien Automotive, Video_Games und Books.

---



## 1. Setup und Daten einlesen (Stichprobe)
Wir lesen pro Kategorie nur die ersten 1 000 000 Zeilen ein, um Speicher zu sparen.



In [10]:
import pandas as pd
import json
from pathlib import Path

DATA_DIR = Path('../data/raw')
CATEGORIES = ['Automotive', 'Video_Games', 'Books']
dfs = {}
N = 1000000        # so viele Zeilen pro Kategorie einlesen

for cat in CATEGORIES:
    data = []
    with open(DATA_DIR / f'{cat}.jsonl', 'r', encoding='utf-8') as f:
        for i, line in enumerate(f):
            if i >= N:          # nach 1 000 000 Zeilen Schluss
                break
            data.append(json.loads(line))
    df = pd.DataFrame(data)
    dfs[cat] = df
    print(f'{cat}: {df.shape[0]} Einträge, Spalten: {list(df.columns)}')



Automotive: 1000000 Einträge, Spalten: ['rating', 'title', 'text', 'images', 'asin', 'parent_asin', 'user_id', 'timestamp', 'helpful_vote', 'verified_purchase']
Video_Games: 1000000 Einträge, Spalten: ['rating', 'title', 'text', 'images', 'asin', 'parent_asin', 'user_id', 'timestamp', 'helpful_vote', 'verified_purchase']
Books: 1000000 Einträge, Spalten: ['rating', 'title', 'text', 'images', 'asin', 'parent_asin', 'user_id', 'timestamp', 'helpful_vote', 'verified_purchase']


## 2. Anzahl der Einträge pro Kategorie
Wir prüfen, wie viele Rezensionen pro Kategorie vorliegen.


In [11]:
for cat, df in dfs.items():
    print(f'{cat}: {len(df)} Rezensionen')


Automotive: 1000000 Rezensionen
Video_Games: 1000000 Rezensionen
Books: 1000000 Rezensionen


## 3. Verteilung der Sternebewertungen (Labels)
Wir analysieren, wie die Bewertungen (1–5 Sterne) verteilt sind.


In [12]:
for cat, df in dfs.items():
    print(f'--- {cat} ---')
    print(df['rating'].value_counts().sort_index())
    print()


--- Automotive ---
rating
1.0     77368
2.0     37808
3.0     62439
4.0    122296
5.0    700089
Name: count, dtype: int64

--- Video_Games ---
rating
1.0     93823
2.0     48940
3.0     77460
4.0    139600
5.0    640177
Name: count, dtype: int64

--- Books ---
rating
0.0         4
1.0     33952
2.0     35812
3.0     79114
4.0    170464
5.0    680654
Name: count, dtype: int64



## 4. Fehlende Werte prüfen
Wir prüfen, ob es fehlende Werte in den wichtigsten Feldern gibt.


In [None]:
for cat, df in dfs.items():
    print(f'--- {cat} ---')
    print(df[['title', 'text', 'rating']].isnull().sum())
    print()


--- Automotive ---
title     0
text      0
rating    0
dtype: int64

--- Video_Games ---
title     0
text      0
rating    0
dtype: int64

--- Books ---
title     0
text      0
rating    0
dtype: int64



## 5. Beispielhafte Rezensionen
Wir zeigen zufällig ausgewählte Rezensionen (Titel & Text) pro Kategorie.


In [5]:
for cat, df in dfs.items():
    print(f'--- {cat} ---')
    display(df[['title', 'text', 'rating']].sample(5, random_state=42))


--- Automotive ---


Unnamed: 0,title,text,rating
987231,Diesel additive,Don’t know yet truck seems a little smoother,4.0
79954,Great for our Prius,This fits our 2011 Prius perfectly. It matches...,5.0
567130,"Economical, and they FIT",There are so many choices in windshield wipers...,5.0
500891,Obstructs your view,This product is a good idea but if you hang yo...,2.0
55399,It’s black,Worked well to seal a crack,5.0


--- Video_Games ---


Unnamed: 0,title,text,rating
987231,Might not last,Left ear stopped working,4.0
79954,"1 Step Forward, 1 Step Back",Go to Best Buy or some other electronics store...,3.0
567130,Five Stars,love it.,5.0
500891,Bad plastic,Snapped after a week of use,3.0
55399,Five Stars,"Love it. Addictive, entertaining and challenging",5.0


--- Books ---


Unnamed: 0,title,text,rating
987231,Very great read,"If you love any of the In Death series, this w...",5.0
79954,An Ad Pretending to be a Business Book,More a public relations puff piece than a busi...,3.0
567130,Great read for brothers,Great book for my grandsons. They love it.,5.0
500891,Guide to the Best Bets at Favorite Chains,Whether you have a hankering for a juicy strip...,5.0
55399,Four Stars,enjoyable,4.0


## 6. Statistiken zu Textlängen
Wir analysieren die Länge der Rezensionstexte.


In [6]:
for cat, df in dfs.items():
    df['text_length'] = df['text'].astype(str).str.len()
    print(f'--- {cat} ---')
    print(df['text_length'].describe())
    print()


--- Automotive ---
count    1000000.000000
mean         181.748387
std          276.136495
min            0.000000
25%           39.000000
50%           98.000000
75%          217.000000
max        19982.000000
Name: text_length, dtype: float64

--- Video_Games ---
count    1000000.000000
mean         325.374404
std          693.340170
min            0.000000
25%           44.000000
50%          127.000000
75%          316.000000
max        32866.000000
Name: text_length, dtype: float64

--- Books ---
count    1000000.000000
mean         442.261393
std          787.262362
min            0.000000
25%           63.000000
50%          168.000000
75%          460.000000
max        31508.000000
Name: text_length, dtype: float64



## 7. Auffälligkeiten & Erkenntnisse
Hier werden die wichtigsten Beobachtungen aus der Analyse dokumentiert.

Kapitel 1:
Automotive: 1000000 Einträge, Spalten: ['rating', 'title', 'text', 'images', 'asin', 'parent_asin', 'user_id', 'timestamp', 'helpful_vote', 'verified_purchase']
Video_Games: 1000000 Einträge, Spalten: ['rating', 'title', 'text', 'images', 'asin', 'parent_asin', 'user_id', 'timestamp', 'helpful_vote', 'verified_purchase']
Books: 1000000 Einträge, Spalten: ['rating', 'title', 'text', 'images', 'asin', 'parent_asin', 'user_id', 'timestamp', 'helpful_vote', 'verified_purchase']


Kapitel 2:
Automotive: 1000000 Rezensionen
Video_Games: 1000000 Rezensionen
Books: 1000000 Rezensionen

Kapitel 3:
--- Automotive ---
rating
1.0     77368
2.0     37808
3.0     62439
4.0    122296
5.0    700089
Name: count, dtype: int64

--- Video_Games ---
rating
1.0     93823
2.0     48940
3.0     77460
4.0    139600
5.0    640177
Name: count, dtype: int64

--- Books ---
rating
0.0         4
1.0     33952
2.0     35812
3.0     79114
4.0    170464
5.0    680654
Name: count, dtype: int64


Kapitel 4:
--- Automotive ---
title     0
text      0
rating    0
dtype: int64

--- Video_Games ---
title     0
text      0
rating    0
dtype: int64

--- Books ---
title     0
text      0
rating    0
dtype: int64

Kapitel 5:
sind nur Beispielrezensionen. Kannst du selber sehen im Notebook

Kapitel 6:

--- Automotive ---
count    1000000.000000
mean         181.748387
std          276.136495
min            0.000000
25%           39.000000
50%           98.000000
75%          217.000000
max        19982.000000
Name: text_length, dtype: float64

--- Video_Games ---
count    1000000.000000
mean         325.374404
std          693.340170
min            0.000000
25%           44.000000
50%          127.000000
75%          316.000000
max        32866.000000
Name: text_length, dtype: float64

--- Books ---
count    1000000.000000
mean         442.261393
std          787.262362
min            0.000000
25%           63.000000
50%          168.000000
75%          460.000000
max        31508.000000
Name: text_length, dtype: float64




