# Data Exploration & Processing
This notebook performs environment setup, data loading, EDA, visualizations, basic cleaning, saves processed datasets and figures, and includes a simple modeling example. It uses `winequality-red.csv` and `books_data.json` from `Mock_student_packet_v4/`.

## 1) Environment Setup
- Commands to create a venv and install dependencies.

```bash
python -m venv .venv
# Windows (PowerShell)
.venv\Scripts\Activate.ps1
pip install -r requirements.txt
```

This notebook will attempt to install missing Python packages automatically if needed.

In [None]:
# Install common packages if missing and write requirements.txt
import importlib, subprocess, sys
reqs = ["pandas", "numpy", "matplotlib", "seaborn", "scikit-learn", "pyarrow", "joblib"]
installed = []
for pkg in reqs:
    try:
        importlib.import_module(pkg)
        installed.append(pkg)
    except Exception:
        print(f"Installing {pkg}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", pkg])
        installed.append(pkg)
# Write requirements
subprocess.check_call([sys.executable, "-m", "pip", "freeze", "--local"], stdout=open('requirements.txt','w'))
print('Installed/verified:', installed)


## 2) Configure Jupyter Kernel
To register the environment as a kernel (run in terminal after activating venv):

```bash
python -m ipykernel install --user --name=mock_env --display-name "Python (mock_env)"
```

Then select the kernel in VS Code's kernel picker.

In [None]:
# Quick sanity checks and magic
%matplotlib inline
print('Hello, world â€” notebook environment is ready')

In [None]:
# Imports and helper functions
import os
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
import joblib

ROOT = Path('.') / 'Mock_student_packet_v4'
FIG_DIR = Path('figures')
PROC_DIR = Path('data/processed')
FIG_DIR.mkdir(parents=True, exist_ok=True)
PROC_DIR.mkdir(parents=True, exist_ok=True)

sns.set(style='whitegrid')

def save_fig(fig, name, dpi=150):
    path = FIG_DIR / name
    fig.tight_layout()
    fig.savefig(path, dpi=dpi)
    print('Saved figure:', path)


In [None]:
# Data Loading
wine_path = ROOT / 'winequality-red.csv'
books_path = ROOT / 'books_data.json'

wine = pd.read_csv(wine_path, sep=';')
books = pd.read_json(books_path)

print('wine:', wine.shape)
print('books:', books.shape)

# Save quick I/O checks
wine.describe().to_csv(PROC_DIR / 'wine_describe.csv')
books.describe(include='all').to_csv(PROC_DIR / 'books_describe.csv')


In [None]:
# Basic EDA - wine
print('Wine head:')
print(wine.head().to_string(index=False))
print('\nDtypes:\n', wine.dtypes)
print('\nMissing values:\n', wine.isna().sum())

# Basic EDA - books
print('\nBooks head:')
print(books.head().to_string(index=False))
print('\nBooks missing:\n', books.isna().sum())


In [None]:
# Wine visualizations
num_cols = wine.select_dtypes(include='number').columns.tolist()
fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(12,12))
for ax, col in zip(axes.flatten(), num_cols):
    sns.histplot(wine[col], ax=ax, kde=True)
    ax.set_title(col)
save_fig(fig, 'wine_histograms.png')

# Correlation heatmap
corr = wine.corr()
fig, ax = plt.subplots(figsize=(8,6))
sns.heatmap(corr, annot=True, fmt='.2f', cmap='vlag', ax=ax)
save_fig(fig, 'wine_corr_heatmap.png')

# Quality distribution
fig, ax = plt.subplots(figsize=(6,4))
sns.countplot(x='quality', data=wine, palette='rocket', ax=ax)
ax.set_title('Quality counts')
save_fig(fig, 'wine_quality_counts.png')


In [None]:
# Books visualizations
# Price distribution
fig, ax = plt.subplots(figsize=(6,4))
sns.histplot(books['price'], kde=True, ax=ax)
ax.set_title('Book price distribution')
save_fig(fig, 'books_price_hist.png')

# Rating distribution
fig, ax = plt.subplots(figsize=(6,4))
sns.countplot(x='rating', data=books, palette='mako', ax=ax)
ax.set_title('Book rating counts')
save_fig(fig, 'books_rating_counts.png')

# Top categories
top_cats = books['category'].value_counts().nlargest(10)
fig, ax = plt.subplots(figsize=(8,4))
sns.barplot(x=top_cats.values, y=top_cats.index, palette='crest', ax=ax)
ax.set_title('Top 10 categories')
save_fig(fig, 'books_top_categories.png')


In [None]:
# Data cleaning
# Wine: check duplicates, dtypes
before = wine.shape[0]
wine = wine.drop_duplicates()
after = wine.shape[0]
print(f'Wine rows removed by duplicates: {before - after}')

# Books: convert numeric fields, drop duplicates
before = books.shape[0]
books['price'] = pd.to_numeric(books['price'], errors='coerce')
books['rating'] = pd.to_numeric(books['rating'], errors='coerce')
books = books.drop_duplicates()
after = books.shape[0]
print(f'Books rows removed by duplicates: {before - after}')

# Fill or flag missing values
print('Books missing after conversions:\n', books.isna().sum())


In [None]:
# Save cleaned datasets
wine.to_csv(PROC_DIR / 'winequality_red_clean.csv', index=False)
wine.to_parquet(PROC_DIR / 'winequality_red_clean.parquet')
books.to_csv(PROC_DIR / 'books_data_clean.csv', index=False)
books.to_parquet(PROC_DIR / 'books_data_clean.parquet')
print('Saved cleaned datasets to', PROC_DIR)


In [None]:
# Modeling example: simple regression for wine quality
features = ['alcohol', 'sulphates', 'volatile acidity', 'citric acid']
X = wine[features]
y = wine['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
preds = model.predict(X_test)
mae = mean_absolute_error(y_test, preds)
print('Wine quality MAE (LinearRegression):', round(mae, 3))
# Save model
joblib.dump(model, PROC_DIR / 'wine_quality_lr.joblib')
print('Model saved to', PROC_DIR / 'wine_quality_lr.joblib')


## 9) Unit tests / Reproducibility
- Move testable functions to `utils.py` and add pytest tests.
- Use `random.seed` and `numpy.random.seed` / `sklearn` random_state for reproducible splits.

## 10) Debugging & Logging
- Use `%debug`, `pdb.set_trace()`, and configure the `logging` module for debug-level logs.

## 11) Profiling & Optimization
- Use `%timeit` and `cProfile` to find hotspots and optimize pandas operations.

## 12) Exporting & CI
- Export to HTML:

```bash
jupyter nbconvert --to html data_exploration.ipynb
```

- For CI, use `papermill` or `nbval` in GitHub Actions to validate notebook execution.

## Final notes
This notebook is a reproducible workflow: it installs required packages, runs EDA, generates figures in `figures/`, saves processed data into `data/processed/`, and includes a simple model example.