# Data Science Learnings - Kaggle

A personal notebook covering the core concepts learned through Kaggle's Data Science courses.

**Topics covered:**
1. Pandas — reading data, Series, DataFrames, `describe()`, `value_counts()`
2. Scikit-learn — Decision Trees (with `max_leaf_nodes`), Random Forests
3. Model Evaluation — Mean Absolute Error (MAE)

---
## 1. Pandas

Pandas is the core library for loading and manipulating tabular data in Python.

### 1.1 Reading Data

The most common read function is `pd.read_csv()`. Pandas also supports Excel, JSON, SQL, and more.

In [None]:
import pandas as pd

# The most common way to load data:
# df = pd.read_csv('path/to/file.csv')

# Other read functions:
# pd.read_excel('data.xlsx')      -> Excel files
# pd.read_json('data.json')       -> JSON files
# pd.read_sql(query, connection)  -> SQL databases
# pd.read_parquet('data.parquet') -> Parquet files (efficient columnar format)

# For this notebook i'll generate a dataset using sklearn so it's self-contained
from sklearn.datasets import fetch_california_housing
import numpy as np

housing = fetch_california_housing(as_frame=True)
df = housing.frame
df.head()

### 1.2 Series

A **Series** is a single column of data — essentially a labeled one-dimensional array.

In [None]:
# Selecting a single column returns a Series
house_age = df['HouseAge']

print(type(house_age))   # <class 'pandas.core.series.Series'>
print()
print(house_age.head(10))

In [None]:
# You can also create a Series manually
manual_series = pd.Series([10, 20, 30, 40, 50], name='example')
print(manual_series)

### 1.3 DataFrames

A **DataFrame** is a table — a collection of Series sharing the same index. Think of it as a spreadsheet in Python.

In [None]:
print(type(df))      # <class 'pandas.core.frame.DataFrame'>
print('Shape:', df.shape)   # (rows, columns)
print('Columns:', df.columns.tolist())

In [None]:
# Selecting multiple columns returns a DataFrame (not a Series)
subset = df[['HouseAge', 'AveRooms', 'MedHouseVal']]
subset.head()

In [None]:
# Useful DataFrame inspection methods
df.info()    # column names, non-null counts, dtypes

### 1.4 `describe()`

`describe()` gives a quick statistical summary of all numeric columns: count, mean, std, min, quartiles, and max.

In [None]:
df.describe()

In [None]:
# You can also call it on a single column (Series)
df['MedHouseVal'].describe()

### 1.5 `value_counts()`

`value_counts()` counts how many times each unique value appears in a Series. Very useful for categorical or discrete columns.

In [None]:
# HouseAge is discrete (in years), so value_counts is useful here
df['HouseAge'].value_counts().head(10)

In [None]:
# normalize=True gives proportions instead of raw counts
df['HouseAge'].value_counts(normalize=True).head(10)

---
## 2. Prediction Models with Scikit-learn

The standard workflow in sklearn:
1. Define features (`X`) and target (`y`)
2. Split data into train/validation sets
3. Instantiate and fit the model
4. Make predictions
5. Evaluate with MAE (Mean Absolute Error)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# Define features and target
feature_cols = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup']

X = df[feature_cols]
y = df['MedHouseVal']   # target: median house value

# Split into training and validation sets (80% train, 20% validation)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

print('Training rows  :', len(X_train))
print('Validation rows:', len(X_val))

### 2.1 Decision Tree

A Decision Tree splits data into branches based on feature values, arriving at a prediction at each leaf.

- **Overfitting**: a deep tree memorises the training data but performs poorly on new data.
- **Underfitting**: a shallow tree is too simple to capture patterns.
- `max_leaf_nodes` controls the maximum number of leaves — it's the main knob for tuning this tradeoff.

In [None]:
from sklearn.tree import DecisionTreeRegressor

# Default tree (no limit — will overfit)
dt_default = DecisionTreeRegressor(random_state=42)
dt_default.fit(X_train, y_train)

preds_default = dt_default.predict(X_val)
mae_default = mean_absolute_error(y_val, preds_default)
print(f'Decision Tree (default) — Validation MAE: {mae_default:.4f}')

#### Tuning `max_leaf_nodes`

We can try different values and pick the one that gives the lowest validation MAE.

In [None]:
def get_mae(max_leaf_nodes, X_train, X_val, y_train, y_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=42)
    model.fit(X_train, y_train)
    preds = model.predict(X_val)
    return mean_absolute_error(y_val, preds)

leaf_counts = [5, 10, 25, 50, 100, 250, 500, 1000]

results = {n: get_mae(n, X_train, X_val, y_train, y_val) for n in leaf_counts}

print(f"{'max_leaf_nodes':>16} | {'MAE':>8}")
print('-' * 28)
for nodes, mae in results.items():
    print(f"{nodes:>16} | {mae:>8.4f}")

best_leaf_nodes = min(results, key=results.get)
print(f"\nBest max_leaf_nodes: {best_leaf_nodes}  (MAE = {results[best_leaf_nodes]:.4f})")

In [None]:
# Train the final Decision Tree with the best max_leaf_nodes
dt_best = DecisionTreeRegressor(max_leaf_nodes=best_leaf_nodes, random_state=42)
dt_best.fit(X_train, y_train)

mae_best_dt = mean_absolute_error(y_val, dt_best.predict(X_val))
print(f'Decision Tree (max_leaf_nodes={best_leaf_nodes}) — Validation MAE: {mae_best_dt:.4f}')

### 2.2 Random Forest

A Random Forest builds **many** decision trees on random subsets of the data and features, then **averages** their predictions.

This reduces overfitting without requiring careful tuning of `max_leaf_nodes`. It almost always outperforms a single decision tree.

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

rf_preds = rf_model.predict(X_val)
mae_rf = mean_absolute_error(y_val, rf_preds)
print(f'Random Forest (100 trees) — Validation MAE: {mae_rf:.4f}')

---
## 3. Model Comparison Summary

In [None]:
summary = {
    'Decision Tree (default / unlimited)': mae_default,
    f'Decision Tree (max_leaf_nodes={best_leaf_nodes})': mae_best_dt,
    'Random Forest (100 trees)': mae_rf,
}

print(f"{'Model':<45} | {'Validation MAE':>14}")
print('-' * 63)
for model_name, mae in summary.items():
    print(f"{model_name:<45} | {mae:>14.4f}")

best_model = min(summary, key=summary.get)
print(f"\nBest model: {best_model}")

---
## Key Takeaways

| Concept | What it does |
|---|---|
| `pd.read_csv()` | Loads a CSV file into a DataFrame |
| **Series** | A single labeled column of data |
| **DataFrame** | A table of data (collection of Series) |
| `.describe()` | Summary statistics for numeric columns |
| `.value_counts()` | Frequency count of each unique value |
| **Decision Tree** | Splits data on feature thresholds to make predictions |
| `max_leaf_nodes` | Limits tree depth to control overfitting/underfitting |
| **Random Forest** | Ensemble of many trees — generally more accurate and robust |
| **MAE** | Average absolute difference between predictions and actual values |