# Data Science Learnings - Kaggle

A personal notebook covering the core concepts learned through Kaggle's Data Science courses.

**Topics covered:**
1. Pandas — reading data, Series, DataFrames, `describe()`, `value_counts()`
2. Scikit-learn — Decision Trees (with `max_leaf_nodes`), Random Forests
3. Model Evaluation — Mean Absolute Error (MAE)
4. More Pandas — filtering, sorting, missing values, adding/dropping columns, GroupBy, `loc`/`iloc`
5. Data Analysis — correlation, feature importance, cross-validation

---
## 1. Pandas

Pandas is the core library for loading and manipulating tabular data in Python.

### 1.1 Reading Data

The most common read function is `pd.read_csv()`. Pandas also supports Excel, JSON, SQL, and more.

In [1]:
import pandas as pd

# The most common way to load data:
# df = pd.read_csv('path/to/file.csv')

# Other read functions:
# pd.read_excel('data.xlsx')      -> Excel files
# pd.read_json('data.json')       -> JSON files
# pd.read_sql(query, connection)  -> SQL databases
# pd.read_parquet('data.parquet') -> Parquet files (efficient columnar format)

# For this notebook i'll generate a dataset using sklearn so it's self-contained
from sklearn.datasets import fetch_california_housing
import numpy as np

housing = fetch_california_housing(as_frame=True)
df = housing.frame
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


### 1.2 Series

A **Series** is a single column of data — essentially a labeled one-dimensional array.

In [2]:
# Selecting a single column returns a Series
house_age = df['HouseAge']

print(type(house_age))   # <class 'pandas.core.series.Series'>
print()
print(house_age.head(10))

<class 'pandas.Series'>

0    41.0
1    21.0
2    52.0
3    52.0
4    52.0
5    52.0
6    52.0
7    52.0
8    42.0
9    52.0
Name: HouseAge, dtype: float64


In [3]:
# You can also create a Series manually
manual_series = pd.Series([10, 20, 30, 40, 50], name='example')
print(manual_series)

0    10
1    20
2    30
3    40
4    50
Name: example, dtype: int64


### 1.3 DataFrames

A **DataFrame** is a table — a collection of Series sharing the same index. Think of it as a spreadsheet in Python.

In [4]:
print(type(df))      # <class 'pandas.core.frame.DataFrame'>
print('Shape:', df.shape)   # (rows, columns)
print('Columns:', df.columns.tolist())

<class 'pandas.DataFrame'>
Shape: (20640, 9)
Columns: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude', 'MedHouseVal']


In [5]:
# Selecting multiple columns returns a DataFrame (not a Series)
subset = df[['HouseAge', 'AveRooms', 'MedHouseVal']]
subset.head()

Unnamed: 0,HouseAge,AveRooms,MedHouseVal
0,41.0,6.984127,4.526
1,21.0,6.238137,3.585
2,52.0,8.288136,3.521
3,52.0,5.817352,3.413
4,52.0,6.281853,3.422


In [6]:
# Useful DataFrame inspection methods
df.info()    # column names, non-null counts, dtypes

<class 'pandas.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MedInc       20640 non-null  float64
 1   HouseAge     20640 non-null  float64
 2   AveRooms     20640 non-null  float64
 3   AveBedrms    20640 non-null  float64
 4   Population   20640 non-null  float64
 5   AveOccup     20640 non-null  float64
 6   Latitude     20640 non-null  float64
 7   Longitude    20640 non-null  float64
 8   MedHouseVal  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


### 1.4 `describe()`

`describe()` gives a quick statistical summary of all numeric columns: count, mean, std, min, quartiles, and max.

In [7]:
df.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


In [8]:
# You can also call it on a single column (Series)
df['MedHouseVal'].describe()

count    20640.000000
mean         2.068558
std          1.153956
min          0.149990
25%          1.196000
50%          1.797000
75%          2.647250
max          5.000010
Name: MedHouseVal, dtype: float64

### 1.5 `value_counts()`

`value_counts()` counts how many times each unique value appears in a Series. Very useful for categorical or discrete columns.

In [9]:
# HouseAge is discrete (in years), so value_counts is useful here
df['HouseAge'].value_counts().head(10)

HouseAge
52.0    1273
36.0     862
35.0     824
16.0     771
17.0     698
34.0     689
26.0     619
33.0     615
18.0     570
25.0     566
Name: count, dtype: int64

In [10]:
# normalize=True gives proportions instead of raw counts
df['HouseAge'].value_counts(normalize=True).head(10)

HouseAge
52.0    0.061676
36.0    0.041764
35.0    0.039922
16.0    0.037355
17.0    0.033818
34.0    0.033382
26.0    0.029990
33.0    0.029797
18.0    0.027616
25.0    0.027422
Name: proportion, dtype: float64

---
## 2. Prediction Models with Scikit-learn

The standard workflow in sklearn:
1. Define features (`X`) and target (`y`)
2. Split data into train/validation sets
3. Instantiate and fit the model
4. Make predictions
5. Evaluate with MAE (Mean Absolute Error)

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# Define features and target
feature_cols = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup']

X = df[feature_cols]
y = df['MedHouseVal']   # target: median house value

# Split into training and validation sets (80% train, 20% validation)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

print('Training rows  :', len(X_train))
print('Validation rows:', len(X_val))

Training rows  : 16512
Validation rows: 4128


### 2.1 Decision Tree

A Decision Tree splits data into branches based on feature values, arriving at a prediction at each leaf.

- **Overfitting**: a deep tree memorises the training data but performs poorly on new data.
- **Underfitting**: a shallow tree is too simple to capture patterns.
- `max_leaf_nodes` controls the maximum number of leaves — it's the main knob for tuning this tradeoff.

In [12]:
from sklearn.tree import DecisionTreeRegressor

# Default tree (no limit — will overfit)
dt_default = DecisionTreeRegressor(random_state=42)
dt_default.fit(X_train, y_train)

preds_default = dt_default.predict(X_val)
mae_default = mean_absolute_error(y_val, preds_default)
print(f'Decision Tree (default) — Validation MAE: {mae_default:.4f}')

Decision Tree (default) — Validation MAE: 0.6269


#### Tuning `max_leaf_nodes`

We can try different values and pick the one that gives the lowest validation MAE.

In [13]:
def get_mae(max_leaf_nodes, X_train, X_val, y_train, y_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=42)
    model.fit(X_train, y_train)
    preds = model.predict(X_val)
    return mean_absolute_error(y_val, preds)

leaf_counts = [5, 10, 25, 50, 100, 250, 500, 1000]

results = {n: get_mae(n, X_train, X_val, y_train, y_val) for n in leaf_counts}

print(f"{'max_leaf_nodes':>16} | {'MAE':>8}")
print('-' * 28)
for nodes, mae in results.items():
    print(f"{nodes:>16} | {mae:>8.4f}")

best_leaf_nodes = min(results, key=results.get)
print(f"\nBest max_leaf_nodes: {best_leaf_nodes}  (MAE = {results[best_leaf_nodes]:.4f})")

  max_leaf_nodes |      MAE
----------------------------
               5 |   0.6328
              10 |   0.5829
              25 |   0.5334
              50 |   0.5087
             100 |   0.4988
             250 |   0.5010
             500 |   0.5082
            1000 |   0.5241

Best max_leaf_nodes: 100  (MAE = 0.4988)


In [14]:
# Train the final Decision Tree with the best max_leaf_nodes
dt_best = DecisionTreeRegressor(max_leaf_nodes=best_leaf_nodes, random_state=42)
dt_best.fit(X_train, y_train)

mae_best_dt = mean_absolute_error(y_val, dt_best.predict(X_val))
print(f'Decision Tree (max_leaf_nodes={best_leaf_nodes}) — Validation MAE: {mae_best_dt:.4f}')

Decision Tree (max_leaf_nodes=100) — Validation MAE: 0.4988


### 2.2 Random Forest

A Random Forest builds **many** decision trees on random subsets of the data and features, then **averages** their predictions.

This reduces overfitting without requiring careful tuning of `max_leaf_nodes`. It almost always outperforms a single decision tree.

In [15]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

rf_preds = rf_model.predict(X_val)
mae_rf = mean_absolute_error(y_val, rf_preds)
print(f'Random Forest (100 trees) — Validation MAE: {mae_rf:.4f}')

Random Forest (100 trees) — Validation MAE: 0.4619


---
## 3. Model Comparison Summary

In [16]:
summary = {
    'Decision Tree (default / unlimited)': mae_default,
    f'Decision Tree (max_leaf_nodes={best_leaf_nodes})': mae_best_dt,
    'Random Forest (100 trees)': mae_rf,
}

print(f"{'Model':<45} | {'Validation MAE':>14}")
print('-' * 63)
for model_name, mae in summary.items():
    print(f"{model_name:<45} | {mae:>14.4f}")

best_model = min(summary, key=summary.get)
print(f"\nBest model: {best_model}")

Model                                         | Validation MAE
---------------------------------------------------------------
Decision Tree (default / unlimited)           |         0.6269
Decision Tree (max_leaf_nodes=100)            |         0.4988
Random Forest (100 trees)                     |         0.4619

Best model: Random Forest (100 trees)


---
## Checkpoint — Sections 1–3 Summary

| Concept | What it does |
|---|---|
| `pd.read_csv()` | Loads a CSV file into a DataFrame |
| **Series** | A single labeled column of data |
| **DataFrame** | A table of data (collection of Series) |
| `.describe()` | Summary statistics for numeric columns |
| `.value_counts()` | Frequency count of each unique value |
| **Decision Tree** | Splits data on feature thresholds to make predictions |
| `max_leaf_nodes` | Limits tree depth to control overfitting/underfitting |
| **Random Forest** | Ensemble of many trees — generally more accurate and robust |
| **MAE** | Average absolute difference between predictions and actual values |

---
## 4. More Pandas Operations

This section covers the everyday operations you'll use in almost every data project:
- Filtering rows with boolean indexing
- Sorting data
- Handling missing values
- Adding and dropping columns
- GroupBy aggregations
- Selecting rows and columns precisely with `loc` and `iloc`

### 4.1 Filtering Rows (Boolean Indexing)

You filter a DataFrame by passing a **boolean condition** inside `df[...]`. Pandas evaluates the condition row-by-row and keeps only the rows where it is `True`.

Use `&` (and), `|` (or), and `~` (not) to combine conditions — always wrap each condition in parentheses.

In [17]:
# Single condition: houses with median income above 8
high_income = df[df['MedInc'] > 8]
print(f'High-income rows: {len(high_income)} out of {len(df)}')
high_income.head()

High-income rows: 690 out of 20640


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
131,11.6017,18.0,8.335052,1.082474,533.0,2.747423,37.84,-122.19,3.926
134,8.2049,28.0,6.978947,0.968421,463.0,2.436842,37.83,-122.19,3.352
135,8.401,26.0,7.530806,1.056872,542.0,2.56872,37.83,-122.2,3.512


In [18]:
# Multiple conditions — use & (and) or | (or)
# Important: each condition MUST be wrapped in parentheses

# Houses that are old AND expensive
old_and_expensive = df[(df['HouseAge'] >= 40) & (df['MedHouseVal'] >= 4.0)]
print(f'Old AND expensive: {len(old_and_expensive)} rows')

# Houses that are very new OR very cheap
new_or_cheap = df[(df['HouseAge'] <= 5) | (df['MedHouseVal'] <= 0.5)]
print(f'New OR cheap:      {len(new_or_cheap)} rows')

Old AND expensive: 570 rows
New OR cheap:      766 rows


In [19]:
# The ~ operator negates a condition (NOT)
not_old = df[~(df['HouseAge'] >= 40)]
print(f'Houses NOT 40+ years old: {len(not_old)} rows')

# .isin() checks membership in a list — useful for categorical / discrete columns
selected_ages = df[df['HouseAge'].isin([10, 20, 30, 40, 50])]
print(f'Exactly 10, 20, 30, 40, or 50 years old: {len(selected_ages)} rows')

Houses NOT 40+ years old: 16458 rows
Exactly 10, 20, 30, 40, or 50 years old: 1645 rows


### 4.2 Sorting

`sort_values()` sorts a DataFrame by one or more columns.
Use `ascending=False` to get the largest values first.

In [20]:
# Sort by median house value, highest first
df.sort_values('MedHouseVal', ascending=False).head(5)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
5253,13.2935,27.0,7.607143,1.012673,2336.0,2.691244,34.11,-118.49,5.00001
5254,10.7937,29.0,7.471787,1.217868,1500.0,2.351097,34.07,-118.48,5.00001
5255,8.5153,40.0,6.407266,0.92543,1564.0,2.99044,34.07,-118.48,5.00001
5256,12.8665,37.0,7.457565,1.012915,1318.0,2.431734,34.07,-118.48,5.00001
5257,15.0001,42.0,9.229032,1.16129,829.0,2.674194,34.06,-118.49,5.00001


In [21]:
# Sort by multiple columns:
# first by HouseAge (ascending), then by MedHouseVal (descending within each age group)
df.sort_values(['HouseAge', 'MedHouseVal'], ascending=[True, False]).head(8)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
18972,5.2636,1.0,7.69403,1.279851,872.0,3.253731,38.23,-122.0,1.913
19536,4.25,1.0,20.125,2.928571,402.0,3.589286,37.65,-120.93,1.892
3130,4.875,1.0,5.533333,1.0,32.0,2.133333,35.08,-117.95,1.417
12286,1.625,1.0,3.0,1.0,8.0,4.0,33.86,-116.95,0.55
10336,7.1193,2.0,9.09375,1.09375,199.0,3.109375,33.81,-117.76,5.00001
10519,10.1122,2.0,8.786655,1.27568,3550.0,3.116769,33.57,-117.68,5.00001
13177,8.4411,2.0,10.296296,1.166667,179.0,3.314815,33.97,-117.78,5.00001
10376,10.1531,2.0,9.906329,1.13038,2985.0,3.778481,33.64,-117.62,4.841


### 4.3 Handling Missing Values

Real-world data almost always has `NaN` (Not a Number) missing values.
The California Housing dataset is clean, so we introduce some artificially to practice.

| Method | What it does |
|---|---|
| `df.isnull()` | Boolean DataFrame — `True` where values are missing |
| `df.isnull().sum()` | Count missing values per column |
| `df.dropna()` | Remove rows with **any** missing value |
| `df.fillna(value)` | Replace missing values with a given number or statistic |

In [22]:
import numpy as np

# Create a copy with artificial NaNs
df_missing = df.copy()
df_missing.loc[0:99, 'MedInc'] = np.nan        # 100 missing in MedInc
df_missing.loc[500:549, 'AveRooms'] = np.nan   #  50 missing in AveRooms

# Count missing values per column
missing_counts = df_missing.isnull().sum()
print("Missing values per column:")
print(missing_counts[missing_counts > 0])

Missing values per column:
MedInc      100
AveRooms     50
dtype: int64


In [23]:
# Strategy 1 — dropna(): remove any row that has at least one missing value
df_dropped = df_missing.dropna()
print(f"Original rows : {len(df_missing)}")
print(f"After dropna(): {len(df_dropped)}")
print(f"Rows removed  : {len(df_missing) - len(df_dropped)}")

Original rows : 20640
After dropna(): 20490
Rows removed  : 150


In [24]:
# Strategy 2 — fillna(): fill missing values with the column median
# The median is preferred over the mean because it is robust to outliers.
df_filled = df_missing.copy()
for col in df_filled.columns:
    if df_filled[col].isnull().any():
        median_val = df_filled[col].median()
        df_filled[col] = df_filled[col].fillna(median_val)
        print(f"Filled '{col}' with median = {median_val:.4f}")

print(f"\nMissing values remaining: {df_filled.isnull().sum().sum()}")

Filled 'MedInc' with median = 3.5422
Filled 'AveRooms' with median = 5.2301

Missing values remaining: 0


### 4.4 Adding and Dropping Columns

Creating new columns from existing ones is called **feature engineering** — often the single
most impactful step in a data science project.

In [25]:
df2 = df.copy()

# Rooms per bedroom — a ratio that captures housing density
df2['RoomsPerBedroom'] = df2['AveRooms'] / df2['AveBedrms']

# Income per occupant — purchasing power adjusted for household size
df2['IncomePerOccupant'] = df2['MedInc'] / df2['AveOccup']

print("New engineered columns:")
df2[['AveRooms', 'AveBedrms', 'RoomsPerBedroom',
     'MedInc', 'AveOccup', 'IncomePerOccupant']].head()

New engineered columns:


Unnamed: 0,AveRooms,AveBedrms,RoomsPerBedroom,MedInc,AveOccup,IncomePerOccupant
0,6.984127,1.02381,6.821705,8.3252,2.555556,3.257687
1,6.238137,0.97188,6.418626,8.3014,2.109842,3.934608
2,8.288136,1.073446,7.721053,7.2574,2.80226,2.589838
3,5.817352,1.073059,5.421277,5.6431,2.547945,2.214765
4,6.281853,1.081081,5.810714,3.8462,2.181467,1.763125


In [26]:
# Dropping columns — use drop(columns=[...])
# Default: returns a new DataFrame (does NOT modify df2 in place)
df_reduced = df2.drop(columns=['Latitude', 'Longitude', 'IncomePerOccupant'])
print("Remaining columns:", df_reduced.columns.tolist())

Remaining columns: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'MedHouseVal', 'RoomsPerBedroom']


### 4.5 GroupBy

`groupby()` is pandas' equivalent of SQL's `GROUP BY`:
1. **Split** the DataFrame into groups by a column's values
2. **Apply** an aggregation function to each group
3. **Combine** the results into a new DataFrame

Common aggregations: `mean()`, `median()`, `sum()`, `count()`, `min()`, `max()`, `std()`

In [27]:
# Create age buckets so groups are more meaningful
df2['AgeBucket'] = pd.cut(
    df2['HouseAge'],
    bins=[0, 10, 20, 30, 40, 52],
    labels=['0-10', '11-20', '21-30', '31-40', '41-52']
)

# Average house value per age group
avg_val_by_age = df2.groupby('AgeBucket', observed=True)['MedHouseVal'].mean()
print("Average median house value by age bucket:")
print(avg_val_by_age.round(3))

Average median house value by age bucket:
AgeBucket
0-10     2.003
11-20    1.912
21-30    2.068
31-40    2.067
41-52    2.290
Name: MedHouseVal, dtype: float64


In [28]:
# Multiple aggregations at once with .agg()
stats = (df2.groupby('AgeBucket', observed=True)['MedHouseVal']
           .agg(['mean', 'median', 'std', 'count']))
stats.columns = ['Mean', 'Median', 'Std Dev', 'Count']
print(stats.round(3))

            Mean  Median  Std Dev  Count
AgeBucket                               
0-10       2.003   1.704    1.031   1569
11-20      1.912   1.661    1.018   4724
21-30      2.068   1.848    1.150   4852
31-40      2.067   1.800    1.165   5617
41-52      2.290   1.975    1.301   3878


In [29]:
# GroupBy multiple columns at once
by_age = (df2.groupby('AgeBucket', observed=True)[['MedHouseVal', 'MedInc']]
            .mean()
            .round(3))
print(by_age)

           MedHouseVal  MedInc
AgeBucket                     
0-10             2.003   4.554
11-20            1.912   4.011
21-30            2.068   3.871
31-40            2.067   3.774
41-52            2.290   3.563


### 4.6 `loc` and `iloc` — Precise Row/Column Selection

| Accessor | Selects by | Slice upper bound |
|---|---|---|
| `.loc[rows, cols]` | **Labels** (index values, column names) | **Inclusive** |
| `.iloc[rows, cols]` | **Integer positions** (0-based) | **Exclusive** |

They both accept slices, lists, and boolean arrays.

In [30]:
# .loc — label-based selection
# Rows with index labels 10 through 14 (both ends INCLUSIVE)
# Columns selected by name
print("loc — rows 10:14, columns MedInc and MedHouseVal:")
print(df.loc[10:14, ['MedInc', 'MedHouseVal']])

loc — rows 10:14, columns MedInc and MedHouseVal:
    MedInc  MedHouseVal
10  3.2031        2.815
11  3.2705        2.418
12  3.0750        2.135
13  2.6736        1.913
14  1.9167        1.592


In [31]:
# .iloc — position-based selection
# Rows at positions 10, 11, 12, 13, 14 (upper bound 15 is EXCLUSIVE)
# Columns at positions 0 (MedInc) and 8 (MedHouseVal)
print("iloc — positions 10:15, column positions 0 and 8:")
print(df.iloc[10:15, [0, 8]])

iloc — positions 10:15, column positions 0 and 8:
    MedInc  MedHouseVal
10  3.2031        2.815
11  3.2705        2.418
12  3.0750        2.135
13  2.6736        1.913
14  1.9167        1.592


In [32]:
# Combining loc with a boolean mask
high_value_mask = df['MedHouseVal'] > 4.5
expensive = df.loc[high_value_mask, ['MedInc', 'HouseAge', 'MedHouseVal']]
print(f"Houses with MedHouseVal > 4.5: {len(expensive)} rows")
expensive.head()

Houses with MedHouseVal > 4.5: 1257 rows


Unnamed: 0,MedInc,HouseAge,MedHouseVal
0,8.3252,41.0,4.526
89,1.2434,52.0,5.00001
140,6.3624,30.0,4.833
459,1.1696,52.0,5.00001
489,3.0417,48.0,4.896


---
## 5. Data Analysis

Before building models, good data scientists **explore** their data first.
This section covers:
- **Correlation** — which features move together?
- **Feature importance** — which features does the model rely on most?
- **Cross-validation** — a more reliable way to estimate model performance

### 5.1 Correlation

**Pearson correlation** measures how strongly two numeric variables are *linearly* related.

| Value | Meaning |
|---|---|
| +1.0 | Perfect positive relationship (both go up together) |
| 0.0 | No linear relationship |
| -1.0 | Perfect inverse relationship (one goes up, the other down) |

`df.corr()` computes every pair of columns at once.

In [33]:
corr_matrix = df.corr(numeric_only=True)
corr_matrix.round(2)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
MedInc,1.0,-0.12,0.33,-0.06,0.0,0.02,-0.08,-0.02,0.69
HouseAge,-0.12,1.0,-0.15,-0.08,-0.3,0.01,0.01,-0.11,0.11
AveRooms,0.33,-0.15,1.0,0.85,-0.07,-0.0,0.11,-0.03,0.15
AveBedrms,-0.06,-0.08,0.85,1.0,-0.07,-0.01,0.07,0.01,-0.05
Population,0.0,-0.3,-0.07,-0.07,1.0,0.07,-0.11,0.1,-0.02
AveOccup,0.02,0.01,-0.0,-0.01,0.07,1.0,0.0,0.0,-0.02
Latitude,-0.08,0.01,0.11,0.07,-0.11,0.0,1.0,-0.92,-0.14
Longitude,-0.02,-0.11,-0.03,0.01,0.1,0.0,-0.92,1.0,-0.05
MedHouseVal,0.69,0.11,0.15,-0.05,-0.02,-0.02,-0.14,-0.05,1.0


In [34]:
# Most useful view: correlation of EACH FEATURE with the TARGET
target_corr = corr_matrix['MedHouseVal'].drop('MedHouseVal').sort_values(ascending=False)

print("Correlation with MedHouseVal (target):")
print(target_corr.round(3))
print()
print("Tip: features with |r| > 0.3 are generally worth keeping.")
print("     features near 0 may not help the model much.")

Correlation with MedHouseVal (target):
MedInc        0.688
AveRooms      0.152
HouseAge      0.106
AveOccup     -0.024
Population   -0.025
Longitude    -0.046
AveBedrms    -0.047
Latitude     -0.144
Name: MedHouseVal, dtype: float64

Tip: features with |r| > 0.3 are generally worth keeping.
     features near 0 may not help the model much.


In [35]:
# Check for highly correlated FEATURE pairs (multicollinearity)
# When two features are highly correlated, they carry redundant information.
high_corr_pairs = []
cols = [c for c in corr_matrix.columns if c != 'MedHouseVal']
for i, c1 in enumerate(cols):
    for c2 in cols[i + 1:]:
        r = corr_matrix.loc[c1, c2]
        if abs(r) > 0.5:
            high_corr_pairs.append((c1, c2, round(r, 3)))

if high_corr_pairs:
    print("Highly correlated feature pairs (|r| > 0.5):")
    for c1, c2, r in sorted(high_corr_pairs, key=lambda x: abs(x[2]), reverse=True):
        print(f"  {c1:<15} <-> {c2:<15}  r = {r}")
else:
    print("No pairs with |r| > 0.5 found among features.")

Highly correlated feature pairs (|r| > 0.5):
  Latitude        <-> Longitude        r = -0.925
  AveRooms        <-> AveBedrms        r = 0.848


### 5.2 Feature Importance

After training a Random Forest, `.feature_importances_` tells you how much each feature
contributed to reducing prediction error across all trees.

- Values are between 0 and 1
- All values sum to exactly 1.0
- Higher = more important

In [36]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

feature_cols = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup']
X = df[feature_cols]
y = df['MedHouseVal']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

importances = pd.Series(rf.feature_importances_, index=feature_cols).sort_values(ascending=False)

print("Feature Importances (Random Forest):")
print(f"{'Feature':<15} | {'Importance':>10} | Relative bar")
print("-" * 50)
for feat, score in importances.items():
    bar = "#" * int(score * 60)
    print(f"{feat:<15} | {score:>10.4f} | {bar}")

Feature Importances (Random Forest):
Feature         | Importance | Relative bar
--------------------------------------------------
MedInc          |     0.5617 | #################################
AveOccup        |     0.1625 | #########
AveRooms        |     0.0798 | ####
HouseAge        |     0.0779 | ####
Population      |     0.0596 | ###
AveBedrms       |     0.0585 | ###


In [37]:
# Interpretation:
# MedInc (median income) is by far the most important predictor.
# This makes intuitive sense — richer neighbourhoods have higher house prices.
# AveOccup (average occupants) ranks second, capturing housing density effects.
# AveBedrms has very low importance — it adds little beyond what AveRooms already captures.

# Features with near-zero importance are candidates to DROP — they add noise
# without improving predictions, and slower models with more features are harder to maintain.

least_important = importances.tail(3)
print("Least important features:")
for feat, score in least_important.items():
    print(f"  {feat}: {score:.4f}")

Least important features:
  HouseAge: 0.0779
  Population: 0.0596
  AveBedrms: 0.0585


### 5.3 Cross-Validation

A single train/validation split gives a MAE that depends on **which rows** landed in the validation set.
That's random luck, not stable measurement.

**K-Fold Cross-Validation** repeats the evaluation *k* times, rotating which fold acts as validation:

```
Fold 1:  [VAL][---][---][---][---]   -> MAE_1
Fold 2:  [---][VAL][---][---][---]   -> MAE_2
Fold 3:  [---][---][VAL][---][---]   -> MAE_3
Fold 4:  [---][---][---][VAL][---]   -> MAE_4
Fold 5:  [---][---][---][---][VAL]   -> MAE_5

Final estimate = mean(MAE_1 … MAE_5)
Stability      = std(MAE_1 … MAE_5)   <- low std = stable model
```

In [38]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor

X = df[['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup']]
y = df['MedHouseVal']

# scoring='neg_mean_absolute_error': sklearn returns negative MAE by convention
# so we negate the result to get positive values
dt_scores = -cross_val_score(
    DecisionTreeRegressor(max_leaf_nodes=100, random_state=42),
    X, y, cv=5, scoring='neg_mean_absolute_error'
)

rf_scores = -cross_val_score(
    RandomForestRegressor(n_estimators=100, random_state=42),
    X, y, cv=5, scoring='neg_mean_absolute_error'
)

print("5-Fold Cross-Validation MAE")
print(f"{'Fold':<6} | {'Decision Tree':>14} | {'Random Forest':>14}")
print("-" * 42)
for i, (dt, rf) in enumerate(zip(dt_scores, rf_scores), 1):
    print(f"  {i}    | {dt:>14.4f} | {rf:>14.4f}")
print("-" * 42)
print(f"{'Mean':>6} | {dt_scores.mean():>14.4f} | {rf_scores.mean():>14.4f}")
print(f"{'Std':>6} | {dt_scores.std():>14.4f} | {rf_scores.std():>14.4f}")

5-Fold Cross-Validation MAE
Fold   |  Decision Tree |  Random Forest
------------------------------------------
  1    |         0.5262 |         0.4896
  2    |         0.4979 |         0.4619
  3    |         0.5003 |         0.4700
  4    |         0.5784 |         0.5433
  5    |         0.5499 |         0.5173
------------------------------------------
  Mean |         0.5306 |         0.4964
   Std |         0.0305 |         0.0302


In [39]:
print("Takeaways:")
print(f"  Decision Tree — Mean MAE: {dt_scores.mean():.4f}, Std: {dt_scores.std():.4f}")
print(f"  Random Forest — Mean MAE: {rf_scores.mean():.4f}, Std: {rf_scores.std():.4f}")
print()
print("Random Forest wins on both counts:")
print("  * Lower mean MAE   -> more accurate predictions")
print("  * Lower std        -> more stable across different data splits")

Takeaways:
  Decision Tree — Mean MAE: 0.5306, Std: 0.0305
  Random Forest — Mean MAE: 0.4964, Std: 0.0302

Random Forest wins on both counts:
  * Lower mean MAE   -> more accurate predictions
  * Lower std        -> more stable across different data splits


---
## Key Takeaways

### Pandas Cheat Sheet
| Task | Method | Notes |
|---|---|---|
| Load data | `pd.read_csv()` | Also: `read_excel`, `read_json`, `read_parquet` |
| Inspect | `.shape`, `.info()`, `.head()` | Always do this first |
| Statistics | `.describe()` | Count, mean, std, quartiles |
| Frequencies | `.value_counts()` | Add `normalize=True` for proportions |
| Filter rows | `df[df['col'] > x]` | Boolean indexing |
| Combine filters | `(cond1) & (cond2)` | Use `&`, `\|`, `~` — never `and`/`or` |
| Sort | `.sort_values('col', ascending=False)` | Multi-column: pass a list |
| Missing values | `.isnull().sum()` → `.dropna()` / `.fillna()` | Check early, fix before modelling |
| New column | `df['new'] = expression` | Feature engineering |
| Drop column | `.drop(columns=['col'])` | Remove noise |
| Aggregate groups | `.groupby('col').agg(...)` | Split-apply-combine |
| Label selection | `.loc[rows, cols]` | Inclusive slicing |
| Position selection | `.iloc[rows, cols]` | Exclusive upper bound |

### Analysis & Modelling
| Concept | What it tells you |
|---|---|
| **Correlation matrix** | Which variables are linearly related |
| **Feature importance** | Which features the model relies on most |
| **Cross-validation (CV)** | More reliable performance estimate than a single split |
| **Low CV std** | Model is stable — not sensitive to the random split |
| **MAE** | Average absolute error — in the same units as the target |
