# Log Transform on Skewed Numerical Features

This notebook creates a static sample DataFrame with 10 numeric columns (including `Distance`) and applies a logarithmic transformation to features with high skewness. We use `np.log1p` so zeros are handled safely (log(0+1) = 0).

Steps:
- Build a small, static dataset (no randomness)
- Measure skewness of numeric columns
- Apply `log1p` to columns with |skew| > 0.75
- Compare skewness before vs after

In [1]:
# Imports
import pandas as pd
import numpy as np

# Create a static DataFrame with 10 numeric columns (no randomness)
# Columns: Distance, Fare, Age, Income, Items, Rating, Duration, Area, Steps, Views
# Values are chosen to include some skew (e.g., a few larger numbers)

data = {
    "Distance": [0, 1, 2, 3, 5, 8, 13, 21, 34, 55],          # Fibonacci-like growth (skewed)
    "Fare":     [2, 2, 3, 3, 5, 8, 8, 13, 21, 34],           # right-skewed
    "Age":      [18, 22, 25, 27, 30, 34, 38, 45, 52, 60],    # roughly increasing (mild skew)
    "Income":   [15_000, 18_000, 22_000, 25_000, 30_000, 40_000, 60_000, 90_000, 140_000, 220_000],
    "Items":    [1, 1, 1, 2, 2, 3, 5, 8, 13, 21],            # many small, few large
    "Rating":   [1, 1, 2, 2, 3, 3, 4, 4, 5, 5],              # low skew
    "Duration": [5, 7, 9, 12, 18, 27, 41, 62, 94, 141],      # multiplicative growth
    "Area":     [30, 35, 40, 50, 65, 85, 110, 150, 220, 320],
    "Steps":    [500, 600, 800, 1000, 1400, 2000, 3000, 4500, 7000, 11000],
    "Views":    [10, 12, 15, 20, 30, 45, 70, 110, 170, 260]
}

df = pd.DataFrame(data)

df.head(10)

Unnamed: 0,Distance,Fare,Age,Income,Items,Rating,Duration,Area,Steps,Views
0,0,2,18,15000,1,1,5,30,500,10
1,1,2,22,18000,1,1,7,35,600,12
2,2,3,25,22000,1,2,9,40,800,15
3,3,3,27,25000,2,2,12,50,1000,20
4,5,5,30,30000,2,3,18,65,1400,30
5,8,8,34,40000,3,3,27,85,2000,45
6,13,8,38,60000,5,4,41,110,3000,70
7,21,13,45,90000,8,4,62,150,4500,110
8,34,21,52,140000,13,5,94,220,7000,170
9,55,34,60,220000,21,5,141,320,11000,260


In [2]:
# Skewness before transformation
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
skew_before = df[numeric_cols].skew()

skew_before.to_frame(name="skew_before")

Unnamed: 0,skew_before
Distance,1.617721
Fare,1.70412
Age,0.676185
Income,1.673228
Items,1.701551
Rating,0.0
Duration,1.466489
Area,1.454171
Steps,1.602905
Views,1.535086


In [3]:
# Apply log1p to columns with high absolute skew (e.g., > 0.75)
skew_threshold = 0.75
cols_to_log = [c for c, s in skew_before.items() if abs(s) > skew_threshold]

# Keep a copy for comparison
df_log = df.copy()
for col in cols_to_log:
    df_log[col] = np.log1p(df_log[col])

{
    "columns_transformed": cols_to_log,
    "count": len(cols_to_log)
}

{'columns_transformed': ['Distance',
  'Fare',
  'Income',
  'Items',
  'Duration',
  'Area',
  'Steps',
  'Views'],
 'count': 8}

In [4]:
# Skewness after transformation
skew_after = df_log[numeric_cols].skew()

comparison = (
    pd.concat([skew_before.rename("skew_before"), skew_after.rename("skew_after")], axis=1)
      .assign(reduced=lambda d: (d["skew_before"].abs() - d["skew_after"].abs()))
      .sort_values("reduced", ascending=False)
)
comparison

Unnamed: 0,skew_before,skew_after,reduced
Distance,1.617721,0.017378,1.600343
Steps,1.602905,0.365428,1.237478
Duration,1.466489,0.260881,1.205608
Views,1.535086,0.349501,1.185585
Fare,1.70412,0.57509,1.129031
Income,1.673228,0.625644,1.047584
Area,1.454171,0.445389,1.008781
Items,1.701551,0.738919,0.962632
Age,0.676185,0.676185,0.0
Rating,0.0,0.0,0.0
