# Pandas for Machine Learning (Beginner-friendly)

**Learning Objectives (short):**
- Learn how to load and explore tabular data with Pandas
- Clean, transform, and prepare features for modeling
- Convert cleaned data to NumPy arrays for downstream tools

**Prerequisites:** Basic Python and NumPy

**Estimated Time:** ~45 minutes

---

Pandas is the go-to library for working with tabular data in Python. This notebook focuses on simple, clear examples and short explanations aimed at beginners. Advanced techniques are marked as optional.

In [1]:
from datetime import datetime

import numpy as np
import pandas as pd

# Set random seed for reproducibility
np.random.seed(42)

# Display settings
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 10)

print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Pandas version: 2.3.2
NumPy version: 2.3.3


## 1. DataFrame Creation and Basic Operations

Understanding how to create and manipulate DataFrames is fundamental to ML data preprocessing.

In [2]:
# Create sample ML dataset
np.random.seed(42)
n_samples = 1000

# Generate synthetic customer data for ML
data = {
    'customer_id': range(1, n_samples + 1),
    'age': np.random.normal(35, 12, n_samples).astype(int),
    'income': np.random.lognormal(10, 0.5, n_samples),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n_samples, p=[0.3, 0.4, 0.2, 0.1]),
    'experience_years': np.random.exponential(5, n_samples),
    'num_purchases': np.random.poisson(3, n_samples),
    'satisfaction_score': np.random.uniform(1, 5, n_samples),
    'is_premium': np.random.choice([0, 1], n_samples, p=[0.7, 0.3]),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n_samples),
    'signup_date': pd.date_range('2020-01-01', periods=n_samples, freq='D')[:n_samples]
}

df = pd.DataFrame(data)

# Introduce some missing values (realistic scenario)
missing_indices = np.random.choice(df.index, size=int(0.05 * len(df)), replace=False)
df.loc[missing_indices[:20], 'income'] = np.nan
df.loc[missing_indices[20:40], 'satisfaction_score'] = np.nan

print("Sample ML Dataset:")
print(df.head())
print(f"\nDataset shape: {df.shape}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")

Sample ML Dataset:
   customer_id  age        income    education  experience_years  \
0            1   40  44341.562353     Bachelor          1.800687   
1            2   33  34972.483357  High School          4.143785   
2            3   42  22693.077136     Bachelor          8.143228   
3            4   53  15939.117886  High School          0.737563   
4            5   32  31229.288168       Master          4.345832   

   num_purchases  satisfaction_score  is_premium region signup_date  
0              3            1.991730           0   East  2020-01-01  
1              3            4.711166           0   East  2020-01-02  
2              7            4.728536           1  North  2020-01-03  
3              3            3.882346           1   East  2020-01-04  
4              3            4.063053           0   East  2020-01-05  

Dataset shape: (1000, 10)
Memory usage: 186.23 KB


In [3]:
# Basic DataFrame information (essential for ML)
print("Dataset Information:")
print(df.info())

print("\nData Types:")
print(df.dtypes)

print("\nMissing Values:")
print(df.isnull().sum())

print("\nBasic Statistics:")
print(df.describe())

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   customer_id         1000 non-null   int64         
 1   age                 1000 non-null   int64         
 2   income              980 non-null    float64       
 3   education           1000 non-null   object        
 4   experience_years    1000 non-null   float64       
 5   num_purchases       1000 non-null   int64         
 6   satisfaction_score  980 non-null    float64       
 7   is_premium          1000 non-null   int64         
 8   region              1000 non-null   object        
 9   signup_date         1000 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(4), object(2)
memory usage: 78.3+ KB
None

Data Types:
customer_id                    int64
age                            int64
income                     

## 2. Data Exploration and Analysis

Understanding your data is crucial before building ML models.

In [4]:
# Categorical data analysis
print("Categorical Data Analysis:")

# Value counts for categorical features
print("Education distribution:")
print(df['education'].value_counts())
print("\nEducation percentages:")
print(df['education'].value_counts(normalize=True) * 100)

print("\nRegion distribution:")
print(df['region'].value_counts())

print("\nPremium customers:")
print(df['is_premium'].value_counts())
print(f"Premium rate: {df['is_premium'].mean():.2%}")

Categorical Data Analysis:
Education distribution:
education
Bachelor       382
High School    315
Master         209
PhD             94
Name: count, dtype: int64

Education percentages:
education
Bachelor       38.2
High School    31.5
Master         20.9
PhD             9.4
Name: proportion, dtype: float64

Region distribution:
region
South    271
West     260
North    235
East     234
Name: count, dtype: int64

Premium customers:
is_premium
0    706
1    294
Name: count, dtype: int64
Premium rate: 29.40%


In [5]:
# Numerical data analysis
print("Numerical Data Analysis:")

# Select numerical columns
numerical_cols = df.select_dtypes(include=[np.number]).columns
print(f"Numerical columns: {list(numerical_cols)}")

# Correlation analysis (important for feature selection)
correlation_matrix = df[numerical_cols].corr()
print("\nCorrelation with target (is_premium):")
target_corr = correlation_matrix['is_premium'].sort_values(ascending=False)
print(target_corr)

# Identify highly correlated features (multicollinearity)
print("\nHighly correlated feature pairs (|correlation| > 0.5):")
high_corr_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        corr_val = correlation_matrix.iloc[i, j]
        if abs(corr_val) > 0.5:
            high_corr_pairs.append((correlation_matrix.columns[i], correlation_matrix.columns[j], corr_val))

for col1, col2, corr in high_corr_pairs:
    print(f"{col1} - {col2}: {corr:.3f}")

Numerical Data Analysis:
Numerical columns: ['customer_id', 'age', 'income', 'experience_years', 'num_purchases', 'satisfaction_score', 'is_premium']

Correlation with target (is_premium):
is_premium            1.000000
customer_id           0.058631
income                0.022244
age                   0.008329
satisfaction_score   -0.006930
experience_years     -0.036799
num_purchases        -0.042162
Name: is_premium, dtype: float64

Highly correlated feature pairs (|correlation| > 0.5):


In [6]:
# Groupby analysis (understanding patterns)
print("Group Analysis:")

# Analyze by education level
education_analysis = df.groupby('education').agg({
    'age': ['mean', 'std'],
    'income': ['mean', 'median'],
    'satisfaction_score': 'mean',
    'is_premium': 'mean',
    'customer_id': 'count'
}).round(2)

print("Analysis by Education Level:")
print(education_analysis)

# Analyze by region
print("\nAnalysis by Region:")
region_analysis = df.groupby('region').agg({
    'income': 'mean',
    'is_premium': 'mean',
    'satisfaction_score': 'mean'
}).round(2)
print(region_analysis)

Group Analysis:
Analysis by Education Level:
               age           income           satisfaction_score is_premium  \
              mean    std      mean    median               mean       mean   
education                                                                     
Bachelor     34.49  11.28  25722.33  22188.83               3.02       0.29   
High School  33.90  12.14  25138.59  22588.63               3.00       0.28   
Master       36.34  11.76  27224.65  23229.79               3.02       0.32   
PhD          35.02  12.08  24331.07  22946.36               2.86       0.29   

            customer_id  
                  count  
education                
Bachelor            382  
High School         315  
Master              209  
PhD                  94  

Analysis by Region:
          income  is_premium  satisfaction_score
region                                          
East    26469.21        0.27                2.90
North   25249.80        0.28                2.97
So

## 3. Data Cleaning and Preprocessing

Essential steps before feeding data to ML models.

In [7]:
# Handle missing values
print("Handling Missing Values:")
print("Missing values before cleaning:")
print(df.isnull().sum())

# Create a copy for cleaning
df_clean = df.copy()

# Strategy 1: Fill numerical missing values with median
df_clean['income'].fillna(df_clean['income'].median(), inplace=True)
df_clean['satisfaction_score'].fillna(df_clean['satisfaction_score'].mean(), inplace=True)

print("\nMissing values after cleaning:")
print(df_clean.isnull().sum())

# Alternative strategies
print("\nAlternative missing value strategies:")
print("1. Forward fill: df.fillna(method='ffill')")
print("2. Backward fill: df.fillna(method='bfill')")
print("3. Interpolation: df.interpolate()")
print("4. Drop rows: df.dropna()")
print("5. Drop columns: df.dropna(axis=1)")

Handling Missing Values:
Missing values before cleaning:
customer_id            0
age                    0
income                20
education              0
experience_years       0
num_purchases          0
satisfaction_score    20
is_premium             0
region                 0
signup_date            0
dtype: int64

Missing values after cleaning:
customer_id           0
age                   0
income                0
education             0
experience_years      0
num_purchases         0
satisfaction_score    0
is_premium            0
region                0
signup_date           0
dtype: int64

Alternative missing value strategies:
1. Forward fill: df.fillna(method='ffill')
2. Backward fill: df.fillna(method='bfill')
3. Interpolation: df.interpolate()
4. Drop rows: df.dropna()
5. Drop columns: df.dropna(axis=1)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_clean['income'].fillna(df_clean['income'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_clean['satisfaction_score'].fillna(df_clean['satisfaction_score'].mean(), inplace=True)


In [8]:
# Handle outliers
print("Outlier Detection and Handling:")

# Identify outliers using IQR method
def detect_outliers_iqr(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return (series < lower_bound) | (series > upper_bound)

# Check for outliers in income
income_outliers = detect_outliers_iqr(df_clean['income'])
print(f"Income outliers: {income_outliers.sum()} ({income_outliers.mean():.1%})")

# Visualize outliers
print("Income statistics:")
print(f"Mean: ${df_clean['income'].mean():.0f}")
print(f"Median: ${df_clean['income'].median():.0f}")
print(f"95th percentile: ${df_clean['income'].quantile(0.95):.0f}")
print(f"99th percentile: ${df_clean['income'].quantile(0.99):.0f}")
print(f"Max: ${df_clean['income'].max():.0f}")

# Handle outliers (cap at 95th percentile)
income_cap = df_clean['income'].quantile(0.95)
df_clean['income_capped'] = df_clean['income'].clip(upper=income_cap)

print("\nAfter capping at 95th percentile:")
print(f"Max income: ${df_clean['income_capped'].max():.0f}")

Outlier Detection and Handling:
Income outliers: 37 (3.7%)
Income statistics:
Mean: $25670
Median: $22732
95th percentile: $50839
99th percentile: $69259
Max: $105754

After capping at 95th percentile:
Max income: $50839


In [9]:
# Data type optimization (important for large datasets)
print("Data Type Optimization:")
print(f"Memory usage before optimization: {df_clean.memory_usage(deep=True).sum() / 1024:.2f} KB")

# Optimize integer columns
int_cols = df_clean.select_dtypes(include=['int64']).columns
for col in int_cols:
    if col != 'customer_id':  # Keep ID as int64
        df_clean[col] = pd.to_numeric(df_clean[col], downcast='integer')

# Optimize float columns
float_cols = df_clean.select_dtypes(include=['float64']).columns
for col in float_cols:
    df_clean[col] = pd.to_numeric(df_clean[col], downcast='float')

# Convert categorical columns to category dtype
categorical_cols = ['education', 'region']
for col in categorical_cols:
    df_clean[col] = df_clean[col].astype('category')

print(f"Memory usage after optimization: {df_clean.memory_usage(deep=True).sum() / 1024:.2f} KB")
print(f"Memory reduction: {(1 - df_clean.memory_usage(deep=True).sum() / df.memory_usage(deep=True).sum()) * 100:.1f}%")

print("\nOptimized data types:")
print(df_clean.dtypes)

Data Type Optimization:
Memory usage before optimization: 194.04 KB
Memory usage after optimization: 44.90 KB
Memory reduction: 75.9%

Optimized data types:
customer_id                    int64
age                             int8
income                       float64
education                   category
experience_years             float32
                           ...      
satisfaction_score           float32
is_premium                      int8
region                      category
signup_date           datetime64[ns]
income_capped                float64
Length: 11, dtype: object


## 4. Feature Engineering

Creating new features that can improve ML model performance.

In [10]:
# Feature engineering examples
print("Feature Engineering:")

# 1. Binning continuous variables
df_clean['age_group'] = pd.cut(df_clean['age'],
                              bins=[0, 25, 35, 50, 100],
                              labels=['Young', 'Adult', 'Middle-aged', 'Senior'])

df_clean['income_tier'] = pd.qcut(df_clean['income_capped'],
                                 q=4,
                                 labels=['Low', 'Medium', 'High', 'Very High'])

print("Age group distribution:")
print(df_clean['age_group'].value_counts())

print("\nIncome tier distribution:")
print(df_clean['income_tier'].value_counts())

# 2. Mathematical transformations
df_clean['log_income'] = np.log1p(df_clean['income_capped'])  # log(1+x) to handle zeros
df_clean['income_per_purchase'] = df_clean['income_capped'] / (df_clean['num_purchases'] + 1)
df_clean['satisfaction_squared'] = df_clean['satisfaction_score'] ** 2

# 3. Date-based features
df_clean['signup_year'] = df_clean['signup_date'].dt.year
df_clean['signup_month'] = df_clean['signup_date'].dt.month
df_clean['signup_dayofweek'] = df_clean['signup_date'].dt.dayofweek
df_clean['days_since_signup'] = (datetime.now() - df_clean['signup_date']).dt.days

print("\nNew features created:")
new_features = ['age_group', 'income_tier', 'log_income', 'income_per_purchase',
                'satisfaction_squared', 'signup_year', 'signup_month', 'days_since_signup']
print(df_clean[new_features].head())

Feature Engineering:
Age group distribution:
age_group
Middle-aged    374
Adult          309
Young          218
Senior          98
Name: count, dtype: int64

Income tier distribution:
income_tier
Medium       260
Low          250
Very High    250
High         240
Name: count, dtype: int64

New features created:
     age_group income_tier  log_income  income_per_purchase  \
0  Middle-aged   Very High   10.699700         11085.390588   
1        Adult   Very High   10.462345          8743.120839   
2  Middle-aged      Medium   10.029859          2836.634642   
3       Senior         Low    9.676594          3984.779471   
4        Adult        High   10.349144          7807.322042   

   satisfaction_squared  signup_year  signup_month  days_since_signup  
0              3.966988         2020             1               2079  
1             22.195084         2020             1               2078  
2             22.359049         2020             1               2077  
3             15.072

In [11]:
# Interaction features
print("Interaction Features:")

# Create interaction between important features
df_clean['age_income_interaction'] = df_clean['age'] * df_clean['log_income']
df_clean['experience_satisfaction'] = df_clean['experience_years'] * df_clean['satisfaction_score']

# Boolean combinations
df_clean['high_income_high_satisfaction'] = (
    (df_clean['income_tier'] == 'Very High') &
    (df_clean['satisfaction_score'] > 4)
).astype(int)

df_clean['experienced_premium'] = (
    (df_clean['experience_years'] > 5) &
    (df_clean['is_premium'] == 1)
).astype(int)

print("Interaction features:")
interaction_features = ['age_income_interaction', 'experience_satisfaction',
                       'high_income_high_satisfaction', 'experienced_premium']
print(df_clean[interaction_features].describe())

Interaction Features:
Interaction features:
       age_income_interaction  experience_satisfaction  \
count             1000.000000              1000.000000   
mean               348.022457                13.925125   
std                118.560771                15.730806   
min                -30.808238                 0.000356   
25%                267.392329                 3.523085   
50%                348.322690                 9.498555   
75%                425.520200                18.350727   
max                877.751173               175.326508   

       high_income_high_satisfaction  experienced_premium  
count                    1000.000000          1000.000000  
mean                        0.067000             0.084000  
std                         0.250147             0.277527  
min                         0.000000             0.000000  
25%                         0.000000             0.000000  
50%                         0.000000             0.000000  
75%          

In [12]:
# Aggregation features (useful for time series or grouped data)
print("Aggregation Features:")

# Features based on region
region_stats = df_clean.groupby('region').agg({
    'income_capped': ['mean', 'std'],
    'satisfaction_score': 'mean',
    'is_premium': 'mean'
}).round(3)

# Flatten column names
region_stats.columns = ['_'.join(col).strip() for col in region_stats.columns]
region_stats = region_stats.add_prefix('region_')

# Merge back to main dataframe
df_clean = df_clean.merge(region_stats, left_on='region', right_index=True, how='left')

print("Region-based features:")
region_features = [col for col in df_clean.columns if col.startswith('region_')]
print(df_clean[['region'] + region_features].head())

# Relative features (compare individual to group)
df_clean['income_vs_region_mean'] = df_clean['income_capped'] / df_clean['region_income_capped_mean']
df_clean['satisfaction_vs_region_mean'] = df_clean['satisfaction_score'] / df_clean['region_satisfaction_score_mean']

print("\nRelative features:")
print(df_clean[['income_vs_region_mean', 'satisfaction_vs_region_mean']].describe())

Aggregation Features:
Region-based features:
  region  region_income_capped_mean  region_income_capped_std  \
0   East                  25677.381                 10974.508   
1   East                  25677.381                 10974.508   
2  North                  24758.733                 11104.631   
3   East                  25677.381                 10974.508   
4   East                  25677.381                 10974.508   

   region_satisfaction_score_mean  region_is_premium_mean  
0                           2.901                   0.269  
1                           2.901                   0.269  
2                           2.971                   0.281  
3                           2.901                   0.269  
4                           2.901                   0.269  

Relative features:
       income_vs_region_mean  satisfaction_vs_region_mean
count            1000.000000                  1000.000000
mean                1.000000                     0.999992
std       

  region_stats = df_clean.groupby('region').agg({


## 5. Categorical Encoding

Converting categorical variables to numerical format for ML models.

In [13]:
# One-hot encoding
print("One-Hot Encoding:")

# Select categorical columns for encoding
categorical_cols = ['education', 'region', 'age_group', 'income_tier']

# One-hot encode
df_encoded = pd.get_dummies(df_clean, columns=categorical_cols, prefix=categorical_cols, drop_first=True)

print(f"Shape before encoding: {df_clean.shape}")
print(f"Shape after encoding: {df_encoded.shape}")

# Show new columns
new_cols = [col for col in df_encoded.columns if any(cat in col for cat in categorical_cols)]
print(f"\nNew encoded columns ({len(new_cols)}):")
for col in new_cols[:10]:  # Show first 10
    print(f"  {col}")
if len(new_cols) > 10:
    print(f"  ... and {len(new_cols) - 10} more")

One-Hot Encoding:
Shape before encoding: (1000, 30)
Shape after encoding: (1000, 38)

New encoded columns (18):
  region_income_capped_mean
  region_income_capped_std
  region_satisfaction_score_mean
  region_is_premium_mean
  income_vs_region_mean
  satisfaction_vs_region_mean
  education_High School
  education_Master
  education_PhD
  region_North
  ... and 8 more


In [14]:
# Label encoding (for ordinal variables)
print("Label Encoding:")

from sklearn.preprocessing import LabelEncoder

# Create a copy for label encoding
df_label_encoded = df_clean.copy()

# Education has natural ordering
education_order = {'High School': 0, 'Bachelor': 1, 'Master': 2, 'PhD': 3}
df_label_encoded['education_encoded'] = df_label_encoded['education'].map(education_order)

# For non-ordinal categories, use LabelEncoder
le_region = LabelEncoder()
df_label_encoded['region_encoded'] = le_region.fit_transform(df_label_encoded['region'])

print("Education encoding:")
print(df_label_encoded[['education', 'education_encoded']].drop_duplicates().sort_values('education_encoded'))

print("\nRegion encoding:")
print(df_label_encoded[['region', 'region_encoded']].drop_duplicates().sort_values('region_encoded'))

# Show encoding mapping
print("\nRegion encoding mapping:")
for i, region in enumerate(le_region.classes_):
    print(f"  {region}: {i}")

Label Encoding:
Education encoding:
     education education_encoded
0     Bachelor                 1
1  High School                 0
4       Master                 2
5          PhD                 3

Region encoding:
  region  region_encoded
0   East               0
2  North               1
6  South               2
9   West               3

Region encoding mapping:
  East: 0
  North: 1
  South: 2
  West: 3


In [15]:
# Target encoding (advanced technique)
print("Target Encoding:")

# Calculate mean target value for each category
def target_encode(df, categorical_col, target_col, smoothing=1):
    """
    Target encoding with smoothing to prevent overfitting
    """
    # Calculate global mean
    global_mean = df[target_col].mean()

    # Calculate category means and counts
    category_stats = df.groupby(categorical_col)[target_col].agg(['mean', 'count'])

    # Apply smoothing
    smoothed_means = (
        (category_stats['mean'] * category_stats['count'] + global_mean * smoothing) /
        (category_stats['count'] + smoothing)
    )

    return smoothed_means

# Target encode education based on premium rate
education_target_encoding = target_encode(df_clean, 'education', 'is_premium')
df_clean['education_target_encoded'] = df_clean['education'].map(education_target_encoding)

print("Education target encoding (premium rate):")
print(education_target_encoding.sort_values(ascending=False))

# Target encode region
region_target_encoding = target_encode(df_clean, 'region', 'is_premium')
df_clean['region_target_encoded'] = df_clean['region'].map(region_target_encoding)

print("\nRegion target encoding (premium rate):")
print(region_target_encoding.sort_values(ascending=False))

Target Encoding:
Education target encoding (premium rate):
education
Master         0.320448
Bachelor       0.293196
PhD            0.287305
High School    0.279411
dtype: float64

Region target encoding (premium rate):
region
South    0.317257
West     0.303808
North    0.280907
East     0.269336
dtype: float64


  category_stats = df.groupby(categorical_col)[target_col].agg(['mean', 'count'])
  category_stats = df.groupby(categorical_col)[target_col].agg(['mean', 'count'])


## 6. Feature Scaling and Normalization

Preparing numerical features for ML algorithms that are sensitive to scale.

In [16]:
# Feature scaling
print("Feature Scaling:")

from sklearn.preprocessing import MinMaxScaler, RobustScaler, StandardScaler

# Select numerical features for scaling
numerical_features = ['age', 'income_capped', 'experience_years', 'satisfaction_score',
                     'log_income', 'days_since_signup']

print("Original feature statistics:")
print(df_clean[numerical_features].describe())

# Standard scaling (z-score normalization)
scaler_standard = StandardScaler()
df_standard_scaled = df_clean.copy()
df_standard_scaled[numerical_features] = scaler_standard.fit_transform(df_clean[numerical_features])

print("\nAfter Standard Scaling (mean=0, std=1):")
print(df_standard_scaled[numerical_features].describe())

# Min-Max scaling (0-1 range)
scaler_minmax = MinMaxScaler()
df_minmax_scaled = df_clean.copy()
df_minmax_scaled[numerical_features] = scaler_minmax.fit_transform(df_clean[numerical_features])

print("\nAfter Min-Max Scaling (range 0-1):")
print(df_minmax_scaled[numerical_features].describe())

Feature Scaling:
Original feature statistics:
               age  income_capped  experience_years  satisfaction_score  \
count  1000.000000    1000.000000       1000.000000         1000.000000   
mean     34.743000   25075.212815          4.712564            2.998455   
std      11.748233   11361.041618          4.769008            1.106650   
min      -3.000000    5063.461821          0.000154            1.000211   
25%      27.000000   16372.665420          1.354730            2.094302   
50%      35.000000   22732.225225          3.295244            2.998455   
75%      42.000000   31391.594681          6.413300            3.903685   
max      81.000000   50838.771470         38.617649            4.994469   

        log_income  days_since_signup  
count  1000.000000        1000.000000  
mean     10.024256        1579.500000  
std       0.471203         288.819436  
min       8.530003        1080.000000  
25%       9.703429        1329.750000  
50%      10.031583        1579.500000 

In [17]:
# Robust scaling (less sensitive to outliers)
scaler_robust = RobustScaler()
df_robust_scaled = df_clean.copy()
df_robust_scaled[numerical_features] = scaler_robust.fit_transform(df_clean[numerical_features])

print("After Robust Scaling (median=0, IQR=1):")
print(df_robust_scaled[numerical_features].describe())

# Compare scaling methods visually
print("\nScaling Comparison for 'income_capped':")
comparison_df = pd.DataFrame({
    'Original': df_clean['income_capped'],
    'Standard': df_standard_scaled['income_capped'],
    'MinMax': df_minmax_scaled['income_capped'],
    'Robust': df_robust_scaled['income_capped']
})

print(comparison_df.describe())

After Robust Scaling (median=0, IQR=1):
               age  income_capped  experience_years  satisfaction_score  \
count  1000.000000    1000.000000       1000.000000        1.000000e+03   
mean     -0.017133       0.156002          0.280182       -6.786049e-09   
std       0.783216       0.756448          0.942758        6.116174e-01   
min      -2.533333      -1.176433         -0.651388       -1.104379e+00   
25%      -0.533333      -0.423436         -0.383609       -4.997026e-01   
50%       0.000000       0.000000          0.000000        0.000000e+00   
75%       0.466667       0.576564          0.616391        5.002974e-01   
max       3.066667       1.871408          6.982686        1.103147e+00   

        log_income  days_since_signup  
count  1000.000000       1.000000e+03  
mean     -0.011256      -2.842171e-17  
std       0.723928       5.782171e-01  
min      -2.306936      -1.000000e+00  
25%      -0.504155      -5.000000e-01  
50%       0.000000       0.000000e+00  
75% 

## 7. Data Splitting and Sampling

Preparing data for training, validation, and testing.

In [18]:
# Train-validation-test split
print("Data Splitting:")

from sklearn.model_selection import train_test_split

# Prepare features and target
feature_columns = [col for col in df_encoded.columns
                  if col not in ['customer_id', 'is_premium', 'signup_date', 'education', 'region']]

X = df_encoded[feature_columns]
y = df_encoded['is_premium']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"Target distribution: {y.value_counts().to_dict()}")

# First split: separate test set (20%)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Second split: separate train and validation (80% of remaining)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp  # 0.25 * 0.8 = 0.2 of total
)

print("\nSplit sizes:")
print(f"Train: {X_train.shape[0]} ({X_train.shape[0]/len(X):.1%})")
print(f"Validation: {X_val.shape[0]} ({X_val.shape[0]/len(X):.1%})")
print(f"Test: {X_test.shape[0]} ({X_test.shape[0]/len(X):.1%})")

# Check target distribution in each split
print("\nTarget distribution:")
print(f"Train: {y_train.mean():.3f}")
print(f"Validation: {y_val.mean():.3f}")
print(f"Test: {y_test.mean():.3f}")

Data Splitting:
Features shape: (1000, 35)
Target shape: (1000,)
Target distribution: {0: 706, 1: 294}

Split sizes:
Train: 600 (60.0%)
Validation: 200 (20.0%)
Test: 200 (20.0%)

Target distribution:
Train: 0.293
Validation: 0.295
Test: 0.295


In [19]:
# Handling imbalanced data
print("Handling Imbalanced Data:")

# Check class imbalance
class_counts = y_train.value_counts()
imbalance_ratio = class_counts.max() / class_counts.min()
print(f"Class distribution: {class_counts.to_dict()}")
print(f"Imbalance ratio: {imbalance_ratio:.2f}:1")

if imbalance_ratio > 2:  # If significantly imbalanced
    print("\nDataset is imbalanced. Strategies to consider:")

    # 1. Undersampling majority class
    majority_class = y_train.value_counts().index[0]
    minority_class = y_train.value_counts().index[1]

    majority_indices = y_train[y_train == majority_class].index
    minority_indices = y_train[y_train == minority_class].index

    # Random undersample majority class
    undersampled_majority = np.random.choice(majority_indices, size=len(minority_indices), replace=False)
    balanced_indices = np.concatenate([undersampled_majority, minority_indices])

    X_train_balanced = X_train.loc[balanced_indices]
    y_train_balanced = y_train.loc[balanced_indices]

    print(f"1. Undersampling - New size: {len(X_train_balanced)}")
    print(f"   New distribution: {y_train_balanced.value_counts().to_dict()}")

    # 2. Class weights (for algorithms that support it)
    from sklearn.utils.class_weight import compute_class_weight

    class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
    class_weight_dict = dict(zip(np.unique(y_train), class_weights, strict=False))

    print(f"2. Class weights: {class_weight_dict}")

else:
    print("Dataset is reasonably balanced.")

Handling Imbalanced Data:
Class distribution: {0: 424, 1: 176}
Imbalance ratio: 2.41:1

Dataset is imbalanced. Strategies to consider:
1. Undersampling - New size: 352
   New distribution: {0: 176, 1: 176}
2. Class weights: {np.int8(0): np.float64(0.7075471698113207), np.int8(1): np.float64(1.7045454545454546)}


## 8. Converting cleaned data for other tools (framework-neutral)

After cleaning and encoding, you will often convert DataFrame features to NumPy arrays. The guidance below is framework-neutral and focuses on checks and formats most tools expect.

- Use numeric dtypes for features (float32 is common for inputs)
- Use integer dtypes for labels (int32/int64 depending on the tool)
- Verify shapes: (n_samples, n_features) for feature matrices
- Check for NaNs and infinite values before exporting

Example: convert train/val/test splits to NumPy and save them for later reuse.

In [20]:
print("Converting DataFrame to NumPy arrays (neutral):")

# Convert only numeric columns to NumPy arrays (ensure appropriate dtype)
numeric_columns = X_train.select_dtypes(include=[np.number]).columns

X_train_np = X_train[numeric_columns].values.astype(np.float32)
y_train_np = y_train.values.astype(np.int64)

X_val_np = X_val[numeric_columns].values.astype(np.float32)
y_val_np = y_val.values.astype(np.int64)

X_test_np = X_test[numeric_columns].values.astype(np.float32)
y_test_np = y_test.values.astype(np.int64)

print(f"Training data shape: {X_train_np.shape}")
print(f"Training data dtype: {X_train_np.dtype}")
print(f"Training labels dtype: {y_train_np.dtype}")

print("\nNotes:")
print(" - Save these NumPy arrays to disk for reproducibility and to reuse in other tools")
print(" - Save preprocessing objects (scalers, encoders) so the same transforms are applied in production")

# Example saving (commented out so the notebook can run without writing files)
# np.save('data/processed/X_train.npy', X_train_np)
# np.save('data/processed/y_train.npy', y_train_np)
# np.save('data/processed/X_val.npy', X_val_np)
# np.save('data/processed/y_val.npy', y_val_np)
# np.save('data/processed/X_test.npy', X_test_np)
# np.save('data/processed/y_test.npy', y_test_np)

# Save feature names and preprocessing info
# import pickle
# with open('data/processed/preprocessing_info.pkl', 'wb') as f:
#     pickle.dump(preprocessing_summary, f)

# Save scalers for future use
# with open('data/processed/scaler.pkl', 'wb') as f:
#     pickle.dump(scaler_standard, f)

Converting DataFrame to NumPy arrays (neutral):
Training data shape: (600, 23)
Training data dtype: float32
Training labels dtype: int64

Notes:
 - Save these NumPy arrays to disk for reproducibility and to reuse in other tools
 - Save preprocessing objects (scalers, encoders) so the same transforms are applied in production


## Summary and Key Takeaways (updated)

**What we've learned (short):**
- Explore data with .info(), .describe(), and .value_counts()
- Handle missing values and outliers with straightforward strategies
- Create new features and encode categorical variables
- Scale numeric features and split data into train/val/test
- Convert cleaned DataFrames to NumPy arrays for downstream tools

**Next steps:** keep the preprocessing code and scalers so you can reproduce results later.
