## 3. NON-REPRESENTATIVE DATA

### Definition
**Sampling Bias**: Dataset doesn't represent real-world population, leading to poor generalization.

### Two Types of Non-Representativeness:

#### 3.1 Sampling Noise

**Definition**: Random, unavoidable variability when sampling from population.


In [None]:
# Example: Sampling noise in elections
import numpy as np

# True population: 51% support candidate A, 49% support B
true_probability = 0.51

# Random sample of 1000 voters
sample_size = 1000
support_counts = np.random.binomial(1, true_probability, sample_size)
sample_support = np.mean(support_counts)

print(f"True population support: {true_probability*100:.1f}%")
print(f"Sample estimate: {sample_support*100:.1f}%")
print(f"Error: {abs(sample_support - true_probability)*100:.1f}%")

# If we repeat sampling multiple times:
estimates = []
for trial in range(100):
    sample = np.random.binomial(1, true_probability, 1000)
    estimates.append(np.mean(sample))

print(f"\nAverage error across 100 samples: {np.std(estimates)*100:.2f}%")
# This variation is SAMPLING NOISE (unavoidable)


**Solutions:**
1. Increase sample size (reduces noise by √n)
2. Use stratified sampling (ensures representation)
3. Use confidence intervals


In [None]:
# Stratified sampling: Ensure each group represented
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit

# Imbalanced dataset (99% class 0, 1% class 1)
X = np.random.rand(10000, 10)
y = np.hstack([np.zeros(9900), np.ones(100)])

print(f"Original class distribution:")
print(f"Class 0: {np.sum(y==0)} (99%)")
print(f"Class 1: {np.sum(y==1)} (1%)")

# ❌ Random sampling might miss rare class
random_indices = np.random.choice(10000, 100, replace=False)
random_y = y[random_indices]
print(f"\nRandom sample class distribution:")
print(f"Class 0: {np.sum(random_y==0)}")
print(f"Class 1: {np.sum(random_y==1)}")  # Might be 0-1 only!

# ✅ Stratified sampling preserves distribution
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.01)
for train_idx, test_idx in sss.split(X, y):
    stratified_y = y[test_idx]

print(f"\nStratified sample class distribution:")
print(f"Class 0: {np.sum(stratified_y==0)} (99%)")
print(f"Class 1: {np.sum(stratified_y==1)} (1%)")  # Preserves ratio!


#### 3.2 Sampling Bias

**Definition**: Systematic error where certain groups are over/under-represented due to collection method.

### Real-World Examples of Sampling Bias:


In [None]:
# Example 1: Student Survey (sampling bias)
class Survey:
    def __init__(self):
        self.respondents = []
    
    def survey_in_library(self):
        """❌ BIASED: Only surveys students in library"""
        # Library students: typically better students, more studious
        # Missing: students who don't study, struggle academically
        # Bias: Overrepresents studious, underrepresents struggling
        pass
    
    def random_student_selection(self):
        """✅ UNBIASED: Randomly select from all students"""
        # Each student has equal chance of selection
        # Represents all types of students
        pass

# Example 2: Email Spam Detection
# ❌ BIASED DATASET:
# - Collected only from Gmail accounts
# - Gmail's spam filter already removes most spam
# - Missing many real-world spam types
# - Poor generalization to other email providers

# ✅ REPRESENTATIVE DATASET:
# - Include spam from multiple email providers
# - Include emails from different regions/languages
# - Include new spam patterns
# - Better generalization

# Example 3: Medical Diagnosis
# ❌ BIASED DATASET (PROBLEMATIC):
# - Trained only on patients aged 40-70
# - Mostly white population
# - Only from one hospital
# - Cannot predict for young patients or other races!

# ✅ REPRESENTATIVE DATASET:
# - Diverse ages: 20s to 80s
# - Multiple ethnicities
# - Multiple hospitals and regions
# - Generalizes to all populations


### Types of Sampling Bias:

#### 1. **Selection Bias** - Who collects / How collected


In [None]:
# Scenario: Survey about social media usage
# ❌ "Man on the street" interview at shopping mall
#    Problem: Only surveys people out shopping (busy, tech-savvy)
#    Missing: Housebound, ill, unemployed people

# ❌ Online survey only
#    Problem: Only reaches people with internet
#    Missing: Elderly, poor, underdeveloped regions

# ✅ Random sample from addresses/phone directory
#    Ensures representative cross-section


#### 2. **Non-Response Bias** - Who doesn't respond


In [None]:
# Survey about TV watching habits
# ❌ Problem: People who watch TV respond more
#            People who don't watch (busy/outdoor-oriented) skip survey
#            Results overestimate TV watching

# ✅ Solution: 
#    - Track and analyze non-response patterns
#    - Weight results by response probability
#    - Follow up with non-responders


#### 3. **Survivorship Bias** - Only include "survivors"


In [None]:
# Example: Company dataset
# ❌ BIASED: 
#    - Use data only from companies that survived
#    - Missing failed companies, lessons learned
#    - Conclusion: These practices work!
#    - Actually: Good practices + luck combined

# Real example: World War II planes
# Military only had data on planes that returned
# Conclusion: Reinforce areas with most bullet holes
# Actually: Planes missing from those areas crashed!
# Bias: Survivorship bias


### Detecting Sampling Bias:


In [None]:
import pandas as pd
import numpy as np

def check_for_sampling_bias(dataset, population_stats):
    """Compare dataset characteristics to known population stats"""
    
    print("=== BIAS DETECTION ===\n")
    
    for column, pop_stat in population_stats.items():
        dataset_stat = dataset[column].value_counts(normalize=True)
        
        print(f"{column}:")
        print(f"  Population: {pop_stat}")
        print(f"  Dataset:    {dict(dataset_stat)}")
        
        # Calculate divergence
        divergence = sum(abs(
            dataset_stat.get(k, 0) - pop_stat.get(k, 0) 
            for k in set(list(pop_stat.keys()) + list(dataset_stat.index))
        ))
        
        if divergence > 0.1:
            print(f"  ⚠️  BIAS DETECTED: {divergence:.2%} divergence")
        else:
            print(f"  ✅ Representative")
        print()

# Example usage
dataset = pd.DataFrame({
    'gender': ['M'] * 700 + ['F'] * 300,
    'age_group': ['20-40'] * 600 + ['40+'] * 400
})

population_stats = {
    'gender': {'M': 0.49, 'F': 0.51},  # Real population: 49% male, 51% female
    'age_group': {'20-40': 0.45, '40+': 0.55}  # Real: 45% young, 55% older
}

check_for_sampling_bias(dataset, population_stats)

# Output shows:
# Gender: 70% male vs 49% in population → BIAS!
# Age: 60% young vs 45% in population → BIAS!


### Solutions for Sampling Bias:


In [None]:
# Solution 1: Reweighting samples
import numpy as np

# Dataset is 70% males, 30% females
# Real population is 49% males, 51% females

weights = np.array([
    0.49 / 0.70 if sample_gender == 'M' else 0.51 / 0.30
    for sample_gender in dataset['gender']
])

# Use these weights in model training
model.fit(X, y, sample_weight=weights)
# Now model sees properly balanced data

# Solution 2: Collect more representative data
# Ensure sampling method includes all subgroups

# Solution 3: Synthetic data augmentation
# Generate synthetic samples for underrepresented groups
from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_balanced, y_balanced = smote.fit_resample(X, y)
# Now minority class properly represented

# Solution 4: Domain adaptation
# Train on biased data, adapt to real-world distribution
from sklearn.pipeline import Pipeline

# Transfer learning with domain adaptation


---
