#### Exercise 1
<!-- @q -->

1. What kinds of EDA techniques might you use to explore the following types of data:
    - Numeric data?
      Descriptive statistics (mean, median, mode, std, min, max, quartiles)
      Distribution analysis with histograms and box plots
      Density plots and Q-Q plots for normality testing
      Correlation analysis with correlation matrices and heatmaps
      Scatter plots for relationships between variables
      Outlier detection using IQR method or z-scores
      
    - Categorical data?
     Frequency tables and value counts
     Bar charts and count plots for visualization
     Cross-tabulation for relationships between categorical variables
     Pie charts for proportion visualization
     Chi-square tests for independence testing
  

    - The relationship between categorical and numeric data?
      Group-by statistics (mean, median by category)
      Box plots comparing numeric values across categories
      Violin plots for distribution comparison
      ANOVA tests for statistical significance
      Scatter plots with categorical encoding (colors/shapes)
      Pivot tables for summarization

*Enter your answer in this cell*

2. Generate some fake data (~1000 rows) with 1 categorical column (with 10 categories) and 2 numeric columns. Use the techniques you mentioned to explore the numeric, categorical, and the relationship between them.

In [17]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings('ignore')



In [3]:
# Your code here
# Generate fake data with ~1000 rows
np.random.seed(42)
n_samples = 1000

# 1 categorical column with 10 categories
categories = ['Category_A', 'Category_B', 'Category_C', 'Category_D', 'Category_E', 
              'Category_F', 'Category_G', 'Category_H', 'Category_I', 'Category_J']
categorical_data = np.random.choice(categories, n_samples)

# 2 numeric columns
numeric_col1 = np.random.normal(50, 15, n_samples)  # mean=50, std=15
numeric_col2 = np.random.exponential(2, n_samples)  # exponential distribution

# Create DataFrame
df_eda = pd.DataFrame({
    'categorical_feature': categorical_data,
    'numeric_feature_1': numeric_col1,
    'numeric_feature_2': numeric_col2
})

print(f"Dataset shape: {df_eda.shape}")
print("\nFirst 5 rows:")
print(df_eda.head())


Dataset shape: (1000, 3)

First 5 rows:
  categorical_feature  numeric_feature_1  numeric_feature_2
0          Category_G          63.017736           0.205312
1          Category_D          35.952290           3.960813
2          Category_H          36.867125           1.232973
3          Category_E          53.069153           2.766867
4          Category_G          40.960283           0.686636


In [9]:
# Your code here
# EDA for Numeric Data
print("1. EDA FOR NUMERIC DATA:")
print("="*30)

# Descriptive statistics
print("Descriptive Statistics:")
print(df_eda[['numeric_feature_1', 'numeric_feature_2']].describe())

print("\nCorrelation Matrix:")
correlation_matrix = df_eda[['numeric_feature_1', 'numeric_feature_2']].corr()
print(correlation_matrix)

# Outlier detection using IQR method for numeric_feature_1
Q1 = df_eda['numeric_feature_1'].quantile(0.25)
Q3 = df_eda['numeric_feature_1'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df_eda[(df_eda['numeric_feature_1'] < lower_bound) | 
                  (df_eda['numeric_feature_1'] > upper_bound)]

print(f"\nOutlier Analysis for numeric_feature_1:")
print(f"Number of outliers detected: {len(outliers)}")
print(f"Percentage of outliers: {len(outliers)/len(df_eda)*100:.2f}%")


1. EDA FOR NUMERIC DATA:
Descriptive Statistics:
       numeric_feature_1  numeric_feature_2
count        1000.000000        1000.000000
mean           50.156663           2.065296
std            15.363112           2.094032
min            -3.523079           0.000377
25%            40.271307           0.585889
50%            50.410394           1.478348
75%            60.953650           2.861854
max            97.129505          16.317668

Correlation Matrix:
                   numeric_feature_1  numeric_feature_2
numeric_feature_1           1.000000           0.008202
numeric_feature_2           0.008202           1.000000

Outlier Analysis for numeric_feature_1:
Number of outliers detected: 6
Percentage of outliers: 0.60%


In [10]:
# Your code here
# EDA for Categorical Data
print("2. EDA FOR CATEGORICAL DATA:")
print("="*32)

# Frequency tables and value counts
print("Value Counts:")
print(df_eda['categorical_feature'].value_counts())

print("\nProportions:")
proportions = df_eda['categorical_feature'].value_counts(normalize=True)
print(proportions)

print(f"\nNumber of unique categories: {df_eda['categorical_feature'].nunique()}")
print(f"Most frequent category: {df_eda['categorical_feature'].mode().iloc}")

# Create a frequency table
freq_table = pd.DataFrame({
    'Category': df_eda['categorical_feature'].value_counts().index,
    'Frequency': df_eda['categorical_feature'].value_counts().values,
    'Percentage': df_eda['categorical_feature'].value_counts(normalize=True).values * 100
})
print("\nFrequency Table:")
print(freq_table)


2. EDA FOR CATEGORICAL DATA:
Value Counts:
categorical_feature
Category_A    118
Category_C    110
Category_E    107
Category_J    107
Category_H    100
Category_F     96
Category_G     94
Category_D     94
Category_I     91
Category_B     83
Name: count, dtype: int64

Proportions:
categorical_feature
Category_A    0.118
Category_C    0.110
Category_E    0.107
Category_J    0.107
Category_H    0.100
Category_F    0.096
Category_G    0.094
Category_D    0.094
Category_I    0.091
Category_B    0.083
Name: proportion, dtype: float64

Number of unique categories: 10
Most frequent category: <pandas.core.indexing._iLocIndexer object at 0x12caea3a0>

Frequency Table:
     Category  Frequency  Percentage
0  Category_A        118        11.8
1  Category_C        110        11.0
2  Category_E        107        10.7
3  Category_J        107        10.7
4  Category_H        100        10.0
5  Category_F         96         9.6
6  Category_G         94         9.4
7  Category_D         94         9.

In [11]:
# Your code here
# EDA for Relationship between Categorical and Numeric Data
print("3. EDA FOR RELATIONSHIP BETWEEN CATEGORICAL AND NUMERIC DATA:")
print("="*65)

# Group-by statistics
print("Group-by Statistics for numeric_feature_1 by categorical_feature:")
group_stats1 = df_eda.groupby('categorical_feature')['numeric_feature_1'].agg([
    'count', 'mean', 'median', 'std', 'min', 'max'
]).round(3)
print(group_stats1)

print("\nGroup-by Statistics for numeric_feature_2 by categorical_feature:")
group_stats2 = df_eda.groupby('categorical_feature')['numeric_feature_2'].agg([
    'count', 'mean', 'median', 'std', 'min', 'max'
]).round(3)
print(group_stats2)

# Create pivot table
print("\nPivot Table (Mean values):")
pivot_table = df_eda.pivot_table(
    values=['numeric_feature_1', 'numeric_feature_2'], 
    index='categorical_feature', 
    aggfunc='mean'
).round(3)
print(pivot_table)


3. EDA FOR RELATIONSHIP BETWEEN CATEGORICAL AND NUMERIC DATA:
Group-by Statistics for numeric_feature_1 by categorical_feature:
                     count    mean  median     std     min     max
categorical_feature                                               
Category_A             118  49.656  50.262  14.655  13.173  86.808
Category_B              83  53.515  51.664  17.955  12.060  94.193
Category_C             110  52.129  51.513  13.914  25.607  87.686
Category_D              94  49.804  51.382  15.425  10.291  85.455
Category_E             107  50.372  52.235  14.461   7.907  96.411
Category_F              96  51.481  50.572  14.816  12.972  89.376
Category_G              94  50.471  50.204  14.320   9.313  84.063
Category_H             100  48.700  49.129  17.083  -0.826  97.130
Category_I              91  46.977  48.772  14.894  15.590  88.461
Category_J             107  48.771  48.968  15.998  -3.523  87.021

Group-by Statistics for numeric_feature_2 by categorical_feature:
 

#### Exercise 2


Generate a data set you can use with a supervised ML model.  The data should meet the following criteria:
   - It should have 1000 rows
   - It should have 6 columns, with one column (your "target" column being a boolean column), one categorical column with 5 categories, and 4 numeric columns.
   - The numeric columns should have dramatically different scales - different means, different std. deviations.
   - Each non-target column should have about 5% nulls.

Make this data a little more interesting by calculating the target column using a noisy function of the other columns.

In [13]:
# Your code here
# Generate dataset for supervised ML
np.random.seed(42)
n_samples = 1000

# Generate 4 numeric columns with dramatically different scales
numeric_col1 = np.random.normal(10, 2, n_samples)           # Small scale: mean=10, std=2
numeric_col2 = np.random.normal(1000, 200, n_samples)      # Medium scale: mean=1000, std=200  
numeric_col3 = np.random.normal(0.01, 0.005, n_samples)    # Very small scale: mean=0.01, std=0.005
numeric_col4 = np.random.normal(50000, 15000, n_samples)   # Large scale: mean=50000, std=15000

# Generate 1 categorical column with 5 categories
categories = ['Type_A', 'Type_B', 'Type_C', 'Type_D', 'Type_E']
categorical_col = np.random.choice(categories, n_samples)

# Create DataFrame with all features
df = pd.DataFrame({
    'numeric_1': numeric_col1,
    'numeric_2': numeric_col2, 
    'numeric_3': numeric_col3,
    'numeric_4': numeric_col4,
    'categorical': categorical_col
})


In [14]:
# Your code here
# Add 5% nulls to each non-target column
def add_nulls(series, null_percentage=0.05):
    n_nulls = int(len(series) * null_percentage)
    null_indices = np.random.choice(len(series), n_nulls, replace=False)
    series_copy = series.copy()
    series_copy.iloc[null_indices] = np.nan
    return series_copy

df['numeric_1'] = add_nulls(df['numeric_1'])
df['numeric_2'] = add_nulls(df['numeric_2'])
df['numeric_3'] = add_nulls(df['numeric_3'])
df['numeric_4'] = add_nulls(df['numeric_4'])
df['categorical'] = add_nulls(df['categorical'])

# Create target column using a noisy function of other columns
df_no_nulls = df.fillna(df.mean(numeric_only=True))  # Fill numeric nulls with mean
df_no_nulls['categorical'] = df_no_nulls['categorical'].fillna('Type_A')  # Fill categorical nulls

def create_target(row):
    # Normalize features to similar scales for calculation
    norm_n1 = (row['numeric_1'] - 10) / 2          
    norm_n2 = (row['numeric_2'] - 1000) / 200      
    norm_n3 = (row['numeric_3'] - 0.01) / 0.005    
    norm_n4 = (row['numeric_4'] - 50000) / 15000   
    
    # Categorical effect
    cat_effect = {'Type_A': 0.2, 'Type_B': -0.1, 'Type_C': 0.0, 'Type_D': 0.3, 'Type_E': -0.2}
    cat_val = cat_effect[row['categorical']]
    
    # Complex function with interactions
    score = (0.3 * norm_n1 + 
             0.2 * norm_n2 + 
             -0.4 * norm_n3 + 
             0.1 * norm_n4 + 
             cat_val +
             0.2 * norm_n1 * norm_n2 +  # interaction term
             0.1 * norm_n3 * norm_n4)   # another interaction
    
    # Add noise
    noise = np.random.normal(0, 0.5)
    score_with_noise = score + noise
    
    # Convert to boolean (threshold at 0)
    return score_with_noise > 0

# Apply function to create target
np.random.seed(42)  # Reset seed for reproducible noise
df['target'] = df_no_nulls.apply(create_target, axis=1)

print(f"Target distribution:")
print(df['target'].value_counts())
print(f"\nFinal dataset shape: {df.shape}")


Target distribution:
target
True     512
False    488
Name: count, dtype: int64

Final dataset shape: (1000, 6)


#### Exercise 3

Use whatever resources you need to figure out how to build an SKLearn ML pipelines. Use a pipeline to build an ML approach to predicting your target column in the preceding data with logistic regression.  I have set up the problem below so that you will write your code in a function function call that takes an SKLearn model and data frame and returns the results of a cross validation scoring routine.  

I have not taught you how to do this; use the book, google, the notes, chatgpt, or whatever. This is a test of your ability to *find* information, and use this to construct a solution. Your solution should:

- Use a transformer pipeline that processes your numeric and categorical features separately
- Place everything in a pipeline with the classifier that is passed in to the function.
- I've already implemented the call to cross_val_score - to make it work, you'll need to assign your pipeline to the `pipeline` variable.

_Note: You could just feed this question to AI and get an answer, and chances are, it will be right. But if you do, you won't really learn much. So, be thoughtful in your use of AI here - you can use it to build the solution step by step, and it will explain how everything works. It's all in how you use it. So, it's your choice - go for the easy grade, or learn something._

In [15]:
# --- Imports (already done above)

def run_classifier(df,classifier):
    # Separate features/target
    y = df["target"].astype(int) # logistic expects numeric; 0/1 from boolean
    X = df.drop(columns=["target"])
    
    # Identify numeric and categorical columns
    numeric_features = X.select_dtypes(include=['float64', 'int64']).columns.tolist()
    categorical_features = X.select_dtypes(include=['object']).columns.tolist()
    
    # Create preprocessing pipelines for numeric and categorical data
    
    # Numeric preprocessing: impute missing values + scale
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),  # Handle missing values
        ('scaler', StandardScaler())  # Standardize features
    ])
    
    # Categorical preprocessing: impute missing values + one-hot encode
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),  # Handle missing values
        ('onehot', OneHotEncoder(drop='first', handle_unknown='ignore'))  # One-hot encode
    ])
    
    # Combine preprocessing steps
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)
        ]
    )
    
    # Create the complete pipeline with preprocessing + classifier
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', classifier)
    ])
    
    # --- 5-fold CV using F1
    return cross_val_score(pipeline, X, y, scoring="f1", cv=5)

# Test with Logistic Regression
scores = run_classifier(df,LogisticRegression(random_state=42))
print(f"F1 (5-fold): mean={scores.mean():.3f}, std={scores.std():.3f}")
print("Fold scores:", np.round(scores, 3))


F1 (5-fold): mean=0.948, std=0.018
Fold scores: [0.956 0.937 0.956 0.92  0.971]


Try using a `RandomForestClassifier` in the preceding pipeline. Just call `run_classifier` with a `RandomForestClassifier`, and print out the results as above.

In [16]:
# Your code here
# Test with Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

scores_rf = run_classifier(df, RandomForestClassifier(random_state=42))
print(f"F1 (5-fold): mean={scores_rf.mean():.3f}, std={scores_rf.std():.3f}")
print("Fold scores:", np.round(scores_rf, 3))


F1 (5-fold): mean=0.929, std=0.021
Fold scores: [0.936 0.906 0.946 0.902 0.956]


Normally, `RandomForestClassifier`s are considered to be more powerful than `LogisticRegression`.  Depending on your data, this may or may not be the case. Reflect on your answers - which one does better here, and why do you think that is?  Once again, you might use AI, but you should probably also try to _understand_ the answer.

In this specific case, Logistic Regression performed better than Random Forest. Here are the likely reasons:

1. Linear Relationships:
The target variable was created using a linear combination of normalized features plus noise. Since the underlying relationship is fundamentally linear (with some interaction terms), Logistic Regression is well-suited to capture this pattern directly.

2. Feature Engineering in Data Generation:
Features were normalized before creating the target
The function used linear combinations with weights
Interaction terms were simple multiplicative relationships
This structure favors linear models like Logistic Regression

3. Dataset Characteristics:

Medium-sized dataset (1000 samples) may not provide enough complexity for Random Forest to excel
Features have clear, interpretable relationships with the target
Limited non-linear patterns in the underlying data generation process

4. Random Forest Limitations:
Random Forest excels with complex, non-linear relationships and feature interactions
It might be experiencing some overfitting with the current dataset size
The bootstrap sampling in Random Forest might be adding unnecessary noise to a relatively straightforward problem

5. Preprocessing Impact:
StandardScaler helps Logistic Regression handle different feature scales effectively
Random Forest is scale-invariant, so it doesn't benefit as much from preprocessing
The preprocessing pipeline may be more optimized for linear models

General Principle:
While Random Forest is often more powerful for complex, real-world datasets with non-linear relationships, simpler linear models can outperform when the underlying relationships are indeed linear or when the dataset size is limited. This demonstrates the importance of understanding your data and trying multiple approaches rather than assuming more complex models are always better