# Data Preprocessing & Feature Engineering Tutorial

Welcome to this **guided demo and exercise** notebook. In this tutorial, we'll explore:

1. **Loading a small dataset** with missing values, outliers, and categorical columns.  
2. **Identifying** and **handling** missing data (e.g., dropping or imputing).  
3. **Detecting** outliers and applying a basic clipping strategy.  
4. **Encoding** categorical features.  
5. **Creating** new features from existing columns.

There is a sample CSV file named **`sample_data.csv`** in the same repo, representing a small user dataset with columns like:

- `user_id`: an identifier (might not be used in modeling)
- `age`: numeric, can contain missing values
- `income`: numeric, can have outliers
- `city`: categorical
- `purchases`: numeric (count of purchases)
- `score`: some numeric rating like credit score.

---

## 1. Imports & Setup

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


## 2. Load the Dataset

We'll define a short function to read the CSV into a DataFrame.

In [None]:
DATA_PATH = "./datasets/week-2_sample_data.csv" 

def load_data():
    """
    Reads a CSV, returns the DataFrame.
    Columns: ['user_id','age','income','city','purchases','score']
    """
    df = pd.read_csv(DATA_PATH)
    return df

# Demo
if __name__ == "__main__":
    df_raw = load_data()
    display(df_raw.head(10))
    print(df_raw.info())

## 3. Handling Missing Data

### 3.1 Identify Missing Values

Check how many `NaN` or `None` entries each column has, and see if there's a strategy (drop vs. fill).

**Practice**:  
1. Print `df.isna().sum()` to see missing counts.  
2. Decide how to handle missing `age` or other columns.

*(No right/wrong approach, but let’s do something simple for demonstration.)*

In [None]:
def handle_missing(df):
    """
    Example function that:
    1. Fills or drops missing values
    2. Demonstrates a simple approach for missing 'age'
    """
    # TODO: fill in your approach
    # e.g. fill 'age' with median:
    median_age = df['age'].median()
    df['age'].fillna(median_age, inplace=True)
    
    # If 'city' is missing, maybe fill with 'Unknown':
    df['city'].fillna('Unknown', inplace=True)
    
    return df

if __name__ == "__main__":
    df_demo = load_data()
    print("Before handling missing values:")
    print(df_demo.isna().sum())
    
    df_demo = handle_missing(df_demo)
    print("\nAfter handling missing values:")
    print(df_demo.isna().sum())


## Don't panic!
The error above is expected and part of the learning process. If you look at the error message, it says that it cannot convert values in 'age' to a numeric type. This is because the missing values represented as 'nan'/'na' are present. This is common in the real world. To handle this, we can do the following.

In [None]:
def handle_missing(df):
    """
    Example function that:
    1. Fills or drops missing values
    2. Demonstrates a simple approach for missing 'age'
    """
    # Convert 'age' to numeric, forcing invalid strings to NaN
    df['age'] = pd.to_numeric(df_demo['age'], errors='coerce')

    # Explanation:
    #   - This will turn '12' -> 12, 'na' -> NaN, '' -> NaN, etc.
    #   - The column becomes float64 or int64 if no decimals.

    # e.g. fill 'age' with median:
    median_age = df['age'].median()
    df['age'].fillna(median_age, inplace=True)
    
    # If 'city' is missing, maybe fill with 'Unknown':
    df['city'].fillna('Unknown', inplace=True)
    
    return df

df_demo = handle_missing(df_demo)
print("\nAfter handling missing values:")
print(df_demo.isna().sum())


## 4. Detecting & Handling Outliers

We’ll assume `income` might have outliers. We can do:

1. Plot a histogram or boxplot to see distribution.  
2. Decide on a cap or method (e.g., clip at 95th percentile).

**Practice**:  
- Use `df['income'].describe()` or a boxplot.  
- Clip incomes above e.g. 100000 to exactly 100000, if that’s your chosen approach.


In [None]:
def clip_outliers(df, col='income', cap=100000):
    """
    Example function that clips 'income' at a certain cap.
    """
    # TODO: implement your outlier strategy
    df[col] = np.where(df[col] > cap, cap, df[col])
    return df

if __name__ == "__main__":
    df_demo = load_data()
    # A quick distribution check
    df_demo['income'].hist(bins=20)
    plt.title("Income Distribution")
    plt.show()
    
    # clip approach
    df_demo = clip_outliers(df_demo, col='income', cap=100000)
    # see result
    df_demo['income'].hist(bins=20)
    plt.title("Income Dist after clipping")
    plt.show()



## 5. Encoding Categorical Features

- **Label Encoding** or **One-Hot**: `pd.get_dummies()`.
- If `df['city']` is something like `[New York, Paris, Tokyo]`, we can do: `df = pd.get_dummies(df, columns=['city'])`.

**Practice**:  
- Create a function that one-hot encodes `city`.  
- Or do label encoding if you prefer.


In [None]:
def encode_city(df):
    """
    Example: one-hot encode 'city'
    This might create columns city_New York, city_Paris, city_Tokyo, etc.
    """
    # TODO: e.g.: df = pd.get_dummies(df, columns=['city'], drop_first=False)
    df = pd.get_dummies(df, columns=['city'], drop_first=False)
    return df

if __name__ == "__main__":
    df_demo = load_data()
    df_demo = encode_city(df_demo)
    display(df_demo.head())


## 6. Creating New Features

Examples:
- **`df['family_size'] = df['SibSp'] + df['Parch'] + 1`** (like Titanic).  
- **`df['purchases_per_income'] = df['purchases'] / df['income']`** (user spending rate).
- **`df['log_score'] = np.log1p(df['score'])`** if `score` is large or skewed.

**Practice**:  
- Add a new feature based on existing columns. 
- Inspect correlation or distribution.


In [None]:
def create_features(df):
    """
    Example: create 'purchases_per_income'
    """
    # TODO: df['purchases_per_income'] = df['purchases'] / (df['income'] + 1)
    df['purchases_per_income'] = df['purchases'] / (df['income'] + 1)
    return df

if __name__ == "__main__":
    df_demo = load_data()
    df_demo = create_features(df_demo)
    display(df_demo.head(10))


## 7. Putting It All Together: A Pipeline

We can define a single function that:

1. **Loads** data
2. **Handles** missing
3. **Clips** outliers
4. **Encodes** categories
5. **Creates** new features
6. Returns the final cleaned DataFrame

Use the sub-steps from above. 


In [None]:
def preprocess_data():
    df = load_data()
    df = handle_missing(df)
    df = clip_outliers(df, col='income', cap=100000)
    df = encode_city(df)
    df = create_features(df)
    return df

if __name__ == "__main__":
    final_df = preprocess_data()
    display(final_df.head())
    print(final_df.info())

## 8. Next Steps

- **Scale** numeric columns (e.g., StandardScaler) if training a model.  
- **Split** train/test, confirm if new features help.  
- Possibly store the cleaned data: `final_df.to_csv("cleaned_data.csv", index=False)` if you want a final artifact.

**End of Tutorial**  
Feel free to experiment with your own transformations and methods!