<a href="https://colab.research.google.com/github/MJMortensonWarwick/data_engineering_for_data_scientists/blob/main/0_2_Data_Science_and_Data_Engineering_p2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0.2 Data Science and Data Engineering (part two)
The previous notebook ([0.1](https://github.com/MJMortensonWarwick/data_engineering_for_data_scientists/blob/main/0_1_Data_Science_and_Data_Engineering_p1.ipynb)) shows some of the issues you might find in a Data Scientist's Notebook (compared to how a Data Engineer might approach this). This Notebook will shows you an alternative implementation!

In [None]:
# Import libraries - would be handled in a requirements.txt file
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import logging
from dataclasses import dataclass
from typing import Tuple

Rather than a bunch of `print()` statements, our data engineering approach will write out to `logging`. This means we can track any issues in the standalone log rather than relying on catching any mistake while the code runs.

We also will create a `Config` that will store the paths we need for data and the constants such as exchange rate and so on. By making these independent, they are easier to control and change. In practice we would not include this in the code but in the overall platform (as stored 'secrets'), but that is hard to illustrate here.

In [None]:
# Setup logging (instead of print statements)
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Use a config class (no more magic numbers)
@dataclass
class Config:
    # Paths (Ideally these come from env vars, but relative paths are better than absolute)
    INPUT_PATH: str = "data/raw/sales_data.csv"
    OUTPUT_PATH: str = "data/processed/scored_leads_{date}.csv"

    # Business Logic Constants
    FX_RATE_GBP_USD: float = 0.82
    EXCLUDED_REGIONS: list = (4,)

    # Campaign Multipliers
    CLICK_MULTIPLIER_A: float = 0.05
    CLICK_MULTIPLIER_B: float = 0.12

    # Outlier Capping
    MAX_SCORE_CAP: int = 1000
    CAP_VALUE: int = 999

    # Model Params
    TEST_SPLIT_SIZE: float = 0.2
    RANDOM_SEED: int = 42

config = Config()
logger.info("Configuration loaded.")

Rather than a large block of code which is hard to understand and not reusable, instead we will favour a series of functions. Each of these will have clearly defined scope (ideally performing a single task), set of inputs and produce a clear output. In each function we will state these at the very start. Each process will report its output directly to the logger.

In [None]:
~def load_data(path: str) -> pd.DataFrame:
    """Loads data and handles missing file errors gracefully.
       INPUTS: a dataframe path
       OUTPUTS: a dataframe
    """
    try:
        df = pd.read_csv(path)
        logger.info(f"Data loaded successfully: {df.shape[0]} rows")
        return df
    except FileNotFoundError:
        logger.error(f"File not found at {path}")
        raise

We also favour speed. Our previous code included a for loop - `.iterrows()`. This is fine in experimentation and with a limited data size, but is a relatively slow approach as it needs to process each row one by one. Vectorising the operation (as below) means the whole dataset is transformed in a single operation. As a data engineer, even if the dataset we first work with is relatively small, we should be designing code that can easily scale up to much larger datasets.

In [None]:
def clean_data(df: pd.DataFrame, cfg: Config) -> pd.DataFrame:
    """Applies standard cleaning and filtering rules.
       INPUTS: a dataframe
       OUTPUTS: a new dataframe
    """

    # Vectorised operation (fast) instead of .iterrows (slow)
    df['adjusted_price'] = df['raw_price'] * cfg.FX_RATE_GBP_USD

    # Filter using readable syntax
    initial_count = len(df)
    df = df[~df['region_id'].isin(cfg.EXCLUDED_REGIONS)]
    removed = initial_count - len(df)

    logger.info(f"Cleaned data. Removed {removed} rows from excluded regions.")
    return df.copy() # Return a copy to avoid SettingWithCopy warnings

In [None]:
def calculate_score(df: pd.DataFrame, cfg: Config) -> pd.DataFrame:
    """Engineers the 'score' feature based on campaign type.
       INPUTS: a dataframe
       OUTPUTS: a new dataframe
    """
    # Vectorised logic using numpy select (much faster than loops)
    conditions = [
        df['type'] == 'A',
        df['type'] == 'B'
    ]
    choices = [
        df['clicks'] * cfg.CLICK_MULTIPLIER_A,
        df['clicks'] * cfg.CLICK_MULTIPLIER_B
    ]

    df['score'] = np.select(conditions, choices, default=0)

    # Handle outliers
    outlier_count = df[df['score'] > cfg.MAX_SCORE_CAP].shape[0]
    df['score'] = np.where(df['score'] > cfg.MAX_SCORE_CAP, cfg.CAP_VALUE, df['score'])

    if outlier_count > 0:
        logger.warning(f"Capped {outlier_count} scores > {cfg.MAX_SCORE_CAP} to {cfg.CAP_VALUE}")

    return df

In [None]:
def train_model(df: pd.DataFrame, cfg: Config):
    """Splits data and trains the model reproducibly.
       INPUTS: a dataframe
       OUTPUTS: a trained model
    """

    X = df[['score']]
    y = df['conversions']

    # Reproducible split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=cfg.TEST_SPLIT_SIZE, random_state=cfg.RANDOM_SEED
    )

    model = LinearRegression()
    model.fit(X_train, y_train)

    score = model.score(X_test, y_test)
    logger.info(f"Model Training Complete. R^2 Score: {score:.4f}")
    return model

With all our operations now written out as independent functions, our final job is to orchestrate this as a flow. The `if __name__ == __main__` instruction means this part of the code will operate if the script is called remotely (as opposed to being manually run).  

In [None]:
# This orchestrates the flow. In a real system, this might be an Airflow DAG.
if __name__ == "__main__":
    # 1. Extract
    raw_df = load_data(config.INPUT_PATH)

    # 2. Transform
    clean_df = clean_data(raw_df, config)
    final_df = calculate_score(clean_df, config)

    # 3. Model
    model = train_model(final_df, config)

    # 4. Load / Save
    from datetime import datetime
    timestamp = datetime.now().strftime("%Y%m%d")
    output_filename = config.OUTPUT_PATH.format(date=timestamp)

    final_df.to_csv(output_filename, index=False)
    logger.info(f"Results saved to {output_filename}")