# TP03: Testing and Collaboration

## Exercise 1: Unit Test for Data Cleaning
**Objective**: Practice writing unit tests with pytest.

**Task**: Write tests for a `clean_data(df)` function that removes duplicates and nulls.

- Duplicates are removed correctly.
- All null values are dropped.
- The number of rows decreases after cleaning when nulls or duplicates exist.

In [1]:
%%writefile cleaning.py
import pandas as pd

def clean_data(df):
    """
    Removes duplicates and null values from the DataFrame.
    """
    df_clean = df.drop_duplicates()
    df_clean = df_clean.dropna()
    return df_clean

Writing cleaning.py


In [4]:
%%writefile test_cleaning.py
import pandas as pd
import pytest
from cleaning import clean_data

def test_clean_data():
    # Create sample data with duplicates and nulls
    data = {
        "id": [1, 2, 2, 3, 4],
        "value": [10, 20, 20, None, 40]
    }
    df = pd.DataFrame(data)
    
    # Apply cleaning
    df_cleaned = clean_data(df)
    
    # Assertions
    assert df_cleaned.shape[0] == 3, "Should have 3 rows after cleaning (removed 1 duplicate and 1 null)"
    assert df_cleaned.isnull().sum().sum() == 0, "Should have no null values"
    assert df_cleaned.duplicated().sum() == 0, "Should have no duplicates"
    assert 2 in df_cleaned["id"].values, "ID 2 should still exist"
    assert 3 not in df_cleaned["id"].values, "ID 3 (with null value) should be removed"

Overwriting test_cleaning.py


In [5]:
!pytest test_cleaning.py

platform darwin -- Python 3.11.12, pytest-7.1.2, pluggy-1.0.0
rootdir: /Users/macbookair/Documents/Data Science 5th Year/Advanced Programing for DS
plugins: anyio-4.9.0, langsmith-0.4.8
collected 1 item                                                               [0m[1m

test_cleaning.py [32m.[0m[32m                                                       [100%][0m

collected 1 item                                                               [0m[1m

test_cleaning.py [32m.[0m[32m                                                       [100%][0m



## Exercise 2: TDD - Normalization Function
**Objective**: Apply Test-Driven Development (TDD).

**Task**:
1. Write tests first for a function `normalize_column(df, column)` that scales values between 0 and 1.
2. Implement the function to make the tests pass.

In [9]:
%%writefile normalization.py
import pandas as pd

def normalize_column(df, column):
    """
    Scales values in the specified column between 0 and 1.
    """
    if column not in df.columns:
        raise KeyError(f"Column {column} not found in DataFrame")
    
    df_copy = df.copy()
    min_val = df_copy[column].min()
    max_val = df_copy[column].max()
    
    if max_val - min_val == 0:
        df_copy[column] = 0.0
    else:
        df_copy[column] = (df_copy[column] - min_val) / (max_val - min_val)
        
    return df_copy

Writing normalization.py


In [11]:
%%writefile test_normalization.py
import pandas as pd
import pytest
# We import the function even though it might not exist yet (TDD process)
# In a real TDD cycle, this import would fail first.
try:
    from normalization import normalize_column
except ImportError:
    pass

def test_normalize_column():
    df = pd.DataFrame({"score": [10, 20, 30]})
    
    # Test normalization
    df_norm = normalize_column(df, "score")
    
    assert df_norm["score"].min() == 0.0
    assert df_norm["score"].max() == 1.0
    assert len(df_norm) == 3

def test_invalid_column():
    df = pd.DataFrame({"score": [10, 20, 30]})
    with pytest.raises(KeyError):
        normalize_column(df, "invalid_col")

Overwriting test_normalization.py


In [12]:
# This is expected to fail or error because the module/function doesn't exist yet
!pytest test_normalization.py

platform darwin -- Python 3.11.12, pytest-7.1.2, pluggy-1.0.0
rootdir: /Users/macbookair/Documents/Data Science 5th Year/Advanced Programing for DS
plugins: anyio-4.9.0, langsmith-0.4.8
collected 2 items                                                              [0m[1m

test_normalization.py [32m.[0m[32m.[0m[32m                                                 [100%][0m

collected 2 items                                                              [0m[1m

test_normalization.py [32m.[0m[32m.[0m[32m                                                 [100%][0m



## Exercise 3: Testing Model Evaluation Function
**Objective**: Test ML evaluation logic using pytest.

**Task**: Write tests for `evaluate_model(y_true, y_pred)` that returns a dictionary with accuracy and F1 score.

In [13]:
%%writefile evaluation.py
from sklearn.metrics import accuracy_score, f1_score

def evaluate_model(y_true, y_pred):
    """
    Returns a dictionary with accuracy and F1 score.
    """
    acc = accuracy_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred, average='weighted', zero_division=0)
    return {"accuracy": acc, "f1_score": f1}

Writing evaluation.py


In [14]:
%%writefile test_evaluation.py
import pytest
from evaluation import evaluate_model

def test_evaluate_model_perfect():
    y_true = [1, 0, 1, 1]
    y_pred = [1, 0, 1, 1]
    metrics = evaluate_model(y_true, y_pred)
    
    assert metrics["accuracy"] == 1.0
    assert metrics["f1_score"] == 1.0
    assert "accuracy" in metrics
    assert "f1_score" in metrics

def test_evaluate_model_wrong():
    y_true = [1, 0, 1]
    y_pred = [0, 1, 0]
    metrics = evaluate_model(y_true, y_pred)
    
    assert metrics["accuracy"] == 0.0
    assert metrics["f1_score"] == 0.0

Writing test_evaluation.py


In [15]:
!pytest test_evaluation.py

platform darwin -- Python 3.11.12, pytest-7.1.2, pluggy-1.0.0
rootdir: /Users/macbookair/Documents/Data Science 5th Year/Advanced Programing for DS
plugins: anyio-4.9.0, langsmith-0.4.8
collected 2 items                                                              [0m[1m

test_evaluation.py [32m.[0m[32m.[0m[32m                                                    [100%][0m



## Exercise 4: Continuous Integration with GitHub Actions
**Objective**: Automate testing using GitHub workflows.

**Task**: Create a `.github/workflows/run-tests.yml` file.

In [16]:
import os
os.makedirs(".github/workflows", exist_ok=True)

workflow_content = """name: Run Tests

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - name: Checkout code
      uses: actions/checkout@v3

    - name: Set up Python
      uses: actions/setup_python@v4
      with:
        python-version: '3.10'

    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install pytest pandas scikit-learn

    - name: Run tests
      run: |
        pytest -v
"""

with open(".github/workflows/run-tests.yml", "w") as f:
    f.write(workflow_content)

print("Workflow file created at .github/workflows/run-tests.yml")

Workflow file created at .github/workflows/run-tests.yml


## Exercise 5: End-to-End Testing (Integration Test)
**Objective**: Combine testing of multiple components.

**Task**: Create and test a mini ML pipeline.

In [20]:
%%writefile test_integration.py
import pandas as pd
import pytest
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from evaluation import evaluate_model

# Mock functions for the pipeline
def load_data():
    # Returns a simple dataframe
    data = {
        "feature1": [1, 2, 3, 4, 5, 6],
        "feature2": [10, 20, 30, 40, 50, 60],
        "target": [0, 0, 0, 1, 1, 1]
    }
    return pd.DataFrame(data)

def train_model(X, y):
    model = LogisticRegression()
    model.fit(X, y)
    return model

def test_ml_pipeline():
    # 1. Load Data
    df = load_data()
    assert not df.empty
    assert "target" in df.columns
    
    # 2. Train Model
    X = df[["feature1", "feature2"]]
    y = df["target"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    
    model = train_model(X_train, y_train)
    assert model is not None
    
    # 3. Evaluate Model
    y_pred = model.predict(X_test)
    metrics = evaluate_model(y_test, y_pred)
    
    assert "accuracy" in metrics
    assert 0 <= metrics["accuracy"] <= 1.0
    print(f"Integration Test Metrics: {metrics}")

Overwriting test_integration.py


In [21]:
!pytest test_integration.py

platform darwin -- Python 3.11.12, pytest-7.1.2, pluggy-1.0.0
rootdir: /Users/macbookair/Documents/Data Science 5th Year/Advanced Programing for DS
plugins: anyio-4.9.0, langsmith-0.4.8
collected 1 item                                                               [0m[1m

test_integration.py [32m.[0m[32m                                                    [100%][0m

test_integration.py::test_ml_pipeline
    opt_res = optimize.minimize(

