# Melio Fullstack Data Scientist Technical Interview

### Task 1: Building the classifier

This is the main data science component of the technical assessment.

Build a classifier to determine whether the name belongs to a `Person`, `Company`, or `University`:

  - You can use any library you want.
  - You can use a rule-based classification, a pre-built model/embedding, build a model yourself or a hybrid. 
  - Format:
    - If you are building an ML solution, the training of your model can be in a Jupyter notebook. 
    - If you are not building an ML solution, you will have to embed your python code into the app.

Note that the classifications are generated by the client's upstream system, but it is not always correct. 

## Submission Requirements

  1. Give enough information on how to run your solution (i.e. python version, packages, requirements.txt, Dockerfile, etc.).
  2. State all of your assumptions, if any.
  3. There is no right or wrong answer, but give a clear reasoning on each step you took. 


# ! Please see README for assumptions and project planning process !

In [1]:
import pandas as pd

df = pd.read_csv('names_data_candidate.csv')
df.sample(10)

Unnamed: 0,dirty_name,dirty_label
2982,DR nomfundo buthelezi,Person
0,Wright Pentlow,Person
3106,rev. bryan edwards,Person
2289,Sr. dr. stephen viljoen,Person
947,MISS KABELO MAZIBUKO,Person
4242,MRS heidi mngomezulu,Person
2739,Altusi Group Ltd,Company
1152,Miss Tracy Coetzer,Person
3597,Judicaël Deware,Person
3799,SIPHOKAZI OBERHOLZER,Person


In [35]:
# imports
import string
import unidecode
import re
from sklearn.model_selection import train_test_split
import random
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, recall_score, f1_score, confusion_matrix, log_loss, classification_report
from sklearn.pipeline import Pipeline
from joblib import dump
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from imblearn.over_sampling import RandomOverSampler
from sklearn.utils.class_weight import compute_class_weight



# Data Exploration  

## First remove all accents, capitals and punctuation to create cleaner data to work with

In [5]:
def lowercase_names(df):
    """
    Converts all entries in the 'dirty_name' column to lowercase.

    Parameters:
    - df: pandas.DataFrame
        Input DataFrame with a 'dirty_name' column.

    Returns:
    - df: pandas.DataFrame
        DataFrame with 'dirty_name' values converted to lowercase.

    This step standardises casing to avoid case-sensitive mismatches in rule-based
    keyword matching and improves model generalisation.
    """
    df['dirty_name'] = df['dirty_name'].str.lower()
    return df

def remove_punctuation(df):
    """
    Removes punctuation characters from the 'dirty_name' column.

    Parameters:
    - df: pandas.DataFrame
        Input DataFrame with a 'dirty_name' column.

    Returns:
    - df: pandas.DataFrame
        DataFrame with punctuation removed from 'dirty_name'.

    This reduces noise from characters like '.', ',', etc., which can vary in presence
    but rarely carry class-discriminative value.
    """
    translator = str.maketrans('', '', string.punctuation)
    df['dirty_name'] = df['dirty_name'].apply(lambda name: name.translate(translator))
    return df


def remove_accents(df):
    """
    Replaces accented characters in the 'dirty_name' column with ASCII equivalents.

    Parameters:
    - df: pandas.DataFrame
        Input DataFrame with a 'dirty_name' column.

    Returns:
    - df: pandas.DataFrame
        DataFrame with special/accented characters replaced.

    This improves matching for names with diacritics (e.g. 'Émile' → 'Emile').
    """
    df['dirty_name'] = df['dirty_name'].apply(unidecode.unidecode)
    return df


In [6]:
df = lowercase_names(df)
df = remove_punctuation(df)
df = remove_accents(df)
print(df.head(20))

                       dirty_name dirty_label
0                  wright pentlow      Person
1                ms sydney hadebe      Person
2             prof hennie vorster      Person
3                   enrica hayter      Person
4                    teboho ngema      Person
5                    irene klaves      Person
6                   aila tenpenny      Person
7            oceanne dawidowitsch      Person
8                 long mac geffen      Person
9                thabiso blignaut      Person
10              emalee le strange      Person
11                 lindiwe wright      Person
12  imibono fuels capital pty ltd     Company
13          imibono fuels pty ltd     Company
14                  imibono fuels     Company
15          mr miss bronwyn kotze      Person
16            dr anson dudderidge      Person
17          prof frederick turner     Company
18               katherina hawkey      Person
19        dr rev hannes mashinini      Person


## remove known miss labelled instances

In [7]:
# identify noisy labels
def identify_noisy_labels(df):
    """
    Identifies and counts noisy or invalid labels in the classification dataset.

    Parameters:
    - df: pandas.DataFrame
        Input DataFrame containing at least 'dirty_label' column.

    Returns:
    - noisy_df: pandas.DataFrame
        Subset of rows with invalid labels.
    - num_noisy: int
        Count of noisy label rows.

    This function compares all entries in the 'dirty_label' column against a known set
    of valid classes ('Person', 'Company', 'University'). Any entry not matching these
    is considered noisy.
    """
    valid_labels = {'Person', 'Company', 'University'}

    # Filter out rows that don't match the valid label set
    noisy_df = df[~df['dirty_label'].isin(valid_labels)]
    num_noisy = len(noisy_df)

    print(f"Number of noisy label entries: {num_noisy}")
    return noisy_df, num_noisy



In [8]:
noisy_df, num_noisy = identify_noisy_labels(df)

Number of noisy label entries: 0


In [9]:
# no entries are missing classifications

In [10]:
# check for miss classifcations using a rule based approach

def get_label_patterns():
    """
    Defines common indicative substrings or patterns for each label class.
    
    Returns:
    - patterns: dict
        A dictionary with keys as class labels ('Person', 'Company', 'University') and
        values as lists of lowercase keywords typically found in those categories.
    """
    return {
        'Person': [
            'mr', 'mrs', 'ms', 'miss', 'dr', 'prof', 'rev', 'sr'
        ],
        'Company': [
            'pty', 'ltd', 'inc', 'cc', 'corp', 'company', 'llc', 'gmbh', 'foundation', 'trust',
            'capital', 'group', 'holdings', 'investments'
        ],
        'University': [
            'university', 'college', 'institute', 'politecnico', 'instituto', 'universidad', 'universidade', 'universite'
        ]
    }

def flag_misclassified_entries(df):
    """
    Identifies rows where the label seems to contradict common naming patterns.

    Parameters:
    - df: pandas.DataFrame
        DataFrame with 'dirty_name' and 'dirty_label' columns.

    Returns:
    - mismatched_df: pandas.DataFrame
        Subset of rows suspected to be misclassified based on rule-based patterns.
    - num_mismatches: int
        Count of suspected misclassifications.

    This function uses common keyword patterns for each class and flags entries
    where the name suggests one class, but the label is something else.
    """
    patterns = get_label_patterns()
    
    # Precompile regex patterns for efficiency
    compiled_patterns = {
        label: re.compile('|'.join([fr'\b{re.escape(p)}\b' for p in keywords]), re.IGNORECASE)
        for label, keywords in patterns.items()
    }

    mismatches = []
    
    for _, row in df.iterrows():
        name = row['dirty_name']
        label = row['dirty_label']
        
        # Determine expected label(s) based on pattern match
        expected_labels = [lbl for lbl, pat in compiled_patterns.items() if pat.search(name)]
        
        # If the actual label is not one of the expected ones, flag as mismatch
        if expected_labels and label not in expected_labels:
            mismatches.append(row)

    mismatched_df = pd.DataFrame(mismatches)
    num_mismatches = len(mismatched_df)

    print(f"Number of suspected misclassified rows: {num_mismatches}")
    return mismatched_df, num_mismatches

def clean_misclassified_entries(df):
    """
    Removes suspected misclassified rows from the dataset based on naming patterns.

    Parameters:
    - df: pandas.DataFrame
        Original dataset.

    Returns:
    - cleaned_df: pandas.DataFrame
        Dataset with suspected misclassified rows removed.

    This function uses the rule-based flagging logic to detect and exclude entries
    that appear to have incorrect labels.
    """
    mismatched_df, _ = flag_misclassified_entries(df)
    
    # Remove mismatched entries
    cleaned_df = df.drop(mismatched_df.index).reset_index(drop=True)
    print(f"Cleaned dataset size after removing misclassified rows: {len(cleaned_df)}")

    return cleaned_df


In [11]:
# get the label patterns
label_patterns = get_label_patterns()
# flag the missclassified entries and print the number (check its not too high)
mismatched_df, num_mismatches = flag_misclassified_entries(df)
# happy it is not too many, so remove the bad ones
df = clean_misclassified_entries(df)

Number of suspected misclassified rows: 84
Number of suspected misclassified rows: 84
Cleaned dataset size after removing misclassified rows: 4436


## Remove doubled prefixes for names

In [12]:
def remove_double_prefixes(df):
    """
    Removes the first two words in 'dirty_name' if both are common prefixes
    (e.g., 'Mr Dr Jane Doe' → 'Jane Doe').

    Parameters:
    - df: pandas.DataFrame
        Input DataFrame with a 'dirty_name' column (pre-cleaned, lowercase, no punctuation).

    Returns:
    - df: pandas.DataFrame
        DataFrame with modified 'dirty_name' values where double prefixes were removed.

    This function targets cases where two known titles/prefixes occur at the beginning of the name.
    It uses a defined prefix list and only alters names that start with two such prefixes.
    """
    # Define common person prefixes
    prefixes = {'mr', 'mrs', 'ms', 'miss', 'dr', 'prof', 'rev', 'sr'}

    def clean_name(name):
        # Split name into parts
        parts = name.strip().split()

        # Check if first two words are both valid prefixes
        if len(parts) >= 2 and parts[0] in prefixes and parts[1] in prefixes:
            return ' '.join(parts[2:])  # Remove both
        return name

    df['dirty_name'] = df['dirty_name'].apply(clean_name)
    return df


In [13]:
df = remove_double_prefixes(df)

## Count the number of entries for each category and spread of patterns

In [14]:
def count_all_patterns_by_label(df, pattern_dict):
    """
    Counts how often each pattern from a label-specific pattern dictionary appears
    in the corresponding names within each label group, and includes the total number
    of entries per group.

    Parameters:
    - df: pandas.DataFrame
        DataFrame with 'dirty_name' and 'dirty_label' columns.
    - pattern_dict: dict
        Dictionary where keys are label names (e.g., 'Person', 'Company') and values are
        lists of lowercase patterns to match.

    Returns:
    - results: dict
        A nested dictionary where each key is a label, and the value is another dictionary
        containing:
            - pattern counts (including 'no_pattern')
            - 'total_entries': total number of rows for that label

    This function gives insight into how many names match each pattern and the pattern
    coverage relative to total entries per category.
    """
    results = {}

    for label, patterns in pattern_dict.items():
        label_df = df[df['dirty_label'] == label]
        total = len(label_df)

        pattern_counts = {p: 0 for p in patterns}
        pattern_counts['no_pattern'] = 0
        pattern_counts['total_entries'] = total

        for name in label_df['dirty_name']:
            found = False
            for p in patterns:
                if p in name:
                    pattern_counts[p] += 1
                    found = True
            if not found:
                pattern_counts['no_pattern'] += 1

        print(f"\nPattern match counts for label: '{label}'")
        print(f"Total entries: {total}")
        for k, v in pattern_counts.items():
            if k != 'total_entries':
                print(f"{k}: {v}")
        
        results[label] = pattern_counts

    return results


In [15]:
count_all_patterns_by_label(df, get_label_patterns())


Pattern match counts for label: 'Person'
Total entries: 3674
mr: 364
mrs: 171
ms: 121
miss: 230
dr: 372
prof: 119
rev: 69
sr: 55
no_pattern: 2383

Pattern match counts for label: 'Company'
Total entries: 664
pty: 96
ltd: 128
inc: 25
cc: 54
corp: 25
company: 11
llc: 15
gmbh: 21
foundation: 4
trust: 18
capital: 6
group: 12
holdings: 4
investments: 6
no_pattern: 348

Pattern match counts for label: 'University'
Total entries: 98
university: 58
college: 14
institute: 2
politecnico: 2
instituto: 3
universidad: 15
universidade: 2
universite: 2
no_pattern: 6


{'Person': {'mr': 364,
  'mrs': 171,
  'ms': 121,
  'miss': 230,
  'dr': 372,
  'prof': 119,
  'rev': 69,
  'sr': 55,
  'no_pattern': 2383,
  'total_entries': 3674},
 'Company': {'pty': 96,
  'ltd': 128,
  'inc': 25,
  'cc': 54,
  'corp': 25,
  'company': 11,
  'llc': 15,
  'gmbh': 21,
  'foundation': 4,
  'trust': 18,
  'capital': 6,
  'group': 12,
  'holdings': 4,
  'investments': 6,
  'no_pattern': 348,
  'total_entries': 664},
 'University': {'university': 58,
  'college': 14,
  'institute': 2,
  'politecnico': 2,
  'instituto': 3,
  'universidad': 15,
  'universidade': 2,
  'universite': 2,
  'no_pattern': 6,
  'total_entries': 98}}

Suggested augmentation is add common words at random during training for even split of them.
Also need to ensure data split is even when training. 

# Data split and augmentation

## split

In [16]:
def stratified_train_val_split(df, val_frac=0.1, random_state=42):
    """
    Splits a DataFrame into training and validation sets with stratified sampling,
    ensuring the validation set contains val_frac proportion of each label class.

    Parameters:
    - df: pandas.DataFrame
        DataFrame containing 'dirty_name' and 'dirty_label' columns.
    - val_frac: float (default=0.1)
        Proportion of each class to include in the validation set.
    - random_state: int
        Seed for reproducibility.

    Returns:
    - train_df: pandas.DataFrame
        Training set.
    - val_df: pandas.DataFrame
        Validation set with stratified class proportions.

    This function ensures that the split respects class distribution, which is
    essential for fair model evaluation when data is imbalanced.
    """
    train_df, val_df = train_test_split(
        df,
        test_size=val_frac,
        stratify=df['dirty_label'],
        random_state=random_state
    )
    
    # Print split counts for verification
    print("Split summary:")
    print("Train set:")
    print(train_df['dirty_label'].value_counts())
    print("\nValidation set:")
    print(val_df['dirty_label'].value_counts())

    return train_df.reset_index(drop=True), val_df.reset_index(drop=True)


In [17]:
train_df, test_df = stratified_train_val_split(df, val_frac=0.1, random_state=42)

Split summary:
Train set:
dirty_label
Person        3306
Company        598
University      88
Name: count, dtype: int64

Validation set:
dirty_label
Person        368
Company        66
University     10
Name: count, dtype: int64


In [18]:
train_df.head(50)

Unnamed: 0,dirty_name,dirty_label
0,mitchell claw,Person
1,mrs takalani le roux,Person
2,meline bermingham,Person
3,sr julie moland,Person
4,united insurance brokers ltd,Company
5,dr charmaine masilela,Person
6,kagiso van eeden,Person
7,osten marikhin,Person
8,rev derek robertson,Person
9,meedoo tazz latz jv co,Company


## Augment data to give a more even and realistic mix of patterns for training augmentation

In [19]:
def reassign_random_pattern_in_place(df, pattern_dict, seed=42):
    """
    Removes all label-specific patterns from each name, then re-adds one random
    pattern (or none) at the original position of a matched pattern. If no pattern was
    present, inserts at the beginning for 'Person' and end for 'Company' or 'University'.

    Parameters:
    - df: pandas.DataFrame
        DataFrame with 'dirty_name' and 'dirty_label' columns.
    - pattern_dict: dict
        Dictionary of label -> list of patterns.
    - seed: int
        Random seed for reproducibility.

    Returns:
    - updated_df: pandas.DataFrame
        DataFrame with modified 'dirty_name' values.
    """
    random.seed(seed)
    updated_names = []

    for _, row in df.iterrows():
        name = row['dirty_name']
        label = row['dirty_label']
        patterns = pattern_dict.get(label, [])

        # Find all pattern matches and track first position
        first_match_index = None
        name_lower = name.lower()
        for pat in patterns:
            match = re.search(r'\b' + re.escape(pat) + r'\b', name_lower)
            if match and first_match_index is None:
                first_match_index = match.start()

        # Remove all occurrences of patterns
        for pat in patterns:
            name = re.sub(r'\b' + re.escape(pat) + r'\b', '', name, flags=re.IGNORECASE)

        # Clean up whitespace
        name = re.sub(r'\s+', ' ', name).strip()

        # Choose random pattern to reinsert (or none)
        chosen_pattern = random.choice(patterns + [None])

        if chosen_pattern:
            name_words = name.split()
            if first_match_index is not None:
                # Estimate insert position based on original match character index
                words_before = name_lower[:first_match_index].split()
                insert_pos = len(words_before)
                insert_pos = min(insert_pos, len(name_words))  # Bound safety
                name_words.insert(insert_pos, chosen_pattern)
            else:
                # If no match existed, insert front or end by label
                if label == 'Person':
                    name_words = [chosen_pattern] + name_words
                else:
                    name_words = name_words + [chosen_pattern]

            name = ' '.join(name_words)

        updated_names.append(name.strip())

    updated_df = df.copy()
    updated_df['dirty_name'] = updated_names
    return updated_df


In [20]:
train_df_adjusted = reassign_random_pattern_in_place(train_df, get_label_patterns())

In [21]:
train_df_adjusted.head(50)

Unnamed: 0,dirty_name,dirty_label
0,mrs mitchell claw,Person
1,mr takalani le roux,Person
2,dr meline bermingham,Person
3,miss julie moland,Person
4,united insurance brokers cc,Company
5,ms charmaine masilela,Person
6,mrs kagiso van eeden,Person
7,osten marikhin,Person
8,mrs derek robertson,Person
9,meedoo tazz latz jv co trust,Company


## Augment data randomly with character or entry noise. Augment each entry only once (dataset should double)

In [103]:
# remove some entries from the person category
def remove_person_examples(train_df, percentage, seed=42):
    """
    Removes a specified percentage of rows from the 'Person' category in the training dataset.

    Parameters:
    - train_df: pandas.DataFrame
        Training dataset with a 'dirty_label' column.
    - percentage: float
        Percentage of 'Person' rows to remove (e.g., 0.3 for 30%).
    - seed: int
        Random seed for reproducibility.

    Returns:
    - reduced_df: pandas.DataFrame
        Training dataset with specified portion of 'Person' entries removed.
    """
    import pandas as pd
    import numpy as np

    assert 0 <= percentage <= 1, "Percentage must be between 0 and 1"

    person_df = train_df[train_df['dirty_label'] == 'Person']
    non_person_df = train_df[train_df['dirty_label'] != 'Person']

    # Determine how many to keep
    keep_count = int(len(person_df) * (1 - percentage))
    reduced_person_df = person_df.sample(n=keep_count, random_state=seed)

    # Combine and shuffle
    reduced_df = pd.concat([reduced_person_df, non_person_df], ignore_index=True)
    reduced_df = reduced_df.sample(frac=1, random_state=seed).reset_index(drop=True)

    return reduced_df

In [139]:
reduced_df = remove_person_examples(train_df_adjusted, 0.8)

In [140]:
reduced_df.shape

(1347, 2)

In [22]:
# Augmentation for all categories
# def augment_dataset_with_noise(df, seed=42):
#     """
#     Augments the dataset by duplicating each entry once and applying one of two types
#     of random noise to the name:
#       1. Insert/remove a space or swap two adjacent characters.
#       2. Slightly altered duplicate version with same label.

#     Parameters:
#     - df: pandas.DataFrame
#         DataFrame containing 'dirty_name' and 'dirty_label' columns.
#     - seed: int
#         Random seed for reproducibility.

#     Returns:
#     - augmented_df: pandas.DataFrame
#         New DataFrame with both original and augmented examples (2x the size).
#     """
#     random.seed(seed)
#     augmented_rows = []

#     def add_character_noise(name):
#         """Applies one of three noise types to the string: insert space, remove space, or swap characters."""
#         if len(name) < 2:
#             return name

#         choice = random.choice(['insert_space', 'remove_space', 'swap_chars'])
#         chars = list(name)

#         if choice == 'insert_space':
#             idx = random.randint(1, len(chars) - 1)
#             return name[:idx] + ' ' + name[idx:]
#         elif choice == 'remove_space':
#             if ' ' in name:
#                 idx = name.index(' ')
#                 return name[:idx] + name[idx+1:]
#             else:
#                 return name
#         elif choice == 'swap_chars':
#             idx = random.randint(0, len(chars) - 2)
#             chars[idx], chars[idx+1] = chars[idx+1], chars[idx]
#             return ''.join(chars)

#     for _, row in df.iterrows():
#         name = row['dirty_name']
#         label = row['dirty_label']

#         augmented_name = add_character_noise(name)  # Apply noise
#         augmented_rows.append({
#             'dirty_name': augmented_name,
#             'dirty_label': label
#         })

#     # Combine original and augmented
#     df_aug = pd.DataFrame(augmented_rows)
#     augmented_df = pd.concat([df, df_aug], ignore_index=True)
#     return augmented_df


In [142]:
# Augmentation with specific categories
def augment_dataset_with_noise(df, seed=42):
    """
    Augments the dataset by duplicating entries from 'Company' and 'University' classes,
    applying character-level noise:
      - 'University': 10 augmentations per original
      - 'Company': 3 augmentations per original

    Parameters:
    - df: pandas.DataFrame
        DataFrame containing 'dirty_name' and 'dirty_label' columns.
    - seed: int
        Random seed for reproducibility.

    Returns:
    - augmented_df: pandas.DataFrame
        DataFrame with original and augmented entries.
    """

    random.seed(seed)
    augmented_rows = []

    def add_character_noise(name):
        """Applies one of three noise types to the string: insert space, remove space, or swap characters."""
        if len(name) < 2:
            return name

        choice = random.choice(['insert_space', 'remove_space', 'swap_chars'])
        chars = list(name)

        if choice == 'insert_space':
            idx = random.randint(1, len(chars) - 1)
            return name[:idx] + ' ' + name[idx:]
        elif choice == 'remove_space':
            if ' ' in name:
                idx = name.index(' ')
                return name[:idx] + name[idx+1:]
            else:
                return name
        elif choice == 'swap_chars':
            idx = random.randint(0, len(chars) - 2)
            chars[idx], chars[idx+1] = chars[idx+1], chars[idx]
            return ''.join(chars)

    for _, row in df.iterrows():
        name = row['dirty_name']
        label = row['dirty_label']

        if label == 'University':
            for _ in range(6):
                augmented_name = add_character_noise(name)
                augmented_rows.append({
                    'dirty_name': augmented_name,
                    'dirty_label': label
                })
        # elif label == 'Company':
        #     for _ in range(2):
        #         augmented_name = add_character_noise(name)
        #         augmented_rows.append({
        #             'dirty_name': augmented_name,
        #             'dirty_label': label
        #         })

    df_aug = pd.DataFrame(augmented_rows)
    augmented_df = pd.concat([df, df_aug], ignore_index=True)
    return augmented_df

In [143]:
augmented_train_df = augment_dataset_with_noise(reduced_df)

In [133]:
# augmented_train_df = reduced_df

In [66]:
# augmented_train_df = augment_dataset_with_noise(train_df_adjusted)

In [144]:
augmented_train_df = reassign_random_pattern_in_place(augmented_train_df, get_label_patterns())

In [145]:
print(augmented_train_df.shape)

(1875, 2)


In [146]:
augmented_train_df.head(50)

Unnamed: 0,dirty_name,dirty_label
0,thoughtbeat llp capital,Company
1,livetube geba ltd,Company
2,mr angelique ivashchenko,Person
3,peernet global bw group,Company
4,dr tshepiso manamela,Person
5,keeley gabbidon cc,Company
6,castelo branco politecnico,University
7,ms hannie woollam,Person
8,college of kashmir,University
9,bird lifsey,Person


In [147]:
augmented_train_df.tail(50)

Unnamed: 0,dirty_name,dirty_label
1825,brunel polit ecnico uxbridge instituto,University
1826,brunel politencico uxbridge universidade,University
1827,luther univresity university,University
1828,luther univ ersity universidad,University
1829,luther univers ity politecnico,University
1830,lutheruniversity universidade,University
1831,luther univers ity institute,University
1832,lutheruniversity college,University
1833,xiangtan instiutte,University
1834,xiantgan universidad,University


# Train model

In [148]:
def train_and_evaluate_logreg_model(train_df, test_df, model_path='best_logreg_model.joblib'):
    """
    Trains a Logistic Regression model with TF-IDF features and evaluates on the test set.
    Reports metrics every 25 iterations and saves separate plots for each metric, per-class metrics,
    and a confusion matrix plot.

    Parameters:
    - train_df: pandas.DataFrame
        Training dataset with 'dirty_name' and 'dirty_label' columns.
    - test_df: pandas.DataFrame
        Test dataset with same structure.
    - model_path: str
        Path to save the best model.

    Returns:
    - metrics: dict
        Dictionary of evaluation metrics (accuracy, recall, F1, confusion matrix).
    """
    X_train = train_df['dirty_name']
    y_train = train_df['dirty_label']
    X_test = test_df['dirty_name']
    y_test = test_df['dirty_label']

    # Vectoriser
    vectoriser = TfidfVectorizer(ngram_range=(2, 5), max_features=3000)
    X_train_tfidf = vectoriser.fit_transform(X_train)
    X_test_tfidf = vectoriser.transform(X_test)

    # Model with warm_start to track progress
    model = LogisticRegression(
        C=1,
        class_weight='balanced',
        max_iter=25,
        warm_start=True,
        verbose=0
    )

    max_iter = 100
    steps = max_iter // 25
    all_metrics = {'loss': [], 'accuracy': [], 'recall': [], 'f1': []}

    for i in range(steps):
        model.max_iter += 25
        model.fit(X_train_tfidf, y_train)

        y_pred = model.predict(X_test_tfidf)
        y_proba = model.predict_proba(X_test_tfidf)

        loss = log_loss(y_test, y_proba)
        acc = accuracy_score(y_test, y_pred)
        recall = recall_score(y_test, y_pred, average='macro')
        f1 = f1_score(y_test, y_pred, average='macro')

        all_metrics['loss'].append(loss)
        all_metrics['accuracy'].append(acc)
        all_metrics['recall'].append(recall)
        all_metrics['f1'].append(f1)

        print(f"After {model.max_iter} iterations:")
        print(f"  Loss: {loss:.4f}, Accuracy: {acc:.4f}, Recall: {recall:.4f}, F1 Score: {f1:.4f}\n")

    # Final predictions
    y_pred_final = model.predict(X_test_tfidf)
    cm = confusion_matrix(y_test, y_pred_final, labels=np.unique(y_test))
    report = classification_report(y_test, y_pred_final, output_dict=True)
    report_df = pd.DataFrame(report).transpose()

    # Save model and vectoriser in a pipeline
    pipeline = Pipeline([
        ('tfidf', vectoriser),
        ('clf', model)
    ])
    dump(pipeline, model_path)

    # Save plots directory
    plot_dir = os.path.splitext(model_path)[0] + "_plots"
    os.makedirs(plot_dir, exist_ok=True)
    x_ticks = [(i+1)*25 for i in range(steps)]

    # Save individual metric plots
    for metric in all_metrics:
        plt.figure()
        plt.plot(x_ticks, all_metrics[metric], marker='o')
        plt.title(f'{metric.capitalize()} over Iterations')
        plt.xlabel('Iterations')
        plt.ylabel(metric.capitalize())
        plt.grid(True)
        plt.tight_layout()
        plot_path = os.path.join(plot_dir, f"{metric}_plot.png")
        plt.savefig(plot_path)
        plt.close()

    # Save per-class metrics plot
    for class_label in report_df.index[:-3]:  # Exclude avg/total rows
        plt.figure()
        for metric in ['precision', 'recall', 'f1-score']:
            plt.bar(metric, report_df.loc[class_label, metric])
        plt.ylim(0, 1)
        plt.title(f'Performance for class: {class_label}')
        plt.tight_layout()
        plot_path = os.path.join(plot_dir, f"{class_label}_metrics.png")
        plt.savefig(plot_path)
        plt.close()

    # Save confusion matrix plot
    plt.figure(figsize=(6, 5))
    sns.heatmap(cm, annot=True, fmt='d', xticklabels=np.unique(y_test), yticklabels=np.unique(y_test), cmap='Blues')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title('Confusion Matrix')
    cm_path = os.path.join(plot_dir, "confusion_matrix.png")
    plt.tight_layout()
    plt.savefig(cm_path)
    plt.close()

    return {
        'accuracy': all_metrics['accuracy'][-1],
        'recall': all_metrics['recall'][-1],
        'f1_score': all_metrics['f1'][-1],
        'confusion_matrix': cm
    }


In [149]:
metrics = train_and_evaluate_logreg_model(augmented_train_df, test_df, model_path='best_logreg_model.joblib')

After 50 iterations:
  Loss: 0.9511, Accuracy: 0.8604, Recall: 0.5737, F1 Score: 0.6295

After 75 iterations:
  Loss: 0.9511, Accuracy: 0.8604, Recall: 0.5737, F1 Score: 0.6295

After 100 iterations:
  Loss: 0.9511, Accuracy: 0.8604, Recall: 0.5737, F1 Score: 0.6295

After 125 iterations:
  Loss: 0.9511, Accuracy: 0.8604, Recall: 0.5737, F1 Score: 0.6295



In [102]:
print(metrics)

{'accuracy': 0.8581081081081081, 'recall': 0.5686868686868687, 'f1_score': 0.6209774204056441, 'confusion_matrix': array([[  7,  59,   0],
       [  0, 368,   0],
       [  0,   4,   6]])}
