# Melio Fullstack Data Scientist Technical Interview

### Task 1: Building the classifier

This is the main data science component of the technical assessment.

Build a classifier to determine whether the name belongs to a `Person`, `Company`, or `University`:

  - You can use any library you want.
  - You can use a rule-based classification, a pre-built model/embedding, build a model yourself or a hybrid. 
  - Format:
    - If you are building an ML solution, the training of your model can be in a Jupyter notebook. 
    - If you are not building an ML solution, you will have to embed your python code into the app.

Note that the classifications are generated by the client's upstream system, but it is not always correct. 

## Submission Requirements

  1. Give enough information on how to run your solution (i.e. python version, packages, requirements.txt, Dockerfile, etc.).
  2. State all of your assumptions, if any.
  3. There is no right or wrong answer, but give a clear reasoning on each step you took. 


In [1]:
import pandas as pd

df = pd.read_csv('names_data_candidate.csv')
df.sample(10)

Unnamed: 0,dirty_name,dirty_label
3799,SIPHOKAZI OBERHOLZER,Person
3866,Kori Growcock,Person
992,NADEAN COCKERILL,Person
725,rev. anna mlambo,Person
113,nomfundo ngubane,Person
4123,deny gallihawk,Person
1356,MYMM skaboo voonyx Photospace inc.,Company
1453,Elvina Buchett,Person
1874,DR. BARRY ZUMA,Person
2316,CHESTON ARISS,Person


In [7]:
# imports
import string
import unidecode
import re

# Data Exploration  

## First remove all accents, capitals and punctuation to create cleaner data to work with

In [4]:
def lowercase_names(df):
    """
    Converts all entries in the 'dirty_name' column to lowercase.

    Parameters:
    - df: pandas.DataFrame
        Input DataFrame with a 'dirty_name' column.

    Returns:
    - df: pandas.DataFrame
        DataFrame with 'dirty_name' values converted to lowercase.

    This step standardises casing to avoid case-sensitive mismatches in rule-based
    keyword matching and improves model generalisation.
    """
    df['dirty_name'] = df['dirty_name'].str.lower()
    return df

def remove_punctuation(df):
    """
    Removes punctuation characters from the 'dirty_name' column.

    Parameters:
    - df: pandas.DataFrame
        Input DataFrame with a 'dirty_name' column.

    Returns:
    - df: pandas.DataFrame
        DataFrame with punctuation removed from 'dirty_name'.

    This reduces noise from characters like '.', ',', etc., which can vary in presence
    but rarely carry class-discriminative value.
    """
    translator = str.maketrans('', '', string.punctuation)
    df['dirty_name'] = df['dirty_name'].apply(lambda name: name.translate(translator))
    return df


def remove_accents(df):
    """
    Replaces accented characters in the 'dirty_name' column with ASCII equivalents.

    Parameters:
    - df: pandas.DataFrame
        Input DataFrame with a 'dirty_name' column.

    Returns:
    - df: pandas.DataFrame
        DataFrame with special/accented characters replaced.

    This improves matching for names with diacritics (e.g. 'Émile' → 'Emile').
    """
    df['dirty_name'] = df['dirty_name'].apply(unidecode.unidecode)
    return df


In [5]:
df = lowercase_names(df)
df = remove_punctuation(df)
df = remove_accents(df)
print(df.head(20))

                       dirty_name dirty_label
0                  wright pentlow      Person
1                ms sydney hadebe      Person
2             prof hennie vorster      Person
3                   enrica hayter      Person
4                    teboho ngema      Person
5                    irene klaves      Person
6                   aila tenpenny      Person
7            oceanne dawidowitsch      Person
8                 long mac geffen      Person
9                thabiso blignaut      Person
10              emalee le strange      Person
11                 lindiwe wright      Person
12  imibono fuels capital pty ltd     Company
13          imibono fuels pty ltd     Company
14                  imibono fuels     Company
15          mr miss bronwyn kotze      Person
16            dr anson dudderidge      Person
17          prof frederick turner     Company
18               katherina hawkey      Person
19        dr rev hannes mashinini      Person


## remove known miss labelled instances

In [4]:
# identify noisy labels
def identify_noisy_labels(df):
    """
    Identifies and counts noisy or invalid labels in the classification dataset.

    Parameters:
    - df: pandas.DataFrame
        Input DataFrame containing at least 'dirty_label' column.

    Returns:
    - noisy_df: pandas.DataFrame
        Subset of rows with invalid labels.
    - num_noisy: int
        Count of noisy label rows.

    This function compares all entries in the 'dirty_label' column against a known set
    of valid classes ('Person', 'Company', 'University'). Any entry not matching these
    is considered noisy.
    """
    valid_labels = {'Person', 'Company', 'University'}

    # Filter out rows that don't match the valid label set
    noisy_df = df[~df['dirty_label'].isin(valid_labels)]
    num_noisy = len(noisy_df)

    print(f"Number of noisy label entries: {num_noisy}")
    return noisy_df, num_noisy



In [5]:
noisy_df, num_noisy = identify_noisy_labels(df)

Number of noisy label entries: 0


In [None]:
# no entries are missing classifications

In [8]:
# check for miss classifcations using a rule based approach

def get_label_patterns():
    """
    Defines common indicative substrings or patterns for each label class.
    
    Returns:
    - patterns: dict
        A dictionary with keys as class labels ('Person', 'Company', 'University') and
        values as lists of lowercase keywords typically found in those categories.
    """
    return {
        'Person': [
            'mr', 'mrs', 'ms', 'miss', 'dr', 'prof', 'rev', 'sr'
        ],
        'Company': [
            'pty', 'ltd', 'inc', 'cc', 'corp', 'company', 'llc', 'gmbh', 'foundation', 'trust',
            'capital', 'group', 'holdings', 'investments'
        ],
        'University': [
            'university', 'college', 'institute', 'politecnico', 'instituto', 'universidad', 'universidade', 'universite'
        ]
    }

def flag_misclassified_entries(df):
    """
    Identifies rows where the label seems to contradict common naming patterns.

    Parameters:
    - df: pandas.DataFrame
        DataFrame with 'dirty_name' and 'dirty_label' columns.

    Returns:
    - mismatched_df: pandas.DataFrame
        Subset of rows suspected to be misclassified based on rule-based patterns.
    - num_mismatches: int
        Count of suspected misclassifications.

    This function uses common keyword patterns for each class and flags entries
    where the name suggests one class, but the label is something else.
    """
    patterns = get_label_patterns()
    
    # Precompile regex patterns for efficiency
    compiled_patterns = {
        label: re.compile('|'.join([fr'\b{re.escape(p)}\b' for p in keywords]), re.IGNORECASE)
        for label, keywords in patterns.items()
    }

    mismatches = []
    
    for _, row in df.iterrows():
        name = row['dirty_name']
        label = row['dirty_label']
        
        # Determine expected label(s) based on pattern match
        expected_labels = [lbl for lbl, pat in compiled_patterns.items() if pat.search(name)]
        
        # If the actual label is not one of the expected ones, flag as mismatch
        if expected_labels and label not in expected_labels:
            mismatches.append(row)

    mismatched_df = pd.DataFrame(mismatches)
    num_mismatches = len(mismatched_df)

    print(f"Number of suspected misclassified rows: {num_mismatches}")
    return mismatched_df, num_mismatches

def clean_misclassified_entries(df):
    """
    Removes suspected misclassified rows from the dataset based on naming patterns.

    Parameters:
    - df: pandas.DataFrame
        Original dataset.

    Returns:
    - cleaned_df: pandas.DataFrame
        Dataset with suspected misclassified rows removed.

    This function uses the rule-based flagging logic to detect and exclude entries
    that appear to have incorrect labels.
    """
    mismatched_df, _ = flag_misclassified_entries(df)
    
    # Remove mismatched entries
    cleaned_df = df.drop(mismatched_df.index).reset_index(drop=True)
    print(f"Cleaned dataset size after removing misclassified rows: {len(cleaned_df)}")

    return cleaned_df


In [11]:
# get the label patterns
label_patterns = get_label_patterns()
# flag the missclassified entries and print the number (check its not too high)
mismatched_df, num_mismatches = flag_misclassified_entries(df)
# happy it is not too many, so remove the bad ones
df = clean_misclassified_entries(df)

Number of suspected misclassified rows: 84
Number of suspected misclassified rows: 84
Cleaned dataset size after removing misclassified rows: 4436


## Remove doubled prefixes for names

In [12]:
def remove_double_prefixes(df):
    """
    Removes the first two words in 'dirty_name' if both are common prefixes
    (e.g., 'Mr Dr Jane Doe' → 'Jane Doe').

    Parameters:
    - df: pandas.DataFrame
        Input DataFrame with a 'dirty_name' column (pre-cleaned, lowercase, no punctuation).

    Returns:
    - df: pandas.DataFrame
        DataFrame with modified 'dirty_name' values where double prefixes were removed.

    This function targets cases where two known titles/prefixes occur at the beginning of the name.
    It uses a defined prefix list and only alters names that start with two such prefixes.
    """
    # Define common person prefixes
    prefixes = {'mr', 'mrs', 'ms', 'miss', 'dr', 'prof', 'rev', 'sr'}

    def clean_name(name):
        # Split name into parts
        parts = name.strip().split()

        # Check if first two words are both valid prefixes
        if len(parts) >= 2 and parts[0] in prefixes and parts[1] in prefixes:
            return ' '.join(parts[2:])  # Remove both
        return name

    df['dirty_name'] = df['dirty_name'].apply(clean_name)
    return df


In [13]:
df = remove_double_prefixes(df)

## Count the number of entries for each category and spread of patterns

In [20]:
def count_all_patterns_by_label(df, pattern_dict):
    """
    Counts how often each pattern from a label-specific pattern dictionary appears
    in the corresponding names within each label group, and includes the total number
    of entries per group.

    Parameters:
    - df: pandas.DataFrame
        DataFrame with 'dirty_name' and 'dirty_label' columns.
    - pattern_dict: dict
        Dictionary where keys are label names (e.g., 'Person', 'Company') and values are
        lists of lowercase patterns to match.

    Returns:
    - results: dict
        A nested dictionary where each key is a label, and the value is another dictionary
        containing:
            - pattern counts (including 'no_pattern')
            - 'total_entries': total number of rows for that label

    This function gives insight into how many names match each pattern and the pattern
    coverage relative to total entries per category.
    """
    results = {}

    for label, patterns in pattern_dict.items():
        label_df = df[df['dirty_label'] == label]
        total = len(label_df)

        pattern_counts = {p: 0 for p in patterns}
        pattern_counts['no_pattern'] = 0
        pattern_counts['total_entries'] = total

        for name in label_df['dirty_name']:
            found = False
            for p in patterns:
                if p in name:
                    pattern_counts[p] += 1
                    found = True
            if not found:
                pattern_counts['no_pattern'] += 1

        print(f"\nPattern match counts for label: '{label}'")
        print(f"Total entries: {total}")
        for k, v in pattern_counts.items():
            if k != 'total_entries':
                print(f"{k}: {v}")
        
        results[label] = pattern_counts

    return results


In [21]:
count_all_patterns_by_label(df, get_label_patterns())


Pattern match counts for label: 'Person'
Total entries: 3674
mr: 364
mrs: 171
ms: 121
miss: 230
dr: 372
prof: 119
rev: 69
sr: 55
no_pattern: 2383

Pattern match counts for label: 'Company'
Total entries: 664
pty: 96
ltd: 128
inc: 25
cc: 54
corp: 25
company: 11
llc: 15
gmbh: 21
foundation: 4
trust: 18
capital: 6
group: 12
enterprises: 0
holdings: 4
investments: 6
no_pattern: 348

Pattern match counts for label: 'University'
Total entries: 98
university: 58
college: 14
institute: 2
technikon: 0
polytechnic: 0
school of: 1
politecnico: 2
instituto: 3
universidad: 15
universidade: 2
universite: 2
no_pattern: 6


{'Person': {'mr': 364,
  'mrs': 171,
  'ms': 121,
  'miss': 230,
  'dr': 372,
  'prof': 119,
  'rev': 69,
  'sr': 55,
  'no_pattern': 2383,
  'total_entries': 3674},
 'Company': {'pty': 96,
  'ltd': 128,
  'inc': 25,
  'cc': 54,
  'corp': 25,
  'company': 11,
  'llc': 15,
  'gmbh': 21,
  'foundation': 4,
  'trust': 18,
  'capital': 6,
  'group': 12,
  'enterprises': 0,
  'holdings': 4,
  'investments': 6,
  'no_pattern': 348,
  'total_entries': 664},
 'University': {'university': 58,
  'college': 14,
  'institute': 2,
  'technikon': 0,
  'polytechnic': 0,
  'school of': 1,
  'politecnico': 2,
  'instituto': 3,
  'universidad': 15,
  'universidade': 2,
  'universite': 2,
  'no_pattern': 6,
  'total_entries': 98}}

Suggested augmentation is add common words at random during training for even split of them.
Also need to ensure data split is even when training. 