# Melio Fullstack Data Scientist Technical Interview

### Task 1: Building the classifier

This is the main data science component of the technical assessment.

Build a classifier to determine whether the name belongs to a `Person`, `Company`, or `University`:

  - You can use any library you want.
  - You can use a rule-based classification, a pre-built model/embedding, build a model yourself or a hybrid. 
  - Format:
    - If you are building an ML solution, the training of your model can be in a Jupyter notebook. 
    - If you are not building an ML solution, you will have to embed your python code into the app.

Note that the classifications are generated by the client's upstream system, but it is not always correct. 

## Submission Requirements

  1. Give enough information on how to run your solution (i.e. python version, packages, requirements.txt, Dockerfile, etc.).
  2. State all of your assumptions, if any.
  3. There is no right or wrong answer, but give a clear reasoning on each step you took. 


In [4]:
import pandas as pd

df = pd.read_csv('data/names_data_candidate.csv')
df.head(19)

Unnamed: 0,dirty_name,dirty_label
0,Wright Pentlow,Person
1,MS Sydney Hadebe,Person
2,PROF. HENNIE VORSTER,Person
3,ENRICA HAYTER,Person
4,Teboho Ngema,Person
5,Irène Klaves,Person
6,Aila Tenpenny,Person
7,OCÉANNE DAWIDOWITSCH,Person
8,Lóng Mac Geffen,Person
9,THABISO BLIGNAUT,Person


- Thought process for the Task 1
    1. Task 1 seems like Named entity recognition problem. 
        - I have used spacy before for a similar task.
        - Other options include Flair and HuggingFace bert-base-NER.
        - According to Papers with Code SOA is ACE an LSTM Transformer. The dataset used is CoNLL 2003 (English).
        - Lets go with Spacy for a baseline since Ive used it before.
        
        - Lets go with a model approach we can add rules if neccessary.
        - We just need to make sure the model is compatiable with the deployment.

    2. Inspect Data
        - Check for nulls, missing labels and or other anomolies.
        - Check distibution of labels. 
    
    3. Since labels are dirty maybe we can run through the model and check.

    4. Prepare data for training, lets go 70% train 10% validation 20%.         

In [8]:
# Check for missing values
print(f"Missing values:\n{df.isnull().sum()}")

Missing values:
dirty_name     0
dirty_label    0
dtype: int64


In [9]:
print(f"Number of rows: {len(df)}")
print(f"Unique labels: {df['dirty_label'].unique()}")
print(f"Label distribution:\n{df['dirty_label'].value_counts()}")

Number of rows: 4520
Unique labels: ['Person' 'Company' 'University']
Label distribution:
dirty_label
Person        3690
Company        732
University      98
Name: count, dtype: int64


In [10]:
print(f"Percent label distribution:\n{(df['dirty_label'].value_counts() / len(df) * 100)}")

Percent label distribution:
dirty_label
Person        81.637168
Company       16.194690
University     2.168142
Name: count, dtype: float64


- Only 2% of the data is Universities. The data is heavily imbalanced.
* We need to come up with a strategy.
* Thoughts:
    - It will affect Accuracy as data is skewed toward labeling Person.
    - Since data is exact label and not sentances augmenting on data level maybe difficult. Oversample minority class (duplicate)| Augment data
    - On algorithm level can adjust class weights or have prediction thresholds. 
    - I could use LLM to add universities to data and manually balance dataset. Remove some Persons & add more univerties and companies.
    - Stratify when doing splits especially validation.
    - Use transfer learning:
        - Spacy already trained on Person.
        - So we can levereage that.
        - We can check labels by running model through persons only.
    - Lets think about this spacy also does orgnaisation and will classify a university as such. 
    - If we can have subcategories and default to company it may be a good first apporach. 
    - rule baded mixed with model.
    - Spacy also does orgnaisation and will classify university as such
    - A subgroup (company and universiy) for organisation class could work
    - if not university then most likely a company
    




#### Lets Validate our Person class. 

Knowing that spacy can classify entity and the labels are dirty lets see the accuracies and suggest changes 

In [8]:
import pandas as pd
import spacy
import numpy as np
from tqdm import tqdm  # For progress bar (optional)

In [9]:
def validate_person_labels(csv_path, column_name="dirty_name", label_column="dirty_label"):
    """
    Validate PERSON entity labels by comparing with spaCy's pre-trained model.
    
    Args:
        csv_path: Path to your CSV file
        column_name: Column containing the text data
        label_column: Column containing the entity labels
    
    Returns:
        DataFrame with validation results
    """
    # Load your CSV file
    print("Loading data from CSV...")
    df = pd.read_csv(csv_path)
    
    # Basic data exploration
    print(f"Dataset size: {len(df)} rows")
    print(f"Label distribution:\n{df[label_column].value_counts()}")
    
    # Load spaCy model - using the medium model for better NER accuracy
    print("Loading spaCy model...")
    nlp = spacy.load("en_core_web_trf")  # You can use sm, md, or lg based on your needs
    
    # Create an empty list to store results
    results = []
    
    # Process each row
    print("Processing entities...")
    for idx, row in tqdm(df.iterrows(), total=len(df)):
        text = row[column_name]
        original_label = row[label_column]
        
        # Process with spaCy
        doc = nlp(text)
        
        # Check if spaCy found any PERSON entities
        person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"]
        spacy_found_person = len(person_entities) > 0
        
        # Determine if this is a PERSON according to your dataset
        is_labeled_person = original_label.upper()  == "PERSON"
        
        # Check for agreement/disagreement
        status = "CORRECT" if spacy_found_person == is_labeled_person else "POTENTIAL_ERROR"
        
        # Add additional info for potential errors
        confidence = None
        suggestion = None
        
        if status == "POTENTIAL_ERROR":
            if is_labeled_person and not spacy_found_person:
                # You labeled it PERSON, but spaCy didn't find a PERSON
                suggestion = "This might not be a PERSON"
                # Get all entities spaCy found, if any
                other_entities = [(ent.text, ent.label_) for ent in doc.ents]
                confidence = "low" if other_entities else "medium"
            
            elif not is_labeled_person and spacy_found_person:
                # spaCy found PERSON but your label is different
                person_texts = [ent.text for ent in person_entities]
                suggestion = f"This might be a PERSON: {', '.join(person_texts)}"
                # Check confidence based on overlap
                confidence = "high" if any(text.find(ent.text) >= 0 for ent in person_entities) else "medium"
        
        # Store results
        results.append({
            "index": idx,
            "text": text,
            "your_label": original_label,
            "spacy_found_person": spacy_found_person,
            "spacy_entities": [(ent.text, ent.label_) for ent in doc.ents],
            "status": status,
            "confidence": confidence,
            "suggestion": suggestion
        })
    
    # Convert to DataFrame
    results_df = pd.DataFrame(results)
    
    # Generate summary statistics
    total = len(results_df)
    correct = len(results_df[results_df["status"] == "CORRECT"])
    potential_errors = len(results_df[results_df["status"] == "POTENTIAL_ERROR"])
    
    print("\n--- SUMMARY STATISTICS ---")
    print(f"Total entries analyzed: {total}")
    print(f"Correct labels: {correct} ({correct/total*100:.2f}%)")
    print(f"Potential errors: {potential_errors} ({potential_errors/total*100:.2f}%)")
    
    # Breakdown of error types
    false_persons = len(results_df[(results_df["your_label"].str.upper() == "PERSON") & (~results_df["spacy_found_person"])])
    missed_persons = len(results_df[(results_df["your_label"].str.upper() != "PERSON") & (results_df["spacy_found_person"])])
    
    print("\n--- ERROR BREAKDOWN ---")
    print(f"Potentially false PERSON labels: {false_persons}")
    print(f"Potentially missed PERSON labels: {missed_persons}")
    
    return results_df

# Function to save the results for further analysis
def save_validation_results(results_df, output_path="person_validation_results.csv"):
    """Save validation results to CSV file"""
    results_df.to_csv(output_path, index=False)
    print(f"Results saved to {output_path}")
    
    # Also save a filtered version with only potential errors for easier review
    errors_df = results_df[results_df["status"] == "POTENTIAL_ERROR"]
    error_path = output_path.replace(".csv", "_errors_only.csv")
    errors_df.to_csv(error_path, index=False)
    print(f"Potential errors saved to {error_path}")
    
    return errors_df

# Function to create a correction guide
def create_correction_guide(errors_df, output_path="person_correction_suggestions.csv"):
    """Create a simplified guide for manual corrections"""
    
    correction_guide = errors_df[["index", "text", "your_label", "spacy_entities", "suggestion"]].copy()
    
    # Add a column for manual decisions
    correction_guide["corrected_label"] = ""
    correction_guide["notes"] = ""
    
    correction_guide.to_csv(output_path, index=False)
    print(f"Correction guide saved to {output_path}")
    
    return correction_guide

# Main execution function
def main(csv_path):
    """Run the full validation process"""
    # Validate person labels
    results_df = validate_person_labels(csv_path)
    
    # Save results
    errors_df = save_validation_results(results_df)
    
    # Create correction guide
    create_correction_guide(errors_df)
    
    # Display sample errors for immediate review
    print("\n--- SAMPLE ERROR CASES ---")
    sample_size = min(10, len(errors_df))
    for i, (_, row) in enumerate(errors_df.sample(sample_size).iterrows()):
        print(f"\nCase {i+1}:")
        print(f"Text: '{row['text']}'")
        print(f"Your label: {row['your_label']}")
        print(f"spaCy entities: {row['spacy_entities']}")
        print(f"Suggestion: {row['suggestion']}")
    
    print("\n--- NEXT STEPS ---")
    print("1. Review the error cases in the generated files")
    print("2. Make corrections to your original dataset based on this analysis")
    print("3. Consider running this validation again after corrections")
    
    return results_df

In [10]:
results = main('data/names_data_candidate.csv') 

Loading data from CSV...
Dataset size: 4520 rows
Label distribution:
dirty_label
Person        3690
Company        732
University      98
Name: count, dtype: int64
Loading spaCy model...
Processing entities...


100%|██████████| 4520/4520 [03:30<00:00, 21.52it/s] 


--- SUMMARY STATISTICS ---
Total entries analyzed: 4520
Correct labels: 4024 (89.03%)
Potential errors: 496 (10.97%)

--- ERROR BREAKDOWN ---
Potentially false PERSON labels: 235
Potentially missed PERSON labels: 261
Results saved to person_validation_results.csv
Potential errors saved to person_validation_results_errors_only.csv
Correction guide saved to person_correction_suggestions.csv

--- SAMPLE ERROR CASES ---

Case 1:
Text: 'Adriaan Africa'
Your label: Person
spaCy entities: [('Adriaan Africa', 'LOC')]
Suggestion: This might not be a PERSON

Case 2:
Text: 'cléopatre zannuto'
Your label: Person
spaCy entities: []
Suggestion: This might not be a PERSON

Case 3:
Text: 'Mr. rev. brandon fredericks'
Your label: Company
spaCy entities: [('brandon fredericks', 'PERSON')]
Suggestion: This might be a PERSON: brandon fredericks

Case 4:
Text: 'Browsecat  Realblab YODO Blognation Bond'
Your label: Person
spaCy entities: []
Suggestion: This might not be a PERSON

Case 5:
Text: 'Mr Dr. Mart




- Cleaning the data manually will take too long lets just drop the cases we are unsure of 

In [25]:
print("Loading data from CSV...")
errors_df = pd.read_csv('data/person_validation_results_errors_only.csv')

# Basic data exploration
print(f"Dataset size: {len(errors_df)} rows")
errors_df["index"].to_list()

Loading data from CSV...
Dataset size: 496 rows


[17,
 23,
 28,
 44,
 48,
 54,
 105,
 108,
 136,
 139,
 142,
 146,
 147,
 155,
 157,
 160,
 163,
 182,
 185,
 198,
 226,
 230,
 247,
 255,
 287,
 290,
 294,
 301,
 307,
 349,
 354,
 356,
 364,
 367,
 371,
 380,
 387,
 390,
 392,
 407,
 411,
 436,
 468,
 469,
 509,
 514,
 517,
 518,
 534,
 535,
 538,
 558,
 568,
 569,
 582,
 587,
 595,
 610,
 625,
 637,
 652,
 657,
 664,
 672,
 680,
 692,
 696,
 723,
 738,
 748,
 758,
 762,
 771,
 773,
 777,
 781,
 791,
 798,
 799,
 807,
 818,
 847,
 855,
 858,
 863,
 864,
 876,
 884,
 886,
 936,
 970,
 979,
 988,
 990,
 995,
 1006,
 1008,
 1018,
 1027,
 1030,
 1032,
 1058,
 1063,
 1092,
 1097,
 1122,
 1131,
 1151,
 1211,
 1228,
 1254,
 1278,
 1283,
 1288,
 1308,
 1313,
 1338,
 1340,
 1344,
 1345,
 1350,
 1363,
 1364,
 1372,
 1374,
 1380,
 1384,
 1385,
 1392,
 1412,
 1421,
 1424,
 1425,
 1452,
 1465,
 1470,
 1471,
 1507,
 1517,
 1542,
 1544,
 1575,
 1584,
 1593,
 1603,
 1618,
 1634,
 1635,
 1639,
 1647,
 1648,
 1649,
 1656,
 1662,
 1663,
 1664,
 1666,
 1

In [27]:
df.drop(errors_df["index"].to_list(),axis=0, inplace=True)

In [32]:
df.to_csv("clean_data.csv",index=False)

- Lets build the classifier (Thank you claude for your assistance)
- Create own classifier by filtering out org and person from spacy entities.
- add keywords and rules for classifying orgainsiation to university or company

In [None]:
import spacy
from spacy.tokens import Doc
from spacy.language import Language
from sklearn.metrics import classification_report, accuracy_score
import pandas as pd

# Add custom extension to store organization subtypes
Doc.set_extension("org_subtypes", default={}, force=True)

@Language.factory("org_subclassifier")
class OrganizationSubclassifier:
    """A component that filters for PERSON/ORG entities and subclassifies organizations"""
    
    def __init__(self, nlp, name):
        self.name = name
        
        # Keywords for university classification
        self.university_keywords = [
            "university", "college", "institute", "school", 
            "academy", "polytechnic", "conservatory"
        ]
        
        # Keywords for company classification
        self.company_keywords = [
            "inc", "corp", "ltd", "limited", "llc", "company", 
            "technologies", "systems", "group", "industries"
        ]
        
        # Known entities
        self.known_entities = {
            "mit": "UNIVERSITY",
            "harvard": "UNIVERSITY",
            "oxford": "UNIVERSITY",
            "cambridge": "UNIVERSITY",
            "apple": "COMPANY",
            "google": "COMPANY",
            "microsoft": "COMPANY",
            "amazon": "COMPANY"
        }
    
    def __call__(self, doc):
        # Filter entities to keep only PERSON and ORG
        filtered_ents = []
        
        # If no entities were found but we have text, try to classify it
        if not doc.ents and len(doc.text) > 0:
            # Since we're dealing with single entities, try to classify the whole text
            entity_text = doc.text.lower()
            entity_type = self._guess_entity_type(entity_text)
            
            if entity_type in ["PERSON", "ORG"]:
                # Create a span covering the whole text
                span = doc.char_span(0, len(doc.text), label=entity_type)
                if span:
                    filtered_ents.append(span)
        else:
            # Filter existing entities
            for ent in doc.ents:
                if ent.label_ in ["PERSON", "ORG"]:
                    filtered_ents.append(ent)
        
        # Overwrite doc.ents with our filtered list
        doc.ents = filtered_ents
        
        # Subclassify organizations
        for ent in doc.ents:
            if ent.label_ == "ORG":
                subtype = self._classify_organization(ent.text)
                doc._.org_subtypes[ent.text] = subtype
        
        return doc
    
    def _guess_entity_type(self, text):
        """Guess if an entity is a PERSON or ORG when spaCy NER doesn't detect it"""
        text = text.lower()
        
        # Check for known entities
        for entity, subtype in self.known_entities.items():
            if entity in text:
                return "ORG"  # All our known entities are organizations
        
        # Check for organization keywords
        if any(keyword in text for keyword in self.university_keywords + self.company_keywords):
            return "ORG"
        
        # Check for person-like patterns (1-3 words, no special chars except ' and -)
        words = text.split()
        if (1 <= len(words) <= 3 and 
            all(word.isalpha() or "'" in word or "-" in word for word in words)):
            return "PERSON"
        
        # Default to ORG 
        return "ORG"
    
    def _classify_organization(self, text):
        """Subclassify an organization as UNIVERSITY or COMPANY"""
        text = text.lower()
        
        # Check known entities first
        for entity, subtype in self.known_entities.items():
            if entity in text:
                return subtype
        
        # Check for university indicators
        if any(keyword in text for keyword in self.university_keywords) or " of " in text:
            return "UNIVERSITY"
        
        # Default to COMPANY
        return "COMPANY"

def evaluate_entity_classifier(nlp, data):
    """Evaluate the entity classifier on test data"""
    y_true_main = []  # Main entity type (PERSON/ORG)
    y_pred_main = []
    
    y_true_sub = []   # Organization subtype (UNIVERSITY/COMPANY)
    y_pred_sub = []
    
    for _, row in data.iterrows():
        text = row["entity"]
        true_label = row["label"].lower()
        
        # Map true labels
        if "person" in true_label:
            true_main = "PERSON"
            true_sub = None
        elif "company" in true_label:
            true_main = "ORG"
            true_sub = "COMPANY"
        elif "university" in true_label:
            true_main = "ORG"
            true_sub = "UNIVERSITY"
        else:
            # Skip other entity types that aren't relevant
            continue
        
        # Classify with our model
        doc = nlp(text)
        
        # Check if any entity was found
        if doc.ents:
            pred_main = doc.ents[0].label_
            
            # Get subtype for organizations
            if pred_main == "ORG":
                pred_sub = doc._.org_subtypes.get(text, "UNKNOWN")
            else:
                pred_sub = None
        else:
            # No entity detected
            pred_main = "UNKNOWN"
            pred_sub = None
        
        # Record results
        y_true_main.append(true_main)
        y_pred_main.append(pred_main)
        
        # Only evaluate subtypes for organizations
        if true_main == "ORG" and true_sub:
            y_true_sub.append(true_sub)
            
            # If main prediction is wrong, count subtype as wrong too
            if pred_main == "ORG":
                y_pred_sub.append(pred_sub)
            else:
                y_pred_sub.append("WRONG_MAIN_TYPE")
    
    # Calculate metrics
    results = {
        'main_accuracy': accuracy_score(y_true_main, y_pred_main),
        'main_report': classification_report(y_true_main, y_pred_main, zero_division=0),
    }
    
    if y_true_sub:
        results['sub_accuracy'] = accuracy_score(y_true_sub, y_pred_sub)
        results['sub_report'] = classification_report(y_true_sub, y_pred_sub, zero_division=0)
    
    return results

def main():
    # Load data
    df = pd.read_csv("clean_data.csv")
    df.columns = ["dirty_name", "dirty_label"]
    df["entity"] = df["dirty_name"].str.strip()
    df["label"] = df["dirty_label"].str.strip()
    
    # Load spaCy with NER
    nlp = spacy.load("en_core_web_md")
    
    # Add our subclassifier after NER
    nlp.add_pipe("org_subclassifier", after="ner")
    
    # Evaluate
    results = evaluate_entity_classifier(nlp, df)
    
    # Print results
    print("===== EVALUATION RESULTS =====")
    print(f"Main Entity Type Accuracy: {results['main_accuracy']:.4f}")
    print("\nDetailed Report:")
    print(results['main_report'])
    
    if 'sub_accuracy' in results:
        print(f"\nOrganization Subtype Accuracy: {results['sub_accuracy']:.4f}")
        print("\nDetailed Subtype Report:")
        print(results['sub_report'])
    
    # Save the model
    nlp.to_disk("models/entity_classifier")
    print("\nModel saved to 'models/entity_classifier'")

if __name__ == "__main__":
    main()

===== EVALUATION RESULTS =====
Main Entity Type Accuracy: 0.8735

Detailed Report:
              precision    recall  f1-score   support

         ORG       0.58      0.81      0.68       569
      PERSON       0.98      0.88      0.93      3455
     UNKNOWN       0.00      0.00      0.00         0

    accuracy                           0.87      4024
   macro avg       0.52      0.56      0.54      4024
weighted avg       0.92      0.87      0.89      4024


Organization Subtype Accuracy: 0.3726

Detailed Subtype Report:
                 precision    recall  f1-score   support

        COMPANY       0.93      0.34      0.50       473
     UNIVERSITY       0.78      0.52      0.62        96
        UNKNOWN       0.00      0.00      0.00         0
WRONG_MAIN_TYPE       0.00      0.00      0.00         0

       accuracy                           0.37       569
      macro avg       0.43      0.22      0.28       569
   weighted avg       0.90      0.37      0.52       569


Model saved

In [2]:
# Add custom extension to store organization subtypes
Doc.set_extension("org_subtypes", default={}, force=True)

@Language.factory("org_subclassifier")
class OrganizationSubclassifier:
    """A component that filters for PERSON/ORG entities and subclassifies organizations"""
    
    def __init__(self, nlp, name):
        self.name = name
        
        # Keywords for university classification
        self.university_keywords = [
            "university", "college", "institute", "school", 
            "academy", "polytechnic", "conservatory"
        ]
        
        # Keywords for company classification
        self.company_keywords = [
            "inc", "corp", "ltd", "limited", "llc", "company", 
            "technologies", "systems", "group", "industries"
        ]
        
        # Known entities
        self.known_entities = {
            "mit": "UNIVERSITY",
            "harvard": "UNIVERSITY",
            "oxford": "UNIVERSITY",
            "cambridge": "UNIVERSITY",
            "apple": "COMPANY",
            "google": "COMPANY",
            "microsoft": "COMPANY",
            "amazon": "COMPANY"
        }
    
    def __call__(self, doc):
        # Filter entities to keep only PERSON and ORG
        filtered_ents = []
        
        # If no entities were found but we have text, try to classify it
        if not doc.ents and len(doc.text) > 0:
            # Since we're dealing with single entities, try to classify the whole text
            entity_text = doc.text.lower()
            entity_type = self._guess_entity_type(entity_text)
            
            if entity_type in ["PERSON", "ORG"]:
                # Create a span covering the whole text
                span = doc.char_span(0, len(doc.text), label=entity_type)
                if span:
                    filtered_ents.append(span)
        else:
            # Filter existing entities
            for ent in doc.ents:
                if ent.label_ in ["PERSON", "ORG"]:
                    filtered_ents.append(ent)
        
        # Overwrite doc.ents with our filtered list
        doc.ents = filtered_ents
        
        # Subclassify organizations
        for ent in doc.ents:
            if ent.label_ == "ORG":
                subtype = self._classify_organization(ent.text)
                doc._.org_subtypes[ent.text] = subtype
        
        return doc
    
    def _guess_entity_type(self, text):
        """Guess if an entity is a PERSON or ORG when spaCy NER doesn't detect it"""
        text = text.lower()
        
        # Check for known entities
        for entity, subtype in self.known_entities.items():
            if entity in text:
                return "ORG"  # All our known entities are organizations
        
        # Check for organization keywords
        if any(keyword in text for keyword in self.university_keywords + self.company_keywords):
            return "ORG"
        
        # Check for person-like patterns (1-3 words, no special chars except ' and -)
        words = text.split()
        if (1 <= len(words) <= 3 and 
            all(word.isalpha() or "'" in word or "-" in word for word in words)):
            return "PERSON"
        
        # Default to ORG 
        return "ORG"
    
    def _classify_organization(self, text):
        """Subclassify an organization as UNIVERSITY or COMPANY"""
        text = text.lower()
        
        # Check known entities first
        for entity, subtype in self.known_entities.items():
            if entity in text:
                return subtype
        
        # Check for university indicators
        if any(keyword in text for keyword in self.university_keywords) or " of " in text:
            return "UNIVERSITY"
        
        # Default to COMPANY
        return "COMPANY"


In [1]:
import spacy
from spacy.tokens import Doc
from spacy.language import Language
from sklearn.metrics import classification_report, accuracy_score
import pandas as pd


In [3]:
nlp = spacy.load("models/entity_classifier")


In [4]:
text = "Apple Inc"

In [5]:
doc = nlp(text)

In [12]:
if doc.ents:
    pred_main = doc.ents[0].label_
            
    # Get subtype for organizations
    if pred_main == "ORG":
        pred_sub = doc._.org_subtypes[text]
           

In [13]:
doc.ents[0].label_

'ORG'

In [15]:
 pred_sub

'COMPANY'