# Manual spell correction of the prompts

This notebook documents manual spell correction steps applied to the prompts table, to be able to replicate manual cleaning steps and to trace back what was changed. Each rule is documented with its purpose and implementation,
The original prompts table remains unchanged - all output goes to `manually_corrected_prompts`.

In [5]:
import sqlite3
import pandas as pd

conn = sqlite3.connect("../../../giicg.db")
prompts= pd.read_sql("SELECT * FROM translated_prompts", conn)
prompts

Unnamed: 0,message_id,conversation_id,role,message_text,conversational,code,other,gender,user_id,language
0,1,1,user,"parsing data from python iterator, how it coul...","parsing data from python iterator, how it coul...",,,Man (cisgender),6,en
1,730,32,user,Write python function to do operations with in...,Write python function to do operations with in...,,report_dt\tsource\tmetric_name\tmetric_num\tme...,Man (cisgender),6,en
2,1133,55,user,Write shortest tutorial on creating RAG on ema...,Write shortest tutorial on creating RAG on ema...,,,Man (cisgender),6,en
3,1135,55,user,what is FAISS,what is FAISS,,,Man (cisgender),6,en
4,1137,55,user,Transform given code to process large .mbox file,Transform given code to process large .mbox file,,Transform given code to process large .mbox file,Man (cisgender),6,en
...,...,...,...,...,...,...,...,...,...,...
755,724,31,user,import pandas as pd\nimport numpy as np\nfrom ...,Please replace my retrieval pipeline here with...,import pandas as pd\nimport numpy as np\nfrom ...,You are tasked with separating user prompts in...,Man (cisgender),92,en
756,726,31,user,"please update my code accordingly, no comments...","please update my code accordingly, no comments...",,,Man (cisgender),92,en
757,1131,54,user,import pandas as pd\nimport numpy as np\nfrom ...,"I want to tune optimal thresholds. Currently, ...",import pandas as pd\nimport numpy as np\nfrom ...,The narratives list looks like this:\nnarrativ...,Man (cisgender),92,en
758,1532,71,user,"from transformers import AutoTokenizer, AutoMo...",I want to use an LLM for listwise reranking in...,"from transformers import AutoTokenizer, AutoMo...",,Man (cisgender),92,en


In [6]:
import re


# Correct Spelling

def fix_spelling_errors(df):
    """Fix common spelling errors in conversational content"""
    corrections = {
        'orthogogal': 'orthogonal',
        'Anothers' : 'Another',
        'follwoing' : 'following',
        'insted' : 'instead',
        'thats' : 'that\'s',
        'addtion' : 'addition',
        'develope' : 'develop',
        'familys' : 'families',
        'responsable' : 'responsible',
        'optimate' : 'optimize',
        'esier' : 'easier',
        'sentencs': 'sentences',
        'preveent' : 'prevent',
        'indexs': 'indices',
        'hundrets': 'hundreds',
        'orthotogonal' : 'orthogonal',
        'palatte' : 'plaette',
        'favoriting' : 'favoring',
        'orignal' : 'original',
        'resemblens' : 'resemblance',
        'appoinment' : 'appointment',
        'Canvaslet\'s' : 'let\'s',
        'ealier' : 'earlier',
        'Rectlange' : 'Rectangle',
        'gesorächslauf' : 'chat history',
        'evry': 'every',
        'inlcude': 'include',
        'wrtie' : 'write',
        'tring' : 'trying',
        'im' : 'i\'m',
        'exaple' : 'example',
        'modalitieees': 'modalities',
        'doenst' : 'doesn\'t',
        'actualy': 'actually',
        'paremeter': 'parameter',
        'experiement': 'experiment',
        'concatination' : 'concatenation',
        'impliment' : 'implement',
        'impilmentation' : 'implementation',
        'nessacary' : 'necessary',
        'caclulated' : 'calculated',
        'matrixes' : 'matrices',
        'writie' : 'write',
        'follws' : 'follows',
        'inlcuded' : 'included',
        'origional' : 'original',
        'thta' : 'that',
        'heres' : 'here\'s',
        'laike' : 'like',
        'wehre' : 'where',
        'reuslts':'results',
        'intersetd' : 'interested',
        'concatition' : 'concatenation',
        'paramters' : 'parameters',
        'implyin' : 'implying',
        'guassion' : 'gaussian',
        'thigns' : 'things',
        'dockerfule' : 'dockerfile',
        'snipped'  : 'snippet',
        'viseos' : 'videos',
        'phne' : 'without',
        'Asked ChatGPTi' : 'i',
        'Asked ChatGPTplease' : 'please',
        'deday' : 'decay',
        'seperate' : 'separate',
        'apporach' : 'approach',
        'indiviually' : 'individually',
        'allegeded' : 'alleged',
        'Asked ChatGPTwhy' : 'why',
        'testset' : 'test set',
        'classificaiton' : 'classification',
        'optimice' : 'optimize',
        'adaptbefore' : 'adapt before',
        'onlz' : 'only',
        'sepearte' : 'separate',
        'informaiton' : 'information',
        'validaiton' : 'validation',
        'overfitiing' : 'overfitting',
        'eahc' : 'each',
        'directorys' : 'directories',
        'direcrtorys' : 'directories',
        'manipultaet' : 'manipulated',
        'aere' : 'are',
        'withoud' : 'without',
        'thorugh' : 'thorugh',
        'please provide example hot to use' : 'please provide example how to use',
        'rewertie' : 'rewrite',
        'Uploaded an imagei' : 'i',
        'inforamtion' : 'information',
        'visualiysing' : 'visualizing',
        'achive' : 'achieve',
        'sohuellete' : 'silhouette',
        'autimatically' : 'automatically',
        'entrys' : 'entries',
        'visualisaitons' : 'visualizations',
        'seachr' : 'search',
        'searhc' : 'search',
        'shoulndt' : 'shouldn\'t',
        'llookng' : 'looking',
        'clutering' : 'clustering',
        'restuks' : 'results',
        'mehtod' : 'method',
        'delimimetr' : 'delimiter',
        'strucuture' : 'structure',
        'neccessary' : 'necessary',
        'extractzion' : 'extraction',
        'suddently' : 'suddenly',
        'miminal' : 'minimal',
        'isnt' : 'isn\'t',
        'anomalious' : 'anomalous',
        'staistical' : 'statistical',
        'vlaues' : 'values',
        'calsses' : 'classes',
        'ommitted' : 'omitted',
        'approcheas' : 'approaches',
        'inthe' : 'in the',
        'sceintificly' : 'scientifically',
        'Uploaded to filethis' : 'this',
        'soonn' : 'soon',
        'dissappear' : 'disappear',
        'tryping' : 'trying',
        'undefinded' : 'undefined',
        'lenght' : 'length',
        'overritten' : 'overwritten',
        'exisiting' : 'existing',
        'caluclate' : 'calculate',
        'optimise' : 'optimize',
        'Fehlermeldungen' : 'error messages',
        'HEader' : 'Header',
        'nich tunterstrichen' : 'no underscore',
        'carussel': 'carousel',
        'adress' : 'address',
        'blancs' : 'blanks',
        'doesnt' : 'doesn\'t',
        'didnt' : 'didn\'t',
        'lets' : 'let\'s',
        'cant' : 'can\'t',
        'formated' : 'formatted',

    }

    total_fixes = 0
    for incorrect, correct in corrections.items():
        # Create regex pattern for exact word match (case-insensitive)
        pattern = r'\b' + re.escape(incorrect) + r'\b'

        # Check if pattern exists in any row
        mask = df['conversational'].str.contains(pattern, case=False, na=False, regex=True)

        if mask.any():
            # Replace using regex with word boundaries
            df['conversational'] = df['conversational'].str.replace(
                pattern, correct, case=False, regex=True
            )
            total_fixes += mask.sum()


    print(f"Fixed spelling in {total_fixes} rows")
    return df

In [7]:
def apply_all_rules(df):
    """Apply all cleaning rules in sequence"""
    print(f"Starting with {len(df)} rows")
    print("-" * 40)

    df = fix_spelling_errors(df)

    print("-" * 40)
    print(f"Final: {len(df)} rows")
    return df

# Apply rules
corrected_prompts = apply_all_rules(prompts.copy())
corrected_prompts

Starting with 760 rows
----------------------------------------
Fixed spelling in 178 rows
----------------------------------------
Final: 760 rows


Unnamed: 0,message_id,conversation_id,role,message_text,conversational,code,other,gender,user_id,language
0,1,1,user,"parsing data from python iterator, how it coul...","parsing data from python iterator, how it coul...",,,Man (cisgender),6,en
1,730,32,user,Write python function to do operations with in...,Write python function to do operations with in...,,report_dt\tsource\tmetric_name\tmetric_num\tme...,Man (cisgender),6,en
2,1133,55,user,Write shortest tutorial on creating RAG on ema...,Write shortest tutorial on creating RAG on ema...,,,Man (cisgender),6,en
3,1135,55,user,what is FAISS,what is FAISS,,,Man (cisgender),6,en
4,1137,55,user,Transform given code to process large .mbox file,Transform given code to process large .mbox file,,Transform given code to process large .mbox file,Man (cisgender),6,en
...,...,...,...,...,...,...,...,...,...,...
755,724,31,user,import pandas as pd\nimport numpy as np\nfrom ...,Please replace my retrieval pipeline here with...,import pandas as pd\nimport numpy as np\nfrom ...,You are tasked with separating user prompts in...,Man (cisgender),92,en
756,726,31,user,"please update my code accordingly, no comments...","please update my code accordingly, no comments...",,,Man (cisgender),92,en
757,1131,54,user,import pandas as pd\nimport numpy as np\nfrom ...,"I want to tune optimal thresholds. Currently, ...",import pandas as pd\nimport numpy as np\nfrom ...,The narratives list looks like this:\nnarrativ...,Man (cisgender),92,en
758,1532,71,user,"from transformers import AutoTokenizer, AutoMo...",I want to use an LLM for listwise reranking in...,"from transformers import AutoTokenizer, AutoMo...",,Man (cisgender),92,en


In [6]:

differences = prompts.compare(corrected_prompts)
# Get rows that have any differences
altered_rows_mask = ~prompts.eq(corrected_prompts).all(axis=1)
altered_df = corrected_prompts[altered_rows_mask]

altered_df

Unnamed: 0,message_id,conversation_id,role,message_text,conversational,code,other,gender,user_id,language
1,730,32,user,Write python function to do operations with in...,Write python function to do operations with in...,,,Man (cisgender),6,en
18,1648,83,user,my code randomly selects 5 sentences and I wan...,my code randomly selects 5 sentences and I wan...,import json\nimport pandas as pd\nimport rando...,,Woman (cisgender),11,en
23,5,6,user,I want to use Dummy Hot encoding to replace th...,I want to use Dummy Hot encoding to replace th...,,,Woman (cisgender),16,en
26,11,6,user,Ok back to the Embarked column. I have realise...,Ok back to the Embarked column. I have realise...,,,Woman (cisgender),16,en
27,13,6,user,No thats not what I mean. I mean there is a dr...,No that's not what I mean. I mean there is a d...,,,Woman (cisgender),16,en
...,...,...,...,...,...,...,...,...,...,...
740,1097,53,user,ok nice. can i let the color switch already on...,ok nice. can i let the color switch already on...,,,Man (cisgender),91,en
743,1103,53,user,"ill send you some links, please turn all links...","ill send you some links, please turn all links...",,,Man (cisgender),91,en
745,1109,53,user,in deutschland: was muss im impressum auf eine...,in Germany: what does the i'mprint have to say...,,,Man (cisgender),91,de
758,1131,54,user,import pandas as pd\nimport numpy as np\nfrom ...,"I want to tune opti'mal thresholds. Currently,...",import pandas as pd\nimport numpy as np\nfrom ...,The narratives list looks like this:\nnarrativ...,Man (cisgender),92,en


In [8]:
corrected_prompts.to_sql('manually_corrected_prompts', conn, if_exists='replace', index=False)

760