# Manual spell correction of the prompts

This notebook documents manual spell correction steps applied to the prompts table, to be able to replicate manual cleaning steps and to trace back what was changed. Each rule is documented with its purpose and implementation,
The original prompts table remains unchanged - all output goes to `manually_corrected_prompts`.

In [1]:
import sqlite3
import pandas as pd

conn = sqlite3.connect("../../../giicg.db")
prompts= pd.read_sql("SELECT * FROM translated_prompts", conn)
prompts

Unnamed: 0,message_id,conversation_id,role,message_text,conversational,code,other,gender,user_id,language
0,1,1,user,"parsing data from python iterator, how it coul...","parsing data from python iterator, how it coul...",,,Man (cisgender),6,en
1,730,32,user,Write python function to do operations with in...,Write python function to do operations with in...,,report_dt\tsource\tmetric_name\tmetric_num\tme...,Man (cisgender),6,en
2,1133,55,user,Write shortest tutorial on creating RAG on ema...,Write shortest tutorial on creating RAG on ema...,,,Man (cisgender),6,en
3,1135,55,user,what is FAISS,what is FAISS,,,Man (cisgender),6,en
4,1137,55,user,Transform given code to process large .mbox file,Transform given code to process large .mbox file,,Transform given code to process large .mbox file,Man (cisgender),6,en
...,...,...,...,...,...,...,...,...,...,...
750,1646,82,user,"def run_query(query, n_results):\n query_em...",this is my code. I want to: Get nodes and edge...,"def run_query(query, n_results):\n query_em...",,Man (cisgender),92,en
751,1845,37,user,\n nun möchte ich judgement balancing m...,Now I want to bring judgement balancing into t...,,,Woman (cisgender),29,de
752,1847,37,user,\n ich sehe keine veränderung im Plot. Was ...,I don't see any change in the plot.,,,Woman (cisgender),29,de
753,1849,2,user,\n I am working on the problem of reconstru...,\n I am working on the problem of reconstru...,,Classic CV - Drone navigation\nIf you ever tho...,Man (cisgender),8,en


In [2]:
import re


# Correct Spelling

def fix_spelling_errors(df):
    """Fix common spelling errors in conversational content"""
    corrections = {
        'orthogogal': 'orthogonal',
        'Anothers' : 'Another',
        'follwoing' : 'following',
        'insted' : 'instead',
        'thats' : 'that\'s',
        'addtion' : 'addition',
        'develope' : 'develop',
        'familys' : 'families',
        'responsable' : 'responsible',
        'optimate' : 'optimize',
        'esier' : 'easier',
        'sentencs': 'sentences',
        'preveent' : 'prevent',
        'indexs': 'indices',
        'hundrets': 'hundreds',
        'orthotogonal' : 'orthogonal',
        'palatte' : 'plaette',
        'favoriting' : 'favoring',
        'orignal' : 'original',
        'resemblens' : 'resemblance',
        'appoinment' : 'appointment',
        'Canvaslet\'s' : 'let\'s',
        'ealier' : 'earlier',
        'Rectlange' : 'Rectangle',
        'gesorächslauf' : 'chat history',
        'evry': 'every',
        'inlcude': 'include',
        'wrtie' : 'write',
        'tring' : 'trying',
        'im' : 'i\'m',
        'exaple' : 'example',
        'modalitieees': 'modalities',
        'doenst' : 'doesn\'t',
        'actualy': 'actually',
        'paremeter': 'parameter',
        'experiement': 'experiment',
        'concatination' : 'concatenation',
        'impliment' : 'implement',
        'impilmentation' : 'implementation',
        'nessacary' : 'necessary',
        'caclulated' : 'calculated',
        'matrixes' : 'matrices',
        'writie' : 'write',
        'follws' : 'follows',
        'inlcuded' : 'included',
        'origional' : 'original',
        'thta' : 'that',
        'heres' : 'here\'s',
        'laike' : 'like',
        'wehre' : 'where',
        'reuslts':'results',
        'intersetd' : 'interested',
        'concatition' : 'concatenation',
        'paramters' : 'parameters',
        'implyin' : 'implying',
        'guassion' : 'gaussian',
        'thigns' : 'things',
        'dockerfule' : 'dockerfile',
        'snipped'  : 'snippet',
        'viseos' : 'videos',
        'phne' : 'without',
        'Asked ChatGPTi' : 'i',
        'Asked ChatGPTplease' : 'please',
        'deday' : 'decay',
        'seperate' : 'separate',
        'apporach' : 'approach',
        'indiviually' : 'individually',
        'allegeded' : 'alleged',
        'Asked ChatGPTwhy' : 'why',
        'testset' : 'test set',
        'classificaiton' : 'classification',
        'optimice' : 'optimize',
        'adaptbefore' : 'adapt before',
        'onlz' : 'only',
        'sepearte' : 'separate',
        'informaiton' : 'information',
        'validaiton' : 'validation',
        'overfitiing' : 'overfitting',
        'eahc' : 'each',
        'directorys' : 'directories',
        'direcrtorys' : 'directories',
        'manipultaet' : 'manipulated',
        'aere' : 'are',
        'withoud' : 'without',
        'thorugh' : 'thorugh',
        'please provide example hot to use' : 'please provide example how to use',
        'rewertie' : 'rewrite',
        'Uploaded an imagei' : 'i',
        'inforamtion' : 'information',
        'visualiysing' : 'visualizing',
        'achive' : 'achieve',
        'sohuellete' : 'silhouette',
        'autimatically' : 'automatically',
        'entrys' : 'entries',
        'visualisaitons' : 'visualizations',
        'seachr' : 'search',
        'searhc' : 'search',
        'shoulndt' : 'shouldn\'t',
        'llookng' : 'looking',
        'clutering' : 'clustering',
        'restuks' : 'results',
        'mehtod' : 'method',
        'delimimetr' : 'delimiter',
        'strucuture' : 'structure',
        'neccessary' : 'necessary',
        'extractzion' : 'extraction',
        'suddently' : 'suddenly',
        'miminal' : 'minimal',
        'isnt' : 'isn\'t',
        'anomalious' : 'anomalous',
        'staistical' : 'statistical',
        'vlaues' : 'values',
        'calsses' : 'classes',
        'ommitted' : 'omitted',
        'approcheas' : 'approaches',
        'inthe' : 'in the',
        'sceintificly' : 'scientifically',
        'Uploaded to filethis' : 'this',
        'soonn' : 'soon',
        'dissappear' : 'disappear',
        'tryping' : 'trying',
        'undefinded' : 'undefined',
        'lenght' : 'length',
        'overritten' : 'overwritten',
        'exisiting' : 'existing',
        'caluclate' : 'calculate',
        'optimise' : 'optimize',
        'Fehlermeldungen' : 'error messages',
        'HEader' : 'Header',
        'nich tunterstrichen' : 'no underscore',
        'carussel': 'carousel',
        'adress' : 'address',
        'blancs' : 'blanks',
        'doesnt' : 'doesn\'t',
        'didnt' : 'didn\'t',
        'lets' : 'let\'s',
        'cant' : 'can\'t',
        'formated' : 'formatted',
        'unmanipluated': 'unmanipulated',
        'additonal': 'additional',
        'pls': 'please',
        'plz': 'please'

    }

    total_fixes = 0
    for incorrect, correct in corrections.items():
        # Create regex pattern for exact word match (case-insensitive)
        pattern = r'\b' + re.escape(incorrect) + r'\b'

        # Check if pattern exists in any row
        mask = df['conversational'].str.contains(pattern, case=False, na=False, regex=True)

        if mask.any():
            # Replace using regex with word boundaries
            df['conversational'] = df['conversational'].str.replace(
                pattern, correct, case=False, regex=True
            )
            total_fixes += mask.sum()


    print(f"Fixed spelling in {total_fixes} rows")
    return df

In [3]:
def apply_all_rules(df):
    """Apply all cleaning rules in sequence"""
    print(f"Starting with {len(df)} rows")
    print("-" * 40)

    df = fix_spelling_errors(df)

    print("-" * 40)
    print(f"Final: {len(df)} rows")
    return df

# Apply rules
corrected_prompts = apply_all_rules(prompts.copy())
corrected_prompts

Starting with 755 rows
----------------------------------------
Fixed spelling in 178 rows
----------------------------------------
Final: 755 rows


Unnamed: 0,message_id,conversation_id,role,message_text,conversational,code,other,gender,user_id,language
0,1,1,user,"parsing data from python iterator, how it coul...","parsing data from python iterator, how it coul...",,,Man (cisgender),6,en
1,730,32,user,Write python function to do operations with in...,Write python function to do operations with in...,,report_dt\tsource\tmetric_name\tmetric_num\tme...,Man (cisgender),6,en
2,1133,55,user,Write shortest tutorial on creating RAG on ema...,Write shortest tutorial on creating RAG on ema...,,,Man (cisgender),6,en
3,1135,55,user,what is FAISS,what is FAISS,,,Man (cisgender),6,en
4,1137,55,user,Transform given code to process large .mbox file,Transform given code to process large .mbox file,,Transform given code to process large .mbox file,Man (cisgender),6,en
...,...,...,...,...,...,...,...,...,...,...
750,1646,82,user,"def run_query(query, n_results):\n query_em...",this is my code. I want to: Get nodes and edge...,"def run_query(query, n_results):\n query_em...",,Man (cisgender),92,en
751,1845,37,user,\n nun möchte ich judgement balancing m...,Now I want to bring judgement balancing into t...,,,Woman (cisgender),29,de
752,1847,37,user,\n ich sehe keine veränderung im Plot. Was ...,I don't see any change in the plot.,,,Woman (cisgender),29,de
753,1849,2,user,\n I am working on the problem of reconstru...,\n I am working on the problem of reconstru...,,Classic CV - Drone navigation\nIf you ever tho...,Man (cisgender),8,en


In [4]:

differences = prompts.compare(corrected_prompts)
# Get rows that have any differences
altered_rows_mask = ~prompts.eq(corrected_prompts).all(axis=1)
altered_df = corrected_prompts[altered_rows_mask]

altered_df

Unnamed: 0,message_id,conversation_id,role,message_text,conversational,code,other,gender,user_id,language
1,730,32,user,Write python function to do operations with in...,Write python function to do operations with in...,,report_dt\tsource\tmetric_name\tmetric_num\tme...,Man (cisgender),6,en
14,1151,56,user,I tried to do the same to the monthly views bu...,I tried to do the same to the monthly views bu...,,,Woman (cisgender),11,en
23,5,6,user,I want to use Dummy Hot encoding to replace th...,I want to use Dummy Hot encoding to replace th...,,,Woman (cisgender),16,en
26,11,6,user,Ok back to the Embarked column. I have realise...,Ok back to the Embarked column. I have realise...,,,Woman (cisgender),16,en
27,13,6,user,No thats not what I mean. I mean there is a dr...,No that's not what I mean. I mean there is a d...,,,Woman (cisgender),16,en
...,...,...,...,...,...,...,...,...,...,...
726,1089,53,user,the approach is right. but i left the general ...,the approach is right. but i left the general ...,,,Man (cisgender),91,en
727,1091,53,user,how do i declare attributes to a inside of hea...,how do i declare attributes to a inside of Hea...,,,Man (cisgender),91,en
729,1095,53,user,"for the header links, i'd like to introduce an...","for the Header links, i'd like to introduce an...",,,Man (cisgender),91,en
730,1097,53,user,ok nice. can i let the color switch already on...,ok nice. can i let the color switch already on...,,,Man (cisgender),91,en


In [5]:
corrected_prompts.to_sql('manually_corrected_prompts', conn, if_exists='replace', index=False)

755