## Cleaning a messy data
https://oscarbaruffa.com/messy/

## IMPORT LIBRARIES

In [1]:
!pip install fuzzywuzzy
!pip install python-Levenshtein
!pip install pycountry
!pip install emoji

Defaulting to user installation because normal site-packages is not writeable
Looking in links: /usr/share/pip-wheels
Defaulting to user installation because normal site-packages is not writeable
Looking in links: /usr/share/pip-wheels
Defaulting to user installation because normal site-packages is not writeable
Looking in links: /usr/share/pip-wheels
Defaulting to user installation because normal site-packages is not writeable
Looking in links: /usr/share/pip-wheels


In [2]:
import pandas as pd
import numpy as np
from fuzzywuzzy import process
import pycountry
import re
import emoji

## READING AND ACCESSING DATA

In [3]:
# Google Sheets CSV link
url = "https://docs.google.com/spreadsheets/d/1IPS5dBSGtwYVbjsfbaMCYIWnOuRmJcbequohNxCyGVw/export?format=csv&gid=1625408792"

# Load into DataFrame
df = pd.read_csv(url)

df.head()

Unnamed: 0,Timestamp,How old are you?,What industry do you work in?,Job title,"If your job title needs additional context, please clarify here:","What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)","How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.",Please indicate the currency,"If ""Other,"" please indicate the currency here:","If your income needs additional context, please provide it here:",What country do you work in?,"If you're in the U.S., what state do you work in?",What city do you work in?,How many years of professional work experience do you have overall?,How many years of professional work experience do you have in your field?,What is your highest level of education completed?,What is your gender?,What is your race? (Choose all that apply.)
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,,62000,3000.0,USD,,,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,,60000,7000.0,USD,,,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White


## FIX HEADERS (IF NEEDED)

In [4]:
# Simplify column names
df = df.rename(columns={
    "What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)": "annual_salary"
})

In [5]:
df = df.rename(columns={
    "How old are you?": "age"
})

In [6]:
df = df.rename(columns={
    "How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.": "bonus"
})

In [7]:
df = df.rename(columns={
    "Please indicate the currency": "currency"
})

In [8]:
df = df.rename(columns={
    "What country do you work in?": "country"
})

In [9]:
df = df.rename(columns={
    "How many years of professional work experience do you have overall?": "years_working"
})

In [10]:
df = df.rename(columns={
    "What is your highest level of education completed?": "education"
})

In [11]:
df = df.rename(columns={
    "What is your gender?": "gender"
})

In [12]:
df = df.rename(columns={
    "What is your race? (Choose all that apply.)": "race"
})

In [13]:
df = df.rename(columns={
    "How many years of professional work experience do you have in your field?": "years_in_your_field"
})

In [14]:
df.head()

Unnamed: 0,Timestamp,age,What industry do you work in?,Job title,"If your job title needs additional context, please clarify here:",annual_salary,bonus,currency,"If ""Other,"" please indicate the currency here:","If your income needs additional context, please provide it here:",country,"If you're in the U.S., what state do you work in?",What city do you work in?,years_working,years_in_your_field,education,gender,race
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,,62000,3000.0,USD,,,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,,60000,7000.0,USD,,,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White


### Select specific columns to be in the data frame to analyse

In [15]:
df=df[['Timestamp','age','annual_salary','bonus','currency','country','years_working','years_in_your_field','education','gender','race']]

In [16]:
df.head()

Unnamed: 0,Timestamp,age,annual_salary,bonus,currency,country,years_working,years_in_your_field,education,gender,race
0,4/27/2021 11:02:10,25-34,55000,0.0,USD,United States,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,54600,4000.0,GBP,United Kingdom,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,34000,,USD,US,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,62000,3000.0,USD,USA,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,60000,7000.0,USD,US,8 - 10 years,5-7 years,College degree,Woman,White


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28149 entries, 0 to 28148
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Timestamp            28149 non-null  object 
 1   age                  28149 non-null  object 
 2   annual_salary        28149 non-null  object 
 3   bonus                20813 non-null  float64
 4   currency             28149 non-null  object 
 5   country              28149 non-null  object 
 6   years_working        28149 non-null  object 
 7   years_in_your_field  28149 non-null  object 
 8   education            27921 non-null  object 
 9   gender               27975 non-null  object 
 10  race                 27966 non-null  object 
dtypes: float64(1), object(10)
memory usage: 2.4+ MB


### Cleaning/modifying the datatype

In [18]:
# Extract year from Timestamp
df['Timestamp'] = pd.to_datetime(df['Timestamp'], errors='coerce')
df['year'] = df['Timestamp'].dt.year.astype('Int64')
df = df.drop(columns=['Timestamp'])

# Move 'year' column to the first position
cols = ['year'] + [col for col in df.columns if col != 'year']
df = df[cols]

In [19]:
df.head()

Unnamed: 0,year,age,annual_salary,bonus,currency,country,years_working,years_in_your_field,education,gender,race
0,2021,25-34,55000,0.0,USD,United States,5-7 years,5-7 years,Master's degree,Woman,White
1,2021,25-34,54600,4000.0,GBP,United Kingdom,8 - 10 years,5-7 years,College degree,Non-binary,White
2,2021,25-34,34000,,USD,US,2 - 4 years,2 - 4 years,College degree,Woman,White
3,2021,25-34,62000,3000.0,USD,USA,8 - 10 years,5-7 years,College degree,Woman,White
4,2021,25-34,60000,7000.0,USD,US,8 - 10 years,5-7 years,College degree,Woman,White


In [20]:
# Age — normalize dash spacing
df['age'] = df['age'].str.strip().str.replace(r'\s*-\s*', '-', regex=True)

In [21]:
# Remove commas and convert to integer
df['annual_salary'] = df['annual_salary'].str.replace(',', '').astype(int)

# Bonus → integer (handle NaN)
df['bonus'] = df['bonus'].fillna(0).astype(int)

In [22]:
df.head()

Unnamed: 0,year,age,annual_salary,bonus,currency,country,years_working,years_in_your_field,education,gender,race
0,2021,25-34,55000,0,USD,United States,5-7 years,5-7 years,Master's degree,Woman,White
1,2021,25-34,54600,4000,GBP,United Kingdom,8 - 10 years,5-7 years,College degree,Non-binary,White
2,2021,25-34,34000,0,USD,US,2 - 4 years,2 - 4 years,College degree,Woman,White
3,2021,25-34,62000,3000,USD,USA,8 - 10 years,5-7 years,College degree,Woman,White
4,2021,25-34,60000,7000,USD,US,8 - 10 years,5-7 years,College degree,Woman,White


In [23]:
# Years fields — remove "years" and fix dash spacing
for col in ['years_working', 'years_in_your_field']:
    df[col] = (
        df[col]
          .str.replace('years', '', case=False, regex=False)
          .str.strip()
          .str.replace(r'\s*-\s*', '-', regex=True)
    )

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28149 entries, 0 to 28148
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   year                 28149 non-null  Int64 
 1   age                  28149 non-null  object
 2   annual_salary        28149 non-null  int64 
 3   bonus                28149 non-null  int64 
 4   currency             28149 non-null  object
 5   country              28149 non-null  object
 6   years_working        28149 non-null  object
 7   years_in_your_field  28149 non-null  object
 8   education            27921 non-null  object
 9   gender               27975 non-null  object
 10  race                 27966 non-null  object
dtypes: Int64(1), int64(2), object(8)
memory usage: 2.4+ MB


In [25]:
df.head()

Unnamed: 0,year,age,annual_salary,bonus,currency,country,years_working,years_in_your_field,education,gender,race
0,2021,25-34,55000,0,USD,United States,5-7,5-7,Master's degree,Woman,White
1,2021,25-34,54600,4000,GBP,United Kingdom,8-10,5-7,College degree,Non-binary,White
2,2021,25-34,34000,0,USD,US,2-4,2-4,College degree,Woman,White
3,2021,25-34,62000,3000,USD,USA,8-10,5-7,College degree,Woman,White
4,2021,25-34,60000,7000,USD,US,8-10,5-7,College degree,Woman,White


In [26]:
# Count different values in the 'age' column
df['age'].value_counts()

age
25-34         12690
35-44          9910
45-54          3194
18-24          1251
55-64           995
65 or over       95
under 18         14
Name: count, dtype: int64

In [27]:
# Count values in the 'currency' column
df['currency'].value_counts()

currency
USD        23442
CAD         1675
GBP         1593
EUR          650
AUD/NZD      505
Other        167
CHF           37
SEK           37
JPY           23
ZAR           16
HKD            4
Name: count, dtype: int64

In [28]:
# Count values in the 'country' column
df['country'].value_counts()

country
United States     9015
USA               7951
US                2614
Canada            1573
United States      668
                  ... 
 New Zealand         1
Cuba                 1
Australi             1
Cote d'Ivoire        1
Thailand             1
Name: count, Length: 386, dtype: int64

In [31]:
# Standard sovereign countries
standard_countries = [
    "Afghanistan","Argentina","Australia","Austria","Bangladesh","Belgium","Bosnia and Herzegovina",
    "Brazil","Bulgaria","Myanmar","Cambodia","Canada","Chile","China","Colombia","Congo","Costa Rica",
    "Côte d'Ivoire","Croatia","Cuba","Cyprus","Czech Republic","Denmark","Ecuador","Egypt","Eritrea",
    "Estonia","Finland","France","Germany","Ghana","Greece","Hungary","India","Indonesia","Ireland",
    "Israel","Italy","Jamaica","Japan","Jordan","Kenya","Kuwait","Latvia","Liechtenstein","Lithuania",
    "Luxembourg","Malaysia","Malta","Mexico","Morocco","Nigeria","Netherlands","New Zealand","Norway",
    "Pakistan","Panama","Philippines","Poland","Portugal","Qatar","Romania","Russia","Rwanda",
    "Saudi Arabia","Serbia","Sierra Leone","Singapore","Slovakia","Slovenia","Somalia","South Africa",
    "South Korea","Spain","Sri Lanka","Sweden","Switzerland","Taiwan","Tanzania","Thailand","Turkey",
    "Uganda","Ukraine","United Arab Emirates","United Kingdom","United States","Uruguay","Vietnam","Zimbabwe"
]

# Direct mapping for common messy variants (all lowercase keys)
direct_map = {
    "usa": "United States", "us": "United States", "u.s.": "United States",
    "u.s.a": "United States", "u.s.a.": "United States",
    "united states of america": "United States", "united states": "United States",
    "us of a": "United States", "america": "United States",

    "uk": "United Kingdom", "u.k.": "United Kingdom", "u.k": "United Kingdom",
    "england": "United Kingdom", "scotland": "United Kingdom",
    "wales": "United Kingdom", "northern ireland": "United Kingdom",

    "brasil": "Brazil",
    "czechia": "Czech Republic", "czech republic": "Czech Republic",
    "burma": "Myanmar",
    "france": "France",
    "new zealand aotearoa": "New Zealand",
    "new zealand": "New Zealand",
}


def clean_country_column(df, column, standard_countries, direct_map):
    cleaned = []
    standard_lower = {c.lower(): c for c in standard_countries}

    for value in df[column]:
        val_norm = str(value).strip().lower()

        # Case 1: already in standard list
        if val_norm in standard_lower:
            resolved = standard_lower[val_norm]

        # Case 2: direct mapping
        elif val_norm in direct_map:
            resolved = direct_map[val_norm]

        # Case 3: fuzzy suggestion
        else:
            guess, score = process.extractOne(value, standard_countries)
            print(f"\nFound '{value}' → Did you mean '{guess}'? (confidence {score}%)")
            choice = input("Accept suggestion? (y/n): ").strip().lower()

            if choice == "y":
                resolved = guess
            else:
                resolved = input("Enter the correct country: ").strip()

            # Store permanently in direct_map
            direct_map[val_norm] = resolved

        cleaned.append(resolved)
    
    df[column + "_cleaned"] = cleaned
    return df, direct_map


# Run interactive cleaner
df_cleaned, updated_map = clean_country_column(df, "country", standard_countries, direct_map)



Found 'The Netherlands' → Did you mean 'Netherlands'? (confidence 95%)


Accept suggestion? (y/n):  y



Found 'U.S>' → Did you mean 'United Arab Emirates'? (confidence 86%)


Accept suggestion? (y/n):  n
Enter the correct country:  United States



Found 'ISA' → Did you mean 'Bulgaria'? (confidence 72%)


Accept suggestion? (y/n):  n
Enter the correct country:  United States



Found 'Great Britain ' → Did you mean 'Taiwan'? (confidence 60%)


Accept suggestion? (y/n):  n
Enter the correct country:  United Kingdom



Found 'United State' → Did you mean 'United States'? (confidence 96%)


Accept suggestion? (y/n):  y



Found 'Bermuda' → Did you mean 'Germany'? (confidence 57%)


Accept suggestion? (y/n):  n
Enter the correct country:  Bermuda



Found 'The United States' → Did you mean 'United States'? (confidence 95%)


Accept suggestion? (y/n):  y



Found 'United State of America' → Did you mean 'United Kingdom'? (confidence 86%)


Accept suggestion? (y/n):  n
Enter the correct country:  United States



Found 'United Stated' → Did you mean 'United States'? (confidence 92%)


Accept suggestion? (y/n):  y



Found 'Hong Kong' → Did you mean 'Congo'? (confidence 57%)


Accept suggestion? (y/n):  n
Enter the correct country:  China



Found 'Contracts' → Did you mean 'Congo'? (confidence 54%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'USA-- Virgin Islands' → Did you mean 'Finland'? (confidence 64%)


Accept suggestion? (y/n):  n
Enter the correct country:  United States



Found 'United Statws' → Did you mean 'United States'? (confidence 92%)


Accept suggestion? (y/n):  y



Found 'England/UK' → Did you mean 'Poland'? (confidence 60%)


Accept suggestion? (y/n):  n
Enter the correct country:  United Kingdom



Found 'U.S' → Did you mean 'United Arab Emirates'? (confidence 86%)


Accept suggestion? (y/n):  n
Enter the correct country:  United States



Found 'We don't get raises, we get quarterly bonuses, but they periodically asses income in the area you work, so I got a raise because a 3rd party assessment showed I was paid too little for the area we were located' → Did you mean 'Israel'? (confidence 38%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'Unites States ' → Did you mean 'United States'? (confidence 92%)


Accept suggestion? (y/n):  y



Found 'England, UK.' → Did you mean 'Finland'? (confidence 64%)


Accept suggestion? (y/n):  n
Enter the correct country:  United Kingdom



Found 'U. S. ' → Did you mean 'United Arab Emirates'? (confidence 86%)


Accept suggestion? (y/n):  n
Enter the correct country:  United States



Found 'Britain ' → Did you mean 'Brazil'? (confidence 62%)


Accept suggestion? (y/n):  n
Enter the correct country:  United Kingdom



Found 'United Sates' → Did you mean 'United States'? (confidence 96%)


Accept suggestion? (y/n):  y



Found 'Canada, Ottawa, ontario' → Did you mean 'Canada'? (confidence 90%)


Accept suggestion? (y/n):  y



Found 'Global' → Did you mean 'Colombia'? (confidence 57%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'United States of American ' → Did you mean 'United States'? (confidence 90%)


Accept suggestion? (y/n):  y



Found 'Uniited States' → Did you mean 'United States'? (confidence 96%)


Accept suggestion? (y/n):  y



Found 'United Kingdom (England)' → Did you mean 'United Kingdom'? (confidence 90%)


Accept suggestion? (y/n):  y



Found 'Worldwide (based in US but short term trips aroudn the world)' → Did you mean 'Netherlands'? (confidence 58%)


Accept suggestion? (y/n):  n
Enter the correct country:  United States



Found 'Canadw' → Did you mean 'Canada'? (confidence 83%)


Accept suggestion? (y/n):  y



Found 'United Sates of America' → Did you mean 'United Kingdom'? (confidence 86%)


Accept suggestion? (y/n):  n
Enter the correct country:  United States



Found 'United States (I work from home and my clients are all over the US/Canada/PR' → Did you mean 'United States'? (confidence 90%)


Accept suggestion? (y/n):  y



Found 'Unted States' → Did you mean 'United States'? (confidence 96%)


Accept suggestion? (y/n):  y



Found 'United Statesp' → Did you mean 'United States'? (confidence 96%)


Accept suggestion? (y/n):  y



Found 'United Stattes' → Did you mean 'United States'? (confidence 96%)


Accept suggestion? (y/n):  y



Found 'United Statea' → Did you mean 'United States'? (confidence 92%)


Accept suggestion? (y/n):  y



Found 'United Kingdom.' → Did you mean 'United Kingdom'? (confidence 100%)


Accept suggestion? (y/n):  y



Found 'Trinidad and Tobago' → Did you mean 'India'? (confidence 72%)


Accept suggestion? (y/n):  n
Enter the correct country:  Trinidad and Tobago



Found 'United Statees' → Did you mean 'United States'? (confidence 96%)


Accept suggestion? (y/n):  y



Found 'Cayman Islands' → Did you mean 'Finland'? (confidence 64%)


Accept suggestion? (y/n):  n
Enter the correct country:  Cayman Islands



Found 'Can' → Did you mean 'Canada'? (confidence 90%)


Accept suggestion? (y/n):  y



Found 'I am located in Canada but I work for a company in the US' → Did you mean 'Canada'? (confidence 60%)


Accept suggestion? (y/n):  n
Enter the correct country:  Canada



Found 'Uniyed states' → Did you mean 'United States'? (confidence 92%)


Accept suggestion? (y/n):  y



Found 'Uniyes States' → Did you mean 'United States'? (confidence 85%)


Accept suggestion? (y/n):  y



Found 'United States of Americas' → Did you mean 'United States'? (confidence 90%)


Accept suggestion? (y/n):  y



Found 'U.A.' → Did you mean 'Bosnia and Herzegovina'? (confidence 60%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'Puerto Rico' → Did you mean 'South Africa'? (confidence 52%)


Accept suggestion? (y/n):  n
Enter the correct country:  Puerto Rico



Found 'U.SA' → Did you mean 'United States'? (confidence 68%)


Accept suggestion? (y/n):  y



Found 'United Kindom' → Did you mean 'United Kingdom'? (confidence 96%)


Accept suggestion? (y/n):  y



Found 'United Status' → Did you mean 'United States'? (confidence 92%)


Accept suggestion? (y/n):  y



Found 'Currently finance' → Did you mean 'Finland'? (confidence 64%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'UXZ' → Did you mean 'Luxembourg'? (confidence 60%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'England, UK' → Did you mean 'Finland'? (confidence 64%)


Accept suggestion? (y/n):  n
Enter the correct country:  United Kingdom



Found 'Canda' → Did you mean 'Canada'? (confidence 91%)


Accept suggestion? (y/n):  y



Found 'Canada and USA' → Did you mean 'Canada'? (confidence 90%)


Accept suggestion? (y/n):  y



Found 'Catalonia' → Did you mean 'Australia'? (confidence 67%)


Accept suggestion? (y/n):  n
Enter the correct country:  Spain



Found '$2,175.84/year is deducted for benefits' → Did you mean 'Côte d'Ivoire'? (confidence 60%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'Italy (South)' → Did you mean 'Italy'? (confidence 90%)


Accept suggestion? (y/n):  y



Found 'Jersey, Channel islands' → Did you mean 'Finland'? (confidence 64%)


Accept suggestion? (y/n):  n
Enter the correct country:  United Kingdom



Found 'Virginia' → Did you mean 'India'? (confidence 80%)


Accept suggestion? (y/n):  n
Enter the correct country:  United States



Found 'USS' → Did you mean 'Russia'? (confidence 90%)


Accept suggestion? (y/n):  y



Found 'Uniteed States' → Did you mean 'United States'? (confidence 96%)


Accept suggestion? (y/n):  y



Found 'Hartford' → Did you mean 'Qatar'? (confidence 46%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'Japan, US Gov position' → Did you mean 'Japan'? (confidence 90%)


Accept suggestion? (y/n):  y



Found 'Csnada' → Did you mean 'Canada'? (confidence 83%)


Accept suggestion? (y/n):  y



Found 'United Stares' → Did you mean 'United States'? (confidence 92%)


Accept suggestion? (y/n):  y



Found 'Mainland China' → Did you mean 'China'? (confidence 90%)


Accept suggestion? (y/n):  y



Found 'I.S.' → Did you mean 'Afghanistan'? (confidence 60%)


Accept suggestion? (y/n):  n
Enter the correct country:  United States



Found 'UK (Northern Ireland)' → Did you mean 'Ireland'? (confidence 90%)


Accept suggestion? (y/n):  n
Enter the correct country:  United Kingdom



Found 'UK for U.S. company' → Did you mean 'Romania'? (confidence 56%)


Accept suggestion? (y/n):  n
Enter the correct country:  United Kingdom



Found 'NZ' → Did you mean 'Tanzania'? (confidence 90%)


Accept suggestion? (y/n):  n
Enter the correct country:  New Zealand



Found 'Canad' → Did you mean 'Canada'? (confidence 91%)


Accept suggestion? (y/n):  y



Found 'Unite States' → Did you mean 'United States'? (confidence 96%)


Accept suggestion? (y/n):  y



Found 'The US' → Did you mean 'United States'? (confidence 60%)


Accept suggestion? (y/n):  y



Found 'Remote' → Did you mean 'Greece'? (confidence 50%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'For the United States government, but posted overseas' → Did you mean 'United States'? (confidence 90%)


Accept suggestion? (y/n):  y



Found 'IS' → Did you mean 'Afghanistan'? (confidence 90%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'United Kingdomk' → Did you mean 'United Kingdom'? (confidence 97%)


Accept suggestion? (y/n):  y



Found 'Australi' → Did you mean 'Australia'? (confidence 94%)


Accept suggestion? (y/n):  y



Found 'Cote d'Ivoire' → Did you mean 'Côte d'Ivoire'? (confidence 96%)


Accept suggestion? (y/n):  y



Found 'From Romania, but for an US based company' → Did you mean 'Romania'? (confidence 90%)


Accept suggestion? (y/n):  y



Found 'Wales (United Kingdom)' → Did you mean 'United Kingdom'? (confidence 90%)


Accept suggestion? (y/n):  y



Found 'England, Gb' → Did you mean 'Finland'? (confidence 64%)


Accept suggestion? (y/n):  n
Enter the correct country:  United Kingdom



Found 'UnitedStates' → Did you mean 'United States'? (confidence 96%)


Accept suggestion? (y/n):  y



Found 'Danmark' → Did you mean 'Denmark'? (confidence 86%)


Accept suggestion? (y/n):  y



Found 'U.K. (northern England)' → Did you mean 'Uganda'? (confidence 66%)


Accept suggestion? (y/n):  n
Enter the correct country:  United Kingdom



Found 'NL' → Did you mean 'Finland'? (confidence 90%)


Accept suggestion? (y/n):  n
Enter the correct country:  Netherlands



Found 'Nederland' → Did you mean 'Netherlands'? (confidence 80%)


Accept suggestion? (y/n):  y



Found 'England, United Kingdom' → Did you mean 'United Kingdom'? (confidence 90%)


Accept suggestion? (y/n):  y



Found 'Englang' → Did you mean 'Finland'? (confidence 57%)


Accept suggestion? (y/n):  n
Enter the correct country:  United Kingdom



Found 'United statew' → Did you mean 'United States'? (confidence 92%)


Accept suggestion? (y/n):  y



Found 'UAE' → Did you mean 'Bangladesh'? (confidence 60%)


Accept suggestion? (y/n):  n
Enter the correct country:  United Arab Emirates



Found 'bonus based on meeting yearly goals set w/ my supervisor' → Did you mean 'Eritrea'? (confidence 51%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'International ' → Did you mean 'Latvia'? (confidence 60%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'The Bahamas ' → Did you mean 'Panama'? (confidence 60%)


Accept suggestion? (y/n):  Bahamas
Enter the correct country:  The Bahamas



Found 'I earn commission on sales. If I meet quota, I'm guaranteed another 16k min. Last year i earned an additional 27k. It's not uncommon for people in my space to earn 100k+ after commission. ' → Did you mean 'France'? (confidence 40%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'United Statues' → Did you mean 'United States'? (confidence 96%)


Accept suggestion? (y/n):  y



Found 'Untied States' → Did you mean 'United States'? (confidence 92%)


Accept suggestion? (y/n):  y



Found 'USA (company is based in a US territory, I work remote)' → Did you mean 'Eritrea'? (confidence 64%)


Accept suggestion? (y/n):  n
Enter the correct country:  United States



Found 'UK (England)' → Did you mean 'Uganda'? (confidence 66%)


Accept suggestion? (y/n):  n
Enter the correct country:  United Kingdom



Found 'UK, remote' → Did you mean 'United Arab Emirates'? (confidence 48%)


Accept suggestion? (y/n):  n
Enter the correct country:  United Kingdom



Found 'Scotland, UK' → Did you mean 'Poland'? (confidence 75%)


Accept suggestion? (y/n):  n
Enter the correct country:  United Kingdom



Found 'USAB' → Did you mean 'Cyprus'? (confidence 60%)


Accept suggestion? (y/n):  n
Enter the correct country:  United States



Found 'Unitied States' → Did you mean 'United States'? (confidence 96%)


Accept suggestion? (y/n):  y



Found 'United Sttes' → Did you mean 'United States'? (confidence 96%)


Accept suggestion? (y/n):  y



Found 'Remote (philippines)' → Did you mean 'Philippines'? (confidence 90%)


Accept suggestion? (y/n):  y



Found 'Unites kingdom ' → Did you mean 'United Kingdom'? (confidence 93%)


Accept suggestion? (y/n):  y



Found 'Panamá' → Did you mean 'Panama'? (confidence 91%)


Accept suggestion? (y/n):  y



Found 'Austria, but I work remotely for a Dutch/British company' → Did you mean 'Austria'? (confidence 90%)


Accept suggestion? (y/n):  y



Found 'I work for an US based company but I'm from Argentina.' → Did you mean 'Argentina'? (confidence 90%)


Accept suggestion? (y/n):  n
Enter the correct country:  Argentina



Found 'I was brought in on this salary to help with the EHR and very quickly was promoted to current position but compensation was not altered. ' → Did you mean 'Bosnia and Herzegovina'? (confidence 86%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'Uniter Statez' → Did you mean 'United States'? (confidence 85%)


Accept suggestion? (y/n):  y



Found 'U. S ' → Did you mean 'United Arab Emirates'? (confidence 86%)


Accept suggestion? (y/n):  n
Enter the correct country:  United States



Found 'USA tomorrow ' → Did you mean 'Morocco'? (confidence 51%)


Accept suggestion? (y/n):  n
Enter the correct country:  United States



Found 'United Stateds' → Did you mean 'United States'? (confidence 96%)


Accept suggestion? (y/n):  y



Found 'n/a (remote from wherever I want)' → Did you mean 'Taiwan'? (confidence 60%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'US govt employee overseas, country withheld' → Did you mean 'Kuwait'? (confidence 57%)


Accept suggestion? (y/n):  n
Enter the correct country:  United States



Found 'Africa' → Did you mean 'South Africa'? (confidence 90%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'San Francisco' → Did you mean 'France'? (confidence 75%)


Accept suggestion? (y/n):  n
Enter the correct country:  United States



Found 'Usat' → Did you mean 'Australia'? (confidence 68%)


Accept suggestion? (y/n):  n
Enter the correct country:  United States





Found '🇺🇸 ' → Did you mean 'Afghanistan'? (confidence 0%)


Accept suggestion? (y/n):  n
Enter the correct country:  United States



Found 'Luxemburg' → Did you mean 'Luxembourg'? (confidence 95%)


Accept suggestion? (y/n):  y



Found 'Unitef Stated' → Did you mean 'United States'? (confidence 85%)


Accept suggestion? (y/n):  y



Found 'UA' → Did you mean 'Ecuador'? (confidence 90%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'Wales, UK' → Did you mean 'Ukraine'? (confidence 50%)


Accept suggestion? (y/n):  n
Enter the correct country:  United Kingdom



Found 'USaa' → Did you mean 'Australia'? (confidence 62%)


Accept suggestion? (y/n):  n
Enter the correct country:  United States



Found 'United States- Puerto Rico' → Did you mean 'United States'? (confidence 90%)


Accept suggestion? (y/n):  y



Found 'From New Zealand but on projects across APAC' → Did you mean 'New Zealand'? (confidence 90%)


Accept suggestion? (y/n):  y



Found 'Y' → Did you mean 'Myanmar'? (confidence 90%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'United y' → Did you mean 'United Arab Emirates'? (confidence 86%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'Wales (UK)' → Did you mean 'Ukraine'? (confidence 50%)


Accept suggestion? (y/n):  n
Enter the correct country:  United Kingdom



Found 'Isle of Man' → Did you mean 'Chile'? (confidence 54%)


Accept suggestion? (y/n):  n
Enter the correct country:  United Kingdom



Found 'Northern Ireland, United Kingdom' → Did you mean 'Ireland'? (confidence 90%)


Accept suggestion? (y/n):  n
Enter the correct country:  United Kingdom



Found 'europe' → Did you mean 'Turkey'? (confidence 50%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'California ' → Did you mean 'India'? (confidence 60%)


Accept suggestion? (y/n):  n
Enter the correct country:  United States



Found 'UK, but for globally fully remote company' → Did you mean 'Romania'? (confidence 56%)


Accept suggestion? (y/n):  n
Enter the correct country:  United Kingdom



Found 'Australian ' → Did you mean 'Australia'? (confidence 95%)


Accept suggestion? (y/n):  y



Found 'México' → Did you mean 'Mexico'? (confidence 91%)


Accept suggestion? (y/n):  y



Found 'USD' → Did you mean 'Cyprus'? (confidence 72%)


Accept suggestion? (y/n):  n
Enter the correct country:  United States



Found 'USA, but for foreign gov't' → Did you mean 'Congo'? (confidence 54%)


Accept suggestion? (y/n):  n
Enter the correct country:  United States



Found 'United Statss' → Did you mean 'United States'? (confidence 92%)


Accept suggestion? (y/n):  y



Found 'ARGENTINA BUT MY ORG IS IN THAILAND' → Did you mean 'Argentina'? (confidence 90%)


Accept suggestion? (y/n):  y



Found 'I work for a UAE-based organization, though I am personally in the US.' → Did you mean 'Argentina'? (confidence 50%)


Accept suggestion? (y/n):  United Arab Emirates
Enter the correct country:  United Arab Emirates



Found 'United  States' → Did you mean 'United States'? (confidence 96%)


Accept suggestion? (y/n):  y



Found 'Aotearoa New Zealand' → Did you mean 'New Zealand'? (confidence 90%)


Accept suggestion? (y/n):  y



Found 'na' → Did you mean 'Argentina'? (confidence 90%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'Policy' → Did you mean 'Czech Republic'? (confidence 60%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'hong konh' → Did you mean 'Congo'? (confidence 57%)


Accept suggestion? (y/n):  n
Enter the correct country:  China



Found 'United States is America' → Did you mean 'United States'? (confidence 90%)


Accept suggestion? (y/n):  y



Found 'Company in Germany. I work from Pakistan.' → Did you mean 'Germany'? (confidence 90%)


Accept suggestion? (y/n):  n
Enter the correct country:  Pakistan



Found 'Canadá' → Did you mean 'Canada'? (confidence 91%)


Accept suggestion? (y/n):  y



Found 'London' → Did you mean 'Indonesia'? (confidence 60%)


Accept suggestion? (y/n):  n
Enter the correct country:  United Kingdom



Found 'ss' → Did you mean 'Russia'? (confidence 90%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'dbfemf' → Did you mean 'Belgium'? (confidence 46%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'ibdia' → Did you mean 'India'? (confidence 80%)


Accept suggestion? (y/n):  y



Found 'LOUTRELAND' → Did you mean 'Ireland'? (confidence 71%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'ff' → Did you mean 'Afghanistan'? (confidence 45%)


Accept suggestion? (y/n):  n
Enter the correct country:  



Found 'Česká republika' → Did you mean 'Czech Republic'? (confidence 64%)


Accept suggestion? (y/n):  y



Found 'Italia' → Did you mean 'Australia'? (confidence 75%)


Accept suggestion? (y/n):  n
Enter the correct country:  Italy



Found 'Hong KongKong' → Did you mean 'Congo'? (confidence 54%)


Accept suggestion? (y/n):  n
Enter the correct country:  China


In [32]:
df.head()

Unnamed: 0,year,age,annual_salary,bonus,currency,country,years_working,years_in_your_field,education,gender,race,country_cleaned
0,2021,25-34,55000,0,USD,United States,5-7,5-7,Master's degree,Woman,White,United States
1,2021,25-34,54600,4000,GBP,United Kingdom,8-10,5-7,College degree,Non-binary,White,United Kingdom
2,2021,25-34,34000,0,USD,US,2-4,2-4,College degree,Woman,White,United States
3,2021,25-34,62000,3000,USD,USA,8-10,5-7,College degree,Woman,White,United States
4,2021,25-34,60000,7000,USD,US,8-10,5-7,College degree,Woman,White,United States


In [33]:
df=df[['year','age','annual_salary','bonus','currency','country_cleaned','years_working','years_in_your_field','education','gender','race']]
df

Unnamed: 0,year,age,annual_salary,bonus,currency,country_cleaned,years_working,years_in_your_field,education,gender,race
0,2021,25-34,55000,0,USD,United States,5-7,5-7,Master's degree,Woman,White
1,2021,25-34,54600,4000,GBP,United Kingdom,8-10,5-7,College degree,Non-binary,White
2,2021,25-34,34000,0,USD,United States,2-4,2-4,College degree,Woman,White
3,2021,25-34,62000,3000,USD,United States,8-10,5-7,College degree,Woman,White
4,2021,25-34,60000,7000,USD,United States,8-10,5-7,College degree,Woman,White
...,...,...,...,...,...,...,...,...,...,...,...
28144,2025,25-34,19200,10200,Other,Thailand,2-4,1 year or less,College degree,Woman,Asian or Asian American
28145,2025,25-34,25000,0,GBP,United Kingdom,2-4,2-4,College degree,Man,Another option not listed here or prefer not t...
28146,2025,18-24,8000,0,USD,United States,2-4,1 year or less,High School,Woman,White
28147,2025,25-34,1000000,30000,USD,India,5-7,5-7,College degree,Man,Asian or Asian American


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28149 entries, 0 to 28148
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   year                 28149 non-null  Int64 
 1   age                  28149 non-null  object
 2   annual_salary        28149 non-null  int64 
 3   bonus                28149 non-null  int64 
 4   currency             28149 non-null  object
 5   country_cleaned      28149 non-null  object
 6   years_working        28149 non-null  object
 7   years_in_your_field  28149 non-null  object
 8   education            27921 non-null  object
 9   gender               27975 non-null  object
 10  race                 27966 non-null  object
dtypes: Int64(1), int64(2), object(8)
memory usage: 2.4+ MB
