Why .ipynb? The nature of Data Cleaning and EDA requires often re-running 
certain parts of code, displaying graphs, and drawing tables. 
This is why cell-alike interface is more beneficial. 

However, it is used mostly in Data Science, and should not be ubiquitous.
Certain cleaning functions can be written as a separate .py file for a 
later use.

## Loading Dataframe

In [595]:
# load and display dataset
import pandas as pd
df = pd.read_csv('Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.csv')
df.head()

Unnamed: 0,Timestamp,How old are you?,Industry,Job title,Additional context on job title,Annual salary,Other monetary comp,Currency,Currency - other,Additional context on income,Country,State,City,Overall years of professional experience,Years of experience in field,Highest level of education completed,Gender,Race
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,,62000,3000.0,USD,,,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,,60000,7000.0,USD,,,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White


In [596]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27696 entries, 0 to 27695
Data columns (total 18 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   Timestamp                                 27696 non-null  object 
 1   How old are you?                          27696 non-null  object 
 2   Industry                                  27627 non-null  object 
 3   Job title                                 27696 non-null  object 
 4   Additional context on job title           7168 non-null   object 
 5   Annual salary                             27696 non-null  object 
 6   Other monetary comp                       20531 non-null  float64
 7   Currency                                  27696 non-null  object 
 8   Currency - other                          185 non-null    object 
 9   Additional context on income              3015 non-null   object 
 10  Country                           

Based on the aquired information, we can see that data is very dirty and 
contains a lot of NaN values. Also, almost no columns are numerical data,
even though columns like 'Annual salary' are expected to be numbers.

Column 'State' applies only and exclusively to the United States.

## Column names

Some columns have too lengthy names, for example, "How old are you?", or
"Highest level of education". Instead, we could replace it with a shorter
and more convenient namings.

In [597]:
# renaming columns
df.rename(columns={'How old are you?':'Age',
                  'Years of experience in field': 'Experience',
                  'Currency - other': 'Currency-other',
                  'Overall years of professional experience': 'Overall_experience',
                  'Highest level of education completed': 'Education'}, inplace=True, errors='raise')

df.head()

Unnamed: 0,Timestamp,Age,Industry,Job title,Additional context on job title,Annual salary,Other monetary comp,Currency,Currency-other,Additional context on income,Country,State,City,Overall_experience,Experience,Education,Gender,Race
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,,62000,3000.0,USD,,,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,,60000,7000.0,USD,,,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White


Additionally, writing column names through _ instead of the space can be helpful, I will write a function that will replace all the spaces to _


In [598]:
def remove_spaces(name):
    name = name.replace(' ', '_')
    return name

# apply to all columns using map() function
df.columns = list(map(remove_spaces, df.columns))
df.head()

Unnamed: 0,Timestamp,Age,Industry,Job_title,Additional_context_on_job_title,Annual_salary,Other_monetary_comp,Currency,Currency-other,Additional_context_on_income,Country,State,City,Overall_experience,Experience,Education,Gender,Race
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,,62000,3000.0,USD,,,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,,60000,7000.0,USD,,,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White


## Checking for NaNs

In [599]:
# We already know that there are NaNs. However, where exactly are they?

print("Total NaNs:", df.isnull().sum().sum())
print("NaN by Columns:\n",df.isnull().sum())

Total NaNs: 85486
NaN by Columns:
 Timestamp                              0
Age                                    0
Industry                              69
Job_title                              0
Additional_context_on_job_title    20528
Annual_salary                          0
Other_monetary_comp                 7165
Currency                               0
Currency-other                     27511
Additional_context_on_income       24681
Country                                0
State                               4925
City                                  75
Overall_experience                     0
Experience                             0
Education                            207
Gender                               164
Race                                 161
dtype: int64


In [600]:
# lets look at the percentage of NaNs in each columns:
for name, value in zip(df.columns, [round(element/df.shape[0]*100, 2) for element in list(df.isnull().sum())]):
    print(f'{name} : {value}')


Timestamp : 0.0
Age : 0.0
Industry : 0.25
Job_title : 0.0
Additional_context_on_job_title : 74.12
Annual_salary : 0.0
Other_monetary_comp : 25.87
Currency : 0.0
Currency-other : 99.33
Additional_context_on_income : 89.11
Country : 0.0
State : 17.78
City : 0.27
Overall_experience : 0.0
Experience : 0.0
Education : 0.75
Gender : 0.59
Race : 0.58


So far, we can give a definitive answer on what columns are mostly useless:
'Additional_context_on_job_title', 
'Currency-other',
'Additional_context_on_income'. 

Let's find out why:




In [601]:
for col in ['Additional_context_on_job_title', 'Currency-other','Additional_context_on_income']:
    print(col, df[col].unique())

Additional_context_on_job_title [nan 'High school, FT' 'Data developer/ETL Developer' ...
 'Financial management Division' 'Construction' 'Meeting planner ']
Currency-other [nan 'INR' 'Peso Argentino' '$76,302.34'
 'My bonus is based on performance up to 10% of salary'
 'I work for an online state university, managing admissions data. Not direct tech support. '
 '0' 'MYR' 'CHF' 'KWD' 'NOK' 'Na ' 'USD' 'BR$' 'SEK'
 'Base plus Commission ' 'canadian' 'Dkk' 'EUR' 'COP' 'TTD'
 'Indian rupees' 'BRL (R$)' 'Mexican pesos' 'CZK' 'GBP' 'DKK' 'Bdt'
 'RSU / equity' 'ZAR' 'Additonal = Bonus plus stock' 'American Dollars'
 'Php' 'PLN (Polish zloty)' 'Overtime (about 5 hours a week) and bonus'
 'czech crowns' 'Stock ' 'TRY' 'Norwegian kroner (NOK)' 'CNY' 'ILS/NIS'
 '55,000' 'AUD & NZD are not the same currency...' 'US Dollar' 'Canadian '
 'AUD' 'BRL' 'NIS (new Israeli shekel)' '-' 'RMB (chinese yuan)'
 'Taiwanese dollars'
 "AUD and NZD aren't the same currency, and have absolutely nothing to do with

Looks like these three columns are representing some optional comments.
They might be valuable for certain studies, however, for the sake of overall
cleanliness of the dataset I will drop these columns since there is less likely
will be any use for them. 

In [602]:
df.drop(['Additional_context_on_job_title', 'Currency-other','Additional_context_on_income'], axis=1,  inplace=True)
df.head()

Unnamed: 0,Timestamp,Age,Industry,Job_title,Annual_salary,Other_monetary_comp,Currency,Country,State,City,Overall_experience,Experience,Education,Gender,Race
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,55000,0.0,USD,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,54600,4000.0,GBP,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,34000,,USD,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,62000,3000.0,USD,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,60000,7000.0,USD,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White


After, we have to decide what to do with the rest of NaN cells.

We can assume that if the data is missing in a numerical column, it is most
likely a 0. So "Other_monetary_comp" will get 0 instead of NaN.

How about categorical data?

'Industry', 'Education', 'Gender', 'City' and 'Race' are string values, so it's
safe to replace NaNs with 'Unknown'. We don't know, really, what could be in there.

'State' is different, though. We need to check if the entry is for US or not, based on this, 
we will assign either 'Unknown' or 'None'.


In [603]:
df['Other_monetary_comp'] = df['Other_monetary_comp'].fillna(0)

categorical = ['Industry', 'Education', 'Gender', 'City', 'Race']

for col in categorical:
    df[col] = df[col].fillna('Unknown')
    
df.head()

Unnamed: 0,Timestamp,Age,Industry,Job_title,Annual_salary,Other_monetary_comp,Currency,Country,State,City,Overall_experience,Experience,Education,Gender,Race
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,55000,0.0,USD,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,54600,4000.0,GBP,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,34000,0.0,USD,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,62000,3000.0,USD,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,60000,7000.0,USD,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White


In [604]:
# However, we are coming to some problem with US:
df['Country'].value_counts()

United States                                                                                                                                                                                                        8876
USA                                                                                                                                                                                                                  7870
US                                                                                                                                                                                                                   2579
Canada                                                                                                                                                                                                               1551
United States                                                                                                                   

In [605]:
# is the 'State' column contains EXCLUSIVELY US states names?

df['State'].unique()

array(['Massachusetts', nan, 'Tennessee', 'Wisconsin', 'South Carolina',
       'New Hampshire', 'Arizona', 'Missouri', 'Florida', 'Pennsylvania',
       'Michigan', 'Minnesota', 'Illinois', 'California', 'Georgia',
       'Ohio', 'District of Columbia', 'Maryland', 'Texas', 'Virginia',
       'North Carolina', 'New York', 'New Jersey', 'Rhode Island',
       'Colorado', 'Oregon', 'Washington', 'Indiana', 'Iowa', 'Nebraska',
       'Oklahoma', 'Maine', 'Connecticut', 'South Dakota',
       'West Virginia', 'Idaho', 'Louisiana', 'Montana', 'Kentucky',
       'North Dakota', 'Kansas', 'Vermont', 'Arkansas', 'Alabama',
       'Nevada', 'Delaware', 'New Mexico', 'Hawaii', 'Utah',
       'Mississippi', 'Kentucky, Ohio', 'District of Columbia, Virginia',
       'District of Columbia, Maryland', 'Alaska', 'Arizona, Washington',
       'Georgia, New York', 'California, Colorado', 'California, Oregon',
       'District of Columbia, Maryland, Pennsylvania, Virginia',
       'Arizona, California'

...yes, the states are US-based. However, there are multiple states sometimes.

As we see, there are multiple ways to write different countries. 
In order to address the missing 'State' column, we have to filter out everything
that is based in the US:

In [606]:
df.Country

0                                            United States
1                                           United Kingdom
2                                                       US
3                                                      USA
4                                                       US
5                                                      USA
6                                                      USA
7                                            United States
8                                                       US
9                                            United States
10                                           United States
11                                                     USA
12                                           United States
13                                           United States
14                                                  Canada
15                                         United Kingdom 
16                                                     U

Writing a function that strips spaces from the country name, casts it to
title: 

In [607]:
# Remove leading and trailing spaces
def strip_space(name):
    name = name.lstrip(" ")
    name = name.rstrip(" ")
    if len(name) >= 5:
        name = name.lower()
        name = name.title()
    return name

# calling this function on 'Country' column    
df.Country  = df.Country.apply(lambda x: strip_space(x)) 

In [608]:
import re # import regex
# USA
df.loc[df.Country.str.match(r'(\s)*(U|u)[a-zA-Z]*\s*(S|s)[a-zA-Z]*'), 
       'Country'] = 'United States'
df.loc[df.Country.str.match(r'\s*(U|u).(S|s)'), 
       'Country'] = 'United States'
df.loc[df.Country.str.match(r'\s*u[\s.]*s[ \.]*(?i)'), 
       'Country'] = 'United States'
df.loc[df.Country.str.match(r'america(?i)'),
       'Country'] = 'United States' 
df.loc[df.Country.str.match(r'the united s(?i)'),
       'Country'] = 'United States' 
df.loc[df.Country.str.match(r'the us(?i)'),
       'Country'] = 'United States' 


In [609]:
# United Kingdom
df.loc[
    df.Country.str.match(r'[a-zA-Z]*\s*,*(U|u)[a-zA-Z]*d\s*(K|k)[a-zA-Z]*m'),
    'Country'] = 'United Kingdom'
df.loc[df.Country.str.match(r'(U|u)[\s\.]*(K|k)') & (df.Country != 'Ukraine'),
       'Country'] = 'United Kingdom'
df.loc[(df.Country.str.match(r'eng(?i)')) |
       (df.Country.str.match(r'britain(?i)')) |
       (df.Country.str.match(r'g\s*b(?i)')), 'Country'] = 'United Kingdom'
df.loc[df.Country.str.match(r'[\s\.]*wales(?i)'), 'Country'] = 'United Kingdom'
df.loc[df.Country.str.match(r'[\s\.]*scot(?i)'), 'Country'] = 'United Kingdom'
df.loc[df.Country.str.match(r'(.)*northern ireland(?i)'),
       'Country'] = 'United Kingdom'
df.loc[df.Country.str.match(r'(.)*Great Britain(?i)'),
       'Country'] = 'United Kingdom'


In [610]:
# Germany
df.loc[df.Country.str.match(r'ger(?i)'), 'Country'] = 'Germany'

# Czech Republic
df.loc[df.Country.str.match(r'czech(?i)'), 'Country'] = 'Czech Republic'

# Canada
df.loc[df.Country.str.match(r'(C|c)...da(?i)') | (df.Country.str.match(r'(C|c)an(?i)')), 'Country'] = 'Canada'

# Australia
df.loc[df.Country.str.match(r'\s*austral(?i)'), 'Country'] = 'Australia'

# New Zealand 
df.loc[df.Country.str.match(r'[a-z]*\s*new\s*zeal(?i)'), 'Country'] = 'New Zealand'
df.loc[df.Country.str.match(r'[a-z]*\s*nz(?i)'), 'Country'] = 'New Zealand'

# Remote
df.loc[df.Country.str.match(r'(.)*remote(.)*(?i)'), 'Country'] = 'Remote'

# Netherlands
df.loc[df.Country.str.match(r'[a-z]*\s*nether(?i)'), 'Country'] = 'Netherlands'
df.loc[df.Country.str.match(r'NL'), 'Country'] = 'Netherlands'
df.loc[df.Country.str.match(r'N[a-z]*rland(?i)'), 'Country'] = 'Netherlands'

# Singapore
df.loc[df.Country.str.match(r'[a-z]*\s*singap(?i)'), 'Country'] = 'Singapore'

# Mexico
df.loc[df.Country.str.match(r'[a-z]*\s*m.xico(?i)'), 'Country'] = 'Mexico'

# Denmark
df.loc[df.Country.str.match(r'd..mark(?i)'), 'Country'] = 'Denmark'

# France
df.loc[df.Country.str.match(r'france(?i)'), 'Country'] = 'France'

# Japan
df.loc[df.Country.str.match(r'\s*japan(?i)'), 'Country'] = 'Japan'

# Italy
df.loc[df.Country.str.match(r'\s*italy(?i)'), 'Country'] = 'Italy'

# China
df.loc[df.Country.str.match(r'(.)*china(?i)'), 'Country'] = 'China'

# United Arab Emirates:
df.loc[df.Country.str.match(r'(.)*u[\.\s]*a[\.\s]*e(?i)'), 
       'Country'] = 'United Arab Emirates'

# Luxembourg:
df.loc[df.Country.str.match(r'(.)*Luxemb(?i)'), 'Country'] = 'Luxembourg'

# Hong Kong
df.loc[df.Country.str.match(r'(.)*hong k(?i)'), 'Country'] = 'Hong Kong'

# Brazil
df.loc[df.Country.str.match(r'(.)*bra.il(?i)'), 'Country'] = 'Brazil'

In [611]:
# Unclassified stuff:
# First of all: everything that contain digits needs to be reviewed
# None of it tells us what country is it from, so we can conclude that
# this is a column mismatched. The country is 'Unknown'
df.loc[df.Country.str.match(r'(.)*[\d/]+(.)*'), 'Country'] = 'Unknown'

In [612]:
# especially long ones:
print(df.loc[df.Country.str.len() >= 25]['Country'])


3019     Worldwide (Based In Us But Short Term Trips Ar...
4739     I Am Located In Canada But I Work For A Compan...
11529    For The United States Government, But Posted O...
11813            From Romania, But For An Us Based Company
16210    I Work For An Us Based Company But I'M From Ar...
16853    I Was Brought In On This Salary To Help With T...
25548                  Argentina But My Org Is In Thailand
27107            Company In Germany. I Work From Pakistan.
Name: Country, dtype: object


In [613]:
# Manually fixing the single occurences and errors:
df.loc[3019, 'Country'] = 'United States'
df.loc[4739, 'Country'] = "Canada"
df.loc[11529, 'Country'] = 'Unknown'
df.loc[11813, 'Country'] = 'Romania'
df.loc[16210, 'Country'] = 'Argentina'
df.loc[16853, 'Country'] = 'Unknown'
df.loc[25548, 'Country'] = 'Argentina'
df.loc[25952, 'Country'] = 'United States'
df.loc[27107, 'Country'] = 'Germany'
df.loc[df.Country == 'San Francisco', 'Country'] = 'United States' 
df.loc[df.Country == 'Jersey, Channel Islands', 'Country'] = 'United Kingdom'
df.loc[df.Country == 'United Y', 'Country'] = 'Unknown'
df.loc[df.Country == 'Y', 'Country'] = 'Unknown'
df.loc[df.Country == 'na', 'Country'] = 'Unknown'
df.loc[df.Country == 'Contracts', 'Country'] = 'Unknown'
df.loc[df.Country == 'Hartford', 'Country'] = 'Unknown'
df.loc[df.Country == 'Policy', 'Country'] = 'Unknown'
df.loc[df.Country == 'Virginia', 'Country'] = 'United States'
df.loc[df.Country == 'California', 'Country'] = 'United States'
df.loc[df.Country == '🇺🇸', 'Country'] = 'United States'
df.loc[(df.Country == 'I.S.') | (df.Country == 'IS') | (df.Country == 'ISA'), 'Country'] = 'Unknown'
df.loc[(df.Country == 'UA') | (df.Country == 'U.A.') | (df.Country == 'UXZ'), 'Country'] = 'Unknown'
df.loc[(df.Country == 'Currently Finance') | (df.Country == 'Global') | (df.Country == 'International'), 'Country'] = 'Unknown'
df.loc[df.Country == 'The Bahamas', 'Country'] = 'Bahamas'
df.loc[df.Country == 'Panamá', 'Country'] = 'Panama'






Note, that some errors are an obvious mistyping, for example, "United Stated". It is very clear what country user wanted to input, but most likely missed a key on the keyboard, or a phone autocorrected the phrase. However, some countries have completely ambiguous input, for example, 'UA'. Is it Ukraine? UsA? UAe? We cannot guess, since this will make the information in the table biased. For that matter, I marked everything that is not clear as 'Unknown'

In [614]:
# Checking if all the countries appearing OK:
print(df.Country.nunique())
print(sorted(df.Country.unique()))

97
['Afghanistan', 'Africa', 'Argentina', 'Australia', 'Austria', 'Bahamas', 'Bangladesh', 'Belgium', 'Bermuda', 'Brazil', 'Bulgaria', 'Cambodia', 'Canada', 'Catalonia', 'Cayman Islands', 'Chile', 'China', 'Colombia', 'Congo', 'Costa Rica', "Cote D'Ivoire", 'Croatia', 'Cuba', 'Cyprus', 'Czech Republic', 'Denmark', 'Ecuador', 'Eritrea', 'Estonia', 'Europe', 'Finland', 'France', 'Germany', 'Ghana', 'Greece', 'Hong Kong', 'Hungary', 'India', 'Indonesia', 'Ireland', 'Isle Of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jordan', 'Kenya', 'Kuwait', 'Latvia', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Malaysia', 'Malta', 'Mexico', 'Morocco', 'Netherlands', 'New Zealand', 'Nigeria', 'Norway', 'Pakistan', 'Panama', 'Philippines', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Remote', 'Romania', 'Russia', 'Rwanda', 'Saudi Arabia', 'Serbia', 'Sierra Leone', 'Singapore', 'Slovakia', 'Slovenia', 'Somalia', 'South Africa', 'South Korea', 'Spain', 'Sri Lanka', 'Sweden', 'Switzerland', 'Taiwan', '

This is the downside of the manual data entry, there are a lot of typing
errors. Whenever possible, dropdown options is a better choice

Finally, replacing the rest of NaNs with 'Unknown' if the country is "United States"
or with 'None' if the country is different

In [615]:
# 'Unknown' if Country is 'United States',
# 'None' if Country is different
df.loc[(df['Country'] == 'United States') & (df.State.isnull()), 
       'State'] = 'Unknown' 
df.loc[(df['Country'] != 'United States') & (df.State.isnull()), 
       'State'] = 'None'

In [616]:
df.head()

Unnamed: 0,Timestamp,Age,Industry,Job_title,Annual_salary,Other_monetary_comp,Currency,Country,State,City,Overall_experience,Experience,Education,Gender,Race
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,55000,0.0,USD,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,54600,4000.0,GBP,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,34000,0.0,USD,United States,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,62000,3000.0,USD,United States,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,60000,7000.0,USD,United States,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White


In [617]:
# Final check of NaNs
df.isnull().sum()

Timestamp              0
Age                    0
Industry               0
Job_title              0
Annual_salary          0
Other_monetary_comp    0
Currency               0
Country                0
State                  0
City                   0
Overall_experience     0
Experience             0
Education              0
Gender                 0
Race                   0
dtype: int64

NaN problem is fixed.

# Dealing with data types:

In [618]:
# Looking at data types:
df.dtypes

Timestamp               object
Age                     object
Industry                object
Job_title               object
Annual_salary           object
Other_monetary_comp    float64
Currency                object
Country                 object
State                   object
City                    object
Overall_experience      object
Experience              object
Education               object
Gender                  object
Race                    object
dtype: object

In [619]:
# changing the first column from 'obj' to 'datetime'
df['Timestamp'] = df['Timestamp'].astype('datetime64[ns]')

In [620]:
# getting rid of commas in salary column and casting it as 'int'
df.Annual_salary = df.Annual_salary.apply(lambda x: x.replace(",", ""))
df.Annual_salary = df.Annual_salary.astype('int')

In [621]:
# turning another numerical column to int from float
df.Other_monetary_comp = df.Other_monetary_comp.astype('int')

In [622]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27696 entries, 0 to 27695
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Timestamp            27696 non-null  datetime64[ns]
 1   Age                  27696 non-null  object        
 2   Industry             27696 non-null  object        
 3   Job_title            27696 non-null  object        
 4   Annual_salary        27696 non-null  int32         
 5   Other_monetary_comp  27696 non-null  int32         
 6   Currency             27696 non-null  object        
 7   Country              27696 non-null  object        
 8   State                27696 non-null  object        
 9   City                 27696 non-null  object        
 10  Overall_experience   27696 non-null  object        
 11  Experience           27696 non-null  object        
 12  Education            27696 non-null  object        
 13  Gender               27696 non-

That is it, the rest of the columns are categorical, we cannot cast them as integer


# Checking Data For Consistency

In [623]:
# looking at potential column problems
sorted(list(df.Industry.unique()))

[' Buyer',
 ' Veterinary medicine',
 '"Government Relations" (Lobbying) ',
 'Academia',
 'Academia ',
 'Academia - STEM',
 'Academia / Research',
 'Academia--cell and molecular biology',
 'Academic Medicine ',
 'Academic Press Production',
 'Academic Publishing',
 'Academic Science ',
 'Academic Scientific Research',
 'Academic publishing',
 'Academic publishing ',
 'Academic research',
 'Academic research (Psychology)',
 'Academic research (social science)',
 'Academic science',
 'Academic/nonprofit research',
 'Accessibility',
 'Accounting, Banking & Finance',
 'Actuarial',
 'Administration',
 'Administration ',
 'Administration (food service)',
 'Administration in MLM',
 'Administration, IT',
 'Administrative ',
 'Administrative Support',
 'Administrative Work',
 'Adult education',
 'Aerospace',
 'Aerospace ',
 'Aerospace & Defense',
 'Aerospace and Defense',
 'Aerospace and Defense Manufacturing',
 'Aerospace and Defense/Government Contracting',
 'Aerospace contracting',
 'Aerospac

In [624]:
# Fixing minor descrepancies 
df.Industry = df.Industry.apply(lambda x: strip_space(x))

In [625]:
print(df.Age.unique()) # checks OK
print(df.Industry.unique()) # seems OK

['25-34' '45-54' '35-44' '18-24' '65 or over' '55-64' 'under 18']
['Education (Higher Education)' 'Computing Or Tech'
 'Accounting, Banking & Finance' 'Nonprofits' 'Publishing'
 'Education (Primary/Secondary)' 'Law' 'Health Care'
 'Utilities & Telecommunications' 'Business Or Consulting' 'Art & Design'
 'Government And Public Administration' 'Public Library'
 'Engineering Or Manufacturing' 'Media & Digital'
 'Marketing, Advertising & Pr' 'Retail' 'Property Or Construction'
 'Biotechnology' 'Aerospace Contracting' 'Insurance' 'Sales' 'Energy'
 'Environmental Regulation' 'Hospitality & Events'
 'Transport Or Logistics' 'Medical Devices'
 'Academic Research (Psychology)' 'Social Work' 'Surveying'
 'Recruitment Or Hr' 'PhD' 'Biopharma' 'Stem Research' 'Libraries'
 'Architecture' 'Academic Medicine' 'Commercial Real Estate'
 'Pet Care Industry (Dog Training/Walking)' 'Politics'
 'University Administration' 'Animal Health Product Manufacturing'
 'Educational Technology - Hybrid Between Book 

In [626]:
df.Job_title = df.Job_title.apply(lambda x: strip_space(x))
#print(list(df.Job_title.unique())) # this line is commented because the output is too large

In [627]:
df.Currency.unique() # seems OK

array(['USD', 'GBP', 'CAD', 'EUR', 'AUD/NZD', 'Other', 'CHF', 'ZAR',
       'SEK', 'HKD', 'JPY'], dtype=object)

In [628]:
df.State.unique() 
# There are dual states. I will not correct this, as some people work in two states,
# and the overall data seems to be consistent

array(['Massachusetts', 'None', 'Tennessee', 'Wisconsin',
       'South Carolina', 'New Hampshire', 'Arizona', 'Missouri',
       'Florida', 'Unknown', 'Pennsylvania', 'Michigan', 'Minnesota',
       'Illinois', 'California', 'Georgia', 'Ohio',
       'District of Columbia', 'Maryland', 'Texas', 'Virginia',
       'North Carolina', 'New York', 'New Jersey', 'Rhode Island',
       'Colorado', 'Oregon', 'Washington', 'Indiana', 'Iowa', 'Nebraska',
       'Oklahoma', 'Maine', 'Connecticut', 'South Dakota',
       'West Virginia', 'Idaho', 'Louisiana', 'Montana', 'Kentucky',
       'North Dakota', 'Kansas', 'Vermont', 'Arkansas', 'Alabama',
       'Nevada', 'Delaware', 'New Mexico', 'Hawaii', 'Utah',
       'Mississippi', 'Kentucky, Ohio', 'District of Columbia, Virginia',
       'District of Columbia, Maryland', 'Alaska', 'Arizona, Washington',
       'Georgia, New York', 'California, Colorado', 'California, Oregon',
       'District of Columbia, Maryland, Pennsylvania, Virginia',
       

In [629]:
df.Overall_experience.unique() # good

array(['5-7 years', '8 - 10 years', '2 - 4 years', '21 - 30 years',
       '11 - 20 years', '1 year or less', '41 years or more',
       '31 - 40 years'], dtype=object)

In [630]:
df.Experience.unique() # good

array(['5-7 years', '2 - 4 years', '21 - 30 years', '11 - 20 years',
       '1 year or less', '8 - 10 years', '31 - 40 years',
       '41 years or more'], dtype=object)

In [631]:
df.Education.unique() # OK

array(["Master's degree", 'College degree', 'PhD', 'Unknown',
       'Some college', 'High School',
       'Professional degree (MD, JD, etc.)'], dtype=object)

In [632]:
df.Gender.unique() # OK

array(['Woman', 'Non-binary', 'Man', 'Unknown',
       'Other or prefer not to answer', 'Prefer not to answer'],
      dtype=object)

In [633]:
df.Age.unique() # good

array(['25-34', '45-54', '35-44', '18-24', '65 or over', '55-64',
       'under 18'], dtype=object)

In [634]:
#print(list(df.City.unique())) # this column is definitely very dirty

First, dealing with Remote and Work From Home:

In [635]:
df.loc[(df.City.str.match(r'(.)*remote(.)*(?i)')) | (df.Country == 'Remote'), 'City'] = 'Remote'
df.loc[(df.City.str.match(r'(.)*wfh(.)*(?i)')), 'City'] = 'Remote'

Dealing with Large Cities and their common abbreviations:

In [636]:
#NYC
df.loc[(df.City.str.match(r'(.)*new york(.)*(?i)')) |
       (df.City.str.match(r'(.)*n[\.\s]*y[\.\s]*c[\.\s]*\s*(?i)')) |
       (df.City.str.match(r'(\s)*n[\.\s]*y(\.)*(\s)+(?i)')),
       'City'] = 'New York City'

# SF
df.loc[(df.City.str.match(r'(.)*sa. francis(.)*(?i)')) |
       (df.City.str.match(r'(.)*s[\.\s]*f(\s)+(?i)')),
       'City'] = 'San Francisco'

# LA
df.loc[(df.City.str.match(r'(.)*los angeles(.)*(?i)')) |
       (df.City.str.match(r'(.)*L[\.\s]*A(\s)+')), 'City'] = 'Los Angeles'

# DC
df.loc[(df.City.str.match(r'(.)*d[\.\s]*c[\.]*\s+(.)*(?i)')) |
       (df.City.str.match(r'(.)*district of columbia(.)*(?i)')) |
       (df.City.str.match(r'(.)*D.C.(.)*(?i)')), 'City'] = 'Washington, DC'


In [637]:
# Fixing minor problems
df.loc[df.City == 'Walnut Creek, California ', 'City'] = 'Walnut Creek, CA'

In [638]:

# A lot of entries are missing 'State' column value, however, 'City input
# contains state information.

# we could try to impute 'State' column from it
 
df.loc[(df.City.str.match(r'(.)*,(.)*')) & (df.State == 'Unknown')]



Unnamed: 0,Timestamp,Age,Industry,Job_title,Annual_salary,Other_monetary_comp,Currency,Country,State,City,Overall_experience,Experience,Education,Gender,Race
10,2021-04-27 11:03:03,25-34,Nonprofits,Office Manager,47500,0,USD,United States,Unknown,"Boston, MA",5-7 years,5-7 years,College degree,Woman,White
224,2021-04-27 11:06:59,25-34,Politics,Deputy C-Level,105000,0,USD,United States,Unknown,"Washington, DC",8 - 10 years,8 - 10 years,College degree,Woman,White
278,2021-04-27 11:07:40,45-54,Insurance,Supervisor,59000,7000,USD,United States,Unknown,"Omaha, NE",21 - 30 years,21 - 30 years,Some college,Woman,White
601,2021-04-27 11:12:23,35-44,Education (Higher Education),Lecturer,25000,0,USD,United States,Unknown,"Albany, NY",8 - 10 years,8 - 10 years,PhD,Woman,White
604,2021-04-27 11:12:24,45-54,Nonprofits,Vp Of Communications,154000,0,USD,United States,Unknown,"Walnut Creek, CA",11 - 20 years,11 - 20 years,Master's degree,Woman,White
802,2021-04-27 11:15:07,25-34,Government And Public Administration,"Director, Government Relations",139000,0,USD,United States,Unknown,"Washington, DC",5-7 years,5-7 years,College degree,Woman,White
856,2021-04-27 11:15:58,25-34,Media & Digital,Senior Managing Editor,110000,10000,USD,United States,Unknown,"Washington, DC",11 - 20 years,8 - 10 years,College degree,Woman,White
866,2021-04-27 11:16:09,35-44,Health Care,Physician (Fm),200000,0,USD,United States,Unknown,"Charlotte, NC",2 - 4 years,2 - 4 years,PhD,Woman,"Asian or Asian American, White"
869,2021-04-27 11:16:11,35-44,"Accounting, Banking & Finance",Director,158000,76000,USD,United States,Unknown,"Jersey City, NJ",11 - 20 years,11 - 20 years,College degree,Man,White
1085,2021-04-27 11:19:56,25-34,Government And Public Administration,Analyst,87000,0,USD,United States,Unknown,"Washington, DC",5-7 years,5-7 years,Master's degree,Woman,White


In [639]:
# Creating a dictionary with US state abbreviations to fill out 'Unknown'
# values in 'State' column:
states = {
    'AL': 'Alabama',
    'AK': 'Alaska',
    'AZ': 'Arizona',
    'AR': 'Arkansas',
    'CA': 'California',
    'CO': 'Colorado',
    'CT': 'Connecticut',
    'DE': 'Delaware',
    'DC': 'District of Columbia',
    'FL': 'Florida',
    'GA': 'Georgia',
    'HI': 'Hawaii',
    'ID': 'Idaho',
    'IL': 'Illinois',
    'IN': 'Indiana',
    'IA': 'Iowa',
    'KS': 'Kansas',
    'KY': 'Kentucky',
    'LA': 'Louisiana',
    'ME': 'Maine',
    'MD': 'Maryland',
    'MA': 'Massachusetts',
    'MI': 'Michigan',
    'MN': 'Minnesota',
    'MO': 'Missouri',
    'MT': 'Montana',
    'NE': 'Nebraska',
    'NV': 'Nevada',
    'NH': 'New Hampshire',
    'NJ': 'New Jersey',
    'NM': 'New Mexico',
    'NY': 'New York',
    'NC': 'North Carolina',
    'ND': 'North Dakota',
    'OH': 'Ohio',
    'OK': 'Oklahoma',
    'OR': 'Oregon',
    'PA': 'Pennsylvania',
    'RI': 'Rhode Island',
    'SC': 'South Carolina',
    'SD': 'South Dakota',
    'TN': 'Tennessee',
    'TX': 'Texas',
    'UT': 'Utah',
    'VT': 'Vermont',
    'VA': 'Virginia',
    'WA': 'Washington',
    'WV': 'West Virginia',
    'WI': 'Wisconsin',
    'WY': 'Wyoming'
}


In [640]:
# Where 'State' is 'Unknown', but 'City' has state information, 
# we take state from there, decode the abbreviaiton, input it instead of
# 'Unknown' in 'State' column, and erase it from the 'City' for data
# consistency

import re

for i in range(0, 27696):
    if df.loc[i, 'State'] == 'Unknown' and  re.match(r'(.)*,(.)*', df.loc[i, 'City']):
       if df.loc[i, 'City'][-2:].upper() in states.keys():
           df.loc[i, 'State'] = states[df.loc[i, 'City'][-2:].upper()]
           df.loc[i, 'City'] = df.loc[i, 'City'][:-4]

In [641]:
# fixing dashed and n/a cities
df.loc[(df.City.str.match(r'(.)*--')) | (df.City.str.match(r'(.)*n/a\s(.)*(?i)')), 'City'] = 'Unknown'
# fixing 'sorryful' entries
df.loc[(df.City.str.match(r'(.)*sorry')), 'City'] = 'Unknown'
# Fixing 'prefer not to say'
df.loc[(df.City.str.match(r'(.)*prefer')), 'City'] = 'Unknown'
# Fixing 'too identifiable'
df.loc[(df.City.str.match(r'(.)*identif')), 'City'] = 'Unknown'

In [642]:
df.loc[(df.City.str.match(r'(.)*area'))]['City']

138                                     Detroit Metro area
259                                    Greater Boston area
322                                 Twin cities metro area
332                                               DFW area
489                                         Ann Arbor area
689                                     Atlanta metro area
843                                            Durham area
1015                             Greater Philadelphia area
1386                                          Houston area
1394                                              Bay area
1440                                        Vancouver area
1632                                    Orlando metro area
1701                                     Georgian Bay area
1709                                          Hanover area
1720                             St. Paul/Minniapolis area
2040                                           Boston area
2367                                        Milwaukee ar

In [643]:
print(df.City.nunique())
print(list(df.City.unique()))

4413
['Boston', 'Cambridge', 'Chattanooga', 'Milwaukee', 'Greenville', 'Hanover', 'Columbia', 'Yuma', 'St. Louis', 'Palm Coast', 'Scranton', 'Detroit', 'Saint Paul', 'Remote', 'Lincoln', 'Chicago', 'Pomona', 'Atlanta', 'Boca Raton', 'Philadelphia', 'Toronto', 'Dayton', 'Bradenton', 'Ann Arbor', 'Washington DC', 'Silver Spring', 'Washington', 'San Antonio', 'Minneapolis', 'Washington, DC', 'Richmond', 'Research Triangle', 'Kalamazoo', 'Manhattan', 'Sacramento', 'Dallas', 'waynesboro', 'Liverpool', 'Pittsburgh', 'Arlington, VA', 'Chapel Hill', 'Vancouver', 'Berkeley', 'Apple Valley', 'Troy', 'DC', 'Linden', 'Bristol', 'Champaign', 'Nova Scotia', 'Ottawa', 'Bryan', 'Providence', 'Bloomington', 'Denver', 'New York City', 'Colorado Springs', 'Southampton', 'Unknown', 'Raleigh', 'Edinburgh', 'St. Paul', 'Philadelphia (suburbs)', 'Kitchener, Ontario', 'Portland ', 'Nashville', 'Newcastle, UK', 'Seattle', 'Indianapolis', 'Baltimore', 'Solon', 'Montreal', 'Gainesville', 'Lafayette', 'Phoenix', 

In [644]:
list_of_cities = list(df.City.unique())



In [645]:
existing_cities = []

for city in cities:
    existing_cities.append(city.dict['name'])
    
print(existing_cities)

['Skrad', 'Zapponeta', 'Pio Duran', 'Bolszewo', 'East Port Orchard', 'Catumbela', 'Staindrop', 'Dhūri', 'Lentate sul Seveso', 'Rîbniţa', 'San Nicolás Tolentino', 'Punghina', 'Anneberg', 'Sobotka', 'Kamp-Lintfort', 'Airth', 'Marmentino', 'Old Kilcullen', 'Tarrytown', 'Linda', 'Townsend', 'East Renton Highlands', 'Schliersee', 'Tōno', 'Krasang', 'San Sebastián Alcomunga', 'Kakamas', 'Škabrnje', 'Bolków', 'Dhupgāri', 'Zanica', 'Lenta', 'Pio', 'Blender', 'Pungeşti', 'Björklinge', 'Linden', 'Stainburn', 'Airmyn', 'East Wenatchee', 'Sobotín', 'Oldcastle', 'Monticelli Brusati', 'Catabola', 'La Reforma', 'Lexington', 'Rezina', 'Kamp-Bornhofen', 'Sisak', 'Jozini', 'Middletown', 'Krasae Sin', 'Terrace Heights', 'Dhuliān', 'Marieberg', 'Truro', 'Zanè', 'Lenola', 'Muscoline', 'Pinukpuk', 'Airdrie', 'Puieşti', 'Lindsay', 'Schlierbach', 'Colonia del Sol', 'Camacupa', 'Stainborough', 'Newtown Trim', 'Tomobe', 'Soběslav', 'Boleszkowice', 'East Wenatchee Bench', 'Blekendorf', 'Liberty', 'Pervomaisc', '

In [646]:
df_exp = df.loc[:150]
print(df_exp.City.unique())

['Boston' 'Cambridge' 'Chattanooga' 'Milwaukee' 'Greenville' 'Hanover'
 'Columbia' 'Yuma' 'St. Louis' 'Palm Coast' 'Scranton' 'Detroit'
 'Saint Paul' 'Remote' 'Lincoln' 'Chicago' 'Pomona' 'Atlanta' 'Boca Raton'
 'Philadelphia' 'Toronto' 'Dayton' 'Bradenton' 'Ann Arbor' 'Washington DC'
 'Silver Spring' 'Washington' 'San Antonio' 'Minneapolis' 'Washington, DC'
 'Richmond' 'Research Triangle' 'Kalamazoo' 'Manhattan' 'Sacramento'
 'Dallas' 'waynesboro' 'Liverpool' 'Pittsburgh' 'Arlington, VA'
 'Chapel Hill' 'Vancouver' 'Berkeley' 'Apple Valley' 'Troy' 'DC' 'Linden'
 'Bristol' 'Champaign' 'Nova Scotia' 'Ottawa' 'Bryan' 'Providence'
 'Bloomington' 'Denver' 'New York City' 'Colorado Springs' 'Southampton'
 'Unknown' 'Raleigh' 'Edinburgh' 'St. Paul' 'Philadelphia (suburbs)'
 'Kitchener, Ontario' 'Portland ' 'Nashville' 'Newcastle, UK' 'Seattle'
 'Indianapolis' 'Baltimore' 'Solon' 'Montreal' 'Gainesville' 'Lafayette'
 'Phoenix' 'Boulder' 'Austin' 'Halifax' 'Portsmouth' 'Nottingham'
 'Prefer not

In [648]:
# Cleaning City Column

unique_city_names = df.City.unique()

for name in unique_city_names:
    for city in existing_cities:
        if re.match(f'(.)*{city}(.)*', name):
            df.loc[df.City == name, 'Decoded_city'] = city
            break
        else:
            df.loc[df.City == name, 'Decoded_city'] = 'Unknown'
            
            


KeyboardInterrupt: 

In [None]:
df.Decoded_city.unique()

In [594]:


for row in range(0, 27696):
    for city in existing_cities:
        if re.match(f'(.)*{city}(.)*', df.loc[row, 'City']):
            df.loc[row, 'City'] = city
            break
    else:   
        df.loc[row, 'City'] = 'Unknown'
    
    
df.City.unique()
            
        
    

KeyboardInterrupt: 

In [539]:
'Newcastle, UK' in existing_cities

False

In [346]:
from allcities import cities
bad_names = []
for city in list_of_cities:
    if cities.filter(name=city) == cities.filter(name='xxxxxxxxxxx'):
        bad_names.append(city)
        
print(bad_names)

['Remote', 'Washington DC', 'Washington, DC', 'Research Triangle', 'Arlington, VA', 'Nova Scotia', 'Unknown', 'Philadelphia (suburbs)', 'Kitchener, Ontario', 'Newcastle, UK', 'Montreal', 'Prefer not to answer', 'London, ON CAN', 'Philadelphia ', 'Whippany', 'Detroit Metro area', 'Newtown Square', 'Shaker Hts', 'Northeast Florida', 'Research Triangle region', 'Gorham/Portland', 'Rural Ontario', 'Greater Boston area', 'Western Mass', 'Indianapolis ', 'Houston Area', 'Pittsburgh ', 'Metro Boston', 'Twin cities metro area', 'DFW area', 'Suwanee ', 'DFW', 'Prefer not to say (Montana is small!)', 'Panhandle region', 'Bay Area', 'Springfield, IL', 'Columbia, SC', 'Harrisonburg, VA', 'Eglin AFB', 'Work from home ', 'Billingham ', 'Sunnyvale ', 'Ann Arbor area', 'Verysmall town', 'Rockland ', 'Twin Cities suburbs ', 'Port Carling, Muskoka, Ontario', 'Waterloo, ON', 'Toronto, ON', 'St Louis', 'Lansing ', 'Cornwall, Ontario', 'Saskatoon ', 'Huntsville ', 'City of Buenos Aires', 'Tallahassee, FL',

In [361]:
unmatched = []
for i in bad_names:
    for city in existing_cities:
        if not re.match(f'(.)*{city}(.)*', i):
            unmatched.append(i)
        else:
            break
        
print(unmatched)

MemoryError: 

In [376]:
print(list(df.City.unique()))

['Boston', 'Cambridge', 'Chattanooga', 'Milwaukee', 'Greenville', 'Hanover', 'Columbia', 'Y', 'St. Louis', 'Palm Coast', 'Boston+++', 'Scranton+++', 'Detroit+++', 'Pau+++', 'Re+++', 'Lincoln+++', 'Chicago+++', 'Pomona+++', 'Atlanta+++', 'Boca+++', 'Philadelphia+++', 'Toro+++', 'Dayton+++', 'Bra+++', 'Ann Arbor+++', 'Washington+++', 'Silver Spring+++', 'San Antonio+++', 'Minneapolis+++', 'St. Louis+++', 'Richmond+++', 'Triangle+++', 'Kalama+++', 'Man+++', 'Sacramento+++', 'Dallas+++', 'waynesboro+++', 'Liverpool+++', 'Pittsburg+++', 'Arlington+++', 'Chapel Hill+++', 'Vancouver+++', 'Berkeley+++', 'Apple Valley+++', 'Troy+++', 'DC+++', 'Linden+++', 'Bristol+++', 'Cham+++', 'Nov+++', 'Ottawa+++', 'Bryan+++', 'Providence+++', 'Bloomington+++', 'Denver+++', 'Y+++', 'Colorado Springs+++', 'Southampton+++', 'Columbia+++', 'Un+++', 'Raleigh+++', 'Ed+++', 'Ontario+++', 'Portland+++', 'Nashville+++', 'Newcastle+++', 'Seattle+++', 'Indiana+++', 'Baltimore+++', 'Solon+++', 'Mon+++', 'Gainesville++

In [295]:
df.loc[881, 'City'] = 'Remote'
df.loc[7500, 'City'] = 'Remote'
df.loc[18714, 'City'] = 'Remote'



In [296]:
df.City[df.City.str.len() == 30]

527      Port Carling, Muskoka, Ontario
6238     “Large” Canadian prairie city.
7603     Minneapolis/St Paul metro area
13189    Greater Minneapolis metro area
25892    independence (cincinnati area)
26089    Littleton, MA / Manchester, NH
26236    Te Whanganui-a-Tara Wellington
Name: City, dtype: object

In [297]:
pd.set_option('display.max_rows', None)
print(df.City.value_counts())

New York City                                                                                                                                                                  1606
Boston                                                                                                                                                                          771
Chicago                                                                                                                                                                         746
Seattle                                                                                                                                                                         690
San Francisco                                                                                                                                                                   639
London                                                                                              

In [906]:
pd.options.display.max_rows = 4000
print(list(df.City.unique()))

['Boston', 'Cambridge', 'Chattanooga', 'Milwaukee', 'Greenville', 'Hanover', 'Columbia', 'Yuma', 'St. Louis', 'Palm Coast', 'Boston, MA', 'Scranton', 'Detroit', 'Saint Paul', 'Remote', 'Lincoln', 'Chicago', 'Pomona', 'Atlanta', 'Boca Raton', 'Philadelphia', 'Toronto', 'Dayton', 'Bradenton', 'Ann Arbor', 'Washington DC', 'Silver Spring', 'Washington', 'San Antonio', 'Minneapolis', 'Washington, DC', 'Richmond', 'Research Triangle', 'Kalamazoo', 'Manhattan', 'Sacramento', 'Dallas', 'waynesboro', 'Liverpool', 'Pittsburgh', 'Arlington, VA', 'Chapel Hill', 'Vancouver', 'Berkeley', 'Apple Valley', 'Troy', 'DC', 'Linden', 'Bristol', 'Champaign', 'Nova Scotia', 'Ottawa', 'Bryan', 'Providence', 'District of Columbia', 'Bloomington', 'Denver', 'New York City', 'Colorado Springs', 'Southampton', 'prefer not to answer', 'NYC (remotely)', 'Raleigh', 'New York', '-----', 'Edinburgh', 'St. Paul', 'Philadelphia (suburbs)', 'Kitchener, Ontario', 'Portland ', 'Nashville', 'Newcastle, UK', 'Seattle', 'Ind

In [None]:
df.loc[881, 'City'] = 'Remote'
df.loc[7500, 'City'] = 'Remote'
df.loc[18714, 'City'] = 'Remote'