# Cleaning string data
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.
For strings this process starts by applying a list of common functions (lower, strip).  Then handling missing and duplicate entries. And finally applying custom functions as needed.

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
import pandas as pd
import numpy as np

# Display all cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

from IPython import get_ipython
ipython = get_ipython()

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

# autoreload extension
if 'autoreload' not in ipython.extension_manager.loaded:
    %load_ext autoreload

%autoreload 2

# Set max rows and columns displayed in jupyter
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 10)

## Load a dataset

In [2]:
df=pd.read_csv('../projects/proj1/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.csv')
df.head()
print(df.shape)

Unnamed: 0,Timestamp,How old are you?,Industry,Job title,Additional context on job title,...,Overall years of professional experience,Years of experience in field,Highest level of education completed,Gender,Race
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,...,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,,...,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,...,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,,...,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,,...,8 - 10 years,5-7 years,College degree,Woman,White


(27609, 18)


## Lets handle missing Country data (it will mess up below algorithms).  
Lots of ways to do this, I'm going to do it the simple way, replace NaN with UNKNOWN

This is not a good idea in general, especially if you have a way to figure out the country from the other data present

In [3]:
#how many missing Countries
df.Country.isnull().sum()

0

In [4]:
# quite a few
df.Country = df.Country.fillna('UNKNOWN')

## There is a country column, lets use it to get all the USA entries.  Take a look at the number of unique entries

In [5]:
#how many different countries are there
df.Country.nunique()

364

## How many occurrences for each unique entry?

In [6]:
#lets see what we have
vc=df.Country.value_counts()
print(f'There are {len(vc)} unique entries')
vc[:50]
# vc[50:100]

There are 364 unique entries


United States                8844
USA                          7847
US                           2572
Canada                       1549
United States                 652
U.S.                          571
UK                            566
United Kingdom                540
USA                           468
Usa                           441
United States of America      421
Australia                     312
United states                 203
usa                           180
Germany                       168
England                       134
united states                 113
Us                            103
Ireland                       102
New Zealand                    94
Uk                             84
Canada                         75
Australia                      67
United Kingdom                 65
France                         63
U.S.A.                         46
United States of America       43
Spain                          40
Netherlands                    40
Scotland      

## It looks like there was no filtering on what a user could enter in this field, ANyway lets get to it

## Apply lower and strip to get the easy gains

In [7]:
df.Country = df.Country.map(str.lower).map(str.strip)

#how many different countries are there now
df.Country.nunique()

250

## Looks like a lot of punctuation, lets get rid of it
Use regular expressions

In [8]:
#the regular expressions package
import re
punc = "[!\"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~\`]"  #this is the punctuation to get rid of

#or a function
def fun1(x):
    #re.sub will remove any punction char found in punc 
    return re.sub(punc, '', x)

df.Country = df.Country.map(fun1)

#can do the same thing with a lambda
# df.Country = df.Country.map(lambda x: re.sub(punc, '', x))

#how many different countries are there now
df.Country.nunique()

236

## Looks like a lot of variations of 'united state'

Another easy gain, lets replace all strings with 'united state' in them with 'usa' 

In [9]:
def fun(x):
    """
    replaces any string that contains 'united state' with 'usa'
    """
    if 'united state' in x:
        return 'usa'
    return x


df.Country = df.Country.map(fun);

#how many different countries are there now
df.Country.nunique()

219

## But you might want to do something similar with other strings, do you write another function?  Or do something a little more general?
Be general, always, use a python closure.

## Closures
The problem we face is that map takes a function that takes 1 parameter, and we want it to take 3; The string value passed by map (call it x), the string to search for in x (call it str_to_find), and the string to replace x with if we find str_to_find in x (call it str_replacement).

We can't get around the fact that map only passes 1 parameter to the function.  But we can creae a function that  already knows what str_to_find and str_replacement are.  Its called a closure

In [10]:
def fun1(str_to_find, str_replacement):
    """
    creates findandreplace Closure, which is a stateful function
    that remembers str_to_find and str_replacement values
    returns: findandreplace
    """
    def findandreplace(x):
        if str_to_find in x:
            return str_replacement
        return x
    # in python functions are first class objects
    # we are returning findandreplace, it in turn knows the value of 
    # str_to_find and str_replacement
    return findandreplace


In [11]:
#using the closure
fn= fun1('usa', 'usa')
df.Country = df.Country.map(fn)

fn= fun1('us', 'usa')
df.Country = df.Country.map(fn)

fn= fun1('u s', 'usa')
df.Country = df.Country.map(fn)

fn= fun1('unites states', 'usa')  #its a bit suspicious that 17 people made this mistake
df.Country = df.Country.map(fn)

fn= fun1('united sates', 'usa')
df.Country = df.Country.map(fn)

fn= fun1('unitedstates', 'usa')
df.Country = df.Country.map(fn)

fn= fun1('united stares', 'usa')
df.Country = df.Country.map(fn)
#and so on

In [12]:
#you can simplify the above with a list of str_to_find
#and just iterate over it
vals=['usa', 'us', 'u s', 'unites states', 'united sates', 'unitedstates', 'united stares']
for val in vals:
    fn= fun1(val, 'usa')
    df.Country = df.Country.map(fn)


In [13]:
#how many different countries are there now
df.Country.nunique()

182

## Once you get down to the bottom of the unique values you will probably get a lot of one offs
For instance lets see what the the values are

In [14]:
vals=df.Country.unique()
vals.sort()
vals

array(['217584year is deducted for benefits', 'afghanistan', 'africa',
       'america', 'aotearoa new zealand', 'argentina',
       'argentina but my org is in thailand', 'bangladesh', 'belgium',
       'bermuda', 'brasil', 'brazil', 'britain', 'bulgaria', 'california',
       'cambodia', 'can', 'canad', 'canada', 'canada ottawa ontario',
       'canadw', 'canadá', 'canda', 'catalonia', 'cayman islands',
       'chile', 'china', 'colombia',
       'company in germany i work from pakistan', 'congo', 'contracts',
       'costa rica', 'cote divoire', 'croatia', 'csnada', 'cuba',
       'currently finance', 'czech republic', 'czechia', 'danmark',
       'denmark', 'ecuador', 'england', 'england gb', 'england uk',
       'england united kingdom', 'englanduk', 'englang', 'eritrea',
       'estonia', 'europe', 'finland', 'france',
       'from new zealand but on projects across apac', 'germany', 'ghana',
       'global', 'great britain', 'greece', 'hartford', 'hong kong',
       'hong konh',

## Notice there are a lot of united states ish entries at the bottom, lets see if fuzzy wuzzy helps

In [15]:
# this package lives in the conda forge
# !conda install -c conda-forge fuzzywuzzy -y

In [16]:
# helpful modules
import fuzzywuzzy
from fuzzywuzzy import process

In [17]:
#first is the match, second is the score, third is the index in the dataframe
matches = fuzzywuzzy.process.extract("unit", df.Country, limit=50, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
matches

[('united y', 67, 22822),
 ('kuwait', 60, 1915),
 ('unite states', 50, 10447),
 ('unite states', 50, 13362),
 ('united sttes', 50, 14113),
 ('united statws', 47, 2003),
 ('united kindom', 47, 7156),
 ('uniter statez', 47, 16890),
 ('unitef stated', 47, 20206),
 ('united statss', 47, 25473),
 ('united  states', 47, 26059),
 ('lithuania', 46, 11426),
 ('lithuania', 46, 20695),
 ('united kingdom', 44, 1),
 ('united kingdom', 44, 15),
 ('united kingdom', 44, 59),
 ('united kingdom', 44, 93),
 ('united kingdom', 44, 114),
 ('united kingdom', 44, 126),
 ('united kingdom', 44, 147),
 ('united kingdom', 44, 177),
 ('united kingdom', 44, 207),
 ('united kingdom', 44, 295),
 ('united kingdom', 44, 301),
 ('united kingdom', 44, 339),
 ('india', 44, 434),
 ('united kingdom', 44, 493),
 ('united kingdom', 44, 518),
 ('united kingdom', 44, 641),
 ('united kingdom', 44, 648),
 ('united kingdom', 44, 664),
 ('united kingdom', 44, 682),
 ('united kingdom', 44, 699),
 ('united kingdom', 44, 785),
 ('uni

In [18]:
#get the first match to prove that the index is the index in the dataframe
df.iloc[ matches[0][2]].Country

'united y'

In [19]:
#they look pretty good, except for the united kingdom, kuwait (kuwait???), india
#first lets get all the unique matches in matches
l=[]
for mtch in matches:
    l.append(mtch[0])
    
#get unique values
l=set(l)
print(l)

{'united statws', 'unite states', 'united  states', 'india', 'lithuania', 'united kingdom', 'unitef stated', 'united y', 'uniter statez', 'kuwait', 'united statss', 'united kindom', 'united sttes'}


In [20]:
#create a function that will sub in 'usa' for the remaining matches
def subinmatches(df,mtches, dont_sub_these, str_replacement):
    for mtch in mtches:
        if(mtch[0] in dont_sub_these):
            continue
        df.iloc[mtch[2],df.columns.get_loc('Country')]=str_replacement

#gleaned from previous cells print
dont_sub_these=['lithuania','kuwait','united kingdom', 'india', 'united kindom']
subinmatches(df,matches,dont_sub_these,'usa')    

#how many different countries are there now
df.Country.nunique()

174

## There is a state field, is this just US states? Lets pull out the rows that correspond 

In [21]:
df.State.nunique()

129

## Sigh here we go again