#Exercise 3: Fix incorrect values in 'State' column
In this exercise, we will clean the variable 'State' of a synthetic dataset listing Finance officers in USA.
The original dataset has been shared by Forest Gregg and Derek Eder and can be found here: https://github.com/dedupeio/dedupe-examples/blob/master/extended-variables/officers.csv

1. Open on a new Colab notebook and import the pandas package

In [0]:
import pandas as pd

2. Assign the link to the dataset to a variable called 'file_url':

In [0]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter11/dataset/officers.csv'

3. Using the read_csv method from the package pandas, load the dataset into a new variable called 'df':

In [0]:
df = pd.read_csv(file_url)

4. Print the first 5 rows of the dataframe using the method .head():

In [4]:
df.head()

Unnamed: 0,ID,City,State,Zip,Title,RedactionRequested
0,804,Glenview,IL,60025,Treasurer,False
1,9177,Harrisburg,IL,62946,Treasurer,False
2,53011,Chicago,IL,60606,Treasurer,False
3,9176,Harrisburg,IL,62946,Chairman,False
4,33020,Mechanicsburg,IL,62545,Chairman,False


5. Print out all the unique values of the variable 'State':

In [5]:
df['State'].unique()

array(['IL', 'PA', 'DC', 'Il', nan, 'WI', 'CA', 'MO', 'NC', 'IA', 'MA',
       'IN', 'MI', 'TN', 'NY', 'ng', 'TX', 'CO', 'NV', 'il', 'WA', '8I',
       'In', 'iL', 'OH', 'SC', 'VA', 'NM', 'FL', 'LA', 'GA', 'II', 'NJ',
       'MD', 'I', 'AR', 'KS', 'DE', '60', 'SD', 'MN', 'VT', 'OK', 'KY',
       'CT', 'NH', 'AZ', 'OR', 'PR', 'RI'], dtype=object)

All the states are encoded into a 2 capitalised characters format. We can notice there are some incorrect values with non capitalised characters like 'il' or 'iL' and unexpected values suhc as '8I', 'I' or '60'. In the next steps, we are going to fix these issues.

6. Print out the rows that have the value 'il' in the column 'State' using the pandas method .str.contains() and the subsetting API, dataframe[condition]. You will also have to specify the parameter 'na' to False from 'str.contains()' for excluding observations with missing values:


In [6]:
df[df['State'].str.contains('il', na=False)]

Unnamed: 0,ID,City,State,Zip,Title,RedactionRequested
4245,47448,Chicago,il,60619,Treasurer,False
4651,47447,Chicago,il,60623-1614,Chairman,False
4652,54025,Chicago,il,60623-1614,Chairman,False
18939,39418,Kingston,il,60145,Chairman,False
29699,27124,Hampshire,il,60140,Chairman,False
43761,29179,McHenry,il,60050,Admin Asst,False


We can see that all the cities with 'il' value are actually from the state Illinois. So the correct State value should be 'IL'. We can think that the following values are also referring to Illinois: 'Il', 'iL', 'Il'. Let's have a look at them next.

7. Create a for loop that will iterate through the following values in State column, 'Il', 'iL', 'Il', and print for each observation the value of the variable 'City' and 'State' using the pandas method for subsetting '.loc()': 'dataframe.loc[row_condition, column condition]'

In [7]:
for state in ['Il', 'iL', 'Il']:
  print(df.loc[df['State'] == state, ['City', 'State']])

            City State
43        Ottawa    Il
44        Ottawa    Il
493    Galesburg    Il
613      Chicago    Il
614      Chicago    Il
...          ...   ...
54915    Chicago    Il
54916    Chicago    Il
54918    Chicago    Il
54919    Chicago    Il
54921    Chicago    Il

[665 rows x 2 columns]
         City State
7052  Wheaton    iL
            City State
43        Ottawa    Il
44        Ottawa    Il
493    Galesburg    Il
613      Chicago    Il
614      Chicago    Il
...          ...   ...
54915    Chicago    Il
54916    Chicago    Il
54918    Chicago    Il
54919    Chicago    Il
54921    Chicago    Il

[665 rows x 2 columns]


As we thought all these cities do belong to the state of Illinois so let's replace them with the correct value.

8. Create a condition mask to subset all rows that contain the 4 incorrect values ('il', 'Il', 'iL', 'Il') using the method isin() and the list of values as parameter and save the result into a variable called il_mask:

In [0]:
il_mask = df['State'].isin(['il', 'Il', 'iL', 'Il'])

9. Print the number of rows that matched the condition set in il_mask by using the method .sum(). This will sum all the rows with value True (ie that match the condition).

In [9]:
il_mask.sum()

672

10. Using the pandas method '.loc()', subset the rows with the condition mask il_mask and replace the value of the column 'State' with 'IL':

In [0]:
df.loc[il_mask, 'State'] = 'IL'

11. Print out all the unique values of the variable 'State':

In [11]:
df['State'].unique()

array(['IL', 'PA', 'DC', nan, 'WI', 'CA', 'MO', 'NC', 'IA', 'MA', 'IN',
       'MI', 'TN', 'NY', 'ng', 'TX', 'CO', 'NV', 'WA', '8I', 'In', 'OH',
       'SC', 'VA', 'NM', 'FL', 'LA', 'GA', 'II', 'NJ', 'MD', 'I', 'AR',
       'KS', 'DE', '60', 'SD', 'MN', 'VT', 'OK', 'KY', 'CT', 'NH', 'AZ',
       'OR', 'PR', 'RI'], dtype=object)

Great! We see that the 4 incorrect values are not present anymore. Let's have a look at the other remaining incorrect values: 'II', 'I', '8I', '60'.

12. Print out the rows that have the value 'II' in the column 'State' using  the pandas subsetting API, dataframe.loc[row_condition, column_condition]:



In [12]:
df.loc[df['State'] == 'II',]

Unnamed: 0,ID,City,State,Zip,Title,RedactionRequested
14340,28039,Bloomington,II,61704,Co-Chairman,False
14341,31994,Bloomington,II,61704,Chairman,False


There are only 2 cases with the value 'II' for the 'State' column and both of them have Bloomington as city which is in Illinois. So the correct 'State' value should be 'IL' instead.

13. Create a for loop that iterates through the 3 other incorrect values ('I', '8I', '60') and print out the subsetted rows using the same logic as in step 12 but only displaying the columns 'City' and 'State':

In [13]:
for val in ['I', '8I', '60']:
  print(df.loc[df['State'] == val, ['City', 'State']])

              City State
17596  Bloomington     I
             City State
5513  Springfield    8I
          City State
28060  Chicago    60


All the observations with these incorrect values are cities from Illinois. Let's fix them now.

14. Create a for loop that iterates through the 4 incorrect values ('II', 'I', '8I', '60') and reuse the subsetting logic from step 12 to replace the value in 'State' to 'IL':

In [0]:
for val in ['II', 'I', '8I', '60']:
  df.loc[df['State'] == val, 'State'] = 'IL'

15. Print out all the unique values of the variable 'State':

In [15]:
df['State'].unique()

array(['IL', 'PA', 'DC', nan, 'WI', 'CA', 'MO', 'NC', 'IA', 'MA', 'IN',
       'MI', 'TN', 'NY', 'ng', 'TX', 'CO', 'NV', 'WA', 'In', 'OH', 'SC',
       'VA', 'NM', 'FL', 'LA', 'GA', 'NJ', 'MD', 'AR', 'KS', 'DE', 'SD',
       'MN', 'VT', 'OK', 'KY', 'CT', 'NH', 'AZ', 'OR', 'PR', 'RI'],
      dtype=object)

Great Job! We fixed the issues for the state of Illinois. But there is still 2 more incorrect values in this column: 'In' and 'ng'.

16. Repeat step 13 but iterating through the values 'In' and 'ng' instead:

In [16]:
for val in ['In', 'ng']:
  print(df.loc[df['State'] == val, ['City', 'State']])

           City State
5733  Sherville    In
            City State
2428  none given    ng
2961  none given    ng


The rows with the value 'ng' in 'State' are actually missing values and we will cover this topic in the next section. The observation with 'In' is actually a city from Indiana so the correct value should be 'IN'. Let's fix it.

17. Subset the rows containing the value 'In' in 'State' using the methods '.loc()' and '.str.contains()' and replace the state value with 'IN'. Don't forget to specify the parameter 'na=False' to '.str.contains()': 

In [0]:
df.loc[df['State'].str.contains('In', na=False), 'State'] = 'IN'

18. Print out all the unique values of the variable 'State':

In [18]:
df['State'].unique()

array(['IL', 'PA', 'DC', nan, 'WI', 'CA', 'MO', 'NC', 'IA', 'MA', 'IN',
       'MI', 'TN', 'NY', 'ng', 'TX', 'CO', 'NV', 'WA', 'OH', 'SC', 'VA',
       'NM', 'FL', 'LA', 'GA', 'NJ', 'MD', 'AR', 'KS', 'DE', 'SD', 'MN',
       'VT', 'OK', 'KY', 'CT', 'NH', 'AZ', 'OR', 'PR', 'RI'], dtype=object)

Excellent! We just fixed all the incorrect values for the variable 'State' using the methods provided by the package 'pandas'. We haven't handled the missing value cases but this will be the topic of the next section.