### Wrangling University of British Columbia Survey Data
In this notebook, I accomplish the following:
* Get an overview of the data.
* Check the Internal ID column for duplicates.
* Clean age values.
* Drop unnecessary columns.
* Combine the four media columns.
* Delete abandoned responses.
* Fill invalid categorical fields.
* Convert each of the three categories of opinion on each lolly to an integer.
* Standardise the country and state fields.
* 'Unpivot' the data, moving lolly preference columns to rows.


In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
pwd = os.getcwd() # Helps with file management.
survey_df = pd.read_excel(pwd + '\data_candyhierarchy2017.xlsx')

  warn(msg)


### Data Overview

In [3]:
survey_df.head()

Unnamed: 0,Internal ID,Q1: GOING OUT?,Q2: GENDER,Q3: AGE,Q4: COUNTRY,"Q5: STATE, PROVINCE, COUNTY, ETC",Q6 | 100 Grand Bar,Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes),Q6 | Any full-sized candy bar,Q6 | Black Jacks,...,Q8: DESPAIR OTHER,Q9: OTHER COMMENTS,Q10: DRESS,Unnamed: 113,Q11: DAY,Q12: MEDIA [Daily Dish],Q12: MEDIA [Science],Q12: MEDIA [ESPN],Q12: MEDIA [Yahoo],"Click Coordinates (x, y)"
0,90258773,,,,,,,,,,...,,,,,,,,,,
1,90272821,No,Male,44.0,USA,NM,MEH,DESPAIR,JOY,MEH,...,,Bottom line is Twix is really the only candy w...,White and gold,,Sunday,,1.0,,,"(84, 25)"
2,90272829,,Male,49.0,USA,Virginia,,,,,...,,,,,,,,,,
3,90272840,No,Male,40.0,us,or,MEH,DESPAIR,JOY,MEH,...,,Raisins can go to hell,White and gold,,Sunday,,1.0,,,"(75, 23)"
4,90272841,No,Male,23.0,usa,exton pa,JOY,DESPAIR,JOY,DESPAIR,...,,,White and gold,,Friday,,1.0,,,"(70, 10)"


In [4]:
survey_df.dtypes

Internal ID                   int64
Q1: GOING OUT?               object
Q2: GENDER                   object
Q3: AGE                      object
Q4: COUNTRY                  object
                             ...   
Q12: MEDIA [Daily Dish]     float64
Q12: MEDIA [Science]        float64
Q12: MEDIA [ESPN]           float64
Q12: MEDIA [Yahoo]          float64
Click Coordinates (x, y)     object
Length: 120, dtype: object

It'll be worth converting all these once we've finished surgically alterting this dataset.

### Checking for Duplicate Respondents
Never trust an assumption about your data until you've verified it, especially when it's this zany!

In [5]:
print(f'Total IDs: {len(survey_df["Internal ID"])}')
print(f'Unique IDs: {survey_df["Internal ID"].nunique()}')

Total IDs: 2460
Unique IDs: 2460


### Clean Age Values
The data contain a few... aberrant responses.

Note: I'll be making a few copies throughout so it's easier to track my progress at different stages and roll back mistakes. 
The tradeoff is storing a bunch of variables, but for a project of this size it's not a real concern.

In [6]:
survey_df_agedrop = survey_df.copy()

In [7]:
print(f'Null values: {survey_df_agedrop["Q3: AGE"].isna().sum()}')
print(f'Non-numeric values: {len([age for age in survey_df_agedrop["Q3: AGE"] if isinstance(age, str) == True])}')
print(f'Total values: {len(survey_df_agedrop["Q3: AGE"])}')

Null values: 84
Non-numeric values: 24
Total values: 2460


We therefore need 108 cells to be 0 whilst preserving the int64 dtype.

In [8]:
survey_df_agedrop['Q3: AGE'] = pd.to_numeric(survey_df_agedrop['Q3: AGE'], errors='coerce')
survey_df_agedrop.replace(np.nan, 0, inplace=True)
survey_df_agedrop.astype({'Q3: AGE': 'int64'}, errors='ignore', copy=False)
survey_df_agedrop['Q3: AGE'].value_counts()

0.0       108
40.0       92
34.0       90
37.0       89
43.0       86
         ... 
88.0        1
312.0       1
70.5        1
99.0        1
1000.0      1
Name: Q3: AGE, Length: 84, dtype: int64

In [9]:
print(f'Null values: {survey_df_agedrop["Q3: AGE"].isna().sum()}')
print(f'Non-numeric values: {len([age for age in survey_df_agedrop["Q3: AGE"] if isinstance(age, str) == True])}')
print(len(survey_df_agedrop['Q3: AGE']))

Null values: 0
Non-numeric values: 0
2460


Now all the rubbish has been swept out, time for a sense-check. It's very unlikely that a 300-year-old is taking this survey. I'm going to assume anyone aged 90 or over is just trying to be funny.

In [10]:
print(f'Number of dubious ages: {len([age for age in survey_df_agedrop["Q3: AGE"] if age >= 90])}')

Number of dubious ages: 8


In [11]:
survey_df_sensecheck = survey_df_agedrop.copy()

In [12]:
survey_df_sensecheck["Q3: AGE"] = [int(age) if age < 90 else 0 for age in survey_df_sensecheck["Q3: AGE"]]
print(f'Number of dubious ages: {len([age for age in survey_df_sensecheck["Q3: AGE"] if age >= 90])}')

Number of dubious ages: 0


I'd also like to exclude a few exceptionally young ages. Halloween lollies are the domain of kids, but I'd say it's highly unlikely any kid under three is capable of coherent opinions. If you were an opinionated two-year-old, don't hesitate to refrain from letting me know.

In [13]:
print(f'Number of far-too-young ages: {len([age for age in survey_df_sensecheck["Q3: AGE"] if age <= 2 and age != 0])}')

Number of far-too-young ages: 1


In [14]:
survey_df_sensecheck["Q3: AGE"] = [int(age) if age > 2 else 0 for age in survey_df_sensecheck["Q3: AGE"]]
print(f'Number of far-too-young ages: {len([age for age in survey_df_sensecheck["Q3: AGE"] if age <= 2 and age != 0])}')

Number of far-too-young ages: 0


To finish up, I'm going to replace all 0s with the median age.

In [15]:
survey_df_sensecheck["Q3: AGE"] = [age if age > 0 else survey_df_sensecheck["Q3: AGE"].median() for age in survey_df_sensecheck["Q3: AGE"]]
survey_df_sensecheck["Q3: AGE"] = survey_df_sensecheck["Q3: AGE"].astype('int64')
survey_df_sensecheck["Q3: AGE"].value_counts()

41    191
40     92
34     90
37     89
43     86
     ... 
8       2
4       1
88      1
74      1
77      1
Name: Q3: AGE, Length: 74, dtype: int64

### Dropping Unneeded Columns
There are several suspect columns in this dataset that are unlikely to be valuable without the aid of quantum computing.

In [16]:
survey_df_sensecheck.columns

Index(['Internal ID', 'Q1: GOING OUT?', 'Q2: GENDER', 'Q3: AGE', 'Q4: COUNTRY',
       'Q5: STATE, PROVINCE, COUNTY, ETC', 'Q6 | 100 Grand Bar',
       'Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes)',
       'Q6 | Any full-sized candy bar', 'Q6 | Black Jacks',
       ...
       'Q8: DESPAIR OTHER', 'Q9: OTHER COMMENTS', 'Q10: DRESS', 'Unnamed: 113',
       'Q11: DAY', 'Q12: MEDIA [Daily Dish]', 'Q12: MEDIA [Science]',
       'Q12: MEDIA [ESPN]', 'Q12: MEDIA [Yahoo]', 'Click Coordinates (x, y)'],
      dtype='object', length=120)

In [17]:
survey_df_sensecheck['Unnamed: 113'].value_counts()

0                                                                          2451
dress (https://survey.ubc.ca/media/assets/user/14372/storage/dress.png)       9
Name: Unnamed: 113, dtype: int64

In [18]:
survey_df_coldrops = survey_df_sensecheck.copy()
survey_df_coldrops.drop(columns=['Q7: JOY OTHER', 'Q8: DESPAIR OTHER', 'Q9: OTHER COMMENTS', 'Unnamed: 113', 'Q10: DRESS'], inplace=True)
survey_df_coldrops.columns

Index(['Internal ID', 'Q1: GOING OUT?', 'Q2: GENDER', 'Q3: AGE', 'Q4: COUNTRY',
       'Q5: STATE, PROVINCE, COUNTY, ETC', 'Q6 | 100 Grand Bar',
       'Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes)',
       'Q6 | Any full-sized candy bar', 'Q6 | Black Jacks',
       ...
       'Q6 | Whatchamacallit Bars', 'Q6 | White Bread',
       'Q6 | Whole Wheat anything', 'Q6 | York Peppermint Patties', 'Q11: DAY',
       'Q12: MEDIA [Daily Dish]', 'Q12: MEDIA [Science]', 'Q12: MEDIA [ESPN]',
       'Q12: MEDIA [Yahoo]', 'Click Coordinates (x, y)'],
      dtype='object', length=115)

### Combine Media Columns
The media columns are a highly-unrelated portion of the survey where respondents are presented with four mobile homepages and asked to honestly select which one they'd click on. 

I want a single media column containing a categorical variable for each of the media organisations.

In [45]:
survey_df_media = survey_df_coldrops.copy()

I'm thinking I handle this via the following process:
1. Change media column values to 1 - 4, where 2 would be 'Science', 3 would be 'ESPN' etc.
2. Collapse all values into new MEDIA column.
3. Convert values 1 - 4 into 'Daily Dish' - 'Yahoo' respectively.
4. Drop all four original media columns.

If any of these clowns have selected multiple answers, we'll know soon enough!

In [46]:
def num_changer(data, numadd):
    result = [i + numadd if i > 0 else i for i in data]
    return result

In [47]:
media_data = ['Q12: MEDIA [Daily Dish]', 'Q12: MEDIA [Science]', 'Q12: MEDIA [ESPN]', 'Q12: MEDIA [Yahoo]']

for ind, media_col in enumerate(media_data):
    survey_df_media[media_col] = num_changer(survey_df_media[media_col], ind)

In [48]:
survey_df_media[media_data].loc[1770:1790] # I've taken this oddball slice because it contains responses in each column.

Unnamed: 0,Q12: MEDIA [Daily Dish],Q12: MEDIA [Science],Q12: MEDIA [ESPN],Q12: MEDIA [Yahoo]
1770,0.0,0.0,0.0,0.0
1771,0.0,0.0,0.0,0.0
1772,0.0,0.0,0.0,4.0
1773,0.0,2.0,0.0,0.0
1774,0.0,2.0,0.0,0.0
1775,0.0,0.0,0.0,0.0
1776,0.0,2.0,0.0,0.0
1777,0.0,0.0,0.0,0.0
1778,0.0,2.0,0.0,0.0
1779,0.0,2.0,0.0,0.0


In [49]:
survey_df_media['Q12: MEDIA'] = survey_df_media[media_data].sum(axis=1)
survey_df_media['Q12: MEDIA'].loc[1770:1790]

1770    0.0
1771    0.0
1772    4.0
1773    2.0
1774    2.0
1775    0.0
1776    2.0
1777    0.0
1778    2.0
1779    2.0
1780    0.0
1781    0.0
1782    0.0
1783    2.0
1784    1.0
1785    0.0
1786    2.0
1787    3.0
1788    0.0
1789    0.0
1790    2.0
Name: Q12: MEDIA, dtype: float64

In [52]:
valuedict = {0: 'None',
             1: 'Daily Dish',
             2: 'Science',
             3: 'ESPN',
             4: 'Yahoo'}

for key, value in valuedict.items():
    for ind, media in enumerate(survey_df_media['Q12: MEDIA']):
        if media == key:
            survey_df_media.loc[ind, 'Q12: MEDIA'] = value

survey_df_media.loc[1770:1790, 'Q12: MEDIA']

1770          None
1771          None
1772         Yahoo
1773       Science
1774       Science
1775          None
1776       Science
1777          None
1778       Science
1779       Science
1780          None
1781          None
1782          None
1783       Science
1784    Daily Dish
1785          None
1786       Science
1787          ESPN
1788          None
1789          None
1790       Science
Name: Q12: MEDIA, dtype: object

In [53]:
survey_df_media.drop(columns=media_data, inplace=True)

In [54]:
survey_df_media['Q12: MEDIA'].value_counts()

Science       1362
None           847
ESPN            99
Daily Dish      85
Yahoo           67
Name: Q12: MEDIA, dtype: int64

### Delete Abandoned Responses
Some users opened the survey, but immediately closed the tab. We'll delete these responses by scanning each row for responses of '0' in all main question fields, getting the index of each, then dropping them.

In [55]:
rowdrops = (survey_df_media.iloc[:, 6:111:1] == 0) # Returns a dataframe of True/ False values, where True = 0.
dropindex = rowdrops.index[rowdrops['Q6 | 100 Grand Bar']].tolist()
dropindex[0:5]

[0, 2, 6, 10, 18]

In [56]:
survey_df_abandrops = survey_df_media.copy()
survey_df_abandrops.drop(dropindex, inplace=True)
survey_df_abandrops.reset_index(inplace=True)
survey_df_abandrops.drop('index', axis=1, inplace=True)

In [57]:
survey_df_abandrops.head(5)

Unnamed: 0,Internal ID,Q1: GOING OUT?,Q2: GENDER,Q3: AGE,Q4: COUNTRY,"Q5: STATE, PROVINCE, COUNTY, ETC",Q6 | 100 Grand Bar,Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes),Q6 | Any full-sized candy bar,Q6 | Black Jacks,...,Q6 | Twix,"Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein",Q6 | Vicodin,Q6 | Whatchamacallit Bars,Q6 | White Bread,Q6 | Whole Wheat anything,Q6 | York Peppermint Patties,Q11: DAY,"Click Coordinates (x, y)",Q12: MEDIA
0,90272821,No,Male,44,USA,NM,MEH,DESPAIR,JOY,MEH,...,JOY,DESPAIR,DESPAIR,DESPAIR,DESPAIR,DESPAIR,DESPAIR,Sunday,"(84, 25)",Science
1,90272840,No,Male,40,us,or,MEH,DESPAIR,JOY,MEH,...,JOY,DESPAIR,JOY,JOY,DESPAIR,DESPAIR,DESPAIR,Sunday,"(75, 23)",Science
2,90272841,No,Male,23,usa,exton pa,JOY,DESPAIR,JOY,DESPAIR,...,JOY,MEH,JOY,JOY,DESPAIR,DESPAIR,JOY,Friday,"(70, 10)",Science
3,90272852,No,Male,41,0,0,JOY,DESPAIR,JOY,0,...,JOY,DESPAIR,DESPAIR,JOY,DESPAIR,DESPAIR,JOY,0,"(75, 23)",Science
4,90272854,No,Male,33,canada,ontario,JOY,DESPAIR,JOY,DESPAIR,...,JOY,JOY,MEH,DESPAIR,DESPAIR,DESPAIR,DESPAIR,Friday,"(55, 5)",Science


### Filling Invalid Categorical Fields
There are only two valid choices ('Yes' and 'No'). I'm going to assume any response of 0 means they'd rather not say. This makes it consistent with the gender field, but even that has a similar issue.

In [58]:
survey_df_abandrops[['Q1: GOING OUT?', 'Q2: GENDER']].value_counts()

Q1: GOING OUT?  Q2: GENDER        
No              Male                  860
                Female                493
Yes             Male                  137
                Female                 82
No              I'd rather not say     48
0               Male                   44
No              Other                  18
0               Female                  8
Yes             I'd rather not say      8
0               0                       4
                I'd rather not say      4
No              0                       3
Yes             Other                   3
0               Other                   1
dtype: int64

In [59]:
problemcols = ['Q1: GOING OUT?', 'Q2: GENDER']

for column in problemcols:
    survey_df_abandrops[column] = [answer if answer != 0 else "I'd rather not say" for answer in survey_df_abandrops[column]]

In [60]:
survey_df_abandrops[['Q1: GOING OUT?', 'Q2: GENDER']].value_counts()

Q1: GOING OUT?      Q2: GENDER        
No                  Male                  860
                    Female                493
Yes                 Male                  137
                    Female                 82
No                  I'd rather not say     51
I'd rather not say  Male                   44
No                  Other                  18
I'd rather not say  Female                  8
                    I'd rather not say      8
Yes                 I'd rather not say      8
                    Other                   3
I'd rather not say  Other                   1
dtype: int64

### Converting Lolly Categories
'Despair', 'Meh' and 'Joy' are reasonably functional, but if I were hired to clean this data set knowing there'd be further exploration, I'd want a slightly better way of scoring and aggregating this information. Just having them as integers seems like a solid idea. We can use some of the tricks employed earlier to convert these values to integers.

In [61]:
survey_df_cats = survey_df_abandrops.copy()

In [62]:
scoredict = {'DESPAIR': 1,
             'MEH': 2,
             'JOY': 3}

for column in survey_df_cats:
    for key, value in scoredict.items():
        survey_df_cats[column] = [answer if key != answer else value for answer in survey_df_cats[column]]

In [63]:
survey_df_cats.head(10)

Unnamed: 0,Internal ID,Q1: GOING OUT?,Q2: GENDER,Q3: AGE,Q4: COUNTRY,"Q5: STATE, PROVINCE, COUNTY, ETC",Q6 | 100 Grand Bar,Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes),Q6 | Any full-sized candy bar,Q6 | Black Jacks,...,Q6 | Twix,"Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein",Q6 | Vicodin,Q6 | Whatchamacallit Bars,Q6 | White Bread,Q6 | Whole Wheat anything,Q6 | York Peppermint Patties,Q11: DAY,"Click Coordinates (x, y)",Q12: MEDIA
0,90272821,No,Male,44,USA,NM,2,1,3,2,...,3,1,1,1,1,1,1,Sunday,"(84, 25)",Science
1,90272840,No,Male,40,us,or,2,1,3,2,...,3,1,3,3,1,1,1,Sunday,"(75, 23)",Science
2,90272841,No,Male,23,usa,exton pa,3,1,3,1,...,3,2,3,3,1,1,3,Friday,"(70, 10)",Science
3,90272852,No,Male,41,0,0,3,1,3,0,...,3,1,1,3,1,1,3,0,"(75, 23)",Science
4,90272854,No,Male,33,canada,ontario,3,1,3,1,...,3,3,2,1,1,1,1,Friday,"(55, 5)",Science
5,90272858,No,Male,40,Canada,Ontario,3,1,3,2,...,3,2,1,2,1,1,1,Sunday,"(76, 24)",Science
6,90272859,No,Female,53,Us,Wa,2,1,3,2,...,3,1,1,2,1,1,2,Sunday,"(70, 28)",Science
7,90272865,No,Male,56,Canada,Quebec,3,2,3,2,...,2,1,2,2,3,1,2,Friday,"(73, 24)",Science
8,90272866,No,Male,64,US,NY,2,2,3,2,...,2,3,3,1,1,1,2,Sunday,"(77, 24)",Science
9,90272867,Yes,Male,43,Murica,California,3,1,3,2,...,3,1,1,3,1,1,3,Sunday,0,Science


### Cleaning Country Field
Now for a real challenge! These fields are absolutely abyssmal, filled with inconsistent capitalisation, abbreviations and colloquialisms (such as 'Murica' or 'Merica'). This means we can't just throw a dictionary at it. Let's start with the country field.

In [64]:
pd.set_option('display.max_rows', 500)
survey_df_cats['Q4: COUNTRY'].value_counts()

USA                                                                     515
United States                                                           379
usa                                                                     158
Canada                                                                  104
US                                                                       98
Usa                                                                      87
USA                                                                      53
United States of America                                                 43
united states                                                            30
United States                                                            26
us                                                                       23
0                                                                        20
canada                                                                   19
United state

Yikes! OK, Here's the current plan:
1. Make everything upper case to cut a few inconsistencies in a broad sweep.
2. Cut all punctuation and spaces from each country.
3. Paste in a dictionary of ISO3166 country codes. Source: https://gist.github.com/carlopires/1261951/d13ca7320a6abcd4b0aa800d351a31b54cefdff4
4. Make the dictionary all uppercase.
5. Remove all spaces from existing dictionary values.
6. Append other common terms to dictionary values (eg 'US').
7. Loop through the dictionary and replace each value with ISO country codes, taking care to use '==' instead of 'in' to account for cases where new values exist in other countries (eg 'US' can be found in the word 'AUSTRALIA', so we need an exact match).
8. Celebrate.

In [65]:
survey_df_countryfix = survey_df_cats.copy()

In [66]:
import string

survey_df_countryfix['Q4: COUNTRY'] = [str(country).upper() for country in survey_df_countryfix['Q4: COUNTRY']]

for ind, country in enumerate(survey_df_countryfix['Q4: COUNTRY']):
    survey_df_countryfix.loc[ind, 'Q4: COUNTRY'] = ''.join([char.replace(' ', '') for char in survey_df_countryfix['Q4: COUNTRY'].iloc[ind] if char not in string.punctuation])

In [67]:
from data_isodict import ISO3166

countrydict = ISO3166.copy()
sanitised_countries = [str(country).upper().replace(' ', '') for country in countrydict.values()]

In [68]:
for ind, key in enumerate(countrydict.keys()):
    countrydict[key] = sanitised_countries[ind]

# Given the data is static, I don't mind manually updating the dictionary with a few synonyms it missed.
countrydict['US'] = ['UNITEDSTATESOFAMERICA', 'UNITEDSTATES', 'USA', 'US', 'THEUNITEDSTATES', 'AMERICA']
countrydict['GB'] = ['SCOTLAND', 'UK', 'UNITEDKINGDOM']
countrydict['KR'] = ['SOUTHKOREA', 'REPUBLICOFSOUTHKOREA', 'KOREA']
countrydict['AE'] = 'UAE'
countrydict['TW'] = ['TAIWANPROVINCEOFCHINA', 'TAIWAN']

In [69]:
countrydict['US']

['UNITEDSTATESOFAMERICA',
 'UNITEDSTATES',
 'USA',
 'US',
 'THEUNITEDSTATES',
 'AMERICA']

In [70]:
for ind, country in enumerate(survey_df_countryfix['Q4: COUNTRY']):
    for key, values in countrydict.items():
        if isinstance(values, list) == True:
            for value in values:
                if value == country:
                    survey_df_countryfix.loc[ind, 'Q4: COUNTRY'] = key
        else:
            if values == country:
                survey_df_countryfix.loc[ind, 'Q4: COUNTRY'] = key

In [71]:
survey_df_countryfix['Q4: COUNTRY'].value_counts()

US                                                       1475
CA                                                        129
0                                                          20
GB                                                         15
DE                                                          5
IE                                                          3
NL                                                          3
JP                                                          3
KR                                                          2
UNITESSTATES                                                2
MX                                                          2
FR                                                          2
UNITEDSTATE                                                 2
USAUSAUSA                                                   2
CH                                                          2
USOFA                                                       2
DK      

In [72]:
failures = len([country for country in survey_df_countryfix['Q4: COUNTRY'] if len(country) > 2])
successes = len([country for country in survey_df_countryfix['Q4: COUNTRY'] if len(country) == 2])

print(f"Failed ISO conversions: {failures}")
print(f"Successful ISO conversions: {successes}")
print(f"{round((failures/ successes) * 100, 2)}% fail rate.")

Failed ISO conversions: 41
Successful ISO conversions: 1651
2.48% fail rate.


Decent! Given this is a static data set that won't be run again, I'm OK with having slapped a variety of synonyms into a dictionary.

Let's now get the index of each trouble row and replace each one with 'US', because that has the highest likelihood of being correct given Americans are vastly overrepresented in the results already. We can correct any cases where this assumption is wrong afer we've cleaned the states field.

In [73]:
countrydrop = [ind for ind, country in enumerate(survey_df_countryfix['Q4: COUNTRY']) if len(country) > 2 or len(country) < 2]
survey_df_countryfix['Q4: COUNTRY'] = survey_df_countryfix['Q4: COUNTRY'].drop(countrydrop)
survey_df_countryfix['Q4: COUNTRY'].replace({np.nan: 'US', '0': 'US'}, inplace=True)

In [74]:
survey_df_countryfix['Q4: COUNTRY'].value_counts()

US    1537
CA     129
GB      15
DE       5
JP       3
NL       3
IE       3
MX       2
CH       2
FR       2
KR       2
DK       2
AE       1
CN       1
GR       1
IS       1
AU       1
TW       1
CR       1
ES       1
Name: Q4: COUNTRY, dtype: int64

### Cleaning State Field
Our ol' faithful dictionary-based solution gets markedly less efficient here. Every country has numerous states, all with their own abbreviations, slang and ways for a wayward survey-doer to go awry. 

In [75]:
survey_df_countryfix['Q5: STATE, PROVINCE, COUNTY, ETC'].value_counts()

California                                            104
CA                                                     62
Texas                                                  43
Illinois                                               42
Oregon                                                 38
0                                                      37
Ontario                                                34
New York                                               31
WA                                                     30
Washington                                             27
NY                                                     26
Ohio                                                   25
Virginia                                               22
Pennsylvania                                           21
Alberta                                                21
Massachusetts                                          19
California                                             19
PA            

In [76]:
survey_df_statefix = survey_df_countryfix.copy()

In [77]:
survey_df_statefix['Q5: STATE, PROVINCE, COUNTY, ETC'] = [str(country).upper() for country in survey_df_statefix['Q5: STATE, PROVINCE, COUNTY, ETC']]

for ind, country in enumerate(survey_df_statefix['Q5: STATE, PROVINCE, COUNTY, ETC']):
    survey_df_statefix.loc[ind, 'Q5: STATE, PROVINCE, COUNTY, ETC'] = ''.join([char.replace(' ', '') for char in survey_df_statefix.loc[ind, 'Q5: STATE, PROVINCE, COUNTY, ETC'] if char not in string.punctuation])

In [78]:
survey_df_statefix['Q5: STATE, PROVINCE, COUNTY, ETC'].value_counts()

CALIFORNIA                                     134
CA                                              83
ILLINOIS                                        55
TEXAS                                           50
ONTARIO                                         47
OREGON                                          45
WA                                              43
NY                                              40
NEWYORK                                         39
WASHINGTON                                      38
0                                               37
PENNSYLVANIA                                    36
OHIO                                            34
MASSACHUSETTS                                   32
MA                                              27
VIRGINIA                                        27
BC                                              26
ALBERTA                                         24
PA                                              24
MN                             

We'll use pycountry's subdivision object to iterate through our states and make them consistent. The final output will be state names. Process:
1. Create list of subdivision objects containing state codes and names.
2. Create separate list of sanitised state names, then replace the ones in the subdivision list with those.
3. Loop through each state in the dataframe, replacing them with codes from the subdivision list.
4. Use codes to insert original subdivision state names.

In [79]:
import pycountry as pc

subdivs = list(pc.subdivisions)
subdivs[0] # What a subdivision looks like. Various methods access each component.

Subdivision(code='AD-02', country_code='AD', name='Canillo', parent_code=None, type='Parish')

In [80]:
sanitised_subdivs = subdivs.copy()
state_list = [state.name.upper().replace(' ', '') for state in subdivs]

for ind, state in enumerate(state_list):
    state_list[ind] = ''.join([char for char in state_list[ind] if char not in string.punctuation])
    sanitised_subdivs[ind].name = state_list[ind]

In [81]:
sanitised_subdivs[0]

Subdivision(code='AD-02', country_code='AD', name='CANILLO', parent_code=None, type='Parish')

In [82]:
# This takes 1 - 2 minutes to run.
for ind ,state in enumerate(survey_df_statefix['Q5: STATE, PROVINCE, COUNTY, ETC']):
    for subdiv in sanitised_subdivs:
        if subdiv.country.alpha_2 in survey_df_statefix['Q4: COUNTRY'][ind]:
            if subdiv.name in survey_df_statefix['Q5: STATE, PROVINCE, COUNTY, ETC'][ind]:
                survey_df_statefix.loc[ind, 'Q5: STATE, PROVINCE, COUNTY, ETC'] = subdiv.code

In [83]:
pd.options.display.max_rows = 40
survey_df_statefix

Unnamed: 0,Internal ID,Q1: GOING OUT?,Q2: GENDER,Q3: AGE,Q4: COUNTRY,"Q5: STATE, PROVINCE, COUNTY, ETC",Q6 | 100 Grand Bar,Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes),Q6 | Any full-sized candy bar,Q6 | Black Jacks,...,Q6 | Twix,"Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein",Q6 | Vicodin,Q6 | Whatchamacallit Bars,Q6 | White Bread,Q6 | Whole Wheat anything,Q6 | York Peppermint Patties,Q11: DAY,"Click Coordinates (x, y)",Q12: MEDIA
0,90272821,No,Male,44,US,NM,2,1,3,2,...,3,1,1,1,1,1,1,Sunday,"(84, 25)",Science
1,90272840,No,Male,40,US,OR,2,1,3,2,...,3,1,3,3,1,1,1,Sunday,"(75, 23)",Science
2,90272841,No,Male,23,US,EXTONPA,3,1,3,1,...,3,2,3,3,1,1,3,Friday,"(70, 10)",Science
3,90272852,No,Male,41,US,0,3,1,3,0,...,3,1,1,3,1,1,3,0,"(75, 23)",Science
4,90272854,No,Male,33,CA,CA-ON,3,1,3,1,...,3,3,2,1,1,1,1,Friday,"(55, 5)",Science
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1708,90314022,No,Female,26,US,US-MI,3,2,3,1,...,3,2,2,3,2,2,3,Friday,"(68, 39)",Science
1709,90314359,No,Male,24,US,MD,3,1,2,1,...,3,2,3,1,2,1,2,Friday,0,
1710,90314580,No,Female,33,US,US-NY,2,1,3,0,...,3,0,0,3,1,2,3,Friday,"(70, 26)",Science
1711,90314634,No,Female,26,US,US-TN,2,1,3,1,...,2,2,3,2,1,1,2,Friday,"(67, 35)",Science
