In this notebook, we'll be working on the cleaned Neighbour Survey data. Specifically, I'll be looking at consolidating the data and filling the Null values in basic info columns: Language_of_Survey; Age; Gender; Gender_other; Status_in_canada; and status_in_canada_other.

In [1]:
# importing libraries:
import numpy as np
import pandas as pd

In [2]:
# importing data:
NS = pd.read_csv('Cleaned_NS.csv')

In [3]:
NS.describe()

Unnamed: 0,participant_ID,Language_of_Survey,Enough_income,Enough_food,Afford_NBM,skip_meal,frequency_in_year,eat_less,hungry,first_use_years,...,income_source_CPP,income_source_private_pension,income_source_OAS,income_source_WSIB,income_source_disability,income_source_other_government_programs,income_source_no_income,income_source_prefer_not_to_answer,income_source_other,comments
count,4054,4054,4036,4026,4027,4040,2869,4018,4019,4037,...,99,41,81,28,32,88,220,192,67,2216
unique,4054,4,3,5,5,4,6,4,4,6,...,1,1,1,1,1,1,1,1,60,657
top,ns 757,english,no,sometimes true,sometimes true,yes,some months but not every month,yes,yes,1 to 2 years ago,...,canadian pension plan (cpp),private pension,old age security (oas),workplace safety and insurance board (wsib),short/long term disability,other government programs,no income,prefer not to answer,self employed,no
freq,1,3387,2874,1931,1907,2902,1247,2965,2508,1148,...,99,41,81,28,32,88,220,192,5,883


In [4]:
# I will create a subset of the data with some basic info:
NS_basics = NS[['participant_ID','age','gender','gender_other','status_in_canada', 'status_in_canada_other']]

In [5]:
# Checking out null values in the NS_basics:
NS_basics.isna().sum()

participant_ID               0
age                         18
gender                      68
gender_other              4039
status_in_canada            82
status_in_canada_other    4024
dtype: int64

# Consolidating the data from the "other" columns:

### Consolidating gender columns:

In [6]:
# Checking the values for the gender: are those who did not choose a gender, chose a gender from "other"?
LGBTQ = NS_basics[NS_basics['gender_other'].isna() == False]

In [7]:
LGBTQ.head()

Unnamed: 0,participant_ID,age,gender,gender_other,status_in_canada,status_in_canada_other
20,ns 517,21-30,"if none of the above, please self-identify:",non binary,canadian citizen,
169,ns 894,21-30,"if none of the above, please self-identify:",genderfluid,canadian citizen,
453,ns 2244,21-30,"if none of the above, please self-identify:",nonbinary,canadian citizen,
476,ns 3111,31-40,"if none of the above, please self-identify:",non binary/gender queer,canadian citizen,
578,ns 697,21-30,"if none of the above, please self-identify:",non-binary,canadian citizen,


In [8]:
# Checking how many self-identified as other gender:
LGBTQ.gender_other.value_counts()

non binary                               2
nonbinary                                2
non-binary                               2
genderfluid                              1
non binary/gender queer                  1
who cares?                               1
i identify as a lays dill pickle chip    1
genderqueer                              1
autigender                               1
struggling with gender identity          1
agender                                  1
non-binary/agender                       1
Name: gender_other, dtype: int64

In [9]:
# we can see there are some values are repeated while others don't really belong to LGBTQ:
for x in LGBTQ.index:
    if 'cares' in NS_basics.gender_other.iat[x]:
        NS_basics.gender_other.iat[x] = 'prefer not to answer'
    elif 'identify' in NS_basics.gender_other.iat[x]:
        NS_basics.gender_other.iat[x] = 'prefer not to answer'
    elif 'struggling' in NS_basics.gender_other.iat[x]:
        NS_basics.gender_other.at[x] = 'person unsure'
    else:
        NS_basics.gender_other.iat[x] = 'non-binary/ agender'
        

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  NS_basics.gender_other.iat[x] = 'non-binary/ agender'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  NS_basics.gender_other.iat[x] = 'prefer not to answer'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  NS_basics.gender_other.iat[x] = 'prefer not to answer'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  NS_b

In [10]:
NS_basics.gender_other.value_counts()

non-binary/ agender     12
prefer not to answer     2
person unsure            1
Name: gender_other, dtype: int64

In [11]:
# Replacing the values in the gender column with those in the LGBTQ gender_other column: 
for x in LGBTQ.index:
    NS_basics['gender'].at[x] = NS_basics['gender_other'].at[x]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  NS_basics['gender'].at[x] = NS_basics['gender_other'].at[x]


In [12]:
# now that we have moved the LGBTQ data to the gender column, we can drop the gender_other column:
NS_basics.drop(['gender_other'], axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  NS_basics.drop(['gender_other'], axis = 1, inplace = True)


In [13]:
NS_basics.head()

Unnamed: 0,participant_ID,age,gender,status_in_canada,status_in_canada_other
0,ns 757,51-60,man,canadian citizen,
1,ns 318,51-60,man,canadian citizen,
2,ns 328,51-60,man,canadian citizen,
3,ns 646,21-30,woman,refugee,
4,ns 678,over 70,man,canadian citizen,


In [14]:
NS_basics.gender.value_counts()

woman                   2144
man                     1714
prefer not to answer      71
trans woman               27
trans man                 17
non-binary/ agender       12
person unsure              1
Name: gender, dtype: int64

In [15]:
NS_basics.isna().sum()

participant_ID               0
age                         18
gender                      68
status_in_canada            82
status_in_canada_other    4024
dtype: int64

There are 68 null values for gender at this point. I will try to confer gender based on other data (eg. breastfeeding or pregnant data).

In [16]:
NS_basics.gender.value_counts()

woman                   2144
man                     1714
prefer not to answer      71
trans woman               27
trans man                 17
non-binary/ agender       12
person unsure              1
Name: gender, dtype: int64

In [17]:
# I will be replacing the null values in the gender column with "prefer not to answer"
NS_basics.gender.fillna('prefer not to answer', inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  NS_basics.gender.fillna('prefer not to answer', inplace = True)


In [18]:
NS_basics.gender.value_counts()

woman                   2144
man                     1714
prefer not to answer     139
trans woman               27
trans man                 17
non-binary/ agender       12
person unsure              1
Name: gender, dtype: int64

In [19]:
NS_basics.isna().sum()

participant_ID               0
age                         18
gender                       0
status_in_canada            82
status_in_canada_other    4024
dtype: int64

### There are issues with the gender and breastfeeding, pregnancy data which led us to decide to fill the NaNs with 'prefer not to answer' the gender question.

# Consolidating Status in Canada data:

In [20]:
# we'll check the values of the status_in_canada column:
NS_basics.status_in_canada.value_counts()

canadian citizen               2441
permanent resident              844
refugee                         211
international student           134
prefer not to answer            121
migrant worker                  108
applying for refugee status      78
other                            33
other (please specify)            2
Name: status_in_canada, dtype: int64

In [21]:
# Unify duplicate values:
NS_basics.status_in_canada.replace('other (please specify)', 'other', inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  NS_basics.status_in_canada.replace('other (please specify)', 'other', inplace = True)


In [22]:
# Checking values again:
NS_basics.status_in_canada.value_counts()

canadian citizen               2441
permanent resident              844
refugee                         211
international student           134
prefer not to answer            121
migrant worker                  108
applying for refugee status      78
other                            35
Name: status_in_canada, dtype: int64

In [23]:
# Checking values in the "status_in_canada_other" column:
NS_basics.status_in_canada_other.value_counts()

cuaet                                                    6
work permit                                              5
temporary resident                                       4
native status                                            2
canadien / american                                      1
status first nation                                      1
first nations                                            1
on work permit                                           1
resident                                                 1
protected person                                         1
born in quebec                                           1
student, applying for permanent residence                1
cuaet. refugee from ukraine with an open work permit.    1
work permit resident                                     1
dual citizenship                                         1
temporary                                                1
cuaet for ukrainians                                    

In [24]:
# checking other status in Canada:
other_status = NS_basics[NS_basics['status_in_canada_other'].isna() == False]

In [25]:
other_status

Unnamed: 0,participant_ID,age,gender,status_in_canada,status_in_canada_other
161,ns 1395,21-30,trans woman,other,canadien / american
165,ns 1821,31-40,man,other,work permit resident
189,ns 3121,41-50,woman,other,temporary
254,ns 2597,21-30,woman,other,work permit
314,ns 2624,31-40,woman,other,cuaet
409,ns 544,51-60,man,other,native status
434,ns 2038,41-50,man,other,temporary resident
597,ns 634,41-50,woman,other,dual citizenship
614,ns 417,31-40,prefer not to answer,other,born in quebec
645,ns 600,51-60,woman,other,cuaet


Let's replace the "Other" values in "status_in_canada" by those in the "status_in_canada_other" values:

In [26]:
# replace the values in the "status_in_canada" column with non-null values in the "status_in_canada_other" column 
for x in other_status.index:
    NS_basics['status_in_canada'].at[x] = NS_basics['status_in_canada_other'].at[x]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  NS_basics['status_in_canada'].at[x] = NS_basics['status_in_canada_other'].at[x]


In [27]:
# Now we can drop the status_in_canada_other column:
NS_basics.drop(['status_in_canada_other'], axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  NS_basics.drop(['status_in_canada_other'], axis = 1, inplace = True)


In [28]:
NS_basics.shape

(4054, 4)

In [29]:
# Now, let's unify the words in the status_in_canada as per the dictionary below:

In [30]:
status_dictionary = {
    'native status': 'citizen',
    'status first nation': 'citizen',
    'first nations': 'citizen',
    'canadian citizen': 'citizen',
    'born in quebec': 'citizen',
    'canadien / american': 'citizen',
    'dual citizenship': 'citizen',

    'permanent resident': 'permanet resident',
    'resident': 'permanet resident',

    'migrant worker': 'foreign worker',
    'work permit resident': 'foreign worker',
    'work permit': 'foreign worker',
    'on work permit': 'foreign worker',
    
    'international student': 'international student',
    'student, applying for permanent residence': 'international student',
    
    'refugee': 'refugee',
    'protected person': 'refugee',
    'cuaet': 'refugee', 
    'cuaet. refugee from ukraine with an open work permit.': 'refugee',
    'cuaet for ukrainians': 'refugee',
    'applying for refugee status': 'applying for refugee status',
    

    'prefer not to answer': 'prefer not to answer',
    np.nan: 'prefer not to answer',
    
    'temporary': 'temporary resident',
    'temporary resident': 'temporary resident',

    'other': 'prefer not to answer',

    }

In [31]:
NS_basics.status_in_canada.replace(status_dictionary.keys(), status_dictionary.values(), inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  NS_basics.status_in_canada.replace(status_dictionary.keys(), status_dictionary.values(), inplace = True)


In [32]:
NS_basics.status_in_canada.value_counts()

citizen                        2448
permanet resident               845
refugee                         220
prefer not to answer            208
international student           135
foreign worker                  115
applying for refugee status      78
temporary resident                5
Name: status_in_canada, dtype: int64

In [33]:
NS_basics.isna().sum()

participant_ID       0
age                 18
gender               0
status_in_canada     0
dtype: int64

In [34]:
# let's check the age values:
NS_basics.age.value_counts()

31-40                   1459
21-30                   1130
41-50                    706
51-60                    326
18-20                    183
61-70                    136
over 70                   56
prefer not to answer      40
Name: age, dtype: int64

In [35]:
# will fill the null values in the age column with "prefer not to answer":
NS_basics.age.fillna('prefer not to answer', inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  NS_basics.age.fillna('prefer not to answer', inplace = True)


In [36]:
NS_basics.isna().sum()

participant_ID      0
age                 0
gender              0
status_in_canada    0
dtype: int64

In [37]:
NS_basics.describe()

Unnamed: 0,participant_ID,age,gender,status_in_canada
count,4054,4054,4054,4054
unique,4054,8,7,8
top,ns 757,31-40,woman,citizen
freq,1,1459,2144,2448


In [38]:
# I will save the NS_basics data into a csv file.
NS_basics.to_csv('NS_basics.csv', index = False)

In [39]:
NS_basics.status_in_canada.unique()

array(['citizen', 'refugee', 'foreign worker', 'international student',
       'prefer not to answer', 'permanet resident', 'temporary resident',
       'applying for refugee status'], dtype=object)