In this notebook, I'll look into the consolidated education data and create a csv file with final dataframe:

In [1]:
# import libraries:
import pandas as pd
import numpy as np

In [2]:
# import data:
NS_education = pd.read_csv('education_consolidated.csv')

In [3]:
NS_education.head(1)

Unnamed: 0,participant_ID,education_outside_Canada,education_consolidated,education_number_of_answers
0,ns 757,no,apprenticeship training and trades,1


## How were the questions answered?

In [4]:
NS_education.describe(include = 'all')

Unnamed: 0,participant_ID,education_outside_Canada,education_consolidated,education_number_of_answers
count,4054,3960,3974,4054.0
unique,4054,4,77,
top,ns 757,no,completed college / university,
freq,1,2313,1036,
mean,,,,1.079674
std,,,,0.445186
min,,,,0.0
25%,,,,1.0
50%,,,,1.0
75%,,,,1.0


In [5]:
# for the education outside Canada question the unique values are:
NS_education.education_outside_Canada.unique()

array(['no', 'yes', 'received canadian equivalency',
       'prefer not to answer', nan], dtype=object)

We can see that we have no problems with answers to the education outside Canada question, however, we do have a problem with the consolidated education data. Namely, a person did not only choose one education option. Therefore, I'll be ranking these options and creating a function to keep only the highest choice. First though, let's ensure we correctly identify those who prefered not to answer the questions.

## Who did not want to answer the questions:

In [6]:
NS_education.isna().sum()

participant_ID                  0
education_outside_Canada       94
education_consolidated         80
education_number_of_answers     0
dtype: int64

In [7]:
# let us find out how many "prefer not to answer" the highest level of education question:
prefer_not_to_answer = NS_education[(NS_education.education_consolidated.isna() == True)]

In [8]:
NS_education.education_consolidated.iloc[prefer_not_to_answer.index] = 'prefer not to answer'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  NS_education.education_consolidated.iloc[prefer_not_to_answer.index] = 'prefer not to answer'


In [9]:
# how many left with nans?
NS_education.isna().sum()

participant_ID                  0
education_outside_Canada       94
education_consolidated          0
education_number_of_answers     0
dtype: int64

In [10]:
# let's also replace the NaNs for those who did not answer the "education outside Canada" question:
foreign_education_no_answer = NS_education[(NS_education.education_outside_Canada.isna() == True)]

In [11]:
NS_education.education_outside_Canada.iloc[foreign_education_no_answer.index] = 'prefer not to answer'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  NS_education.education_outside_Canada.iloc[foreign_education_no_answer.index] = 'prefer not to answer'


In [12]:
NS_education.isna().sum()

participant_ID                 0
education_outside_Canada       0
education_consolidated         0
education_number_of_answers    0
dtype: int64

##### P.S. I will **not** be changing the number of answers provided in order to enable any data quality check if needed.

## Now, let us choose the highest education:

In [13]:
# First, I'll create the ordered list of education levels as ordered in the questions:
education_levels = [
    'some high school',
    'completed high school',
    'some college / university',
    'completed college / university',
    'apprenticeship training and trades',
    'some graduate education',
    'completed graduate education',
    'professional degree',
    'prefer not to answer'
]

In [14]:
# Then, I'll create a mapping of education levels to ranks
education_rank = {level: rank for rank, level in enumerate(education_levels)}

In [15]:
# Now, I'll create a function to determine the highest education level
def highest_education(education_string):
    """
    Determine the highest level of education from a semicolon-separated string.

    This function processes a string containing multiple education levels separated by semicolons and returns the highest education level 
    based on a predefined ranking.

    Parameters:
    education_string (str): A string containing education levels separated by '; '.

    Returns:
    str: The highest education level from the input string based on the predefined ranking.
    """
    
    levels = education_string.split('; ')
    highest_level = max(levels, key=lambda level: education_rank[level])
    return highest_level

In [16]:
# Apply the function to the Education column
NS_education['highest_education'] = NS_education['education_consolidated'].apply(highest_education)

In [17]:
# Get a summary of the outcome:
NS_education.describe(include = 'all')

Unnamed: 0,participant_ID,education_outside_Canada,education_consolidated,education_number_of_answers,highest_education
count,4054,4054,4054,4054.0,4054
unique,4054,4,77,,9
top,ns 757,no,completed college / university,,completed college / university
freq,1,2313,1036,,1084
mean,,,,1.079674,
std,,,,0.445186,
min,,,,0.0,
25%,,,,1.0,
50%,,,,1.0,
75%,,,,1.0,


In [18]:
# with only 9 unique values in the highest_education column, this looks much better!
NS_education.highest_education.unique()

array(['apprenticeship training and trades',
       'completed college / university', 'completed high school',
       'some high school', 'some college / university',
       'prefer not to answer', 'completed graduate education',
       'some graduate education', 'professional degree'], dtype=object)

Now, we are good!!

In [19]:
# Select desired columns from dataframe above.
NS_highest_education = NS_education[['participant_ID', 'highest_education',  'education_number_of_answers', 'education_outside_Canada']]

In [20]:
# Save into a csv file.
NS_highest_education.to_csv('NS_highest_education_final.csv', index = False)

Question to consider: What if the apprenticeship program was done after a graduate or professional degree?