# HarvardX_MITx_Person_Course_Dataset_Data_Wrangling

## Data Wrangling

In this file we do data wrangling on the dataset in order to get it ready for future exploration. 

The dataset we are using, is HarvardX_MITx_Person_Course_Dataset named 'HMXPC13_DI_v2_5-14-14.csv', which is de-identified data from the first year (Academic Year 2013: Fall 2012, Spring 2013, and Summer 2013) of MITx and HarvardX courses on the edX platform along with related documentation.

These data are aggregate records, and each record represents one individual's activity in one edX course.  

The data description can be found in file 'Person+Course+Documentation.pdf', for better understanding of each variable, I also put the description of each variable into a spreadsheet(named 'HMX Data description.xlsx') with some comments.

The dataset and documents are available on link below:

[dataverse.harvard.edu](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/26147)

### Loading Data

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from datetime import datetime
%matplotlib inline

In [2]:
# load data into a pandas dataframe and check the first 5 rows
df = pd.read_csv('HMXPC13_DI_v2_5-14-14.csv')
df.head()

Unnamed: 0,course_id,userid_DI,registered,viewed,explored,certified,final_cc_cname_DI,LoE_DI,YoB,gender,grade,start_time_DI,last_event_DI,nevents,ndays_act,nplay_video,nchapters,nforum_posts,roles,incomplete_flag
0,HarvardX/CB22x/2013_Spring,MHxPC130442623,1,0,0,0,United States,,,,0,2012-12-19,2013-11-17,,9.0,,,0,,1.0
1,HarvardX/CS50x/2012,MHxPC130442623,1,1,0,0,United States,,,,0,2012-10-15,,,9.0,,1.0,0,,1.0
2,HarvardX/CB22x/2013_Spring,MHxPC130275857,1,0,0,0,United States,,,,0,2013-02-08,2013-11-17,,16.0,,,0,,1.0
3,HarvardX/CS50x/2012,MHxPC130275857,1,0,0,0,United States,,,,0,2012-09-17,,,16.0,,,0,,1.0
4,HarvardX/ER22x/2013_Spring,MHxPC130275857,1,0,0,0,United States,,,,0,2012-12-19,,,16.0,,,0,,1.0


### Assessing  Data

In [3]:
# dataset size
df.shape

print('Our dataset has {} records, with {} variables'.format(df.shape[0], df.shape[1]))

Our dataset has 641138 records, with 20 variables


In [4]:
# basic info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 641138 entries, 0 to 641137
Data columns (total 20 columns):
course_id            641138 non-null object
userid_DI            641138 non-null object
registered           641138 non-null int64
viewed               641138 non-null int64
explored             641138 non-null int64
certified            641138 non-null int64
final_cc_cname_DI    641138 non-null object
LoE_DI               535130 non-null object
YoB                  544533 non-null float64
gender               554332 non-null object
grade                592766 non-null object
start_time_DI        641138 non-null object
last_event_DI        462184 non-null object
nevents              441987 non-null float64
ndays_act            478395 non-null float64
nplay_video          183608 non-null float64
nchapters            382385 non-null float64
nforum_posts         641138 non-null int64
roles                0 non-null float64
incomplete_flag      100161 non-null float64
dtypes: floa

In [5]:
# missing values
df.isnull().sum()

course_id                 0
userid_DI                 0
registered                0
viewed                    0
explored                  0
certified                 0
final_cc_cname_DI         0
LoE_DI               106008
YoB                   96605
gender                86806
grade                 48372
start_time_DI             0
last_event_DI        178954
nevents              199151
ndays_act            162743
nplay_video          457530
nchapters            258753
nforum_posts              0
roles                641138
incomplete_flag      540977
dtype: int64

In [6]:
# check duplicate
df.duplicated().sum()

0

According to the results above, we found that:
- There are a lot of missing values in the dataset, we will need to look deeper into the missing values to figure out how to handle them
- The data type of 'grade' is object, while float would be more reasonable
- data types of 'start_time_DI' and 'last_event_DI'are object, it is better to convert them into date type.
- There is no value in variable 'roles', we will drop it
- Some variable names are not very intuitive, we will rename them to be more concise and intuitive.
- No duplicated record, which is good.

Next we will check the description of the numerical variabls in our dataset

In [7]:
# description of the dataset
df.describe()

Unnamed: 0,registered,viewed,explored,certified,YoB,nevents,ndays_act,nplay_video,nchapters,nforum_posts,roles,incomplete_flag
count,641138.0,641138.0,641138.0,641138.0,544533.0,441987.0,478395.0,183608.0,382385.0,641138.0,0.0,100161.0
mean,1.0,0.624299,0.061899,0.027587,1985.253279,431.008018,5.710254,114.844173,3.634423,0.018968,,1.0
std,0.0,0.484304,0.240973,0.163786,8.891814,1516.116057,11.866471,426.996844,4.490987,0.229539,,0.0
min,1.0,0.0,0.0,0.0,1931.0,1.0,1.0,1.0,1.0,0.0,,1.0
25%,1.0,0.0,0.0,0.0,1982.0,3.0,1.0,5.0,1.0,0.0,,1.0
50%,1.0,1.0,0.0,0.0,1988.0,24.0,2.0,18.0,2.0,0.0,,1.0
75%,1.0,1.0,0.0,0.0,1991.0,158.0,4.0,73.0,4.0,0.0,,1.0
max,1.0,1.0,1.0,1.0,2013.0,197757.0,205.0,98517.0,48.0,20.0,,1.0


- Since all the values of variable 'registered' is 1, it makes no difference to our analysis, we will just drop this variable. 
- max value of year of birth(variable 'YoB') is 2013, this could not be true, we need to look deep into the values to filter out the unreliable ones
- The min. values of variables 'nevents', 'ndays_act',	'nplay_video',	'nchapters' are 1 instead of 0. Actually this is not true, since per data description,  the variable 'nevents' are left blank if no interactions beyond registration. We will need to fill the NA with 0
- All the non-null values of 'incomplete_flag' variable is 1. Since records with a incomplete flag is not reliable, we will consider drop those records.

Based on the exploration above, we will dive deep into each variable to see how to clean it.

**course_id**

In [8]:
df.course_id.value_counts()

HarvardX/CS50x/2012            169621
MITx/6.00x/2012_Fall            66731
MITx/6.00x/2013_Spring          57715
HarvardX/ER22x/2013_Spring      57406
HarvardX/PH207x/2012_Fall       41592
MITx/6.002x/2012_Fall           40811
HarvardX/PH278x/2013_Spring     39602
MITx/8.02x/2013_Spring          31048
HarvardX/CB22x/2013_Spring      30002
MITx/14.73x/2013_Spring         27870
MITx/6.002x/2013_Spring         22235
MITx/7.00x/2013_Spring          21009
MITx/3.091x/2012_Fall           14215
MITx/8.MReV/2013_Summer          9477
MITx/3.091x/2013_Spring          6139
MITx/2.01x/2013_Spring           5665
Name: course_id, dtype: int64

**userid_DI**

In [9]:
df.userid_DI.unique()

array(['MHxPC130442623', 'MHxPC130275857', 'MHxPC130539455', ...,
       'MHxPC130184108', 'MHxPC130359782', 'MHxPC130098513'], dtype=object)

The 'course_id' and 'userid_DI' variable look good

**registered**

In [10]:
df.registered.value_counts()

1    641138
Name: registered, dtype: int64

The 'registered' variable can just be dropped as discussion earlier

**viewed**

In [11]:
df.viewed.value_counts()

1    400262
0    240876
Name: viewed, dtype: int64

**explored**

In [12]:
df.explored.value_counts()

0    601452
1     39686
Name: explored, dtype: int64

**certified**

In [13]:
df.certified.value_counts()

0    623451
1     17687
Name: certified, dtype: int64

The three variables above('viewed', 'explored', 'certified') looks good

**final_cc_cname_DI**

In [14]:
df.final_cc_cname_DI.value_counts()

United States                             184240
India                                      88696
Unknown/Other                              82029
Other Europe                               40377
Other Africa                               23897
United Kingdom                             22131
Brazil                                     17856
Other Middle East/Central Asia             17325
Other South Asia                           12992
Canada                                     12738
Pakistan                                   10824
Russian Federation                         10432
Spain                                      10003
Other South America                         9916
Egypt                                       9286
Germany                                     8074
Nigeria                                     7483
Other East Asia                             6446
Australia                                   6419
Mexico                                      5638
Philippines         

**LoE_DI**

In [15]:
df.LoE_DI.value_counts()

Bachelor's             219768
Secondary              169694
Master's               118189
Less than Secondary     14092
Doctorate               13387
Name: LoE_DI, dtype: int64

Values of variable 'final_cc_cname_DI' and 'LoE_DI' all look good.

**YoB**

In [16]:
np.sort(df.YoB.unique())

array([1931., 1934., 1935., 1936., 1937., 1938., 1939., 1940., 1941.,
       1942., 1943., 1944., 1945., 1946., 1947., 1948., 1949., 1950.,
       1951., 1952., 1953., 1954., 1955., 1956., 1957., 1958., 1959.,
       1960., 1961., 1962., 1963., 1964., 1965., 1966., 1967., 1968.,
       1969., 1970., 1971., 1972., 1973., 1974., 1975., 1976., 1977.,
       1978., 1979., 1980., 1981., 1982., 1983., 1984., 1985., 1986.,
       1987., 1988., 1989., 1990., 1991., 1992., 1993., 1994., 1995.,
       1996., 1997., 1998., 1999., 2000., 2001., 2002., 2003., 2007.,
       2008., 2009., 2010., 2011., 2012., 2013.,   nan])

In [17]:
df[df.YoB>2002].YoB.value_counts()

2012.0    472
2013.0     61
2011.0     34
2010.0     17
2003.0     10
2008.0     10
2009.0      8
2007.0      6
Name: YoB, dtype: int64

There are really a bunch of unreliable values in this YoB variable. This Wikipedia page[Education in the United States](https://en.wikipedia.org/wiki/Education_in_the_United_States) provides ages of US students go to different schools. Suppose at least elementary school graduates are applicable to study those online courses, then their ages would be 10-11 years old. Based on this, we will select records with the YoB earlier than 2002.

**gender**

In [18]:
df.gender.value_counts()

m    411520
f    142795
o        17
Name: gender, dtype: int64

Values of variable 'gender' look good.

For user-provided variables **'LoE', 'YoB' and 'gender'**, although the values look good, there are some missing values. These missing values are either due to the student created an account before the corresponding student registration question was available or the user declined to provide the information. There is not much we can do about it, we will just drop those records with missing values.

**grade**

In [19]:
df.grade.value_counts()

0       490868
0.01     19891
0.0      18525
          9028
0.02      5417
0.03      4769
0.04      3912
0.05      2238
0.06      2203
1         2010
0.07      1330
0.09      1293
0.08      1069
0.1        817
0.13       753
0.11       705
0.12       680
0.89       586
0.91       575
0.93       551
0.15       547
0.88       545
0.9        541
0.96       535
0.14       518
0.87       507
0.97       504
0.92       502
0.99       502
0.94       497
         ...  
0.65       250
0.61       250
0.34       249
0.6        238
0.38       226
0.39       226
0.4        217
0.36       214
0.37       213
0.35       208
0.42       184
0.56       182
0.57       174
0.45       174
0.55       173
0.5        170
0.46       170
0.53       167
0.59       165
0.51       160
0.43       157
0.47       155
0.41       155
0.58       153
0.44       149
0.54       145
0.52       142
0.48       139
0.49       118
1.01         6
Name: grade, Length: 104, dtype: int64

Something interesting here. 
- There is a 0 value while there is also a 0.0 value,this should not be a problem once we convert the variable to numerical type since they would be the same. 
- The range of grade values is from 0 to 1, however, we found some 1.01 values there, we need to drop those unreliable records
- There are some values which are left blank, We will drop those records.

**incomplete_flag**

In [20]:
# check values of incomplete_flag variable
df.incomplete_flag.value_counts()

1.0    100161
Name: incomplete_flag, dtype: int64

According to the data description in file 'Person+Course+Documentation.pdf', the variable **'incomplete_flag'** identifies the records which are internally inconsistent, to assure the reliability of the analysis, we would prefer select those records without a inconsistent problem, which is to say, the value of incomplete_flag != 1. Then we will drop this variable since it is no longer of any importance.

**Based on the exploration above, what we need to do to handle the data includes:**
- **Subset**
    - Filter to get a subset without inconsistent records(incomplete_flag != 1)
    - Drop variable 'registered'
    - Drop the unreliable YoB values
    - For variable 'grade', drop the unreliable values, drop the values in blank
    - Drop variable 'roles'

- **Missing values**
    - Drop missing values in variables 'LoE', 'YoB', 'gender' and 'grade'
    - For missing values in variables 'nevents', 'ndays_act', 'nplay_video', 'nchapters', fill the NA with 0
    - The value of last_event_DI is left blank if no interactions beyond registration, we will fill NA with 'start_time_DI' which is the registration date.
- **Data type**
    - Convert the data type of 'grade' from object type to numerical type
    - Convert 'start_time_DI' and 'last_event_DI' from object to date type.
- **Variable names**
    - Rename some of the columns names to be more concise and intuitive

    




### Cleaning Data

#### Subset

Filter to get a subset without inconsistent records, which is to say, the value of variable incomplete_flag != 1

In [21]:
# Filter to get a subset without inconsistent records, which is to say, the value of variable incomplete_flag != 1
df = df[df.incomplete_flag != 1]
df.head()

Unnamed: 0,course_id,userid_DI,registered,viewed,explored,certified,final_cc_cname_DI,LoE_DI,YoB,gender,grade,start_time_DI,last_event_DI,nevents,ndays_act,nplay_video,nchapters,nforum_posts,roles,incomplete_flag
5,HarvardX/PH207x/2012_Fall,MHxPC130275857,1,1,1,0,United States,,,,0,2012-09-17,2013-05-23,502.0,16.0,50.0,12.0,0,,
7,HarvardX/CB22x/2013_Spring,MHxPC130539455,1,1,0,0,France,,,,0,2013-01-01,2013-05-14,42.0,6.0,,3.0,0,,
8,HarvardX/CB22x/2013_Spring,MHxPC130088379,1,1,0,0,United States,,,,0,2013-02-18,2013-03-17,70.0,3.0,,3.0,0,,
10,HarvardX/ER22x/2013_Spring,MHxPC130088379,1,1,0,0,United States,,,,0,2013-02-23,2013-06-14,17.0,2.0,,2.0,0,,
11,HarvardX/ER22x/2013_Spring,MHxPC130198098,1,1,0,0,United States,,,,0,2013-06-17,2013-06-17,32.0,1.0,,3.0,0,,


In [22]:
# double check values of incomplete_flag
df.incomplete_flag.value_counts()

Series([], Name: incomplete_flag, dtype: int64)

In [23]:
# Drop variable incomplete_flag, registered and roles
df.drop(['incomplete_flag', 'registered', 'roles'], axis=1,inplace=True)
df.head()

Unnamed: 0,course_id,userid_DI,viewed,explored,certified,final_cc_cname_DI,LoE_DI,YoB,gender,grade,start_time_DI,last_event_DI,nevents,ndays_act,nplay_video,nchapters,nforum_posts
5,HarvardX/PH207x/2012_Fall,MHxPC130275857,1,1,0,United States,,,,0,2012-09-17,2013-05-23,502.0,16.0,50.0,12.0,0
7,HarvardX/CB22x/2013_Spring,MHxPC130539455,1,0,0,France,,,,0,2013-01-01,2013-05-14,42.0,6.0,,3.0,0
8,HarvardX/CB22x/2013_Spring,MHxPC130088379,1,0,0,United States,,,,0,2013-02-18,2013-03-17,70.0,3.0,,3.0,0
10,HarvardX/ER22x/2013_Spring,MHxPC130088379,1,0,0,United States,,,,0,2013-02-23,2013-06-14,17.0,2.0,,2.0,0
11,HarvardX/ER22x/2013_Spring,MHxPC130198098,1,0,0,United States,,,,0,2013-06-17,2013-06-17,32.0,1.0,,3.0,0


In [24]:
# drop the unreliable records of YoB
df = df[df.YoB <= 2002]
df.shape

(458388, 17)

In [25]:
# drop the unreliable records and blank values in grade
df = df[(df.grade != '1.01') & (df.grade != ' ')]
df.shape

(451090, 17)

#### Missing Values

In [26]:
# missing values by column
df.isna().sum()

course_id                 0
userid_DI                 0
viewed                    0
explored                  0
certified                 0
final_cc_cname_DI         0
LoE_DI                11176
YoB                       0
gender                    0
grade                 37768
start_time_DI             0
last_event_DI         81288
nevents               85963
ndays_act             85963
nplay_video          304907
nchapters            198253
nforum_posts              0
dtype: int64

Besides missing value number of each variable, we will also pull out the proportion of missing values for each variable.

In [27]:
# proportion of missing values by rows
df.isna().sum() / df.shape[0]

course_id            0.000000
userid_DI            0.000000
viewed               0.000000
explored             0.000000
certified            0.000000
final_cc_cname_DI    0.000000
LoE_DI               0.024776
YoB                  0.000000
gender               0.000000
grade                0.083726
start_time_DI        0.000000
last_event_DI        0.180204
nevents              0.190567
ndays_act            0.190567
nplay_video          0.675934
nchapters            0.439498
nforum_posts         0.000000
dtype: float64

Drop missing values in variables 'LoE_DI', 'YoB' and 'gender'.

In [28]:
# drop rows with missing values for column 'LoE_DI', 'gender', 'grade'
df.dropna(subset=['LoE_DI', 'YoB', 'gender', 'grade'],inplace=True)
df.isna().sum()

course_id                 0
userid_DI                 0
viewed                    0
explored                  0
certified                 0
final_cc_cname_DI         0
LoE_DI                    0
YoB                       0
gender                    0
grade                     0
start_time_DI             0
last_event_DI         73490
nevents               77559
ndays_act             77559
nplay_video          272089
nchapters            176528
nforum_posts              0
dtype: int64

For missing values in variables 'nevents', 'ndays_act', 'nplay_video', 'nchapters', fill the NA with 0

In [29]:
# For missing values in variables 'nevents', 'ndays_act', 'nplay_video', 'nchapters', fill the NA with 0
df.nevents = df.nevents.fillna(0)
df.ndays_act = df.ndays_act.fillna(0)
df.nplay_video = df.nplay_video.fillna(0)
df.nchapters = df.nchapters.fillna(0)
df.isna().sum()

course_id                0
userid_DI                0
viewed                   0
explored                 0
certified                0
final_cc_cname_DI        0
LoE_DI                   0
YoB                      0
gender                   0
grade                    0
start_time_DI            0
last_event_DI        73490
nevents                  0
ndays_act                0
nplay_video              0
nchapters                0
nforum_posts             0
dtype: int64

Next we will fill the NAs with the start_time_DI for variable last_event_DI

In [30]:
# fill the NAs with the start_time_DI for variable last_event_DI
df['last_event_DI'].fillna(df['start_time_DI'], inplace=True)

In [31]:
df.isna().sum()

course_id            0
userid_DI            0
viewed               0
explored             0
certified            0
final_cc_cname_DI    0
LoE_DI               0
YoB                  0
gender               0
grade                0
start_time_DI        0
last_event_DI        0
nevents              0
ndays_act            0
nplay_video          0
nchapters            0
nforum_posts         0
dtype: int64

#### Data Type

Convert variable 'grade' from object type to numerical type

In [32]:
# Convert variable 'grade' from object type to numerical type
df['grade'] = pd.to_numeric(df.grade)

In [33]:
df.grade.unique()

array([0.  , 1.  , 0.01, 0.29, 0.3 , 0.18, 0.1 , 0.04, 0.11, 0.02, 0.13,
       0.92, 0.09, 0.89, 0.79, 0.99, 0.93, 0.52, 0.05, 0.15, 0.9 , 0.38,
       0.43, 0.22, 0.45, 0.07, 0.85, 0.12, 0.03, 0.83, 0.84, 0.77, 0.96,
       0.33, 0.06, 0.94, 0.6 , 0.87, 0.2 , 0.88, 0.23, 0.91, 0.34, 0.08,
       0.39, 0.98, 0.75, 0.57, 0.8 , 0.95, 0.97, 0.65, 0.7 , 0.37, 0.26,
       0.76, 0.35, 0.31, 0.36, 0.47, 0.16, 0.71, 0.48, 0.66, 0.42, 0.72,
       0.69, 0.4 , 0.14, 0.63, 0.25, 0.51, 0.28, 0.62, 0.54, 0.86, 0.56,
       0.24, 0.61, 0.82, 0.53, 0.17, 0.21, 0.49, 0.67, 0.55, 0.32, 0.64,
       0.78, 0.73, 0.81, 0.5 , 0.46, 0.19, 0.27, 0.68, 0.58, 0.44, 0.74,
       0.59, 0.41])

In [34]:
df.grade.describe()

count    402750.000000
mean          0.038274
std           0.159851
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max           1.000000
Name: grade, dtype: float64

After cleaning, the min of grade is 0 and the max is 1, looks good

Convert 'start_time_DI' and 'last_event_DI' from object to date type.

In [35]:
# Convert 'start_time_DI' and 'last_event_DI' from object to date
df['start_time_DI'] = pd.to_datetime(df['start_time_DI'])
df['last_event_DI'] = pd.to_datetime(df['last_event_DI'])
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 402750 entries, 19330 to 641122
Data columns (total 17 columns):
course_id            402750 non-null object
userid_DI            402750 non-null object
viewed               402750 non-null int64
explored             402750 non-null int64
certified            402750 non-null int64
final_cc_cname_DI    402750 non-null object
LoE_DI               402750 non-null object
YoB                  402750 non-null float64
gender               402750 non-null object
grade                402750 non-null float64
start_time_DI        402750 non-null datetime64[ns]
last_event_DI        402750 non-null datetime64[ns]
nevents              402750 non-null float64
ndays_act            402750 non-null float64
nplay_video          402750 non-null float64
nchapters            402750 non-null float64
nforum_posts         402750 non-null int64
dtypes: datetime64[ns](2), float64(6), int64(4), object(5)
memory usage: 55.3+ MB


In [36]:
df.describe()

Unnamed: 0,viewed,explored,certified,YoB,grade,nevents,ndays_act,nplay_video,nchapters,nforum_posts
count,402750.0,402750.0,402750.0,402750.0,402750.0,402750.0,402750.0,402750.0,402750.0,402750.0
mean,0.594639,0.070977,0.033949,1985.252457,0.038274,343.834975,4.611816,38.21091,2.291404,0.015876
std,0.490962,0.256787,0.181098,8.82012,0.159851,1302.267213,10.550842,213.796306,4.158952,0.170387
min,0.0,0.0,0.0,1931.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,1982.0,0.0,1.0,1.0,0.0,0.0,0.0
50%,1.0,0.0,0.0,1988.0,0.0,9.0,1.0,0.0,1.0,0.0
75%,1.0,0.0,0.0,1991.0,0.0,100.0,3.0,5.0,3.0,0.0
max,1.0,1.0,1.0,2002.0,1.0,53180.0,205.0,34596.0,47.0,6.0


In [37]:
df.shape
print('After cleaning, our dataset has {} records, with {} variables'.format(df.shape[0], df.shape[1]))

After cleaning, our dataset has 402750 records, with 17 variables


### Variable names

Some variable names are not very intuitive, we will rename them to be more concise and intuitive.

In [38]:
# get the names of variables
df.columns

Index(['course_id', 'userid_DI', 'viewed', 'explored', 'certified',
       'final_cc_cname_DI', 'LoE_DI', 'YoB', 'gender', 'grade',
       'start_time_DI', 'last_event_DI', 'nevents', 'ndays_act', 'nplay_video',
       'nchapters', 'nforum_posts'],
      dtype='object')

In [39]:
# rename the columns
df.columns = ['course_id', 'user_id', 'viewed', 'explored', 'certified',
       'country', 'education', 'YoB', 'gender', 'grade',
       'time_registered', 'last_event', 'nevents', 'ndays_act', 'nplay_video',
       'nchapters', 'nforum_posts']

In [40]:
df_type = pd.DataFrame(df.dtypes).reset_index()
df_type.columns = ['Variable', 'Type']
df_type.reset_index(drop=True)

Unnamed: 0,Variable,Type
0,course_id,object
1,user_id,object
2,viewed,int64
3,explored,int64
4,certified,int64
5,country,object
6,education,object
7,YoB,float64
8,gender,object
9,grade,float64


### Saving Cleaned Data
We  will save te cleaned dataset to 'hmx_cleaned.csv' for exploration later.

In [41]:
df.to_csv('hmx_cleaned.csv', index=False)