# Wrangle Pisa 2012
#### By Gabriela Sikora


### Abstract



### Table of Contents

<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
    <ul>
    <li><a href="#gathering">Gathering Data</a></li>
    <li><a href="#assessing1">Initial Assessment</a></li>
        <ul>
          <li><a href="#issues1">Identified Issues</a></li>
        </ul>
    <li><a href="#cleaning1">Cleaning Data</a></li>
    <li><a href="#assessing2">Final Assessment</a></li>
    </ul>
<li><a href="#sav">Store</a></li>
<li><a href="#conclusion">Conclusion</a></li>
<li><a href="#references">References</a></li>    
</ul>

<a id='intro'></a>
## Introduction



To begin, we first need to import all the relevant libraries:
#### Import

In [1]:
# import all packages
import numpy as np
import pandas as pd

<a id='wrangling'></a>
## Data Wrangling

In this section of the report, we will load in the data, check it for cleanliness, and then trim and clean the datasets for analysis. 

<a id='gathering'></a>
### Gathering Data

Unfortunately the Pisa dataset is too large to be uploaded to github, but if you would like to work with it yourself then place follow this link: https://s3.amazonaws.com/udacity-hosted-downloads/ud507/pisa2012.csv.zip 

In [2]:
# Encoding help found here: https://stackoverflow.com/questions/30462807/encoding-error-in-panda-read-csv

# Load in the Pisa dataframe
pisa = pd.read_csv('pisa2012.csv', encoding="cp1252", sep=',')

# Load in the Pisa dictionary
pisa_dict = pd.read_csv('pisadict2012.csv', encoding="cp1252", sep=',')

  interactivity=interactivity, compiler=compiler, result=result)


<a id='assessing1'></a>
### First Assessment

In this section, we will do the preliminary visual and programmatic assessment of the pisa datasets to determine whether or not it holds any major quality or tidiness issues.

In [3]:
pisa

Unnamed: 0.1,Unnamed: 0,CNT,SUBNATIO,STRATUM,OECD,NC,SCHOOLID,STIDSTD,ST01Q01,ST02Q01,...,W_FSTR75,W_FSTR76,W_FSTR77,W_FSTR78,W_FSTR79,W_FSTR80,WVARSTRR,VAR_UNIT,SENWGT_STU,VER_STU
0,1,Albania,80000,ALB0006,Non-OECD,Albania,1,1,10,1.0,...,13.7954,13.9235,13.1249,13.1249,4.3389,13.0829,19,1,0.2098,22NOV13
1,2,Albania,80000,ALB0006,Non-OECD,Albania,1,2,10,1.0,...,13.7954,13.9235,13.1249,13.1249,4.3389,13.0829,19,1,0.2098,22NOV13
2,3,Albania,80000,ALB0006,Non-OECD,Albania,1,3,9,1.0,...,12.7307,12.7307,12.7307,12.7307,4.2436,12.7307,19,1,0.1999,22NOV13
3,4,Albania,80000,ALB0006,Non-OECD,Albania,1,4,9,1.0,...,12.7307,12.7307,12.7307,12.7307,4.2436,12.7307,19,1,0.1999,22NOV13
4,5,Albania,80000,ALB0006,Non-OECD,Albania,1,5,9,1.0,...,12.7307,12.7307,12.7307,12.7307,4.2436,12.7307,19,1,0.1999,22NOV13
5,6,Albania,80000,ALB0006,Non-OECD,Albania,1,6,9,1.0,...,12.7307,12.7307,12.7307,12.7307,4.2436,12.7307,19,1,0.1999,22NOV13
6,7,Albania,80000,ALB0006,Non-OECD,Albania,1,7,10,1.0,...,13.7954,13.9235,13.1249,13.1249,4.3389,13.0829,19,1,0.2098,22NOV13
7,8,Albania,80000,ALB0006,Non-OECD,Albania,1,8,10,1.0,...,14.4599,14.6374,15.8728,15.8728,5.2248,15.2579,19,1,0.2322,22NOV13
8,9,Albania,80000,ALB0006,Non-OECD,Albania,1,9,9,1.0,...,12.7307,12.7307,12.7307,12.7307,4.2436,12.7307,19,1,0.1999,22NOV13
9,10,Albania,80000,ALB0005,Non-OECD,Albania,2,10,10,1.0,...,3.3844,10.1533,3.3844,10.1533,10.1533,10.1533,74,2,0.1594,22NOV13


In [4]:
pisa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 485490 entries, 0 to 485489
Columns: 636 entries, Unnamed: 0 to VER_STU
dtypes: float64(250), int64(18), object(368)
memory usage: 2.3+ GB


In [5]:
pisa_dict

Unnamed: 0.1,Unnamed: 0,x
0,CNT,Country code 3-character
1,SUBNATIO,Adjudicated sub-region code 7-digit code (3-di...
2,STRATUM,Stratum ID 7-character (cnt + region ID + orig...
3,OECD,OECD country
4,NC,National Centre 6-digit Code
5,SCHOOLID,School ID 7-digit (region ID + stratum ID + 3-...
6,STIDSTD,Student ID
7,ST01Q01,International Grade
8,ST02Q01,National Study Programme
9,ST03Q01,Birth - Month


In [6]:
# Check for null values in relevant columns
columns = ['STIDSTD', 'ST01Q01', 'ST04Q01', 'ST13Q01', 
            'ST17Q01', 'ST28Q01', 'ST57Q01', 'ST57Q02', 'ST57Q03', 
            'ST57Q04', 'ST57Q05', 'MMINS', 'LMINS', 'SMINS',
            'PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH',  
            'PV1READ', 'PV2READ',  'PV3READ', 'PV4READ', 'PV5READ',
            'PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE']

for x in columns:
    print(x, pisa[x].isna().sum())

STIDSTD 0
ST01Q01 0
ST04Q01 0
ST13Q01 27511
ST17Q01 42229
ST28Q01 11725
ST57Q01 184123
ST57Q02 215682
ST57Q03 201677
ST57Q04 205833
ST57Q05 195988
MMINS 202187
LMINS 202624
SMINS 214576
PV1MATH 0
PV2MATH 0
PV3MATH 0
PV4MATH 0
PV5MATH 0
PV1READ 0
PV2READ 0
PV3READ 0
PV4READ 0
PV5READ 0
PV1SCIE 0
PV2SCIE 0
PV3SCIE 0
PV4SCIE 0
PV5SCIE 0


In [7]:
pisa['CNT'].value_counts()

Mexico                      33806
Italy                       31073
Spain                       25313
Canada                      21544
Brazil                      19204
Australia                   14481
United Kingdom              12659
United Arab Emirates        11500
Switzerland                 11229
Qatar                       10966
Colombia                     9073
Finland                      8829
Belgium                      8597
Denmark                      7481
Jordan                       7038
Chile                        6856
Thailand                     6606
Japan                        6351
Chinese Taipei               6046
Peru                         6035
Slovenia                     5911
Argentina                    5908
Kazakhstan                   5808
Portugal                     5722
Indonesia                    5622
Singapore                    5546
Macao-China                  5335
Czech Republic               5327
Uruguay                      5315
Bulgaria      

In [8]:
pisa.duplicated().sum()

0

<a id='issues1'></a>
### Identified Issues
Here we can see some of the issues noticed in the dataframes. These quality issues will be cleaned in the following section.

1. Narrow down the dataframe by only looking at variables of interest since there are more variables than needed
2. Make the variable names readable since they are codes that can only be read with the dictionary
3. Create an average of all Math, Reading, and Science scores since all 5 scores for each are not necessary
4. Missing values in relevant columns ('Mother Highest Schooling', 'Father Highest Schooling', 'How many books at home', 'Out-of-School Study Time - Homework', 'Out-of-School Study Time - Guided Homework', 'Out-of-School Study Time - Personal Tutor','Out-of-School Study Time - Commercial Company', 'Out-of-School Study Time - With Parent', 'Learning time (minutes per week)- Mathematics', 'Learning time (minutes per week)  - test language', 'Learning time (minutes per week) - Science')
5. Create a total for Out-of-School Study Time and Learning time (minutes per week)
6. Single countries have various entries in the country column (Florida (USA), Massachusetts (USA), Connecticut (USA) and United States of America)


<a id='cleaning1'></a>
### First Cleaning

In [9]:
# Create a copy to preserve the original dataframe
pisa_clean = pisa.copy()

### Narrow down the dataframe by only looking at variables of interest

There are 636 columns available in the original dataframe, which is far more than what is necessary. We will rather look at just the following variables.

In [10]:
pisa_clean = pisa_clean[['STIDSTD', 'ST04Q01', 'ST13Q01', 'ST14Q01', 'ST14Q02', 'ST14Q03', 'ST14Q04', 
                         'ST17Q01', 'ST18Q01', 'ST18Q02', 'ST18Q03', 'ST18Q04', 
                         'ST28Q01', 
                         'ST57Q01', 'ST57Q02', 'ST57Q03', 'ST57Q04', 'ST57Q05', 'MMINS', 
                         'LMINS', 'SMINS', 'PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 
                         'PV5MATH', 'PV1READ', 'PV2READ',  'PV3READ', 'PV4READ', 'PV5READ',
                         'PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE']]       

In [11]:
# Check to see if the columns are correctly reduced
list(pisa_clean)

['STIDSTD',
 'ST04Q01',
 'ST13Q01',
 'ST14Q01',
 'ST14Q02',
 'ST14Q03',
 'ST14Q04',
 'ST17Q01',
 'ST18Q01',
 'ST18Q02',
 'ST18Q03',
 'ST18Q04',
 'ST28Q01',
 'ST57Q01',
 'ST57Q02',
 'ST57Q03',
 'ST57Q04',
 'ST57Q05',
 'MMINS',
 'LMINS',
 'SMINS',
 'PV1MATH',
 'PV2MATH',
 'PV3MATH',
 'PV4MATH',
 'PV5MATH',
 'PV1READ',
 'PV2READ',
 'PV3READ',
 'PV4READ',
 'PV5READ',
 'PV1SCIE',
 'PV2SCIE',
 'PV3SCIE',
 'PV4SCIE',
 'PV5SCIE']

### Create an average of all Math, Reading, and Science scores

For each of the Math, Reading, and Science scores, there are 5 scores provided. The variation between these scores is not needed, so rather we will look at the average for each of these major categories and remove the individual scores.

In [12]:
# Create average of Math scores
pisa_clean['Average Math Score']    = (pisa_clean['PV1MATH'] + pisa_clean['PV2MATH'] + pisa_clean['PV3MATH'] + pisa_clean['PV4MATH'] + pisa_clean['PV5MATH']) / 5

# Create average of Reading scores
pisa_clean['Average Reading Score'] = (pisa_clean['PV1READ'] + pisa_clean['PV2READ'] + pisa_clean['PV3READ'] + pisa_clean['PV4READ'] + pisa_clean['PV5READ']) / 5

# Create average of Science scores
pisa_clean['Average Science Score'] = (pisa_clean['PV1SCIE'] + pisa_clean['PV2SCIE'] + pisa_clean['PV3SCIE'] + pisa_clean['PV4SCIE'] + pisa_clean['PV5SCIE']) / 5

# Create average score
pisa_clean['Average Total Score']   = (pisa_clean['Average Math Score'] + pisa_clean['Average Reading Score'] + pisa_clean['Average Science Score'] +pisa_clean['Average Math Score']) / 3

In [13]:
# Remove the individual scores for Math, Reading, and Science
pisa_clean.drop(columns=['PV1MATH',  'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH',   
                         'PV1READ', 'PV2READ',  'PV3READ', 'PV4READ',  'PV5READ',
                         'PV1SCIE',  'PV2SCIE', 'PV3SCIE',  'PV4SCIE', 'PV5SCIE'], inplace=True)

In [14]:
# Test to see accuracy of the Math score
print(pisa_clean['Average Math Score'].describe())
print('-'*22)

# Test to see accuracy of the Reading score
print(pisa_clean['Average Reading Score'].describe())
print('-'*22)

# Test to see accuracy of the Science score
print(pisa_clean['Average Science Score'].describe())

# Test to see accuracy of the Total score
print(pisa_clean['Average Total Score'].describe())

count    485490.000000
mean        469.651234
std         100.786610
min          54.767080
25%         396.019620
50%         465.734520
75%         540.123060
max         903.107960
Name: Average Math Score, dtype: float64
----------------------
count    485490.000000
mean        472.006964
std          98.863310
min           6.445400
25%         405.044200
50%         475.477980
75%         542.831195
max         849.359740
Name: Average Reading Score, dtype: float64
----------------------
count    485490.000000
mean        475.808094
std          97.998470
min          25.158540
25%         405.762800
50%         475.512860
75%         546.381920
max         857.832900
Name: Average Science Score, dtype: float64
count    485490.000000
mean        629.039175
std         128.839689
min         105.496473
25%         536.312167
50%         627.318087
75%         721.099990
max        1123.882273
Name: Average Total Score, dtype: float64


In [15]:
pisa_clean.sample(10)

Unnamed: 0,STIDSTD,ST04Q01,ST13Q01,ST14Q01,ST14Q02,ST14Q03,ST14Q04,ST17Q01,ST18Q01,ST18Q02,...,ST57Q03,ST57Q04,ST57Q05,MMINS,LMINS,SMINS,Average Math Score,Average Reading Score,Average Science Score,Average Total Score
437159,4154,Male,<ISCED level 2>,No,No,No,No,<ISCED level 2>,No,No,...,0.0,1.0,0.0,135.0,135.0,45.0,355.59276,326.93474,365.4793,467.86652
436644,3639,Male,<ISCED level 3A>,No,No,No,,<ISCED level 3A>,No,No,...,,,,,,,514.65184,463.34562,499.29132,663.980207
19543,3301,Female,<ISCED level 1>,,,,,,,,...,,,,,,,234.1564,247.19326,219.73144,311.745833
369007,412,Female,<ISCED level 3A>,No,Yes,No,No,<ISCED level 3A>,No,Yes,...,,,,,,,609.3706,771.67624,722.99512,904.470853
371359,2764,Female,"<ISCED level 3B, 3C>",No,No,No,No,"<ISCED level 3B, 3C>",No,No,...,0.0,0.0,1.0,180.0,180.0,180.0,452.8042,531.0797,535.84484,657.51098
423315,540,Female,<ISCED level 3A>,No,No,No,Yes,<ISCED level 3A>,No,No,...,,,,,,,604.22964,564.52016,600.46622,791.148553
180950,5276,Male,"<ISCED level 3B, 3C>",,Yes,,,<ISCED level 3A>,No,Yes,...,,,,,,,561.07648,524.29344,566.98992,737.812107
71769,16504,Male,She did not complete <ISCED level 1>,,,,,He did not complete <ISCED level 1>,,,...,,,,,,,293.97882,282.2664,317.26966,395.831233
434157,1152,Female,<ISCED level 3A>,No,No,No,Yes,<ISCED level 3A>,No,Yes,...,,,,,,,492.2963,509.63334,477.75084,657.325593
15284,10542,Male,<ISCED level 3A>,No,No,No,No,<ISCED level 3A>,,,...,,,,,,,352.71068,364.2252,343.47258,471.039713


### Make the variable names readable

In the original dataframe, the variable names are all codes and unreadable without the dictionary. Therefore, we will change them to accurately reflect the information within the column.

In [16]:
# Rename columns
pisa_clean.rename({'STIDSTD': 'Student ID',
                   'ST04Q01': 'Gender', 
                   'ST13Q01': 'Mother - Highest Schooling', 
                   'ST14Q01': 'Mother Qualifications - ISCED level 6',
                   'ST14Q02': 'Mother Qualifications - ISCED level 5A',
                   'ST14Q03': 'Mother Qualifications - ISCED level 5B',
                   'ST14Q04': 'Mother Qualifications - ISCED level 4',
                   'ST17Q01': 'Father - Highest Schooling',
                   'ST18Q01': 'Father Qualifications - ISCED level 6',
                   'ST18Q02': 'Father Qualifications - ISCED level 5A',
                   'ST18Q03': 'Father Qualifications - ISCED level 5B',
                   'ST18Q04': 'Father Qualifications - ISCED level 4',
                   'ST28Q01': 'How many books at home', 
                   'ST57Q01': 'Out-of-School Study Time - Homework',
                   'ST57Q02': 'Out-of-School Study Time - Guided Homework',
                   'ST57Q03': 'Out-of-School Study Time - Personal Tutor',
                   'ST57Q04': 'Out-of-School Study Time - Commercial Company',
                   'ST57Q05': 'Out-of-School Study Time - With Parent',
                   'MMINS': 'Learning time (minutes per week)- <Mathematics>',
                   'LMINS': 'Learning time (minutes per week)  - <test language>',
                   'SMINS': 'Learning time (minutes per week) - <Science>'}, axis='columns', inplace=True)

In [17]:
# Check to see if the columns are correctly renamed
list(pisa_clean)

['Student ID',
 'Gender',
 'Mother - Highest Schooling',
 'Mother Qualifications - ISCED level 6',
 'Mother Qualifications - ISCED level 5A',
 'Mother Qualifications - ISCED level 5B',
 'Mother Qualifications - ISCED level 4',
 'Father - Highest Schooling',
 'Father Qualifications - ISCED level 6',
 'Father Qualifications - ISCED level 5A',
 'Father Qualifications - ISCED level 5B',
 'Father Qualifications - ISCED level 4',
 'How many books at home',
 'Out-of-School Study Time - Homework',
 'Out-of-School Study Time - Guided Homework',
 'Out-of-School Study Time - Personal Tutor',
 'Out-of-School Study Time - Commercial Company',
 'Out-of-School Study Time - With Parent',
 'Learning time (minutes per week)- <Mathematics>',
 'Learning time (minutes per week)  - <test language>',
 'Learning time (minutes per week) - <Science>',
 'Average Math Score',
 'Average Reading Score',
 'Average Science Score',
 'Average Total Score']

# Null

In [18]:
pisa_clean.replace(["NaN"], np.nan, inplace = True)
pisa_clean.dropna(inplace = True)

### Education
Multiple columns with the single variable. 

http://uis.unesco.org/sites/default/files/documents/international-standard-classification-of-education-isced-2011-en.pdf

In [19]:
father_edu = pisa_clean.copy()

In [20]:
father_edu[['<ISCED level 1>', 
            '<ISCED level 2>', 
            '<ISCED level 3A>', 
            '<ISCED level 3B, 3C>', 
            'He did not complete <ISCED level 1>']] = pd.get_dummies(father_edu['Father - Highest Schooling'])

In [21]:
father_edu.rename({'He did not complete <ISCED level 1>': '<ISCED level 0>',
                   '<ISCED level 3A>': '<ISCED level 3>', 
                   '<ISCED level 3B, 3C>': '<ISCED level 3>',
                   'Father Qualifications - ISCED level 4': '<ISCED level 4>',
                   'Father Qualifications - ISCED level 5A': '<ISCED level 5>',
                   'Father Qualifications - ISCED level 5B': '<ISCED level 5>',
                   'Father Qualifications - ISCED level 6': '<ISCED level 6>'}, axis='columns', inplace=True)

In [22]:
father_edu['<ISCED level 0>'] = father_edu['<ISCED level 0>'].astype(str)
father_edu['<ISCED level 1>'] = father_edu['<ISCED level 1>'].astype(str)
father_edu['<ISCED level 2>'] = father_edu['<ISCED level 2>'].astype(str)
father_edu['<ISCED level 3>'] = father_edu['<ISCED level 3>'].astype(str)

In [23]:
father_edu['<ISCED level 0>'] = father_edu['<ISCED level 0>'].replace(regex='1', value='<ISCED level 0>')
father_edu['<ISCED level 1>'] = father_edu['<ISCED level 1>'].replace(regex='1', value='<ISCED level 1>')
father_edu['<ISCED level 2>'] = father_edu['<ISCED level 2>'].replace(regex='1', value='<ISCED level 2>')
father_edu['<ISCED level 3>'] = father_edu['<ISCED level 3>'].replace(regex='1', value='<ISCED level 3>')
father_edu['<ISCED level 4>'] = father_edu['<ISCED level 4>'].replace(regex='Yes', value='<ISCED level 4>')
father_edu['<ISCED level 5>'] = father_edu['<ISCED level 5>'].replace(regex='Yes', value='<ISCED level 5>')
father_edu['<ISCED level 6>'] = father_edu['<ISCED level 6>'].replace(regex='Yes', value='<ISCED level 6>')

In [24]:
father_edu.sample(10)

Unnamed: 0,Student ID,Gender,Mother - Highest Schooling,Mother Qualifications - ISCED level 6,Mother Qualifications - ISCED level 5A,Mother Qualifications - ISCED level 5B,Mother Qualifications - ISCED level 4,Father - Highest Schooling,<ISCED level 6>,<ISCED level 5>,...,Learning time (minutes per week) - <Science>,Average Math Score,Average Reading Score,Average Science Score,Average Total Score,<ISCED level 1>,<ISCED level 2>,<ISCED level 3>,<ISCED level 3>.1,<ISCED level 0>
148573,2991,Male,She did not complete <ISCED level 1>,No,No,No,No,<ISCED level 1>,No,No,...,275.0,453.58312,455.72714,494.90864,619.26734,<ISCED level 1>,0,0,0,0
23211,1061,Female,<ISCED level 3A>,No,No,No,No,<ISCED level 2>,No,No,...,180.0,370.62626,412.33032,389.63074,514.404527,0,<ISCED level 2>,0,0,0
62421,7156,Male,<ISCED level 3A>,No,No,No,No,<ISCED level 3A>,No,No,...,90.0,289.53888,340.56782,378.4409,432.695493,0,0,<ISCED level 3>,0,0
183432,7758,Female,<ISCED level 3A>,No,Yes,No,No,"<ISCED level 3B, 3C>",No,No,...,180.0,594.10342,612.0993,622.93922,807.748453,0,0,0,<ISCED level 3>,0
153207,7625,Female,<ISCED level 2>,No,No,No,No,<ISCED level 2>,No,No,...,150.0,471.73236,447.67714,451.36142,614.16776,0,<ISCED level 2>,0,0,0
48751,7365,Male,"<ISCED level 3B, 3C>",No,No,Yes,No,<ISCED level 3A>,No,<ISCED level 5>,...,200.0,666.23306,579.06628,633.2898,848.274067,0,0,<ISCED level 3>,0,0
249940,9351,Female,<ISCED level 3A>,No,Yes,No,No,<ISCED level 3A>,No,<ISCED level 5>,...,100.0,510.28976,581.59782,483.81202,695.329787,0,0,<ISCED level 3>,0,0
433166,161,Female,"<ISCED level 3B, 3C>",No,No,No,No,"<ISCED level 3B, 3C>",No,No,...,45.0,482.63748,485.248,492.5774,647.70012,0,0,0,<ISCED level 3>,0
235390,3364,Female,<ISCED level 3A>,No,No,No,Yes,<ISCED level 3A>,No,No,...,120.0,392.6702,435.0476,451.82768,557.405227,0,0,<ISCED level 3>,0,0
262258,21669,Female,<ISCED level 2>,No,No,No,Yes,<ISCED level 2>,No,No,...,110.0,484.74062,489.53728,471.22342,643.41398,0,<ISCED level 2>,0,0,0


In [25]:
father_edu = pd.melt(father_edu, 
                     id_vars=['Student ID', 'Gender',
                            'Mother - Highest Schooling', 
                            'Mother Qualifications - ISCED level 6', 
                            'Mother Qualifications - ISCED level 5A',
                            'Mother Qualifications - ISCED level 5B',
                            'Mother Qualifications - ISCED level 4',
                            'How many books at home',
                            'Out-of-School Study Time - Homework',
                            'Out-of-School Study Time - Guided Homework',
                            'Out-of-School Study Time - Personal Tutor',
                            'Out-of-School Study Time - Commercial Company',
                            'Out-of-School Study Time - With Parent',
                            'Learning time (minutes per week)- <Mathematics>',
                            'Learning time (minutes per week)  - <test language>',
                            'Learning time (minutes per week) - <Science>',
                            'Average Math Score',
                            'Average Reading Score',
                            'Average Science Score',
                            'Average Total Score'], var_name='Completed_edu - Father', value_name='Education - Father')


In [26]:
father_edu.sample(10)

Unnamed: 0,Student ID,Gender,Mother - Highest Schooling,Mother Qualifications - ISCED level 6,Mother Qualifications - ISCED level 5A,Mother Qualifications - ISCED level 5B,Mother Qualifications - ISCED level 4,How many books at home,Out-of-School Study Time - Homework,Out-of-School Study Time - Guided Homework,...,Out-of-School Study Time - With Parent,Learning time (minutes per week)- <Mathematics>,Learning time (minutes per week) - <test language>,Learning time (minutes per week) - <Science>,Average Math Score,Average Reading Score,Average Science Score,Average Total Score,Completed_edu - Father,Education - Father
102897,1624,Male,<ISCED level 2>,No,No,No,Yes,26-100 books,10.0,0.0,...,0.0,200.0,200.0,350.0,557.80496,532.31292,570.53338,739.485407,Father - Highest Schooling,"<ISCED level 3B, 3C>"
739764,844,Female,<ISCED level 2>,No,No,No,No,26-100 books,1.0,1.0,...,1.0,245.0,245.0,175.0,525.8685,516.06724,550.67138,706.15854,<ISCED level 2>,0
529019,19085,Male,<ISCED level 3A>,No,Yes,No,No,26-100 books,2.0,0.0,...,1.0,300.0,240.0,120.0,513.7171,509.69802,497.42634,678.186187,<ISCED level 4>,<ISCED level 4>
208301,3808,Male,"<ISCED level 3B, 3C>",No,No,No,No,11-25 books,0.0,0.0,...,0.0,180.0,180.0,180.0,434.26546,420.9227,445.48676,578.31346,<ISCED level 6>,No
181960,5608,Female,"<ISCED level 3B, 3C>",No,No,No,Yes,201-500 books,5.0,3.0,...,1.0,240.0,360.0,120.0,494.63312,582.78926,531.92838,701.32796,<ISCED level 6>,No
1127220,612,Female,<ISCED level 3A>,No,No,No,Yes,26-100 books,8.0,2.0,...,2.0,225.0,225.0,225.0,457.94516,478.0992,464.32302,619.437513,<ISCED level 0>,0
1083130,3984,Male,<ISCED level 2>,No,No,No,No,201-500 books,8.0,2.0,...,0.0,210.0,210.0,140.0,626.42934,543.37974,585.91942,794.052613,<ISCED level 0>,0
1038628,58,Female,<ISCED level 1>,No,No,No,No,0-10 books,5.0,1.0,...,0.0,135.0,135.0,135.0,405.67848,454.19048,436.53486,567.360767,<ISCED level 0>,0
911452,3640,Male,<ISCED level 2>,No,No,No,No,26-100 books,1.0,0.0,...,0.0,225.0,225.0,250.0,528.36112,567.2777,561.76798,728.589307,<ISCED level 3>,<ISCED level 3>
288177,1367,Female,<ISCED level 3A>,No,Yes,No,No,More than 500 books,20.0,1.0,...,0.0,180.0,180.0,180.0,546.66614,548.31622,541.71954,727.789347,<ISCED level 5>,<ISCED level 5>


In [27]:
father_edu = father_edu[(father_edu['Education - Father'] != '0')] 
father_edu = father_edu[(father_edu['Education - Father'] != 'No')]

In [28]:
father_edu.sample(70)

Unnamed: 0,Student ID,Gender,Mother - Highest Schooling,Mother Qualifications - ISCED level 6,Mother Qualifications - ISCED level 5A,Mother Qualifications - ISCED level 5B,Mother Qualifications - ISCED level 4,How many books at home,Out-of-School Study Time - Homework,Out-of-School Study Time - Guided Homework,...,Out-of-School Study Time - With Parent,Learning time (minutes per week)- <Mathematics>,Learning time (minutes per week) - <test language>,Learning time (minutes per week) - <Science>,Average Math Score,Average Reading Score,Average Science Score,Average Total Score,Completed_edu - Father,Education - Father
898853,1437,Male,<ISCED level 3A>,Yes,No,No,No,26-100 books,5.0,1.0,...,1.0,60.0,60.0,60.0,463.63140,570.88646,543.39800,680.515753,<ISCED level 3>,<ISCED level 3>
833448,3464,Female,<ISCED level 3A>,No,No,Yes,No,More than 500 books,3.0,1.0,...,1.0,180.0,270.0,180.0,471.34288,503.27884,437.18762,627.717407,<ISCED level 3>,<ISCED level 3>
99656,4593,Female,<ISCED level 3A>,No,No,No,No,11-25 books,7.0,0.0,...,0.0,180.0,180.0,90.0,599.47812,532.90660,577.52704,769.796627,Father - Highest Schooling,<ISCED level 1>
77407,30829,Male,<ISCED level 2>,No,No,No,No,26-100 books,8.0,0.0,...,0.0,180.0,360.0,120.0,528.90636,449.47196,534.91234,680.732340,Father - Highest Schooling,<ISCED level 2>
993125,627,Female,"<ISCED level 3B, 3C>",No,Yes,No,No,201-500 books,3.0,1.0,...,2.0,200.0,200.0,200.0,591.22138,602.64704,596.27004,793.786613,<ISCED level 3>,<ISCED level 3>
17527,3474,Female,<ISCED level 3A>,No,No,No,Yes,More than 500 books,4.0,0.0,...,1.0,180.0,180.0,180.0,523.14222,550.54030,522.41702,706.413920,Father - Highest Schooling,<ISCED level 3A>
808729,6942,Male,<ISCED level 3A>,No,Yes,No,No,11-25 books,4.0,1.0,...,0.0,250.0,200.0,100.0,528.51690,520.20352,522.51026,699.915860,<ISCED level 3>,<ISCED level 3>
537072,408,Female,<ISCED level 3A>,No,No,No,No,201-500 books,5.0,0.0,...,0.0,200.0,200.0,200.0,590.05296,545.93328,551.23088,759.090027,<ISCED level 4>,<ISCED level 4>
104444,1026,Male,<ISCED level 3A>,No,Yes,No,No,More than 500 books,11.0,2.0,...,3.0,225.0,135.0,360.0,549.70398,576.09908,594.03208,756.513040,Father - Highest Schooling,<ISCED level 3A>
328036,453,Female,"<ISCED level 3B, 3C>",No,Yes,No,No,11-25 books,28.0,0.0,...,0.0,200.0,160.0,320.0,553.52078,549.03108,594.40508,750.159240,<ISCED level 5>,<ISCED level 5>


In [29]:
father_edu = father_edu[(father_edu['Completed_edu - Father'] != 'Father - Highest Schooling')] 

In [30]:
father_edu['Education - Father'].value_counts()

<ISCED level 3>    84016
<ISCED level 5>    48776
<ISCED level 4>    26281
<ISCED level 2>    21813
<ISCED level 6>     6672
<ISCED level 1>     5901
<ISCED level 0>     2264
Name: Education - Father, dtype: int64

In [31]:
father_edu.drop(columns=['Completed_edu - Father'], axis=1, inplace=True)

In [32]:
father_edu.sample(10)

Unnamed: 0,Student ID,Gender,Mother - Highest Schooling,Mother Qualifications - ISCED level 6,Mother Qualifications - ISCED level 5A,Mother Qualifications - ISCED level 5B,Mother Qualifications - ISCED level 4,How many books at home,Out-of-School Study Time - Homework,Out-of-School Study Time - Guided Homework,...,Out-of-School Study Time - Commercial Company,Out-of-School Study Time - With Parent,Learning time (minutes per week)- <Mathematics>,Learning time (minutes per week) - <test language>,Learning time (minutes per week) - <Science>,Average Math Score,Average Reading Score,Average Science Score,Average Total Score,Education - Father
448249,1688,Female,<ISCED level 3A>,No,Yes,No,No,101-200 books,5.0,0.0,...,7.0,0.0,350.0,210.0,350.0,650.03116,600.8201,623.4987,841.460373,<ISCED level 5>
915143,11434,Male,<ISCED level 3A>,No,Yes,No,No,11-25 books,1.0,0.0,...,0.0,0.0,225.0,225.0,540.0,489.33636,514.99088,520.83178,671.49846,<ISCED level 3>
263884,4499,Female,<ISCED level 3A>,No,Yes,Yes,Yes,26-100 books,6.0,2.0,...,0.0,0.0,180.0,420.0,240.0,449.92212,488.0281,475.51288,621.128407,<ISCED level 5>
898380,173,Female,<ISCED level 3A>,No,No,No,Yes,26-100 books,1.0,1.0,...,0.0,1.0,225.0,225.0,225.0,492.76366,524.48694,507.49722,672.503827,<ISCED level 3>
845628,1349,Female,<ISCED level 3A>,Yes,Yes,No,No,201-500 books,1.0,1.0,...,0.0,1.0,180.0,135.0,180.0,575.25316,553.24092,541.71954,748.488927,<ISCED level 3>
819685,15710,Male,<ISCED level 3A>,No,No,No,No,More than 500 books,12.0,10.0,...,0.0,0.0,375.0,375.0,375.0,557.57126,557.01278,521.95078,731.368693,<ISCED level 3>
268369,10274,Female,<ISCED level 3A>,No,No,No,No,101-200 books,5.0,1.0,...,0.0,1.0,220.0,165.0,330.0,519.71492,523.29546,509.1757,690.633667,<ISCED level 5>
784806,1282,Male,<ISCED level 2>,No,No,No,No,0-10 books,2.0,1.0,...,0.0,1.0,250.0,250.0,250.0,382.6219,422.04546,413.50242,533.597227,<ISCED level 2>
886402,656,Female,"<ISCED level 3B, 3C>",No,No,No,Yes,26-100 books,4.0,0.0,...,0.0,1.0,180.0,180.0,120.0,353.80122,412.01262,395.6919,505.10232,<ISCED level 3>
721831,2425,Male,<ISCED level 3A>,No,No,No,No,0-10 books,3.0,2.0,...,0.0,0.0,240.0,240.0,240.0,453.97258,404.88382,443.34204,585.39034,<ISCED level 2>


In [33]:
father_edu.duplicated().sum()

6901

In [34]:
# https://stackoverflow.com/questions/33042777/removing-duplicates-from-pandas-dataframe-with-condition-for-retaining-original
father_edu['Education - Father'] = father_edu['Education - Father'].astype('category')

father_edu['Education - Father'] = father_edu['Education - Father'].cat.set_categories(['<ISCED level 0>', 
                                                                                        '<ISCED level 1>', 
                                                                                        '<ISCED level 2>', 
                                                                                        '<ISCED level 3>',
                                                                                        '<ISCED level 4>', 
                                                                                        '<ISCED level 5>', 
                                                                                        '<ISCED level 6>'], ordered=True)
father_edu.sort_values(['Education - Father'], inplace=True)
father_edu_clean = father_edu.drop_duplicates(subset='Student ID', keep='last')

In [35]:
father_edu_clean.duplicated().sum()

0

In [36]:
father_edu.shape

(195723, 21)

### Mother Edu

In [37]:
mother_edu = father_edu_clean.copy()

In [38]:
mother_edu.head()

Unnamed: 0,Student ID,Gender,Mother - Highest Schooling,Mother Qualifications - ISCED level 6,Mother Qualifications - ISCED level 5A,Mother Qualifications - ISCED level 5B,Mother Qualifications - ISCED level 4,How many books at home,Out-of-School Study Time - Homework,Out-of-School Study Time - Guided Homework,...,Out-of-School Study Time - Commercial Company,Out-of-School Study Time - With Parent,Learning time (minutes per week)- <Mathematics>,Learning time (minutes per week) - <test language>,Learning time (minutes per week) - <Science>,Average Math Score,Average Reading Score,Average Science Score,Average Total Score,Education - Father
1055573,6301,Female,She did not complete <ISCED level 1>,No,No,No,No,0-10 books,0.0,0.0,...,0.0,1.0,540.0,720.0,90.0,301.69028,311.29408,287.5233,400.732647,<ISCED level 0>
1048110,16914,Male,<ISCED level 3A>,No,Yes,No,No,26-100 books,0.0,0.0,...,0.0,0.0,280.0,280.0,280.0,497.43728,474.25206,511.5069,660.211173,<ISCED level 0>
1047956,16487,Male,<ISCED level 3A>,No,No,No,No,11-25 books,1.0,1.0,...,0.0,1.0,400.0,400.0,400.0,389.0092,406.7283,441.3838,542.0435,<ISCED level 0>
1047644,15611,Female,<ISCED level 3A>,No,No,No,No,26-100 books,2.0,0.0,...,0.0,0.0,375.0,375.0,375.0,383.7124,405.57868,411.7307,528.244727,<ISCED level 0>
1047154,14175,Male,<ISCED level 3A>,No,Yes,No,No,201-500 books,1.0,1.0,...,0.0,1.0,400.0,0.0,0.0,636.16608,622.45152,627.97466,840.919447,<ISCED level 0>


In [39]:
mother_edu['Mother - Highest Schooling'].value_counts()

<ISCED level 3A>                         15433
<ISCED level 3B, 3C>                      3798
<ISCED level 2>                           3605
<ISCED level 1>                            532
She did not complete <ISCED level 1>       152
Name: Mother - Highest Schooling, dtype: int64

In [40]:
pisa_clean['Mother - Highest Schooling'].value_counts()

<ISCED level 3A>                         63289
<ISCED level 3B, 3C>                     23470
<ISCED level 2>                          19733
<ISCED level 1>                           5149
She did not complete <ISCED level 1>      2353
Name: Mother - Highest Schooling, dtype: int64

In [41]:
mother_edu[['<ISCED level 1>', 
            '<ISCED level 2>', 
            '<ISCED level 3A>', 
            '<ISCED level 3B, 3C>', 
            'She did not complete <ISCED level 1>']] = pd.get_dummies(mother_edu['Mother - Highest Schooling'])

In [42]:
mother_edu.rename({'She did not complete <ISCED level 1>': '<ISCED level 0>',
                   '<ISCED level 3A>': '<ISCED level 3>', 
                   '<ISCED level 3B, 3C>': '<ISCED level 3>',
                   'Mother Qualifications - ISCED level 4': '<ISCED level 4>',
                   'Mother Qualifications - ISCED level 5A': '<ISCED level 5>',
                   'Mother Qualifications - ISCED level 5B': '<ISCED level 5>',
                   'Mother Qualifications - ISCED level 6': '<ISCED level 6>'}, axis='columns', inplace=True)

In [43]:
mother_edu['<ISCED level 0>'] = mother_edu['<ISCED level 0>'].astype(str)
mother_edu['<ISCED level 1>'] = mother_edu['<ISCED level 1>'].astype(str)
mother_edu['<ISCED level 2>'] = mother_edu['<ISCED level 2>'].astype(str)
mother_edu['<ISCED level 3>'] = mother_edu['<ISCED level 3>'].astype(str)

In [44]:
mother_edu['<ISCED level 0>'] = mother_edu['<ISCED level 0>'].replace(regex='1', value='<ISCED level 0>')
mother_edu['<ISCED level 1>'] = mother_edu['<ISCED level 1>'].replace(regex='1', value='<ISCED level 1>')
mother_edu['<ISCED level 2>'] = mother_edu['<ISCED level 2>'].replace(regex='1', value='<ISCED level 2>')
mother_edu['<ISCED level 3>'] = mother_edu['<ISCED level 3>'].replace(regex='1', value='<ISCED level 3>')
mother_edu['<ISCED level 4>'] = mother_edu['<ISCED level 4>'].replace(regex='Yes', value='<ISCED level 4>')
mother_edu['<ISCED level 5>'] = mother_edu['<ISCED level 5>'].replace(regex='Yes', value='<ISCED level 5>')
mother_edu['<ISCED level 6>'] = mother_edu['<ISCED level 6>'].replace(regex='Yes', value='<ISCED level 6>')

In [45]:
mother_edu.sample(10)

Unnamed: 0,Student ID,Gender,Mother - Highest Schooling,<ISCED level 6>,<ISCED level 5>,<ISCED level 5>.1,<ISCED level 4>,How many books at home,Out-of-School Study Time - Homework,Out-of-School Study Time - Guided Homework,...,Average Math Score,Average Reading Score,Average Science Score,Average Total Score,Education - Father,<ISCED level 1>,<ISCED level 2>,<ISCED level 3>,<ISCED level 3>.1,<ISCED level 0>
870885,18748,Male,<ISCED level 3A>,No,No,No,No,11-25 books,9.0,0.0,...,569.02166,544.18166,600.1865,760.803827,<ISCED level 3>,0,0,<ISCED level 3>,0,0
210761,4342,Female,<ISCED level 3A>,<ISCED level 6>,<ISCED level 5>,<ISCED level 5>,No,11-25 books,10.0,0.0,...,661.16996,634.18112,605.68816,854.069733,<ISCED level 6>,0,0,<ISCED level 3>,0,0
248339,11624,Male,<ISCED level 3A>,<ISCED level 6>,<ISCED level 5>,No,No,0-10 books,3.0,1.0,...,425.85294,381.46704,399.04886,544.073927,<ISCED level 5>,0,0,<ISCED level 3>,0,0
302277,22418,Male,<ISCED level 3A>,No,<ISCED level 5>,No,No,More than 500 books,30.0,0.0,...,509.90028,542.81838,544.7967,702.47188,<ISCED level 5>,0,0,<ISCED level 3>,0,0
178084,2244,Female,<ISCED level 3A>,<ISCED level 6>,<ISCED level 5>,No,No,101-200 books,4.0,2.0,...,524.07696,504.62918,497.70608,683.496393,<ISCED level 6>,0,0,<ISCED level 3>,0,0
281689,9405,Male,"<ISCED level 3B, 3C>",No,No,<ISCED level 5>,<ISCED level 4>,26-100 books,1.0,0.0,...,569.8785,557.0128,558.1313,751.6337,<ISCED level 5>,0,0,0,<ISCED level 3>,0
874111,27372,Male,"<ISCED level 3B, 3C>",No,<ISCED level 5>,No,No,26-100 books,4.0,1.0,...,456.85466,452.84012,426.65052,597.73332,<ISCED level 3>,0,0,0,<ISCED level 3>,0
385728,20642,Male,<ISCED level 3A>,No,No,No,<ISCED level 4>,201-500 books,8.0,1.0,...,438.16016,413.5448,435.7889,575.218007,<ISCED level 5>,0,0,<ISCED level 3>,0,0
363127,13981,Female,<ISCED level 3A>,No,<ISCED level 5>,No,No,201-500 books,6.0,0.0,...,586.85932,574.13128,639.44424,795.76472,<ISCED level 5>,0,0,<ISCED level 3>,0,0
217681,3824,Female,<ISCED level 3A>,<ISCED level 6>,<ISCED level 5>,<ISCED level 5>,<ISCED level 4>,201-500 books,20.0,0.0,...,595.19398,654.35664,552.62962,799.12474,<ISCED level 6>,0,0,<ISCED level 3>,0,0


In [46]:
mother_edu = pd.melt(mother_edu, 
                     id_vars=['Student ID', 'Gender',
                            'How many books at home',
                            'Out-of-School Study Time - Homework',
                            'Out-of-School Study Time - Guided Homework',
                            'Out-of-School Study Time - Personal Tutor',
                            'Out-of-School Study Time - Commercial Company',
                            'Out-of-School Study Time - With Parent',
                            'Learning time (minutes per week)- <Mathematics>',
                            'Learning time (minutes per week)  - <test language>',
                            'Learning time (minutes per week) - <Science>',
                            'Average Math Score',
                            'Average Reading Score',
                            'Average Science Score',
                            'Average Total Score',
                             'Education - Father'], var_name='Completed_edu - Mother', value_name='Education - Mother')


In [47]:
mother_edu.sample(10)

Unnamed: 0,Student ID,Gender,How many books at home,Out-of-School Study Time - Homework,Out-of-School Study Time - Guided Homework,Out-of-School Study Time - Personal Tutor,Out-of-School Study Time - Commercial Company,Out-of-School Study Time - With Parent,Learning time (minutes per week)- <Mathematics>,Learning time (minutes per week) - <test language>,Learning time (minutes per week) - <Science>,Average Math Score,Average Reading Score,Average Science Score,Average Total Score,Education - Father,Completed_edu - Mother,Education - Mother
231297,11342,Female,11-25 books,3.0,2.0,2.0,1.0,1.0,225.0,45.0,90.0,365.64106,394.6172,364.3603,496.753207,<ISCED level 6>,<ISCED level 0>,0
129125,10887,Female,101-200 books,12.0,0.0,0.0,0.0,1.0,300.0,240.0,420.0,610.8506,614.87942,636.27376,824.284793,<ISCED level 5>,<ISCED level 1>,0
181962,4682,Male,26-100 books,3.0,0.0,0.0,0.0,0.0,200.0,160.0,0.0,576.81102,583.47698,614.17382,783.757613,<ISCED level 5>,<ISCED level 3>,<ISCED level 3>
95094,24426,Female,0-10 books,6.0,0.0,0.0,0.0,0.0,150.0,200.0,400.0,474.77024,439.97232,487.6352,625.716,<ISCED level 2>,<ISCED level 4>,No
216823,17743,Male,26-100 books,2.0,0.0,0.0,0.0,0.0,300.0,375.0,300.0,506.0835,534.31776,550.85788,699.114213,<ISCED level 3>,<ISCED level 0>,0
12461,2578,Male,26-100 books,2.0,0.0,0.0,0.0,0.0,270.0,270.0,405.0,599.78968,516.274,573.79708,763.216813,<ISCED level 5>,Mother - Highest Schooling,<ISCED level 3A>
175667,7029,Male,101-200 books,1.0,0.0,0.0,0.0,0.0,360.0,315.0,450.0,483.26066,501.7588,530.99588,666.425333,<ISCED level 5>,<ISCED level 3>,0
1955,14789,Male,26-100 books,10.0,9.0,0.0,0.0,0.0,240.0,420.0,180.0,465.65664,417.31396,407.62776,585.418333,<ISCED level 2>,Mother - Highest Schooling,<ISCED level 3A>
195689,20872,Male,201-500 books,15.0,0.0,0.0,1.0,1.0,231.0,154.0,231.0,653.22482,644.104,640.46994,863.674527,<ISCED level 4>,<ISCED level 3>,0
172777,9845,Male,101-200 books,2.0,0.0,0.0,0.0,1.0,150.0,150.0,200.0,365.17372,370.80114,443.99478,515.047787,<ISCED level 4>,<ISCED level 3>,0


In [48]:
mother_edu = mother_edu[(mother_edu['Education - Mother'] != '0')] 
mother_edu = mother_edu[(mother_edu['Education - Mother'] != 'No')]

In [49]:
mother_edu.sample(70)

Unnamed: 0,Student ID,Gender,How many books at home,Out-of-School Study Time - Homework,Out-of-School Study Time - Guided Homework,Out-of-School Study Time - Personal Tutor,Out-of-School Study Time - Commercial Company,Out-of-School Study Time - With Parent,Learning time (minutes per week)- <Mathematics>,Learning time (minutes per week) - <test language>,Learning time (minutes per week) - <Science>,Average Math Score,Average Reading Score,Average Science Score,Average Total Score,Education - Father,Completed_edu - Mother,Education - Mother
172195,20134,Male,11-25 books,4.0,1.0,0.0,0.0,0.0,325.0,325.0,325.0,326.61630,388.20336,380.02612,473.820693,<ISCED level 4>,<ISCED level 3>,<ISCED level 3>
9113,8694,Male,101-200 books,3.0,2.0,0.0,8.0,0.0,15.0,0.0,0.0,384.10188,380.34428,431.87244,526.806827,<ISCED level 5>,Mother - Highest Schooling,<ISCED level 3A>
45665,4651,Female,101-200 books,17.0,7.0,0.0,0.0,3.0,520.0,260.0,260.0,519.32544,541.40570,551.32416,710.460247,<ISCED level 6>,<ISCED level 6>,<ISCED level 6>
4844,16070,Male,11-25 books,2.0,0.0,0.0,0.0,0.0,180.0,180.0,120.0,491.82894,485.47930,502.55500,657.230727,<ISCED level 3>,Mother - Highest Schooling,<ISCED level 3A>
111685,5071,Male,26-100 books,5.0,0.0,1.0,0.0,0.0,210.0,210.0,300.0,562.47860,604.40774,612.58860,780.651180,<ISCED level 5>,<ISCED level 4>,<ISCED level 4>
14880,5783,Female,11-25 books,2.0,2.0,0.0,0.0,2.0,180.0,240.0,120.0,363.07056,366.73692,369.20926,487.362433,<ISCED level 5>,Mother - Highest Schooling,<ISCED level 3A>
8258,19102,Female,101-200 books,4.0,1.0,6.0,6.0,0.0,180.0,135.0,0.0,487.23320,508.75954,511.87990,665.035280,<ISCED level 4>,Mother - Highest Schooling,"<ISCED level 3B, 3C>"
861,14332,Female,26-100 books,10.0,0.0,0.0,0.0,0.0,300.0,300.0,300.0,651.51114,640.29732,649.79486,864.371487,<ISCED level 2>,Mother - Highest Schooling,<ISCED level 3A>
194508,31064,Female,201-500 books,20.0,6.0,2.0,0.0,1.0,240.0,300.0,60.0,522.75276,520.27706,528.38494,698.055840,<ISCED level 3>,<ISCED level 3>,<ISCED level 3>
91758,20337,Male,26-100 books,1.0,1.0,5.0,0.0,1.0,240.0,180.0,180.0,460.43778,473.45012,446.51248,613.612720,<ISCED level 6>,<ISCED level 5>,<ISCED level 5>


In [50]:
mother_edu = mother_edu[(mother_edu['Completed_edu - Mother'] != 'Mother - Highest Schooling')] 

In [51]:
mother_edu['Education - Mother'].value_counts()

<ISCED level 3>    19231
<ISCED level 5>    14533
<ISCED level 4>     5094
<ISCED level 2>     3605
<ISCED level 6>     2611
<ISCED level 1>      532
<ISCED level 0>      152
Name: Education - Mother, dtype: int64

In [52]:
mother_edu.drop(columns=['Completed_edu - Mother'], axis=1, inplace=True)

In [53]:
mother_edu.sample(10)

Unnamed: 0,Student ID,Gender,How many books at home,Out-of-School Study Time - Homework,Out-of-School Study Time - Guided Homework,Out-of-School Study Time - Personal Tutor,Out-of-School Study Time - Commercial Company,Out-of-School Study Time - With Parent,Learning time (minutes per week)- <Mathematics>,Learning time (minutes per week) - <test language>,Learning time (minutes per week) - <Science>,Average Math Score,Average Reading Score,Average Science Score,Average Total Score,Education - Father,Education - Mother
172201,20555,Female,101-200 books,15.0,2.0,0.0,0.0,0.0,375.0,300.0,375.0,609.13696,655.23038,546.56844,806.690913,<ISCED level 4>,<ISCED level 3>
196447,18619,Male,101-200 books,10.0,0.0,7.0,0.0,0.0,240.0,240.0,180.0,440.96432,423.32854,470.85042,592.035867,<ISCED level 4>,<ISCED level 3>
68874,2823,Female,11-25 books,5.0,5.0,3.0,2.0,2.0,265.0,212.0,265.0,382.15454,491.84076,414.24842,556.79942,<ISCED level 6>,<ISCED level 5>
110542,8527,Female,26-100 books,10.0,5.0,7.0,2.0,10.0,200.0,200.0,250.0,524.77802,556.89478,553.65534,720.035387,<ISCED level 5>,<ISCED level 4>
101135,20265,Male,26-100 books,1.0,0.0,0.0,0.0,0.0,250.0,250.0,100.0,529.21796,469.60078,510.38792,679.474873,<ISCED level 4>,<ISCED level 4>
171354,30893,Male,201-500 books,2.0,0.0,0.0,0.0,0.0,100.0,150.0,0.0,568.78796,522.28856,565.68444,741.84964,<ISCED level 4>,<ISCED level 3>
168533,13055,Female,More than 500 books,5.0,2.0,0.0,0.0,2.0,150.0,200.0,100.0,482.55962,542.3589,566.61692,691.36502,<ISCED level 3>,<ISCED level 3>
172258,19745,Male,11-25 books,2.0,2.0,0.0,0.0,0.0,195.0,325.0,260.0,529.6074,552.1209,541.16002,717.498573,<ISCED level 4>,<ISCED level 3>
55825,17802,Male,201-500 books,0.0,0.0,1.0,0.0,0.0,300.0,200.0,100.0,614.27792,568.5608,597.48228,798.19964,<ISCED level 5>,<ISCED level 5>
93311,15497,Male,201-500 books,7.0,2.0,0.0,0.0,2.0,150.0,250.0,100.0,479.13226,437.52298,535.47184,643.753113,<ISCED level 6>,<ISCED level 5>


In [75]:
mother_edu['Student ID'].duplicated().sum()

22238

In [55]:
# mother_edu.replace(["NaN"], np.nan, inplace = True)
# mother_edu.dropna(inplace = True)

In [77]:
mother_edu['Education - Mother'] = mother_edu['Education - Mother'].astype('category')

mother_edu['Education - Mother'] = mother_edu['Education - Mother'].cat.set_categories(['<ISCED level 0>', 
                                                                                        '<ISCED level 1>', 
                                                                                        '<ISCED level 2>', 
                                                                                        '<ISCED level 3>',
                                                                                        '<ISCED level 4>', 
                                                                                        '<ISCED level 5>', 
                                                                                        '<ISCED level 6>'], ordered=True)
mother_edu.sort_values(['Education - Mother'], inplace=True)
mother_edu_clean = mother_edu.drop_duplicates(subset='Student ID', keep='last')

In [78]:
mother_edu['Student ID'].duplicated().sum()

22238

In [58]:
mother_edu.shape

(45758, 17)

In [59]:
mother_edu_clean.shape

(23520, 17)

In [60]:
mother_edu_clean.duplicated().sum()

0

In [61]:
mother_edu_clean.sample(10)

Unnamed: 0,Student ID,Gender,How many books at home,Out-of-School Study Time - Homework,Out-of-School Study Time - Guided Homework,Out-of-School Study Time - Personal Tutor,Out-of-School Study Time - Commercial Company,Out-of-School Study Time - With Parent,Learning time (minutes per week)- <Mathematics>,Learning time (minutes per week) - <test language>,Learning time (minutes per week) - <Science>,Average Math Score,Average Reading Score,Average Science Score,Average Total Score,Education - Father,Education - Mother
65070,12115,Male,0-10 books,1.0,1.0,0.0,0.0,0.0,120.0,120.0,150.0,393.21546,373.92872,353.0772,504.478947,<ISCED level 5>,<ISCED level 5>
61797,24192,Female,0-10 books,2.0,0.0,0.0,0.0,0.0,240.0,240.0,240.0,526.258,485.00972,556.07982,697.868513,<ISCED level 5>,<ISCED level 5>
79846,10937,Male,101-200 books,9.0,0.0,0.0,0.0,0.0,300.0,300.0,300.0,587.3267,578.8257,588.71688,780.731993,<ISCED level 5>,<ISCED level 5>
45279,6206,Female,201-500 books,1.0,1.0,2.0,2.0,0.0,250.0,150.0,250.0,323.50054,431.07606,358.95188,479.009673,<ISCED level 6>,<ISCED level 6>
104101,24966,Female,101-200 books,4.0,0.0,0.0,0.0,0.0,165.0,220.0,330.0,567.0743,567.4591,496.4006,732.669433,<ISCED level 5>,<ISCED level 4>
46814,2350,Male,26-100 books,12.0,8.0,4.0,6.0,12.0,240.0,240.0,180.0,395.3186,374.57028,394.1999,519.80246,<ISCED level 6>,<ISCED level 6>
82751,9239,Female,101-200 books,4.0,2.0,0.0,0.0,1.0,300.0,300.0,300.0,628.4546,610.03412,627.69488,831.546067,<ISCED level 5>,<ISCED level 5>
166578,14448,Female,26-100 books,19.0,19.0,0.0,0.0,2.0,165.0,220.0,110.0,380.67458,440.44892,402.77884,534.858973,<ISCED level 2>,<ISCED level 3>
100805,30765,Male,26-100 books,7.0,5.0,2.0,0.0,5.0,200.0,200.0,100.0,446.49478,449.31158,486.70272,609.667953,<ISCED level 4>,<ISCED level 4>
63463,4231,Male,More than 500 books,2.0,0.0,0.0,0.0,0.0,150.0,200.0,150.0,588.33932,562.22546,554.77434,764.55948,<ISCED level 5>,<ISCED level 5>


In [62]:
pisa_clean = mother_edu_clean.copy()

In [63]:
pisa_clean.shape

(23520, 17)

In [64]:
pisa_clean = mother_edu.copy()

In [65]:
pisa_clean.shape

(45758, 17)

### 4. Missing values in relevant columns 
('Mother Highest Schooling', 'Father Highest Schooling', 'How many books at home', 'Out-of-School Study Time - Homework', 'Out-of-School Study Time - Guided Homework', 'Out-of-School Study Time - Personal Tutor','Out-of-School Study Time - Commercial Company', 'Out-of-School Study Time - With Parent', 'Learning time (minutes per week)- Mathematics', 'Learning time (minutes per week)  - test language', 'Learning time (minutes per week) - Science')

Due to the amount of rows in this dataframe, we can comfortably remove the rows with missing values.

In [66]:

# pisa_clean.replace(["NaN"], np.nan, inplace = True)
# pisa_clean.dropna(inplace = True)

In [67]:
# Check to see if null values correctly dropped
pisa_clean

Unnamed: 0,Student ID,Gender,How many books at home,Out-of-School Study Time - Homework,Out-of-School Study Time - Guided Homework,Out-of-School Study Time - Personal Tutor,Out-of-School Study Time - Commercial Company,Out-of-School Study Time - With Parent,Learning time (minutes per week)- <Mathematics>,Learning time (minutes per week) - <test language>,Learning time (minutes per week) - <Science>,Average Math Score,Average Reading Score,Average Science Score,Average Total Score,Education - Father,Education - Mother
235181,1647,Male,0-10 books,2.0,1.0,0.0,0.0,0.0,150.0,200.0,350.0,449.53266,478.18156,474.20738,617.151420,<ISCED level 6>,<ISCED level 0>
212053,14280,Male,11-25 books,3.0,0.0,0.0,0.0,0.0,160.0,160.0,120.0,399.21326,383.87284,344.96456,509.087973,<ISCED level 1>,<ISCED level 0>
212039,18805,Female,0-10 books,0.0,0.0,0.0,0.0,0.0,150.0,150.0,150.0,438.93906,495.01800,456.39686,609.764327,<ISCED level 1>,<ISCED level 0>
212037,18913,Male,11-25 books,2.0,1.0,0.0,0.0,1.0,200.0,150.0,150.0,486.99952,524.61420,498.26558,665.626273,<ISCED level 1>,<ISCED level 0>
212019,12822,Male,11-25 books,30.0,30.0,30.0,30.0,30.0,360.0,240.0,120.0,284.08632,287.23844,312.23424,389.215107,<ISCED level 1>,<ISCED level 0>
212011,9143,Male,11-25 books,4.0,2.0,0.0,0.0,6.0,250.0,250.0,100.0,334.32778,415.54968,389.07126,491.092167,<ISCED level 1>,<ISCED level 0>
212009,10479,Male,0-10 books,2.0,0.0,0.0,0.0,0.0,200.0,200.0,600.0,376.85780,329.09998,330.04476,470.953447,<ISCED level 1>,<ISCED level 0>
211995,16142,Female,11-25 books,5.0,3.0,2.0,0.0,1.0,200.0,200.0,150.0,434.49916,365.22774,362.68184,532.302633,<ISCED level 1>,<ISCED level 0>
211994,16295,Female,0-10 books,1.0,1.0,0.0,1.0,0.0,140.0,220.0,110.0,355.74856,424.95988,393.73368,510.063560,<ISCED level 1>,<ISCED level 0>
211993,17846,Male,0-10 books,2.0,0.0,0.0,0.0,1.0,225.0,180.0,180.0,422.03614,430.94704,468.61244,581.210587,<ISCED level 1>,<ISCED level 0>


In [68]:
# Confirm with smaller shape of dataframe
pisa_clean.shape

(45758, 17)

### 5. Create a total for Out-of-School Study Time and Learning time (minutes per week)



In [69]:
# Create total of Out-of-School Study Time
pisa_clean['Out-of-School Study Time - Total'] = pisa_clean['Out-of-School Study Time - Homework'] + pisa_clean['Out-of-School Study Time - Guided Homework'] + pisa_clean['Out-of-School Study Time - Personal Tutor'] + pisa_clean['Out-of-School Study Time - Commercial Company'] + pisa_clean['Out-of-School Study Time - With Parent']

# Create total of Learning time (minutes per week)
pisa_clean['Learning time (minutes per week) - Total'] = pisa_clean['Learning time (minutes per week)- <Mathematics>'] + pisa_clean['Learning time (minutes per week)  - <test language>'] + pisa_clean['Learning time (minutes per week) - <Science>'] 
                                                                                                                 

In [70]:
list(pisa_clean)

['Student ID',
 'Gender',
 'How many books at home',
 'Out-of-School Study Time - Homework',
 'Out-of-School Study Time - Guided Homework',
 'Out-of-School Study Time - Personal Tutor',
 'Out-of-School Study Time - Commercial Company',
 'Out-of-School Study Time - With Parent',
 'Learning time (minutes per week)- <Mathematics>',
 'Learning time (minutes per week)  - <test language>',
 'Learning time (minutes per week) - <Science>',
 'Average Math Score',
 'Average Reading Score',
 'Average Science Score',
 'Average Total Score',
 'Education - Father',
 'Education - Mother',
 'Out-of-School Study Time - Total',
 'Learning time (minutes per week) - Total']

<a id='assessing2'></a>
### Final Assessment

In this section, we will do the preliminary visual and programmatic assessment of the pisa datasets to determine whether or not it holds any major quality or tidiness issues.

In [71]:
pisa_clean

Unnamed: 0,Student ID,Gender,How many books at home,Out-of-School Study Time - Homework,Out-of-School Study Time - Guided Homework,Out-of-School Study Time - Personal Tutor,Out-of-School Study Time - Commercial Company,Out-of-School Study Time - With Parent,Learning time (minutes per week)- <Mathematics>,Learning time (minutes per week) - <test language>,Learning time (minutes per week) - <Science>,Average Math Score,Average Reading Score,Average Science Score,Average Total Score,Education - Father,Education - Mother,Out-of-School Study Time - Total,Learning time (minutes per week) - Total
235181,1647,Male,0-10 books,2.0,1.0,0.0,0.0,0.0,150.0,200.0,350.0,449.53266,478.18156,474.20738,617.151420,<ISCED level 6>,<ISCED level 0>,3.0,700.0
212053,14280,Male,11-25 books,3.0,0.0,0.0,0.0,0.0,160.0,160.0,120.0,399.21326,383.87284,344.96456,509.087973,<ISCED level 1>,<ISCED level 0>,3.0,440.0
212039,18805,Female,0-10 books,0.0,0.0,0.0,0.0,0.0,150.0,150.0,150.0,438.93906,495.01800,456.39686,609.764327,<ISCED level 1>,<ISCED level 0>,0.0,450.0
212037,18913,Male,11-25 books,2.0,1.0,0.0,0.0,1.0,200.0,150.0,150.0,486.99952,524.61420,498.26558,665.626273,<ISCED level 1>,<ISCED level 0>,4.0,500.0
212019,12822,Male,11-25 books,30.0,30.0,30.0,30.0,30.0,360.0,240.0,120.0,284.08632,287.23844,312.23424,389.215107,<ISCED level 1>,<ISCED level 0>,150.0,720.0
212011,9143,Male,11-25 books,4.0,2.0,0.0,0.0,6.0,250.0,250.0,100.0,334.32778,415.54968,389.07126,491.092167,<ISCED level 1>,<ISCED level 0>,12.0,600.0
212009,10479,Male,0-10 books,2.0,0.0,0.0,0.0,0.0,200.0,200.0,600.0,376.85780,329.09998,330.04476,470.953447,<ISCED level 1>,<ISCED level 0>,2.0,1000.0
211995,16142,Female,11-25 books,5.0,3.0,2.0,0.0,1.0,200.0,200.0,150.0,434.49916,365.22774,362.68184,532.302633,<ISCED level 1>,<ISCED level 0>,11.0,550.0
211994,16295,Female,0-10 books,1.0,1.0,0.0,1.0,0.0,140.0,220.0,110.0,355.74856,424.95988,393.73368,510.063560,<ISCED level 1>,<ISCED level 0>,3.0,470.0
211993,17846,Male,0-10 books,2.0,0.0,0.0,0.0,1.0,225.0,180.0,180.0,422.03614,430.94704,468.61244,581.210587,<ISCED level 1>,<ISCED level 0>,3.0,585.0


In [72]:
pisa_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45758 entries, 235181 to 23582
Data columns (total 19 columns):
Student ID                                             45758 non-null int64
Gender                                                 45758 non-null object
How many books at home                                 45758 non-null object
Out-of-School Study Time - Homework                    45758 non-null float64
Out-of-School Study Time - Guided Homework             45758 non-null float64
Out-of-School Study Time - Personal Tutor              45758 non-null float64
Out-of-School Study Time - Commercial Company          45758 non-null float64
Out-of-School Study Time - With Parent                 45758 non-null float64
Learning time (minutes per week)- <Mathematics>        45758 non-null float64
Learning time (minutes per week)  - <test language>    45758 non-null float64
Learning time (minutes per week) - <Science>           45758 non-null float64
Average Math Score              

In [73]:
pisa_clean.shape

(45758, 19)

<a id='sav'></a>
## Store

In [74]:
# Store the clean DataFrame in a CSV file
pisa_clean.to_csv("pisa_df.csv", index=False, encoding='utf8')

<a id='conclusion'></a>
## Conclusion

<a id='references'></a>
## References