# PISA Data Exploration
## by Andreja Ho

## Preliminary Wrangling

> Briefly introduce your dataset here.

> This document explores a dataset containing PISA's 2012 data. PISA is a survey of students' skills and knowledge as they approach the end of compulsory education. It is not a conventional school test. Rather than examining how well students have learned the school curriculum, it looks at how well prepared they are for life beyond school. Around 510,000 students in 65 economies globally took part in the PISA 2012 assessment of reading, mathematics, and science representing about 28 million 15-year-olds globally.

In [5]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

In [6]:
# Read in original dataset and explore
df_origin = pd.read_csv('pisa2012.csv',sep=',', encoding='latin-1',error_bad_lines=False, index_col=False, dtype='unicode')
pd.set_option('display.max_columns', None)
df_origin.head(3)

Unnamed: 0.1,Unnamed: 0,CNT,SUBNATIO,STRATUM,OECD,NC,SCHOOLID,STIDSTD,ST01Q01,ST02Q01,ST03Q01,ST03Q02,ST04Q01,ST05Q01,ST06Q01,ST07Q01,ST07Q02,ST07Q03,ST08Q01,ST09Q01,ST115Q01,ST11Q01,ST11Q02,ST11Q03,ST11Q04,ST11Q05,ST11Q06,ST13Q01,ST14Q01,ST14Q02,ST14Q03,ST14Q04,ST15Q01,ST17Q01,ST18Q01,ST18Q02,ST18Q03,ST18Q04,ST19Q01,ST20Q01,ST20Q02,ST20Q03,ST21Q01,ST25Q01,ST26Q01,ST26Q02,ST26Q03,ST26Q04,ST26Q05,ST26Q06,ST26Q07,ST26Q08,ST26Q09,ST26Q10,ST26Q11,ST26Q12,ST26Q13,ST26Q14,ST26Q15,ST26Q16,ST26Q17,ST27Q01,ST27Q02,ST27Q03,ST27Q04,ST27Q05,ST28Q01,ST29Q01,ST29Q02,ST29Q03,ST29Q04,ST29Q05,ST29Q06,ST29Q07,ST29Q08,ST35Q01,ST35Q02,ST35Q03,ST35Q04,ST35Q05,ST35Q06,ST37Q01,ST37Q02,ST37Q03,ST37Q04,ST37Q05,ST37Q06,ST37Q07,ST37Q08,ST42Q01,ST42Q02,ST42Q03,ST42Q04,ST42Q05,ST42Q06,ST42Q07,ST42Q08,ST42Q09,ST42Q10,ST43Q01,ST43Q02,ST43Q03,ST43Q04,ST43Q05,ST43Q06,ST44Q01,ST44Q03,ST44Q04,ST44Q05,ST44Q07,ST44Q08,ST46Q01,ST46Q02,ST46Q03,ST46Q04,ST46Q05,ST46Q06,ST46Q07,ST46Q08,ST46Q09,ST48Q01,ST48Q02,ST48Q03,ST48Q04,ST48Q05,ST49Q01,ST49Q02,ST49Q03,ST49Q04,ST49Q05,ST49Q06,ST49Q07,ST49Q09,ST53Q01,ST53Q02,ST53Q03,ST53Q04,ST55Q01,ST55Q02,ST55Q03,ST55Q04,ST57Q01,ST57Q02,ST57Q03,ST57Q04,ST57Q05,ST57Q06,ST61Q01,ST61Q02,ST61Q03,ST61Q04,ST61Q05,ST61Q06,ST61Q07,ST61Q08,ST61Q09,ST62Q01,ST62Q02,ST62Q03,ST62Q04,ST62Q06,ST62Q07,ST62Q08,ST62Q09,ST62Q10,ST62Q11,ST62Q12,ST62Q13,ST62Q15,ST62Q16,ST62Q17,ST62Q19,ST69Q01,ST69Q02,ST69Q03,ST70Q01,ST70Q02,ST70Q03,ST71Q01,ST72Q01,ST73Q01,ST73Q02,ST74Q01,ST74Q02,ST75Q01,ST75Q02,ST76Q01,ST76Q02,ST77Q01,ST77Q02,ST77Q04,ST77Q05,ST77Q06,ST79Q01,ST79Q02,ST79Q03,ST79Q04,ST79Q05,ST79Q06,ST79Q07,ST79Q08,ST79Q10,ST79Q11,ST79Q12,ST79Q15,ST79Q17,ST80Q01,ST80Q04,ST80Q05,ST80Q06,ST80Q07,ST80Q08,ST80Q09,ST80Q10,ST80Q11,ST81Q01,ST81Q02,ST81Q03,ST81Q04,ST81Q05,ST82Q01,ST82Q02,ST82Q03,ST83Q01,ST83Q02,ST83Q03,ST83Q04,ST84Q01,ST84Q02,ST84Q03,ST85Q01,ST85Q02,ST85Q03,ST85Q04,ST86Q01,ST86Q02,ST86Q03,ST86Q04,ST86Q05,ST87Q01,ST87Q02,ST87Q03,ST87Q04,ST87Q05,ST87Q06,ST87Q07,ST87Q08,ST87Q09,ST88Q01,ST88Q02,ST88Q03,ST88Q04,ST89Q02,ST89Q03,ST89Q04,ST89Q05,ST91Q01,ST91Q02,ST91Q03,ST91Q04,ST91Q05,ST91Q06,ST93Q01,ST93Q03,ST93Q04,ST93Q06,ST93Q07,ST94Q05,ST94Q06,ST94Q09,ST94Q10,ST94Q14,ST96Q01,ST96Q02,ST96Q03,ST96Q05,ST101Q01,ST101Q02,ST101Q03,ST101Q05,ST104Q01,ST104Q04,ST104Q05,ST104Q06,IC01Q01,IC01Q02,IC01Q03,IC01Q04,IC01Q05,IC01Q06,IC01Q07,IC01Q08,IC01Q09,IC01Q10,IC01Q11,IC02Q01,IC02Q02,IC02Q03,IC02Q04,IC02Q05,IC02Q06,IC02Q07,IC03Q01,IC04Q01,IC05Q01,IC06Q01,IC07Q01,IC08Q01,IC08Q02,IC08Q03,IC08Q04,IC08Q05,IC08Q06,IC08Q07,IC08Q08,IC08Q09,IC08Q11,IC09Q01,IC09Q02,IC09Q03,IC09Q04,IC09Q05,IC09Q06,IC09Q07,IC10Q01,IC10Q02,IC10Q03,IC10Q04,IC10Q05,IC10Q06,IC10Q07,IC10Q08,IC10Q09,IC11Q01,IC11Q02,IC11Q03,IC11Q04,IC11Q05,IC11Q06,IC11Q07,IC22Q01,IC22Q02,IC22Q04,IC22Q06,IC22Q07,IC22Q08,EC01Q01,EC02Q01,EC03Q01,EC03Q02,EC03Q03,EC03Q04,EC03Q05,EC03Q06,EC03Q07,EC03Q08,EC03Q09,EC03Q10,EC04Q01A,EC04Q01B,EC04Q01C,EC04Q02A,EC04Q02B,EC04Q02C,EC04Q03A,EC04Q03B,EC04Q03C,EC04Q04A,EC04Q04B,EC04Q04C,EC04Q05A,EC04Q05B,EC04Q05C,EC04Q06A,EC04Q06B,EC04Q06C,EC05Q01,EC06Q01,EC07Q01,EC07Q02,EC07Q03,EC07Q04,EC07Q05,EC08Q01,EC08Q02,EC08Q03,EC08Q04,EC09Q03,EC10Q01,EC11Q02,EC11Q03,EC12Q01,ST22Q01,ST23Q01,ST23Q02,ST23Q03,ST23Q04,ST23Q05,ST23Q06,ST23Q07,ST23Q08,ST24Q01,ST24Q02,ST24Q03,CLCUSE1,CLCUSE301,CLCUSE302,DEFFORT,QUESTID,BOOKID,EASY,AGE,GRADE,PROGN,ANXMAT,ATSCHL,ATTLNACT,BELONG,BFMJ2,BMMJ1,CLSMAN,COBN_F,COBN_M,COBN_S,COGACT,CULTDIST,CULTPOS,DISCLIMA,ENTUSE,ESCS,EXAPPLM,EXPUREM,FAILMAT,FAMCON,FAMCONC,FAMSTRUC,FISCED,HEDRES,HERITCUL,HISCED,HISEI,HOMEPOS,HOMSCH,HOSTCUL,ICTATTNEG,ICTATTPOS,ICTHOME,ICTRES,ICTSCH,IMMIG,INFOCAR,INFOJOB1,INFOJOB2,INSTMOT,INTMAT,ISCEDD,ISCEDL,ISCEDO,LANGCOMM,LANGN,LANGRPPD,LMINS,MATBEH,MATHEFF,MATINTFC,MATWKETH,MISCED,MMINS,MTSUP,OCOD1,OCOD2,OPENPS,OUTHOURS,PARED,PERSEV,REPEAT,SCMAT,SMINS,STUDREL,SUBNORM,TCHBEHFA,TCHBEHSO,TCHBEHTD,TEACHSUP,TESTLANG,TIMEINT,USEMATH,USESCH,WEALTH,ANCATSCHL,ANCATTLNACT,ANCBELONG,ANCCLSMAN,ANCCOGACT,ANCINSTMOT,ANCINTMAT,ANCMATWKETH,ANCMTSUP,ANCSCMAT,ANCSTUDREL,ANCSUBNORM,PV1MATH,PV2MATH,PV3MATH,PV4MATH,PV5MATH,PV1MACC,PV2MACC,PV3MACC,PV4MACC,PV5MACC,PV1MACQ,PV2MACQ,PV3MACQ,PV4MACQ,PV5MACQ,PV1MACS,PV2MACS,PV3MACS,PV4MACS,PV5MACS,PV1MACU,PV2MACU,PV3MACU,PV4MACU,PV5MACU,PV1MAPE,PV2MAPE,PV3MAPE,PV4MAPE,PV5MAPE,PV1MAPF,PV2MAPF,PV3MAPF,PV4MAPF,PV5MAPF,PV1MAPI,PV2MAPI,PV3MAPI,PV4MAPI,PV5MAPI,PV1READ,PV2READ,PV3READ,PV4READ,PV5READ,PV1SCIE,PV2SCIE,PV3SCIE,PV4SCIE,PV5SCIE,W_FSTUWT,W_FSTR1,W_FSTR2,W_FSTR3,W_FSTR4,W_FSTR5,W_FSTR6,W_FSTR7,W_FSTR8,W_FSTR9,W_FSTR10,W_FSTR11,W_FSTR12,W_FSTR13,W_FSTR14,W_FSTR15,W_FSTR16,W_FSTR17,W_FSTR18,W_FSTR19,W_FSTR20,W_FSTR21,W_FSTR22,W_FSTR23,W_FSTR24,W_FSTR25,W_FSTR26,W_FSTR27,W_FSTR28,W_FSTR29,W_FSTR30,W_FSTR31,W_FSTR32,W_FSTR33,W_FSTR34,W_FSTR35,W_FSTR36,W_FSTR37,W_FSTR38,W_FSTR39,W_FSTR40,W_FSTR41,W_FSTR42,W_FSTR43,W_FSTR44,W_FSTR45,W_FSTR46,W_FSTR47,W_FSTR48,W_FSTR49,W_FSTR50,W_FSTR51,W_FSTR52,W_FSTR53,W_FSTR54,W_FSTR55,W_FSTR56,W_FSTR57,W_FSTR58,W_FSTR59,W_FSTR60,W_FSTR61,W_FSTR62,W_FSTR63,W_FSTR64,W_FSTR65,W_FSTR66,W_FSTR67,W_FSTR68,W_FSTR69,W_FSTR70,W_FSTR71,W_FSTR72,W_FSTR73,W_FSTR74,W_FSTR75,W_FSTR76,W_FSTR77,W_FSTR78,W_FSTR79,W_FSTR80,WVARSTRR,VAR_UNIT,SENWGT_STU,VER_STU
0,1,Albania,80000,ALB0006,Non-OECD,Albania,1,1,10,1,2,1996,Female,No,6,"No, never","No, never","No, never",,,1,Yes,Yes,Yes,Yes,,,<ISCED level 3A>,No,No,No,No,"Other (e.g. home duties, retired)",<ISCED level 3A>,,,,,Working part-time <for pay>,Country of test,Country of test,Country of test,,Language of the test,Yes,No,Yes,No,No,No,No,Yes,No,Yes,No,Yes,No,Yes,8002,8001,8002,Two,One,,,,0-10 books,Agree,Strongly agree,Agree,Agree,Agree,Agree,Agree,Strongly agree,Disagree,Agree,Disagree,Agree,Agree,Agree,Not at all confident,Not very confident,Confident,Confident,Confident,Not at all confident,Confident,Very confident,Agree,Disagree,Agree,Agree,Agree,Agree,Agree,Disagree,Disagree,Disagree,Agree,Disagree,Disagree,Agree,,Disagree,Likely,Slightly likely,Likely,Likely,Likely,Very Likely,Agree,Agree,Agree,Agree,Agree,Agree,Agree,Agree,Agree,Courses after school Test Language,Major in college Science,Study harder Test Language,Maximum classes Science,Pursuing a career Math,Often,Sometimes,Sometimes,Sometimes,Sometimes,Never or rarely,Never or rarely,Never or rarely,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Every Lesson,Every Lesson,Every Lesson,Every Lesson,Every Lesson,Never or Hardly Ever,Most Lessons,Never or Hardly Ever,Every Lesson,Most Lessons,Every Lesson,Every Lesson,Every Lesson,Never or Hardly Ever,Most Lessons,Every Lesson,Every Lesson,Every Lesson,Always or almost always,Sometimes,Never or rarely,Always or almost always,Always or almost always,Always or almost always,Always or almost always,Often,Often,Never or Hardly Ever,Never or Hardly Ever,Never or Hardly Ever,Never or Hardly Ever,Never or Hardly Ever,Strongly disagree,Strongly disagree,Strongly disagree,Strongly disagree,Agree,Agree,Agree,Strongly agree,Strongly agree,Disagree,Agree,Strongly disagree,Disagree,Agree,Agree,Strongly disagree,Agree,Agree,Disagree,Agree,Agree,Strongly disagree,Strongly agree,Strongly agree,Strongly disagree,Agree,Strongly disagree,Agree,Agree,Strongly agree,Strongly disagree,Strongly disagree,Agree,Strongly agree,Strongly agree,Strongly agree,Strongly agree,Strongly agree,Strongly agree,Strongly disagree,Disagree,Strongly disagree,Very much like me,Very much like me,Very much like me,Somewhat like me,Very much like me,Somewhat like me,Mostly like me,Mostly like me,Mostly like me,Somewhat like me,definitely do this,definitely do this,definitely do this,definitely do this,4,2,1,1,1,2,1,1,,,,,,,,,,,,,,,,,,,,,99,99,99,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,A Simple calculator,99,99,99,StQ Form B,booklet 7,Standard set of booklets,16.17,0,Albania: Upper secondary education,0.32,-2.31,0.5206,-1.18,76.49,79.74,-1.3771,Albania,Albania,Albania,0.6994,,-0.48,1.85,,,,,0.64,,,2,"ISCED 3A, ISCED 4",-1.29,,"ISCED 3A, ISCED 4",,-2.61,,,,,,-3.16,,Native,,,,0.8,0.91,A,ISCED level 3,General,,Albanian,,,0.6426,-0.77,-0.7332,0.2882,"ISCED 3A, ISCED 4",,-0.9508,Building architects,Primary school teachers,0.0521,,12,-0.3407,Did not repeat a <grade>,0.41,,-1.04,-0.0455,1.3625,0.9374,0.4297,1.68,Albanian,,,,-2.92,-1.8636,-0.6779,-0.7351,-0.7808,-0.0219,-0.1562,0.0486,-0.2199,-0.5983,-0.0807,-0.5901,-0.3346,406.8469,376.4683,344.5319,321.1637,381.9209,325.8374,324.2795,279.88,267.417,312.5954,409.1837,388.1524,373.3525,389.7102,415.4152,351.5423,375.6894,341.4161,386.5945,426.3203,396.7207,334.4057,328.9531,339.8582,354.658,324.2795,345.3108,381.1419,380.363,346.8687,319.6059,345.3108,360.8895,390.4892,322.7216,290.7852,345.3108,326.6163,407.6258,367.121,249.5762,254.342,406.8496,175.7053,218.5981,341.7009,408.84,348.2283,367.8105,392.9877,8.9096,13.1249,13.0829,4.5315,13.0829,13.9235,13.1249,13.1249,4.3389,4.3313,13.7954,4.5315,4.3313,13.7954,13.9235,4.3389,4.3313,4.5084,4.5084,13.7954,4.5315,13.1249,13.0829,4.5315,13.0829,13.9235,13.1249,13.1249,4.3389,4.3313,13.7954,4.5315,4.3313,13.7954,13.9235,4.3389,4.3313,4.5084,4.5084,13.7954,4.5315,4.5084,4.5315,13.0829,4.5315,4.3313,4.5084,4.5084,13.7954,13.9235,4.3389,13.0829,13.9235,4.3389,4.3313,13.7954,13.9235,13.1249,13.1249,4.3389,13.0829,4.5084,4.5315,13.0829,4.5315,4.3313,4.5084,4.5084,13.7954,13.9235,4.3389,13.0829,13.9235,4.3389,4.3313,13.7954,13.9235,13.1249,13.1249,4.3389,13.0829,19,1,0.2098,22NOV13
1,2,Albania,80000,ALB0006,Non-OECD,Albania,1,2,10,1,2,1996,Female,"Yes, for more than one year",7,"No, never","No, never","No, never",One or two times,,1,Yes,Yes,,Yes,,,<ISCED level 3A>,Yes,Yes,No,No,Working full-time <for pay>,<ISCED level 3A>,No,No,No,No,Working full-time <for pay>,Country of test,Country of test,Country of test,,Language of the test,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,8001,8001,8002,Three or more,Three or more,Three or more,Two,Two,201-500 books,Disagree,Strongly agree,Disagree,Disagree,Agree,Agree,Disagree,Disagree,Strongly agree,Strongly agree,Disagree,Agree,Disagree,Agree,Confident,Very confident,Very confident,Confident,Very confident,Confident,Very confident,Not very confident,,,,,,,,,,,Strongly agree,Strongly agree,Strongly disagree,Disagree,Agree,Disagree,Likely,Slightly likely,Slightly likely,Very Likely,Slightly likely,Likely,Agree,Agree,Strongly agree,Strongly agree,Strongly agree,Agree,Agree,Disagree,Agree,Courses after school Math,Major in college Science,Study harder Math,Maximum classes Science,Pursuing a career Science,Sometimes,Often,Always or almost always,Sometimes,Always or almost always,Never or rarely,Never or rarely,Often,relating to known,Improve understanding,in my sleep,Repeat examples,I do not attend <out-of-school time lessons> i...,2 or more but less than 4 hours a week,2 or more but less than 4 hours a week,Less than 2 hours a week,,,6.0,0.0,0.0,2.0,Rarely,Rarely,Frequently,Sometimes,Frequently,Sometimes,Frequently,Never,Frequently,"Know it well, understand the concept","Know it well, understand the concept",Heard of it once or twice,"Know it well, understand the concept","Know it well, understand the concept","Know it well, understand the concept",Never heard of it,"Know it well, understand the concept","Know it well, understand the concept",Never heard of it,"Know it well, understand the concept",Heard of it once or twice,"Know it well, understand the concept","Know it well, understand the concept",Never heard of it,Heard of it often,45.0,45.0,45.0,7.0,6.0,2.0,,30.0,Frequently,Sometimes,Frequently,Frequently,Sometimes,Sometimes,Sometimes,Sometimes,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Not at all like me,Not at all like me,Mostly like me,Somewhat like me,Very much like me,Somewhat like me,Not much like me,Not much like me,Mostly like me,Not much like me,probably not do this,probably do this,probably not do this,probably do this,1,2,3,2,2,3,1,1,,,,,,,,,,,,,,,,,,,,,99,99,99,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,A Simple calculator,99,99,99,StQ Form A,booklet 9,Standard set of booklets,16.17,0,Albania: Upper secondary education,,,,,15.35,23.47,,Albania,Albania,Albania,,,1.27,,,,-0.0681,0.7955,0.1524,0.6387,-0.08,2,"ISCED 3A, ISCED 4",1.12,,"ISCED 5A, 6",,1.41,,,,,,1.15,,Native,,,,-0.39,0.0,A,ISCED level 3,General,,Albanian,,315.0,1.4702,0.34,-0.2514,0.649,"ISCED 5A, 6",270.0,,"Tailors, dressmakers, furriers and hatters",Building construction labourers,-0.9492,8.0,16,1.3116,Did not repeat a <grade>,,90.0,,0.6602,,,,,Albanian,,,,0.69,,,,,,,,,,,,,486.1427,464.3325,453.4273,472.9008,476.0165,325.6816,419.933,378.6493,359.9548,384.1019,373.1968,444.0801,456.5431,401.2385,461.2167,366.9653,459.6588,426.1645,423.0488,443.3011,389.5544,438.6275,417.5962,379.4283,438.6275,440.1854,456.5431,486.9216,458.101,444.0801,411.3647,437.8486,457.322,454.2063,460.4378,434.7328,448.7537,494.711,429.2803,434.7328,406.2936,349.8975,400.7334,369.7553,396.7618,548.9929,471.5964,471.5964,443.6218,454.8116,8.9096,13.1249,13.0829,4.5315,13.0829,13.9235,13.1249,13.1249,4.3389,4.3313,13.7954,4.5315,4.3313,13.7954,13.9235,4.3389,4.3313,4.5084,4.5084,13.7954,4.5315,13.1249,13.0829,4.5315,13.0829,13.9235,13.1249,13.1249,4.3389,4.3313,13.7954,4.5315,4.3313,13.7954,13.9235,4.3389,4.3313,4.5084,4.5084,13.7954,4.5315,4.5084,4.5315,13.0829,4.5315,4.3313,4.5084,4.5084,13.7954,13.9235,4.3389,13.0829,13.9235,4.3389,4.3313,13.7954,13.9235,13.1249,13.1249,4.3389,13.0829,4.5084,4.5315,13.0829,4.5315,4.3313,4.5084,4.5084,13.7954,13.9235,4.3389,13.0829,13.9235,4.3389,4.3313,13.7954,13.9235,13.1249,13.1249,4.3389,13.0829,19,1,0.2098,22NOV13
2,3,Albania,80000,ALB0006,Non-OECD,Albania,1,3,9,1,9,1996,Female,"Yes, for more than one year",6,"No, never","No, never","No, never",,,1,Yes,Yes,No,Yes,No,No,"<ISCED level 3B, 3C>",Yes,Yes,Yes,No,Working full-time <for pay>,<ISCED level 3A>,Yes,No,Yes,Yes,Working full-time <for pay>,Country of test,Country of test,Country of test,,Language of the test,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,Yes,No,Yes,No,Yes,8001,8001,8001,Three or more,Two,Two,One,Two,More than 500 books,Agree,Strongly agree,Agree,Agree,Strongly agree,Strongly agree,Strongly agree,Strongly agree,Strongly agree,Strongly agree,Agree,Strongly agree,Strongly agree,Agree,Confident,Very confident,Very confident,Confident,Very confident,Not very confident,Very confident,Confident,,,,,,,,,,,Strongly agree,Agree,Strongly agree,Strongly disagree,Strongly agree,Strongly disagree,Likely,Likely,Very Likely,Very Likely,Very Likely,Slightly likely,Strongly agree,Strongly agree,Strongly agree,Strongly agree,Strongly agree,Agree,Strongly agree,Strongly agree,Strongly agree,Courses after school Math,Major in college Science,Study harder Math,Maximum classes Science,Pursuing a career Science,Sometimes,Always or almost always,Sometimes,Never or rarely,Always or almost always,Never or rarely,Never or rarely,Never or rarely,Most important,Improve understanding,learning goals,more information,Less than 2 hours a week,2 or more but less than 4 hours a week,4 or more but less than 6 hours a week,I do not attend <out-of-school time lessons> i...,,6.0,6.0,7.0,2.0,3.0,Frequently,Sometimes,Frequently,Rarely,Frequently,Rarely,Frequently,Sometimes,Frequently,Never heard of it,"Know it well, understand the concept",Heard of it once or twice,"Know it well, understand the concept","Know it well, understand the concept","Know it well, understand the concept",Heard of it once or twice,"Know it well, understand the concept","Know it well, understand the concept",Heard of it once or twice,"Know it well, understand the concept","Know it well, understand the concept","Know it well, understand the concept","Know it well, understand the concept","Know it well, understand the concept","Know it well, understand the concept",60.0,,,5.0,4.0,2.0,24.0,30.0,Frequently,Frequently,Frequently,Frequently,Frequently,Frequently,Rarely,Rarely,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Not much like me,Not much like me,Very much like me,Very much like me,Somewhat like me,Mostly like me,Mostly like me,Very much like me,Mostly like me,Very much like me,probably not do this,definitely do this,definitely not do this,probably do this,1,3,4,1,3,4,1,1,,,,,,,,,,,,,,,,,,,,,99,99,99,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,A Simple calculator,99,99,99,StQ Form A,booklet 3,Standard set of booklets,15.58,-1,Albania: Lower secondary education,,,,,22.57,,,Albania,Albania,Albania,,,1.27,,,,0.5359,0.7955,1.2219,0.8215,-0.89,2,"ISCED 5A, 6",-0.69,,"ISCED 5A, 6",,0.14,,,,,,-0.4,,Native,,,,1.59,1.23,A,ISCED level 2,General,,Albanian,,300.0,0.9618,0.34,-0.2514,2.0389,"ISCED 5A, 6",,,Housewife,Bricklayers and related workers,0.9383,24.0,16,0.9918,Did not repeat a <grade>,,,,2.235,,,,,Albanian,,,,-0.23,,,,,,,,,,,,,533.2684,481.0796,489.6479,490.4269,533.2684,611.1622,486.5322,567.5417,541.0578,544.9525,597.1413,495.1005,576.8889,507.5635,556.6365,594.8045,473.2902,554.2997,537.1631,568.3206,471.7324,431.2276,460.8272,419.5435,456.9325,559.7523,501.332,555.0787,467.0587,506.7845,580.7836,481.0796,555.0787,453.8168,491.2058,527.0369,444.4695,516.1318,403.9648,476.406,401.21,404.3872,387.7067,431.3938,401.21,499.6643,428.7952,492.2044,512.7191,499.6643,8.4871,12.7307,12.7307,4.2436,12.7307,12.7307,12.7307,12.7307,4.2436,4.2436,12.7307,4.2436,4.2436,12.7307,12.7307,4.2436,4.2436,4.2436,4.2436,12.7307,4.2436,12.7307,12.7307,4.2436,12.7307,12.7307,12.7307,12.7307,4.2436,4.2436,12.7307,4.2436,4.2436,12.7307,12.7307,4.2436,4.2436,4.2436,4.2436,12.7307,4.2436,4.2436,4.2436,12.7307,4.2436,4.2436,4.2436,4.2436,12.7307,12.7307,4.2436,12.7307,12.7307,4.2436,4.2436,12.7307,12.7307,12.7307,12.7307,4.2436,12.7307,4.2436,4.2436,12.7307,4.2436,4.2436,4.2436,4.2436,12.7307,12.7307,4.2436,12.7307,12.7307,4.2436,4.2436,12.7307,12.7307,12.7307,12.7307,4.2436,12.7307,19,1,0.1999,22NOV13


In [7]:
# Dataset shape
df_origin.shape

(485490, 636)

> Load in your dataset and describe its properties through the questions below.
Try and motivate your exploration goals through this section.
Here describe the whole dataset and steps what did you do to get a clean dataset.

PISA 2012 dataset is very complex and the original dataset contains 485490 rows and 636 columns. For this analysis, I’ve chosen only a few columns of interest. To understand the variables (column headers), I used PISA’s data dictionary[link] and PISA’s codebook[link]. Both documents were necessary to choose variables of interest and how to properly interpret them. 

First, I looked through the dictionary to learn what data was gathered and define the idea of the interest. Next, I read through the codebook to decide what features or variables are best suited for the analysis. When I choose the columns of interest I used Excel `textjoin()` function in order to get all columns’ names in a list, separated with a comma and quotation marks.
 
I am interested in how specific features impact students’ performance on math, reading, and science testing. The features if interests are:
    - wealth,<br> 
    - cultural and home possessions,<br>
    - educational resources<br>
    - sense of belongings<br>
    - parents’ and siblings’ presence<br>
    - the educational level of parents<br><br>
Besides those features, I will be analyzing if there is any difference between<br>
    - gender<br>
    - schools that joined OECD<br>
    - country<br><br>
Additionally, I would like to research if<br> 
    - mathematics anxiety<br>
is correlated with performance in math and science.



In [21]:
# Read in csv file:
# - select only columns of interest (col_list)
#           - to get all columns names in a list, separated with comma with quotation marks `textjoin()` function was used
# - issues with readind the large csv:
#           - this why code was used:pd.read_csv('pisa2012.csv',sep=',', encoding='latin-1',error_bad_lines=False, index_col=False, dtype='unicode')

col_list = ['CNT','OECD','SCHOOLID','ST04Q01','ST09Q01','ST115Q01','ST11Q01','ST11Q02','ST11Q03','ST11Q04','ANXMAT','CULTPOS','FISCED','HEDRES','HOMEPOS','MISCED','WEALTH','ANCBELONG','PV1MATH','PV1READ','PV1SCIE']
df = pd.read_csv('pisa2012.csv',sep=',', encoding='latin-1',error_bad_lines=False, index_col=False, dtype='unicode', usecols=col_list)

In [22]:
# Display all columns for better view and display first five columns
pd.set_option('display.max_columns', None)
df.head(5)

Unnamed: 0,CNT,OECD,SCHOOLID,ST04Q01,ST09Q01,ST115Q01,ST11Q01,ST11Q02,ST11Q03,ST11Q04,ANXMAT,CULTPOS,FISCED,HEDRES,HOMEPOS,MISCED,WEALTH,ANCBELONG,PV1MATH,PV1READ,PV1SCIE
0,Albania,Non-OECD,1,Female,,1,Yes,Yes,Yes,Yes,0.32,-0.48,"ISCED 3A, ISCED 4",-1.29,-2.61,"ISCED 3A, ISCED 4",-2.92,-0.7351,406.8469,249.5762,341.7009
1,Albania,Non-OECD,1,Female,,1,Yes,Yes,,Yes,,1.27,"ISCED 3A, ISCED 4",1.12,1.41,"ISCED 5A, 6",0.69,,486.1427,406.2936,548.9929
2,Albania,Non-OECD,1,Female,,1,Yes,Yes,No,Yes,,1.27,"ISCED 5A, 6",-0.69,0.14,"ISCED 5A, 6",-0.23,,533.2684,401.21,499.6643
3,Albania,Non-OECD,1,Female,,1,Yes,Yes,No,Yes,0.31,1.27,"ISCED 5A, 6",0.04,-0.73,"ISCED 3B, C",-1.17,,412.2215,547.363,438.6796
4,Albania,Non-OECD,1,Female,,2,Yes,Yes,Yes,,1.02,1.27,"ISCED 3A, ISCED 4",-0.69,-0.57,,-1.17,0.8675,381.9209,311.7707,361.5628


In [23]:
df.shape

(485490, 21)

In [24]:
# Rename colum names for better understanding
df.rename(columns={'CNT':'Country_code', 'OECD':'OECD_country', 'SCHOOLID':'School_ID', 'ST04Q01':'Gender', 'ST09Q01':'Truancy_Skip_whole_school_day', 'ST115Q01':'Truancy_Skip_classes_within_school_day', 'ST11Q01':'At_Home_Mother', 'ST11Q02':'At_Home_Father', 'ST11Q03':'At_Home_Brothers', 'ST11Q04':'At_Home_Sisters', 'ANXMAT':'Mathematics_Anxiety', 'CULTPOS':'Cultural_Possessions', 'FISCED':'Educational_level_of_father_ISCED', 'HEDRES':'Home_educational_resources', 'HOMEPOS':'Home_Possessions', 'MISCED':'Educational_level_of_mother_ISCED', 'WEALTH':'Wealth', 'ANCBELONG':'Sense_of_Belonging_to_School_Anchored', 'PV1MATH':'Plausible_value_mathematics', 'PV1READ':'Plausible_value_reading', 'PV1SCIE':'Plausible_value_science'}, inplace=True)

In [25]:
# Check renamed columns
df.head(3)

Unnamed: 0,Country_code,OECD_country,School_ID,Gender,Truancy_Skip_whole_school_day,Truancy_Skip_classes_within_school_day,At_Home_Mother,At_Home_Father,At_Home_Brothers,At_Home_Sisters,Mathematics_Anxiety,Cultural_Possessions,Educational_level_of_father_ISCED,Home_educational_resources,Home_Possessions,Educational_level_of_mother_ISCED,Wealth,Sense_of_Belonging_to_School_Anchored,Plausible_value_mathematics,Plausible_value_reading,Plausible_value_science
0,Albania,Non-OECD,1,Female,,1,Yes,Yes,Yes,Yes,0.32,-0.48,"ISCED 3A, ISCED 4",-1.29,-2.61,"ISCED 3A, ISCED 4",-2.92,-0.7351,406.8469,249.5762,341.7009
1,Albania,Non-OECD,1,Female,,1,Yes,Yes,,Yes,,1.27,"ISCED 3A, ISCED 4",1.12,1.41,"ISCED 5A, 6",0.69,,486.1427,406.2936,548.9929
2,Albania,Non-OECD,1,Female,,1,Yes,Yes,No,Yes,,1.27,"ISCED 5A, 6",-0.69,0.14,"ISCED 5A, 6",-0.23,,533.2684,401.21,499.6643


Based on this [wikipedia](https://en.wikipedia.org/wiki/International_Standard_Classification_of_Education) site I have changed educational level of for better understanding:

| Pisa 2012 | From Wikipedia - ISCED 2011 | Equivalent Level - used in this dataset
| :- | :- | :-|
|ISCED 1           | Primary education | 1
|ISCED 2           | Lower secondary education|  2
|ISCED 3B, C       | Upper secondary education  | 3
|ISCED 3A, ISCED 4 | Post-secondary non-tertiary education   | 4
|ISCED 5B          | Short-cycle tertiary education | 5
|ISCED 5A, 6       | Bachelor's or equivalent  | 6

In [26]:
# Rename education level - father ()
df.Educational_level_of_father_ISCED.value_counts()

ISCED 3A, ISCED 4    118890
ISCED 5A, 6          113406
ISCED 2               66728
ISCED 5B              61617
ISCED 3B, C           39789
ISCED 1               35938
None                  16535
Name: Educational_level_of_father_ISCED, dtype: int64

In [27]:
# Rename education level - mother ()
df.Educational_level_of_mother_ISCED.value_counts()

ISCED 3A, ISCED 4    126768
ISCED 5A, 6          114452
ISCED 5B              68219
ISCED 2               66650
ISCED 1               36556
ISCED 3B, C           35672
None                  18768
Name: Educational_level_of_mother_ISCED, dtype: int64

In [37]:
# Replace educational values with numbers (see comments for explanation)
df["Educational_level_of_father_ISCED"] \
.replace({"ISCED 1": "1",
          "ISCED 2": "2",
          "ISCED 3B, C" : "3",
          "ISCED 3A, ISCED 4": "4",
          "ISCED 5B": "5",
          "ISCED 5A, 6" :"6"}, inplace=True)

In [38]:
# Replace educational values with numbers (see comments for explanation)
df["Educational_level_of_mother_ISCED"] \
.replace({"ISCED 1": "1",
          "ISCED 2": "2",
          "ISCED 3B, C" : "3",
          "ISCED 3A, ISCED 4": "4",
          "ISCED 5B": "5",
          "ISCED 5A, 6" :"6"}, inplace=True)

In [39]:
# Check for correct transformation 
df.Educational_level_of_mother_ISCED.value_counts()

4       126768
6       114452
5        68219
2        66650
1        36556
3        35672
None     18768
Name: Educational_level_of_mother_ISCED, dtype: int64

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 485490 entries, 0 to 485489
Data columns (total 21 columns):
 #   Column                                  Non-Null Count   Dtype 
---  ------                                  --------------   ----- 
 0   Country_code                            485490 non-null  object
 1   OECD_country                            485490 non-null  object
 2   School_ID                               485490 non-null  object
 3   Gender                                  485490 non-null  object
 4   Truancy_Skip_whole_school_day           479131 non-null  object
 5   Truancy_Skip_classes_within_school_day  479269 non-null  object
 6   At_Home_Mother                          460559 non-null  object
 7   At_Home_Father                          441036 non-null  object
 8   At_Home_Brothers                        400076 non-null  object
 9   At_Home_Sisters                         390768 non-null  object
 10  Mathematics_Anxiety                     314764 non-null 

In [43]:
pd.set_option('display.max_rows', None)
df.Country_code.value_counts().sort_index(ascending=False)

Vietnam                  4959
Uruguay                  5315
United Kingdom          12659
United Arab Emirates    11500
USA                     10294
Turkey                   4848
Tunisia                  4407
Thailand                 6606
Taiwan                   6046
Switzerland             11229
Sweden                   4736
Spain                   25313
Slovenia                 5911
Slovak Republic          4678
Singapore                5546
Serbia                   4684
Russia                   6992
Romania                  5074
Qatar                   10966
Portugal                 5722
Poland                   4607
Peru                     6035
Norway                   4686
New Zealand              4291
Netherlands              4460
Montenegro               4744
Mexico                  33806
Malaysia                 5197
Luxembourg               5258
Lithuania                4618
Liechtenstein             293
Latvia                   4306
Korea                    5033
Kazakhstan

In [41]:
# Combine names that have been broken down by states or cities with one unique name of the country
df = df.replace(['Connecticut (USA)', 'Florida (USA)', 'Massachusetts (USA)', 'United States of America'], 'USA')
df = df.replace(['Chinese Taipei'], 'Taiwan')
df = df.replace(['China-Shanghai', 'Hong Kong-China', 'Macao-China'], 'China')
df = df.replace(['Russian Federation', 'Perm(Russian Federation)'], 'Russia')

In [45]:
df.Country_code.value_counts().sort_index(ascending=False)

Vietnam                  4959
Uruguay                  5315
United Kingdom          12659
United Arab Emirates    11500
USA                     10294
Turkey                   4848
Tunisia                  4407
Thailand                 6606
Taiwan                   6046
Switzerland             11229
Sweden                   4736
Spain                   25313
Slovenia                 5911
Slovak Republic          4678
Singapore                5546
Serbia                   4684
Russia                   6992
Romania                  5074
Qatar                   10966
Portugal                 5722
Poland                   4607
Peru                     6035
Norway                   4686
New Zealand              4291
Netherlands              4460
Montenegro               4744
Mexico                  33806
Malaysia                 5197
Luxembourg               5258
Lithuania                4618
Liechtenstein             293
Latvia                   4306
Korea                    5033
Kazakhstan

### What is the structure of your dataset?

> Your answer here!

### What is/are the main feature(s) of interest in your dataset?

> Your answer here!

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> Your answer here!

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

> Make sure that, after every plot or related series of plots, that you
include a Markdown cell with comments about what you observed, and what
you plan on investigating next.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!