# Pisa 2012 Data Exploration
### by Gabriela Sikora

## Introduction
This notebook will be dedicated to exploring details of the PISA 2012 dataset. PISA, in particular, is a "survey of students' skills and knowledge as they approach the end of compulsory education. It is not a conventional school test. Rather than examining how well students have learned the school curriculum, it looks at how well prepared they are for life beyond school" (Udacity, 2019).

Within this datset we can find information for about 510,000 students. The PISA 2012 dataset includes information on mathematics, reading in the test language, and science.


Throughout the course of this notebook I will have these two questions in mind:

- Are there differences in achievement based on gender or parental education levels?
- Is there a relationship between the amount of time a student dedicates to learning and their score? 

## Preliminary Wrangling 
To begin, let's start off by assessing the dataset and cleaning any remaining issues.

In [None]:
# Import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

In [None]:
# Read in the cleaned csv that was created in the wrangle_pisa notebook
pisa = pd.read_csv('pisa_df.csv')

### Assessing and Cleaning the Data

#### General

In [None]:
# How many rows and variables the dataset holds
pisa.shape

In [None]:
# What are the data types of the variables
pisa.dtypes

In [None]:
# See 10 examples of data in the dataset 
pisa.sample(10)

In [None]:
# Decriptive statistics for each numeric variable
pisa.describe()

#### Parental Education

In [None]:
# The type and quantity of the educational levels for 'Education - Father'
pisa['Education - Father'].value_counts()

In [None]:
# The type and quantity of the educational levels for 'Education - Mother'
pisa['Education - Mother'].value_counts()

In [None]:
# Convert parental level of education into ordered categorical types
ordinal_var_dict = {'Education - Father': ['<ISCED level 0>', '<ISCED level 1>', '<ISCED level 2>', '<ISCED level 3>', '<ISCED level 4>', '<ISCED level 5>', '<ISCED level 6>'],
                    'Education - Mother': ['<ISCED level 0>', '<ISCED level 1>', '<ISCED level 2>', '<ISCED level 3>', '<ISCED level 4>', '<ISCED level 5>', '<ISCED level 6>']}

for var in ordinal_var_dict:
    ordered_var = pd.api.types.CategoricalDtype(ordered = True,
                                                categories = ordinal_var_dict[var])
    pisa[var] = pisa[var].astype(ordered_var)

#### Scores

In [None]:
high_score = pisa[pisa['Average Total Score'] >= 785]
low_score = pisa[pisa['Average Total Score'] <= 630]

In [None]:
high_score.head()

In [None]:
high_score.shape

In [None]:
low_score.head()

In [None]:
pisa.shape

In [None]:
pisa['Student ID'].duplicated().sum()

In [None]:
pisa.drop_duplicates(inplace=True)

In [None]:
pisa.duplicated().sum()

In [None]:
pisa.shape

### The structure of the dataset

This cleaned version of the Pisa dataset from 2012 is composed of 43,715 rows, each of which represents one student. As for the features of this dataset, there are 18 variables that have been selected, most of which are numeric. Two of the variables are different however in that they are ordered categorical variables. They are the highest educational levels of the mother and father of the student, and are sorted from lowest level of education to highest level:

**(least educated) —> (most educated)** <br>
**<ISCED level 0>** : Pre-primary education <br>
**<ISCED level 1>** : Primary education or first stage of basic education<br>
**<ISCED level 2>** : Lower secondary education or second stage of basic education<br>
**<ISCED level 3>** : Upper secondary education<br>
**<ISCED level 4>** : Post-secondary non-tertiary education <br>
**<ISCED level 5>** : First stage of tertiary education<br>
**<ISCED level 6>** : Second stage of tertiary education <br>


### Main feature of interest in the dataset

The main feature that we will be exploring is the 'Average Total Score'. 

### Features that will support the investigation into 'Average Total Score'

To better understand the Average Total Score, I believe that 'Out-of-School Study Time - Total' and 'Learning time (minutes per week) - Total' will provide illuminating results. The average assumption is that the more homework a student completes, the better they will perform when writing tests, but there has been a recent uprise in research that explains that it is not a good predictor of test success. Rather, I expect that the educational level of the parents, and the amount of books that they have in their home will be a better feature to predict the student's test related success.

## Univariate Exploration

We can start off by looking at the main feature of interest: the average total score. 

In particular, let's first look at a standard-scale plot of this value to see its distribution.

In [None]:
# Histogram of Average Total Score
binsize = 20
bins = np.arange(0, pisa['Average Total Score'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = pisa, x = 'Average Total Score', bins=bins)
plt.xlabel('Average Total Score')

Here we can see that it is a very normal distribution. This is generally not surprising since bell curves are expected when it comes to the grades of students.

We can now move onto the three scores that the total score is comprised of: Math, Reading, and Science.

In [None]:
# Histogram of Average Math Score
binsize = 20
bins = np.arange(0, pisa['Average Math Score'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = pisa, x = 'Average Math Score', bins=bins)
plt.xlabel('Average Math Score')

Although the numbers along the x scale are a bit lower than they are for the total scores, we can easily say that this distribution is very much so like the total score in that it has a distinct normal distribution.

In [None]:
# Histogram of Average Reading Score
binsize = 20
bins = np.arange(0, pisa['Average Reading Score'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = pisa, x = 'Average Reading Score', bins=bins)
plt.xlabel('Average Reading Score')

Just as with the Math score, we can see the average Reading score is falling along a normal distribution.

In [None]:
# Histogram of Average Science Score
binsize = 20
bins = np.arange(0, pisa['Average Science Score'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = pisa, x = 'Average Science Score', bins=bins)
plt.xlabel('Average Science Score')

Just as with the Total, Math, and Reading scores, we can see the Science score also falls along a normal distribution. 

In [None]:
# Histogram of the Total Out-of-School Study Time
plt.figure(figsize=[8, 5])
plt.hist(data = pisa, x = 'Out-of-School Study Time - Total')
plt.xlabel('Price ($)')

In [None]:
# Histogram of the Total Learning Time
plt.figure(figsize=[8, 5])
plt.hist(data = pisa, x = 'Learning Time - Total')
plt.xlabel('Price ($)')

In [None]:
# The ordinal variable's distribution for both Mother's and Father's Education 
fig, ax = plt.subplots(nrows=2, figsize = [18,18])

default_color = sb.color_palette()[0]
sb.countplot(data = pisa, x = 'Education - Father', color = default_color, ax = ax[0])
sb.countplot(data = pisa, x = 'Education - Mother', color = default_color, ax = ax[1])

plt.show()

> Make sure that, after every plot or related series of plots, that you
include a Markdown cell with comments about what you observed, and what
you plan on investigating next.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

In [None]:
numeric_vars = ['Average Math Score', 'Average Reading Score', 'Average Science Score', 'Average Total Score', 'Out-of-School Study Time - Total', 'Learning Time - Total']


In [None]:
# correlation plot
plt.figure(figsize = [8, 5])
sb.heatmap(pisa[numeric_vars].corr(), annot = True, fmt = '.3f',
           cmap = 'vlag_r', center = 0)
plt.show()

In [None]:
numeric_vars = ['Average Math Score', 'Average Reading Score', 'Average Science Score', 
                'Average Total Score', 'Learning Time - Mathematics',
                'Learning Time - Test Language', 'Learning Time - Science', 
                'Learning Time - Total']


In [None]:
# correlation plot
plt.figure(figsize = [8, 5])
sb.heatmap(pisa[numeric_vars].corr(), annot = True, fmt = '.3f',
           cmap = 'vlag_r', center = 0)
plt.show()

In [None]:
samples = np.random.choice(pisa.shape[0], 500, replace = False)
pisa_samp = pisa.loc[samples,:]

g = sb.PairGrid(data = pisa_samp, vars = numeric_vars)
g = g.map_diag(plt.hist, bins = 20);
g.map_offdiag(plt.scatter);

In [None]:
homework_vars = ['Out-of-School Study Time - Homework',
                      'Out-of-School Study Time - Guided Homework',
                      'Out-of-School Study Time - Personal Tutor',
                      'Out-of-School Study Time - Commercial Company',
                      'Out-of-School Study Time - With Parent',
                      'Learning Time - Mathematics',
                      'Learning Time - Test Language',
                      'Learning Time - Science']

In [None]:
# correlation plot
plt.figure(figsize = [8, 5])
sb.heatmap(pisa[homework_vars].corr(), annot = True, fmt = '.3f',
           cmap = 'vlag_r', center = 0)
plt.show()

In [None]:
g = sb.FacetGrid(data = pisa, col = 'Education - Mother');
g.map(plt.hist, 'Average Total Score');

In [None]:
g = sb.FacetGrid(data = pisa, col = 'Education - Father');
g.map(plt.hist, 'Average Total Score');

In [None]:
sb.regplot(data= pisa, x = 'Average Total Score', 
           y = 'Learning Time - Total', 
           fit_reg = False, 
           scatter_kws = {'alpha': 1/5})
plt.xlabel('Out-of-School Study Time - Total')
plt.ylabel('Learning Time - Total');

In [None]:
sb.regplot(data= pisa, x = 'Average Math Score', 
           y = 'Learning Time - Mathematics', 
           fit_reg = False, 
           scatter_kws = {'alpha': 1/5})
plt.xlabel('Out-of-School Study Time - Total')
plt.ylabel('Learning time (minutes per week) - Total');

In [None]:
sb.regplot(data= pisa, x = 'Average Reading Score', 
           y = 'Learning Time - Test Language', 
           fit_reg = False, 
           scatter_kws = {'alpha': 1/5})
plt.xlabel('Out-of-School Study Time - Total')
plt.ylabel('Learning time (minutes per week) - Total');

In [None]:
sb.regplot(data= pisa, x = 'Average Science Score', 
           y = 'Learning Time - Science', 
           fit_reg = False, 
           scatter_kws = {'alpha': 1/5})
plt.xlabel('Out-of-School Study Time - Total')
plt.ylabel('Learning time (minutes per week) - Total');

In [None]:
sb.regplot(data= pisa, x = 'Average Reading Score', 
           y = 'Average Math Score', 
           fit_reg = False, 
           scatter_kws = {'alpha': 1/5})
plt.xlabel('Out-of-School Study Time - Total')
plt.ylabel('Learning time (minutes per week) - Total');

In [None]:
sb.regplot(data= pisa, x = 'Average Reading Score', 
           y = 'Average Science Score', 
           fit_reg = False, 
           scatter_kws = {'alpha': 1/5})
plt.xlabel('Out-of-School Study Time - Total')
plt.ylabel('Learning time (minutes per week) - Total');

In [None]:
sb.regplot(data= pisa, x = 'Average Math Score', 
           y = 'Average Science Score', 
           fit_reg = False, 
           scatter_kws = {'alpha': 1/5})
plt.xlabel('Out-of-School Study Time - Total')
plt.ylabel('Learning time (minutes per week) - Total');

In [None]:
list(pisa)

In [None]:
plt.hist2d(data = pisa, x = 'Average Math Score', 
           y = 'Learning Time - Mathematics', cmin = 0.5);
plt.colorbar()
plt.xlabel('Displacement (1)')
plt.ylabel('Combined Fuel Eff. (mpg)');

In [None]:
plt.hist2d(data = pisa, x = 'Average Reading Score', 
           y = 'Learning Time - Test Language', cmin = 0.5);
plt.colorbar()
plt.xlabel('Displacement (1)')
plt.ylabel('Combined Fuel Eff. (mpg)');

In [None]:
plt.hist2d(data = pisa, x = 'Average Science Score', 
           y = 'Learning Time - Science', cmin = 0.5);
plt.colorbar()
plt.xlabel('Displacement (1)')
plt.ylabel('Combined Fuel Eff. (mpg)');

In [None]:
plt.hist2d(data = pisa, x = 'Average Total Score', 
           y = 'Learning Time - Total', cmin = 0.5);
plt.colorbar()
plt.xlabel('Displacement (1)')
plt.ylabel('Combined Fuel Eff. (mpg)');

In [None]:
sb.violinplot(data = pisa, 
              x = 'Education - Father', 
              y = 'Average Total Score');
plt.xticks(rotation = 15);

In [None]:
sb.violinplot(data = pisa, 
              x = 'Education - Mother', 
              y = 'Average Total Score');
plt.xticks(rotation = 15);

In [None]:
sb.boxplot(data = pisa, 
              x = 'Gender', 
              y = 'Average Total Score');
plt.xticks(rotation = 15);

In [None]:
list(pisa)

In [None]:
sb.boxplot(data = pisa, 
              x = 'Education - Father', 
              y = 'Average Total Score');
plt.xticks(rotation = 15);

In [None]:
sb.boxplot(data = pisa, 
              x = 'Education - Mother', 
              y = 'Average Total Score');
plt.xticks(rotation = 15);

In [None]:
sb.countplot(data = pisa, x = 'Education - Father', hue = 'Gender')
plt.xticks(rotation = 15);

In [None]:
# fix y axis
g = sb.FacetGrid(data = pisa, col = 'Education - Father', col_wrap = 4, sharey = False);
g.map(plt.hist, 'Average Total Score');

In [None]:
# fix y axis
g = sb.FacetGrid(data = pisa, col = 'Education - Mother', col_wrap = 4, sharey = False);
g.map(plt.hist, 'Average Total Score');

In [None]:
base_color = sb.color_palette()[9]
sb.barplot(data = pisa, 
              x = 'Education - Father', 
              y = 'Average Total Score',
              color = base_color);
plt.xticks(rotation = 15);

In [None]:
base_color = sb.color_palette()[9]
sb.barplot(data = pisa, 
              x = 'Education - Mother', 
              y = 'Average Total Score',
              color = base_color);
plt.xticks(rotation = 15);

In [None]:
sb.pointplot(data = pisa, 
              x = 'Education - Father', 
              y = 'Average Total Score');
plt.xticks(rotation = 15);

In [None]:
sb.pointplot(data = pisa, 
              x = 'Education - Mother', 
              y = 'Average Total Score');
plt.xticks(rotation = 15);

In [None]:
list(pisa)

In [None]:
sb.pointplot(data = pisa, 
              x = 'Out-of-School Study Time - Total', 
              y = 'Average Total Score');
plt.xticks(rotation = 15);

In [None]:
np.random.seed(2018)
sample = np.random.choice(pisa.shape[0], 200, replace=False)
pisa_subset = pisa.loc[sample]

In [None]:
g = sb.FacetGrid(data = pisa_subset, hue = 'Gender')
g.map(sb.regplot, 'Average Math Score', 'Average Reading Score', fit_reg = False)
plt.legend();

In [None]:
g = sb.FacetGrid(data = pisa_subset, hue = 'Gender')
g.map(sb.regplot, 'Average Math Score', 'Average Science Score', fit_reg = False)
plt.legend();

In [None]:
g = sb.FacetGrid(data = pisa_subset, hue = 'Gender')
g.map(sb.regplot, 'Average Science Score', 'Average Reading Score', fit_reg = False)
plt.legend();

In [None]:
g = sb.FacetGrid(data = pisa_subset, hue = 'Gender')
g.map(sb.regplot, 'Average Total Score', 'Out-of-School Study Time - Total', fit_reg = False)
plt.legend();

In [None]:
g = sb.FacetGrid(data = pisa_subset, hue = 'Gender')
g.map(sb.regplot, 'Average Total Score', 'Learning Time - Total', fit_reg = False)
plt.legend();

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

In [None]:
numeric_vars = ['Average Math Score', 'Average Reading Score', 'Average Science Score', 
                'Average Total Score']

In [None]:
samples = np.random.choice(pisa.shape[0], 500, replace = False)
pisa_samp = pisa.loc[samples,:]

g = sb.PairGrid(data = pisa_samp, vars = numeric_vars, hue='Education - Father')
g = g.map_diag(plt.hist, bins = 20);
g.map_offdiag(plt.scatter)

In [None]:
list(pisa)

In [None]:
np.random.seed(2018)
sample = np.random.choice(pisa.shape[0], 200, replace=False)
pisa_subset = pisa.loc[sample]

In [None]:
g = sb.FacetGrid(data = pisa_subset, hue = 'Education - Father')
g.map(sb.regplot, 'Average Total Score', 'Average Math Score',  x_jitter = 0.04, fit_reg = False)
plt.legend()
plt.ylabel('Combined Fuel Eff. (mpg)');

In [None]:
plt.figure(figsize = [8,6])
plt.scatter(data = pisa, 
            y = 'Average Math Score',
            x = 'Average Reading Score',
            c = 'Average Total Score')
# plt.xlim(0, 160)
# plt.ylim(15, 160)
plt.colorbar(label = 'Speed')
plt.xlabel('Defense')
plt.ylabel('Special Defense');

In [None]:
pisa['Gender'].value_counts()

In [None]:
list(pisa)

In [None]:
pisa_sub = pisa.loc[pisa['Gender'].isin(['Male', 'Female'])]

sb.boxplot(data = pisa_sub, 
              x = 'Average Science Score', 
              y = 'Learning Time - Science',
             hue = 'Gender');
plt.xticks(rotation = 15);
plt.ylabel('Displacement (1)')
plt.xlabel('Vehicle Class')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5));

In [None]:
pisa.describe()

In [None]:
### Average score per study hours 

In [None]:
# https://python-graph-gallery.com/122-multiple-lines-chart/
plt.plot('Out-of-School Study Time - Total', 'Average Total Score', data=pisa )
plt.plot('Out-of-School Study Time - Total', 'Average Math Score', data=pisa )
plt.plot('Out-of-School Study Time - Total', 'Average Reading Score', data=pisa )
plt.plot('Out-of-School Study Time - Total', 'Average Science Score', data=pisa )

plt.legend()

In [None]:
carat_ticks = [0.2, 0.3, 0.5, 0.7, 1, 1.5, 2, 3]

In [None]:
# compute the logarithm of Average Total Score to make multivariate plotting easier
def log_trans(x, inverse = False):
    """ quick function for computing log and power operations """
    if not inverse:
        return np.log10(x)
    else:
        return np.power(10, x)
    
# compute the cuberoot of Out-of-School Study Time - Total to make multivariate plotting easier
def cuberoot_trans(x, inverse = False):
    """ quick function for computing cube root and cube operations """
    if not inverse:
        return x ** (1/3)
    else:
        return x ** 3
    
    
pisa['log_score'] = pisa['Average Total Score'].apply(log_trans)
pisa['cr_time'] = pisa['Out-of-School Study Time - Total'].apply(cuberoot_trans)

In [None]:
pisa_countries.head()

In [None]:
pisa_cnt = pisa_countries.copy()

In [None]:
pisa_cnt = pisa_cnt.reset_index()

In [None]:
pisa_cnt.head()

In [None]:
sb.regplot(data= pisa_cnt, x = 'Out-of-School Study Time - Total', 
           y = 'Average Total Score', 
           fit_reg = False)
plt.xlabel('Out-of-School Study Time - Total')
plt.ylabel('Learning time (minutes per week) - Total');

In [None]:
sb.pointplot(data = pisa, 
              x = 'Out-of-School Study Time - Total', 
              y = 'Average Total Score');
plt.xticks(rotation = 15);

In [None]:
pisa.head()

In [None]:
g = sb.FacetGrid(pisa, col='Country', col_wrap=3, height=10)
g = g.map(sb.pointplot, 'Average Total Score', 'Out-of-School Study Time - Total')

In [None]:
pisa_countries = pisa.groupby('Country').mean()
# ct_counts = ct_counts.reset_index(name = 'count')
# ct_counts = ct_counts.pivot(index = 'Country', columns = 'Average Total Score', values = 'count')

In [None]:
pisa_countries

In [None]:
g = sb.FacetGrid(pisa, col="Country", col_wrap=3, height=10)
g = g.map(plt.plot, 'log_score', 'cr_time')


# sb.heatmap(ct_counts, annot = True, fmt = 'd');
# fmt makes them into decimal values aka readable

In [None]:
# for each country, average score and time studied

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!