# PISA Dataset Exploration
## by Ellen Zhang

## Preliminary Wrangling


PISA is a survey of students' skills and knowledge as they approach the end of compulsory education. It is not a conventional school test. Rather than examining how well students have learned the school curriculum, it looks at how well prepared they are for life beyond school.


Around 510,000 students in 65 economies took part in the PISA 2012 assessment of reading, mathematics and science representing about 28 million 15-year-olds globally. Of those economies, 44 took part in an assessment of creative problem solving and 18 in an assessment of financial literacy.

Reference:

- [PISA Data Visualization Competition](http://www.oecd.org/pisa/pisaproducts/datavisualizationcontest.htm)

- [PISA 2012 Technical report](http://www.oecd.org/pisa/data/pisa2012technicalreport.htm)

Questions: 

- How does the choice of school play into academic performance?


- Are there differences in achievement based on gender, location, or student attitudes?


- Are there differences in achievement based on teacher practices and attitudes?


- Does there exist inequality in academic achievement?

In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
# from IPython.core.display import display, HTML
# display(HTML("<style>.container { width:100% !important; }</style>"))

In [None]:
pd.set_option('display.max_colwidth', 1000)
# pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

> Load in your dataset and describe its properties through the questions below.
Try and motivate your exploration goals through this section.

In [None]:
df_dict = pd.read_csv('./dataset/pisadict2012.csv',encoding='cp1252')
df_dict

In [None]:
print(df_dict.shape)
print(df_dict.dtypes) 

In [None]:
df_pisa = pd.read_csv('./dataset/pisa2012.csv',encoding='cp1252', low_memory=False)
df_pisa.sample(10)

In [None]:
df_pisa.shape

In [None]:
df_pisa.dtypes

In [None]:
list(df_pisa.columns.values)

In [None]:
list(df_pisa.iloc[0])

In [None]:
df_pisa.dtypes[df_pisa.dtypes != 'object'].shape

In [None]:
df_pisa.dtypes[df_pisa.dtypes == 'object'].shape

In [None]:
category =df_pisa.dtypes[df_pisa.dtypes == 'object']
df_category_column = category.index

In [None]:
df_category=pd.DataFrame(columns=['column_name','category_values'])
# print(df_category)
for column in df_category_column:
    category_values = df_pisa[column].dropna().unique().tolist()
    category_values.sort()
#     print(type(category_values))
    category_values = ','.join(str(cv) for cv in category_values)
#     print(column, (category_values))
    df_category = df_category.append({'column_name':column, 'category_values':category_values}, ignore_index=True)

In [None]:
df_category

In [None]:
df_cate_counts = df_category.groupby(['category_values'])['category_values'] \
                             .count() \
                             .reset_index(name='count') \
                             .sort_values(['count'], ascending=False)

df_cate_counts.head(10)

### What is the structure of your dataset?

> There are 485490 observers in the dataset with 636 features . 268 of variables are numeric in nature, but 368 of variables are catogories.
For example:
- 98 columns has category values: Agree,Disagree,Strongly agree,Strongly disagree
- 29 columns has category values: No,Yes
- ....


### What is/are the main feature(s) of interest in your dataset?

> I'm most interested in figuring out what features are most influence the average of students' math performance. Math performance includes columns from `PV1MACC` to `PV5MACU`.


### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> I expect that student's own `Math Behaviour`, will have the strongest effect on their math grade: the more they do, the higher the grade. I also think that student's atitude to math(`Perceived Control`,`Math Anxiety` ,`Math Work Ethic`, `Perceived Control`, `Perseverance`,  `Learning Strategies` , `Math Interest`, `Attributions to Failure`) will have effects on the grade, though to a much smaller degree than the main effect of `Math Behaviour`.

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

I'll start by looking at the distribution of the main variable of interest: ***Student Math Performance***.

In [None]:
df_pisa['average_math_performance'] = df_pisa.iloc[:,506:526].mean(axis=1)

In [None]:
df_pisa['average_math_performance'].sample(10)

In [None]:
df_pisa['average_math_performance'].mean()

In [None]:
df_pisa['average_math_performance'].std()

In [None]:
df_pisa['average_math_performance'].median()

In [None]:
df_pisa['average_math_performance'].max()

In [None]:
df_pisa['average_math_performance'].min()

In [None]:
df_pisa['average_math_performance'].plot.hist(grid=True, bins=20, rwidth=0.95)
plt.axvline(df_pisa['average_math_performance'].mean(), color='r',  linewidth=2)
plt.title('Student Average Math Performance')
plt.xlabel('Math Performance')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)

In [None]:
df_pisa['average_math_performance'].plot.hist(grid=True, bins=500, rwidth=0.95)
plt.axvline(df_pisa['average_math_performance'].mean(), color='r',  linewidth=2)
plt.title('Student Average Math Performance')
plt.xlabel('Math Performance')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)

Student Average Math Performance has a normal distribution. The mean of the total 485,490 students math performance is about 468. The data points spread out over 103. 

In [None]:
df_pisa['average_math_performance'].plot.box()

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!