In [2]:
# libraries
import numpy as np
import pandas as pd

df = pd.read_csv('../data/edattainxtry_5plus.csv')

# PSTAT 100 Project plan

This is a guide to preparing your project plan. It functions both as a guide to the work you'll need to do and as a guide to preparing the deliverable. You can use it as a template to draft the plan report; if so, **please remove the text explanations and instructions in each section so that it reads as a coherent and continuous document**.

While you may find it useful initially to follow the outline given, you do not need to adhere to it exactly -- you're free to organize your submission in the way that seems most natural to you. However, please do keep the high-level sections, so that your report includes the following headers:

0. Background
1. Data description
2. Initial exporations
3. Planned work

Your report does not need to be long. It should be about 2-4 pages, and might not be much longer than this template once you replace the guiding text with your own work.

## Group information

**Group members**:
Yibo Liang


**Contributions**:
1. Member 1 studied the data documentation and prepared the data description.
2. Member 2 worked on tidying the dataset.
3. Member 3 ...

---
## 0. Background

This section should introduce your reader to the general topic you're engaging with in your project and explain any specialized knowledge that they may need to understand your dataset and why it's interesting. It doesn't need to be long, but should touch on the following points:
* Introduce the topic of your project.
* What area or areas of study are you in dialogue with for your project?
* What is your data about, broadly? 
* What is the motivation for collecting the kind of data you're working with, and what sorts of things could you potentially learn?

You can look to the background sections in the homework assignments for examples. (There you can also see how to include images in your notebook.) The background sections of the homeworks are usually short and focused paragraphs intended to orient you to what you'll do in the assignment. They don't go into a lot of detail -- just enough to (hopefully) convince you that the data are interesting and explain any terminology or general information you may not know.

You may find it useful to write up the data description first, think about what the reader should know before they peek at your dataset, and then come back to the background section. I often write the background sections of your assignments last, once I have a sense of what kind of information would be most useful going into the assignment.

*Start your draft here.*

---
## 1. Data description

This section should introduce your dataset in detail. It should reflect your having gone through the collect/acquaint/tidy stages of the lifecycle. Below I've provided you with an outline. You do not need to adhere to this strictly -- in fact, it would be more natural to divide the items among a few short paragraphs -- but you should touch on each item in a format that suits your project.

### Basic information

Help your reader understand what your data is, where it came from, and how it can be used. Provide the following.

**General description**: provide a one- or two-sentence description of the data right at the beginning. For instance, "The data are diatom counts sampled from evenly-spaced depths in a sediment core from the gulf of California." Nothing too complicated, just something to give your reader a sense of the 'what' right off the bat.

**Source**: indicate where your data came from. Provide a verbal description -- who collected it as part of what project and where -- and either a citation or a hyperlink.

**Collection methods**: How were the data values obtained? Provide a simple description of how measurements were taken (using scientific equipment? web scraping? surveys?).

**Sampling design and scope of inference**: Indicate the relevant population. If identifiable from data documentation, state the sampling frame and sampling mechanism and indicate the scope of inference. If no information is available about the sampling design, indicate this instead, and discuss the extent to which having no scope of inference is a limitation for the particular topic you're investigating.

*Start your draft here.*

### Data semantics and structure

**Units and observations**: State the observational units.

**Variable descriptions**: Provide a table of variable descriptions. If your dataset is large and you'll only work with a subset of the total available variables, limit your attention to the variables that you'll work with. Here's a template you can work with:

Name | Variable description | Type | Units of measurement
---|---|---|---
country | observtion country | Character | Name of Country 
year | year of observation | Numeric | Calendar year 
gdppc2015 | gdp per capita (2015 U.S. dollars) | Numeric | gdppc 
cpi2010 | consumer price index (Based on 2010) | Numeric | cpi(hundreds) 
level | level of education| Numeric | grade 
sex | sex of group | Factor | male/female 
location | location development | Factor | urban.rural 
prop| proportion of the sample that attained the education | Numeric | percentage 


**Example rows**: Print a few example rows of your dataset in tidy format. Please don't include the codes you used to manipulate the raw data. Do that in a separate notebook and export the result to a .csv file -- `data.to_csv('tidy-data.csv')` -- to load directly into the cell below.

*Start your draft here.*

In [63]:
# load tidied data and print rows
data_tidy = pd.read_csv('../data/tidy_data.csv').drop(columns='Unnamed: 0')
data_tidy

Unnamed: 0,country,year,gdppc2015,cpi2010,location,sex,level,prop
0,Afghanistan,2015,556.007,132.883,Urban,Male,1,0.926
1,Afghanistan,2007,392.710,83.074,Urban,Male,1,0.846
2,Afghanistan,2010,526.104,100.000,Urban,Male,1,0.862
3,Angola,2015,4166.980,159.405,Urban,Male,1,0.976
4,Angola,1999,2458.096,0.684,Urban,Male,1,
...,...,...,...,...,...,...,...,...
22531,Zimbabwe,2010,1110.447,100.000,Rural,Female,9,0.551
22532,Zimbabwe,2015,1445.070,106.213,Rural,Female,9,0.499
22533,Zimbabwe,2009,940.532,97.066,Rural,Female,9,0.532
22534,Zimbabwe,2014,1443.618,108.859,Rural,Female,9,0.594


---
## 2. Initial explorations

At this stage, you may spend most of your effort on the computing side tidying up the data. You're not expected to complete a thorough exploratory analysis, and if your dataset was especially messy to start with, you may not even begin your exploratory analysis by the time you prepare this report. You have the option to leave exploration for the next stage of work and simply report basic properties of the dataset, but you should at minimum address the items in the 'basic properties' section below.

### Basic properties of the dataset

Help the reader get acquainted with your dataset on a simple level by identifying characteristics of the dataset and variable summaries. Some amount of code is fine here, but try to use code cells sparingly.

**Dimensions**: state the dimensions of the data (in tidy format, of course).

**Missing values**: Are there missing values? If so, why are they missing?

**Variable summaries**: Provide simple variable summaries for the most important variables in your dataset. Preferably, you'll do this for all variables, but if you have a large number, you might need to prioritize and focus on the ones most of interest. What exactly you do is a little case-specific, but think of things like means and variances, min/max, number of levels and observation counts for categorical variables, etc.

### Dimensions

In [64]:
data_tidy.shape

(22536, 8)

The tidy data is 22536x8. This suggests that our data has 91692 observations and 8 variables.

### Missing Data

In [65]:
data_tidy.isna().mean()

country      0.000000
year         0.000000
gdppc2015    0.011182
cpi2010      0.070288
location     0.000000
sex          0.000000
level        0.000000
prop         0.015974
dtype: float64

There are so missingness in our tidy data. Specifically, `gdppc2015` is missing about 1.1% of values, `value` is missing about 1.6% of its values, and `cpi2010` is missing about 7% of its values. This data is missing because the country did not report its gdp/cpi that year.

One interesting thing of note is that year is not continuous for all countries. Some countries did not report anything for certain years; therefore, we need to keep track of these gap years.

### Variables Summary

Alot of our variables are binary factors; so we will take a look at the most important variables of interest: `gdppc2015`, `cpi2010`, and `prop`. Furthermore, these summaries will be grouped by years as it does not make sense to look at the summary for all the years combined.

#### `gdppc2015` and `cpi2010` summary grouped by year.

In [66]:
data_tidy.loc[:, ['year','gdppc2015', 'cpi2010']].groupby('year').describe()

Unnamed: 0_level_0,gdppc2015,gdppc2015,gdppc2015,gdppc2015,gdppc2015,gdppc2015,gdppc2015,gdppc2015,cpi2010,cpi2010,cpi2010,cpi2010,cpi2010,cpi2010,cpi2010,cpi2010
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
1981,36.0,6079.244,9.22396e-13,6079.244,6079.244,6079.244,6079.244,6079.244,36.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1985,36.0,6178.524,9.22396e-13,6178.524,6178.524,6178.524,6178.524,6178.524,36.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1989,108.0,4700.474667,2758.735,883.763,883.763,5988.541,7229.12,7229.12,108.0,18.569667,14.710811,7.186,7.186,9.281,39.242,39.242
1990,288.0,3790.076875,1541.082,924.635,3075.924,3759.545,4706.346,6155.646,288.0,21.493625,17.910186,0.001,7.982,15.705,34.40925,49.759
1991,360.0,2609.1086,1732.949,553.526,1563.295,2015.6735,2774.255,6110.719,360.0,28.5206,22.569267,0.005,12.081,16.686,46.978,69.721
1992,504.0,1091.065857,865.686,243.371,457.653,696.756,1559.233,2990.413,360.0,29.1262,22.191499,1.139,2.92,27.6335,49.228,64.976
1993,396.0,2070.116727,1624.154,495.764,937.446,1643.149,2870.981,5804.319,324.0,22.434,14.946775,0.336,11.152,19.35,36.646,39.528
1994,468.0,2402.668769,2118.55,355.148,576.689,1662.913,3356.098,8018.721,396.0,37.190182,23.721174,2.024,14.67,51.094,59.448,61.996
1995,576.0,3375.670375,1975.943,243.598,1886.00875,3909.691,4476.79525,6584.739,576.0,37.650938,18.794739,6.906,26.96275,35.017,42.5855,72.937
1996,720.0,2724.90865,2473.522,236.461,571.922,1698.789,3886.13175,7770.539,612.0,39.818824,16.785738,9.63,24.934,42.414,43.76,65.004


Upon first glance `gdppc2015` fluctuates as time changes. However, there is a clear upwards trend in `cpi2010` as years increase.

#### `prop` summary grouped by years.

In [70]:
data_tidy.loc[:, ['year','prop']].groupby(['year']).describe()

Unnamed: 0_level_0,prop,prop,prop,prop,prop,prop,prop,prop
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
1981,36.0,0.4655,0.30175,0.028,0.178,0.469,0.7225,0.938
1985,36.0,0.481528,0.30845,0.021,0.181,0.4995,0.76725,0.947
1989,108.0,0.785361,0.218442,0.158,0.7065,0.8855,0.943,0.995
1990,252.0,0.740282,0.28381,0.03,0.5845,0.873,0.974,1.0
1991,360.0,0.649672,0.291056,0.016,0.43425,0.747,0.8985,0.996
1992,504.0,0.589631,0.299883,0.002,0.33025,0.6495,0.845,0.995
1993,396.0,0.695737,0.269002,0.022,0.49575,0.7725,0.937,0.998
1994,468.0,0.612859,0.335876,0.0,0.31,0.699,0.942,1.0
1995,576.0,0.687719,0.305519,0.003,0.4585,0.805,0.953,1.0
1996,756.0,0.634366,0.298583,0.0,0.3995,0.713,0.8965,1.0


There is a clear upwards trend in `proportion` of kids(5-19 years old) that receieved an education as `year` increases.

#### `prop` summary grouped by level

In [69]:
data_tidy.loc[:, ['level','prop']].groupby(['level']).describe()

Unnamed: 0_level_0,prop,prop,prop,prop,prop,prop,prop,prop
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
level,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
1,2464.0,0.886287,0.167455,0.011,0.85975,0.9595,0.987,1.0
2,2464.0,0.874638,0.172608,0.011,0.838,0.949,0.983,1.0
3,2464.0,0.851898,0.183777,0.011,0.797,0.929,0.976,1.0
4,2464.0,0.817203,0.200569,0.01,0.735,0.895,0.96425,1.0
5,2464.0,0.769976,0.223202,0.008,0.657,0.848,0.947,1.0
6,2464.0,0.700446,0.254869,0.007,0.53075,0.768,0.91425,1.0
7,2464.0,0.603727,0.276006,0.004,0.39,0.638,0.839,1.0
8,2464.0,0.513729,0.287668,0.0,0.26875,0.5125,0.752,1.0
9,2464.0,0.401914,0.269362,0.0,0.163,0.373,0.61425,0.989


There is a negative correlation between `prop` and `level`.

### Exploratory analysis

If you were lucky and your dataset was neat, you should aim to include a few exploratory plots or tables here -- they don't need to be polished at this stage, but you should select plots that are informative (rather than including all plots you may have looked at). 

If you do include exploratory graphics or tables, please explain in a sentence or two what each one shows. Try to include a minimum of code. Consider [saving your plots as images](https://altair-viz.github.io/user_guide/saving_charts.html#png-svg-and-pdf-format) and inputting images into markdown cells instead of generating them anew via code cells.

---
## 3. Planned work

Here you should indicate your tentative ideas for your analysis. Don't worry, these aren't final -- you can always change your mind later or shift gears if they don't pan out. The objective is to have you start thinking ahead about what you'll do.

### Questions

Please propose two focused questions that you plan to explore.

1. *Question 1 here*
2. *Question 2 here*

### Proposed approaches

For each question, please describe an idea or two about how you might approach the question.

1. *Approach 1 here* 
2. *Approach 2 here*

---
## Submission Checklist
1. Save file to confirm all changes are on disk
2. Run *Kernel > Restart & Run All* to execute all code from top to bottom
3. Save file again to write any new output to disk
4. Generate PDF and submit to Gradescope