# PSTAT 100 Project report

This document provides a template that you can elect to fill in or modify for your project report. The sectioning (header structure) is not strict, and you are encouraged to adjust it to suit your project. **If you choose to use this template, please remove all guiding text, including this cell**.

#### General guidelines

The objective of your final report is to provide a thorough overview of your project. It need not be long; quality and clarity is preferred over quantity. Aim for 3-5 pages; if you feel that further material needs to be included, you can add an appendix with supplementary information, tables, and plots.

#### Contents

The report should include the following elements:

* Abstract
    + One-paragraph summary of the report contents.
* Background
    + Adapt your interim report.
* Aims
    + State the specific questions and approaches you took up.
* Data description
    + Adapt your interim report.
    + Possibly include select results of your exploratory analysis.
* Methods 
    + Description of the methods used in your analysis.
* Results
    + Present the figures and tables summarizing your analysis.
* Discussion
    + Highlight your main findings and takeaways.
    + Offer further commentary: caveats, further steps, etc.

These need not be set apart by headers; you are free to determine how to organize your report. For an example, please refer to the adaptation of HW3 into a report provided during week 8.

#### Format and appearance 

* No codes should appear in your report.
* All figures and tables should have captions.
* Figures should be appropriately sized and labeled.
* No text from the template should appear in your report other than headers.
* The total length should not exceed 8 pages.

### Evaluation

Your report will be evaluated based on:
* (format) adherence to formatting and appearance guidelines;
* (clarity) clarity and thoughtfulness in written voice;
* (accuracy) apparent accuracy of quantitative results and technical information;
* (applied a PSTAT100 technique) successful use of one or more techniques in the course.

Notice that no credit is tied to the nature of the results; you can earn credit equally well with an analysis that says little as with one that says a lot. **Negative, neutral, or ambiguous results -- analyses that do not produce any particular insight -- are more than acceptable.** If your analyses turn in one of these directions, present them as clearly as possible, and consider speculating in your discussion section about the absence of signficicant/interpretable findings.

---

# Two Decades of Change in Education Around the World.

**Yibo Liang**, **Aarya Kulkarni**, **Alan Su**, and **Nicole Magallanes Flores**

#### Author contributions

Author 1 contributed ...

Author 2 contributed ...

#### Abstract

Prepare an abstract *after* you've written the entire report. The abstract should be 4-6 sentences summarizing the report contents. Typically:
* the first 1-2 sentences introduce and motivate the topic;
* the next 1-2 sentences state the aims;
* the next 1-2 sentences state the findings.

---
## Introduction

The goal of this section is to introduce your project topic and project aims (the questions you posed in the planning report).

### Background

Provide enough background to position the reader to understand your project aims and their relevance. Most likely, you can adapt your background section from the plan report for this section. If you were satisfied with what you wrote previously and no revisions were suggested, you can simply copy that material. 

You may want to consider making revisions anyway now that you've carried out your data analysis -- perhaps some pieces of background information aren't as relevant to understanding your work, or some pieces need to be elaborated in further detail.

### Aims

In this project, we aim to explore educational attainment for youth across the world. How is educational attainment throughout the world affected by certain circumstances? Which types of circumstances affect youth educational attainment the most? In order to approach this issue, we made a linear regression model with proportion of youth as the response variable, and predictors such as location (urban or rural home), level of education, year, sex, and more. From this model, we also made several visualizations to get a better understanding of what the data looks like. Looking at our results, we found that the variables that affected the response the most were location and level, with year and sex as noteworthy variables. And while they still affect our model, gdp and cpi do not have as much impact as the other variables mentioned. In the following sections below, we will go into more detail about our project as a whole.

---
## Materials and methods

### Data description

The data describes the percentage of population ages 15 to 19 that has completed each grade (1-9) in developing countries around the world from 1990 to 2020. This data was taken from a World Bank database with summary information about education level attained. The information was taken from household surveys from developing countries around the world (http://www.worldbank.org/en/research/brief/edattain). There were multiple different household-based surveys used to create this dataset. Some of which include: Demographic and Health Surveys (DHS- http://www.measuredhs.com), Multiple Indicator Cluster Surveys (MICS- http://www.childinfo.org), Living Standards Measurements Study Surveys (LSMS- http://www.worldbank.org/lsms), and other household-based surveys (ex: country specific socio-economic surveys). In addition to these household-based surveys, selected country/year variables were added to the dataset from the World Development Indicators database (http://databank.worldbank.org). Some variables added include gdp per capita (based on 2015 U.S. dollars) and the consumer price index (based on 2010) for various countries. 

The relevant population is the population ages 15-19 in the countries surveyed. Since the dataset is aggregated from multiple surveys. Some documentation was provided on two of the types of household surveys performed. Two survey methodologies are described next. The Demographic and Health Surveys (DHS) used a mix of questionnaires, biomarkers, and geographic information as survey tools to conduct this survey. This survey's sampling design consisted of a two-stage stratified cluster design in which Enumeration Areas (areas canvased) were drawn from Census files (stage 1), followed by a sample of households being drawn from a list in each Enumeration Area (stage 2). The Living Standards Measurement Study Surveys (LSMS)'s sample frame is given by the Population and Housing Census. Following this, a two round sampling method is used. The first round selects Primary Selection Units (PSU) through random sampling. The second round selects subunits from each PSU using a method of systematic sampling. Additionally, since the sampling frame is usually given by the country's Census, and the sampling mechanism involved random sampling, we can say that the scope of inference are the households captured in the Census.

Through initial exploration of the data, we were able to get a sense of its shape and composition. The tidied data contains 22,536 observations and 8 variables. It is important to note that `gdppc2015` and `cpi2010` were missing some values due to countries not reporting gdp/cpi for given years. We were also able to find some trends through initial exploration of the data. The first trend was a positive correlation between `cpi2010` against `year`. Secondly, we were able to see a positive correlation between `prop` against `year`. Third, there is a negative relationship between `level` and `prop`. Lastly, we were able to facet the `prop` variable by `sex` and `location` and observe a positive correlation against `prop` and `year` in all levels (male/female and urban/rural) of these two binary factors.

**Units and observations**: The observational units are 82 unique countries around the world from 1990 to 2020.

**Variable descriptions**:

Name | Variable description | Type | Units of measurement
---|---|---|---
country | observation country | Character | Name of Country 
year | year of observation | Numeric | Calendar year 
gdppc2015 | gdp per capita (2015 U.S. dollars) | Numeric | gdppc(dollars)
cpi2010 | consumer price index (Based on 2010) | Numeric | cpi(hundreds) 
level | level of education | Numeric | grade 
sex | sex of the group of children aged 15-19 | Factor | none
location | the location of the group of children aged 15-19 | Factor | none 
prop| proportion of the group of children aged 15-19 that attained an education | Numeric | pproportion

### Methods

In order to determined the relationships between of our response and predictors, we will begin by fitting a linear regression model on our data and assess the variable coefficients to determine our significant variables. To do this, we will use the 100 * `prop` (percentage) column as our response, and the rest of the variables as our predictor. We will dummy code our factor variables, impute missing values with column means, and turn our data frame into a design matrix. Furthermore, we will use the scikit-learn package to fit a mutiple linear regression on our design. After determining our significant variables, we will go on to communicate the effects of our significant variables with a few more in-depth visualizations. Finally, we will use our model to create a fitted line(prediction) against all significant variables. At the end, hopefully, we will see trends that describe the variation between `prop` of education in kids ages 15-19.

---
## Results

### Linear Regression

The Model below will be the model for our regression(the response is multiplied by 100 to reflect percentage):

$$
{\displaystyle 100 * {Y} = {{\beta_0} + {\beta_1} * {\operatorname {year}} + {\beta_2} * {\operatorname {gdppc2015}} + {\beta_3} * {\operatorname {cpi2015}} + {\beta_4} * {\operatorname {level}} + {\beta_5} * {\operatorname {sexMale}} + {\beta_6} * {\operatorname {level2}} + ... + {\beta_{13}} * {\operatorname {level9}}}} + \epsilon
$$

(Table 1) The following table shows the coefficient estimates of our regression variables and their standard errors:

<div>
<img src="./Table.png" width="300" height="350" />
</div>

The key takeaway here is that the euclidean distance of coefficient estimates for `location` and `level` are much greater; `year` and `sex` are also worth taking note of. Also, contrary to our initial explorations, `cpi2010` is not as impactful as we had believed; although still significant, `cpi2010` along with `gdppc2015` does not have a great effect on the response `prop`.

Side Note:

We justified our belief about the variables' significance through the use of standard errors and their confidence interals. Is 0 contained within the confidence interval?

$$
{\displaystyle \beta \in \left[{\widehat {\beta }}-se_{\widehat {\beta }}t_{n-p}^{*},\ {\widehat {\beta }}+se_{\widehat {\beta }}t_{n-p}^{*}\right]}
$$



### Visualization of Location and Year

(Visual 1) The following visual depicts `prop` against `year` faceted by `location`(Y is in proportion scale):

<div>
<img src="./visualization.png" width="800" height="450" />
</div>

Notice how average `prop` increases as `year` increases and notice how `Urban` areas have a higher overall mean for `prop`.

### Visualization of median `prop` against all significant variables

(Visual 2) The following visual depicts `prop` against `year` colored by `level` faceted by `location` and `sex`(Y is in proportion scale):

<div>
<img src="./visualization2.png" width="800" height="600" />
</div>

Notice how average `prop` increases as `year` increases and notice how `Urban` areas have a higher overall mean for `prop`. Furthermore, notice how the average `prop` decreases as `level` increases. Meanwhile, the `sex` factor does not have much of a visual difference.

### Model Visualization with All Significant Variables

(Visual 3) The following visual depicts our fitted model and a combination of the previous two visuals(Y is in percentage scale):

<div>
<img src="./visualization3.png" width="800" height="400" />
</div>

Percentage of kids who attain an education is positively correlated to `year` and negatively correlated to `level`. Notice the slope of the lines and the difference between the lines of each `level`.

---
## Discussion

Starting off by looking at the Multiple Linear Regression (MLR) model that was fitted to our variables, we can see which variables had the most significant impact on the response (`prop`). We see that the variables that most significantly impact the response are `location` and `level`. More specifically, the estimated average proportion of the population who completed any given level of education, holding all other variables fixed, is 16.46 percent greater if an observation is in an urban location. Aditionally, the level with the greatest impact on the response (`prop`) is `level_9`. This can be interpreted in the following way: the estimated average proportion of the population who completed any given level of education, holding all other variables fixed, is 48.58 percent lower if an observation is refering to grade 9. We can also note that year and sex also have a higher impact on the response `prop` than other variables, although not as high as `location` and `level`. This regression model shows us that `gdppc2015` and `cpi2010` are not significant in explaining the response `prop`- we didn't focus on these in our further exploration. In (Visual 1) depicting `prop` against `year`, faceted by `location` we can see an upwards trend in `prop` over time for both levels of the location factor. However, observations in urban locations tend to have higher median proportion of youth who have an education than observations and lower variability than observations in rural locations. In (Visual 2), we can see that average`prop` for all levels of education tend to increase over the years. This plot also shows that urban areas have higher overall means for `prop` among all levels of education. For urban/rural and male/female, we can see a distinct trend that the average `prop` decreases as `level` increases. It also looks as if the difference in average proportion of youth that have attained a certain level of education seems to get especially large after ~ level 5. Our last model visualization (Visual 3) includes the significant variables `location`, `level` and `year`. It maintains that there is a positive correlation between percentage of kids who received an education and `year`. We can also see that `prop` is negatively correlated to `level`, as the difference between the lines for each level is increasing as level increases, supporting what earlier plots had shown as well. Our analysis supports the conclusion that in both urban and rural locations, and across male and female groups, the proportion of youth (15-19) that have attained education levels (1-9) has increased over time. We can also see that the proportion of youth that have attained education levels decreases as level increases from 1-9. Additionally, our findings indicate `gdppc2015` and `cpi2010` were not significant in explaining the proportion of youth that have attained education levels (1-9). 


When we did the initial research into this dataset, we thought that gdp and cpi would be big predictors for proportion of youth educational attainment, but our results tell us that they are not that significant in predicting `prop`. If we had to speculate on what could be the cause of this, one possible cause for this result could be that since we are analyzing educational attainment globally, with gdp and cpi being very country specific variables, it won't have as much of an effect as a variable such as sex or location, as these are global traits that are relatively the same within every country. Maybe if we were to study specific countries, gdp and cpi could show a different picture. To conclude the study, from our findings it is clear that educational attainment across the world and for both boys and girls has gotten better over time, at all levels of education, and in both urban and rural areas alike.