<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 1: Standardized Test Analysis

--- 
# Part 1

Part 1 requires knowledge of basic Python.

---

## Problem Statement

### Has collegeboard solved the problems they were seeking to fix with the 2016 format change?



### Contents:
- [Background](#Background)
- [Data Import & Cleaning](#Data-Import-and-Cleaning)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Data Visualization](#Visualize-the-Data)
- [Conclusions and Recommendations](#Conclusions-and-Recommendations)

## Background

The SAT is a standardized test that many colleges and universities in the United States require for their admissions process. This score is used along with other materials such as grade point average (GPA) and essay responses to determine whether or not a potential student will be accepted to the university.

The SAT has two sections of the test: Evidence-Based Reading and Writing and Math ([*source*](https://www.princetonreview.com/college/sat-sections)).
* [SAT](https://collegereadiness.collegeboard.org/sat)

The SAT changed their format in 2016 to address some problems that students and schools had with their style of questions and questionable grading. The ACT surpassed the SAT in popularity due to these problems and a perception that the SAT was "class-biased" due to their questions. The ACT also did this through a lot of states make it free state-wide and a requirement to pass high school which the SAT did not catch up on.

### Choose your Data

* [`act_2017.csv`](./data/act_2017.csv): 2017 ACT Scores by State
* [`act_2018.csv`](./data/act_2018.csv): 2018 ACT Scores by State
* [`act_2019.csv`](./data/act_2019.csv): 2019 ACT Scores by State
* [`sat_2017.csv`](./data/sat_2017.csv): 2017 SAT Scores by State
* [`sat_2018.csv`](./data/sat_2018.csv): 2018 SAT Scores by State
* [`sat_2019.csv`](./data/sat_2019.csv): 2019 SAT Scores by State
* [`PopulationcsvData.csv`](./data/PopulationcsvData.csv): Population rankings by State


### Outside Research

* [`PopulationcsvData.csv`](./data/PopulationcsvData.csv): Population rankings by State ([source](https://worldpopulationreview.com/states))
* [`gdp2019rank.csv`](./data/cleaned_data/gdp2019rank.csv): Dataframe created on Python using outside sources on GDP Rank by state in 2019 ([source](https://www.statista.com/statistics/248063/per-capita-us-real-gross-domestic-product-gdp-by-state/))
* [`actfreecols.csv`](./data/cleaned_data/actfreecols.csv): Dataframe created on Python using outside sources on states that have the ACT free statewide ([source](https://blog.collegevine.com/states-that-require-the-act/#list))
* [`satfreecols.csv`](./data/cleaned_data/satfreecols.csv): Dataframe created on Python using outside sources on states that have the SAT free statewide ([source](https://blog.collegevine.com/states-that-require-sat/))

### Additional Outside Research

[theolivebook](https://theolivebook.com/sat-vs-act-which-is-more-popular/#:~:text=As%20of%202019%2C%20the%20SAT,students%20who%20took%20the%20ACT): SAT-ACT popularity in 2019

[cnn](https://www.cnn.com/2014/03/05/living/sat-test-changes-schools/index.html): SAT Change details

[ivyscholars](https://www.ivyscholars.com/2021/06/11/are-the-sats-biased/): Are the SAT's biased?

[edsource](https://edsource.org/2021/university-of-california-must-drop-sat-act-scores-for-admissions-and-scholarships/654842): University of California drops SAT-ACT scores for admissions

[greentestprep](https://greentestprep.com/resources/sat-prep/new-sat-march2016/why-is-the-college-board-changing-the-sat/): Why collgeboard is changing the SAT

[collegevine](https://blog.collegevine.com/states-that-require-sat/): States where the SAT is free (and/or required)

[collegevine](https://blog.collegevine.com/states-that-require-the-act/#list): States where the ACT is free (and/or required)



### Coding Challenges

1. Manually calculate mean:

    Write a function that takes in values and returns the mean of the values. Create a list of numbers that you test on your function to check to make sure your function works!
    
    *Note*: Do not use any mean methods built-in to any Python libraries to do this! This should be done without importing any additional libraries.

In [6]:
# Code: 
def mean(x):
    return sum(x)/ (len(x))
mean([1,2,3,4,5])

3.0

2. Manually calculate standard deviation:

    The formula for standard deviation is below:

    $$\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^n(x_i - \mu)^2}$$

    Where $x_i$ represents each value in the dataset, $\mu$ represents the mean of all values in the dataset and $n$ represents the number of values in the dataset.

    Write a function that takes in values and returns the standard deviation of the values using the formula above. Hint: use the function you wrote above to calculate the mean! Use the list of numbers you created above to test on your function.
    
    *Note*: Do not use any standard deviation methods built-in to any Python libraries to do this! This should be done without importing any additional libraries.

In [18]:
# Code:
def stddev(x):
    lst = []
    for num in x:
        num = (num - mean(x)) ** 2
        lst.append(num)
    return sum(lst) / (len(x)) ** (1/2)
stddev([1,2,3,4,5])        

4.47213595499958

3. Data cleaning function:
    
    Write a function that takes in a string that is a number and a percent symbol (ex. '50%', '30.5%', etc.) and converts this to a float that is the decimal approximation of the percent. For example, inputting '50%' in your function should return 0.5, '30.5%' should return 0.305, etc. Make sure to test your function to make sure it works!

You will use these functions later on in the project!

In [20]:
# Code:

def to_percent(x):
    return float(x.rstrip('%'))/100
to_percent('5%')

#code used in cleaning
def different_to_percent(x, y):
    return x[y].str.rstrip('%').astype('float') / 100.0

0.05

--- 
# Part 2

Part 2 requires knowledge of Pandas, EDA, data cleaning, and data visualization.

---

## Data Import and Cleaning

### Data Import & Cleaning

Import the datasets that you selected for this project and go through the following steps at a minimum. You are welcome to do further cleaning as you feel necessary:
1. Display the data: print the first 5 rows of each dataframe to your Jupyter notebook.
2. Check for missing values.
3. Check for any obvious issues with the observations (keep in mind the minimum & maximum possible values for each test/subtest).
4. Fix any errors you identified in steps 2-3.
5. Display the data types of each feature.
6. Fix any incorrect data types found in step 5.
    - Fix any individual values preventing other columns from being the appropriate type.
    - If your dataset has a column of percents (ex. '50%', '30.5%', etc.), use the function you wrote in Part 1 (coding challenges, number 3) to convert this to floats! *Hint*: use `.map()` or `.apply()`.
7. Rename Columns.
    - Column names should be all lowercase.
    - Column names should not contain spaces (underscores will suffice--this allows for using the `df.column_name` method to access columns in addition to `df['column_name']`).
    - Column names should be unique and informative.
8. Drop unnecessary rows (if needed).
9. Merge dataframes that can be merged.
10. Perform any additional cleaning that you feel is necessary.
11. Save your cleaned and merged dataframes as csv files.

## Notebook that contains my cleaning of data and creation of new CSV's
[`cleaning_and_creating.ipynb`](../cleaning_and_creating.ipynb)

## Data Dictionary


|Feature|Type|Dataset|Description|
|---|---|---|---|
|**participation17**|*float*|SATTotal|Participation numbers for the 2017 SAT(in decimal percentage)|
|**participation17**|*float*|ACTTotal|Participation numbers for the 2017 ACT(in decimal percentage)|
|**participation18**|*float*|ACTTotal|Participation numbers for the 2018 ACT(in decimal percentage)|
|**participation18**|*float*|SATTotal|Participation numbers for the 2018 SAT(in decimal percentage)|
|**participation19**|*float*|SATTotal|Participation numbers for the 2019 SAT(in decimal percentage)|
|**participation19**|*float(|ACTTotal|Participation numbers for the 2019 ACT(in decimal percentage)|
|**total17**|*int*|SATTotal|Average total score for the 2017 SAT(mean score)|
|**total18**|*int*|SATTotal|Average total score for the 2018 SAT(mean score)|
|**total19**|*int*|SATTotal|Average total score for the 2019 SAT(mean score)|
|**composite17**|*float*|ACTTotal|Average total score for the 2017 ACT(mean score)|
|**composite18**|*float*|ACTTotal|Average total score for the 2018 ACT(mean score)|
|**composite19**|*float*|ACTTotal|Average total score for the 2019 ACT(mean score)|
|**sat_free**|*boolean*|ACTTotal/SATTotal|States that have the SAT free statewide|
|**act_free**|*boolean*|ACTTotal/SATTotal|States that have the ACT free statewide|
|**population_rank**|*float/int*|ACTTotal/SATTotal|Population rank for each state|
|**gdp_rank_19**|*float/int*|ACTTotal/SATTotal|GDP rank for each state|
|**state**|*object*|SATTotal|The states where the data is located|
|**state**|*object*|ACTTotal|The states where the data is located|

## Exploratory Data Analysis

Complete the following steps to explore your data. You are welcome to do more EDA than the steps outlined here as you feel necessary:
1. Summary Statistics.
2. Use a **dictionary comprehension** to apply the standard deviation function you create in part 1 to each numeric column in the dataframe.  **No loops**.
    - Assign the output to variable `sd` as a dictionary where: 
        - Each column name is now a key 
        - That standard deviation of the column is the value 
        - *Example Output :* `{'ACT_Math': 120, 'ACT_Reading': 120, ...}`
3. Investigate trends in the data.
    - Using sorting and/or masking (along with the `.head()` method to avoid printing our entire dataframe), consider questions relevant to your problem statement. Some examples are provided below (but feel free to change these questions for your specific problem):
        - Which states have the highest and lowest participation rates for the 2017, 2019, or 2019 SAT and ACT?
        - Which states have the highest and lowest mean total/composite scores for the 2017, 2019, or 2019 SAT and ACT?
        - Do any states with 100% participation on a given test have a rate change year-to-year?
        - Do any states show have >50% participation on *both* tests each year?
        - Which colleges have the highest median SAT and ACT scores for admittance?
        - Which California school districts have the highest and lowest mean test scores?
    - **You should comment on your findings at each step in a markdown cell below your code block**. Make sure you include at least one example of sorting your dataframe by a column, and one example of using boolean filtering (i.e., masking) to select a subset of the dataframe.

## This is the file that contains my EDA and visualization
[`visualization_and_analysis.ipynb`](../visualization_and_analysis.ipynb)

## Visualize the Data

There's not a magic bullet recommendation for the right number of plots to understand a given dataset, but visualizing your data is *always* a good idea. Not only does it allow you to quickly convey your findings (even if you have a non-technical audience), it will often reveal trends in your data that escaped you when you were looking only at numbers. It is important to not only create visualizations, but to **interpret your visualizations** as well.

**Every plot should**:
- Have a title
- Have axis labels
- Have appropriate tick labels
- Text is legible in a plot
- Plots demonstrate meaningful and valid relationships
- Have an interpretation to aid understanding

Here is an example of what your plots should look like following the above guidelines. Note that while the content of this example is unrelated, the principles of visualization hold:

![](https://snag.gy/hCBR1U.jpg)
*Interpretation: The above image shows that as we increase our spending on advertising, our sales numbers also tend to increase. There is a positive correlation between advertising spending and sales.*

---

Here are some prompts to get you started with visualizations. Feel free to add additional visualizations as you see fit:
1. Use Seaborn's heatmap with pandas `.corr()` to visualize correlations between all numeric features.
    - Heatmaps are generally not appropriate for presentations, and should often be excluded from reports as they can be visually overwhelming. **However**, they can be extremely useful in identify relationships of potential interest (as well as identifying potential collinearity before modeling).
    - Please take time to format your output, adding a title. Look through some of the additional arguments and options. (Axis labels aren't really necessary, as long as the title is informative).
2. Visualize distributions using histograms. If you have a lot, consider writing a custom function and use subplots.
    - *OPTIONAL*: Summarize the underlying distributions of your features (in words & statistics)
         - Be thorough in your verbal description of these distributions.
         - Be sure to back up these summaries with statistics.
         - We generally assume that data we sample from a population will be normally distributed. Do we observe this trend? Explain your answers for each distribution and how you think this will affect estimates made from these data.
3. Plot and interpret boxplots. 
    - Boxplots demonstrate central tendency and spread in variables. In a certain sense, these are somewhat redundant with histograms, but you may be better able to identify clear outliers or differences in IQR, etc.
    - Multiple values can be plotted to a single boxplot as long as they are of the same relative scale (meaning they have similar min/max values).
    - Each boxplot should:
        - Only include variables of a similar scale
        - Have clear labels for each variable
        - Have appropriate titles and labels
4. Plot and interpret scatter plots to view relationships between features. Feel free to write a custom function, and subplot if you'd like. Functions save both time and space.
    - Your plots should have:
        - Two clearly labeled axes
        - A proper title
        - Colors and symbols that are clear and unmistakable
5. Additional plots of your choosing.
    - Are there any additional trends or relationships you haven't explored? Was there something interesting you saw that you'd like to dive further into? It's likely that there are a few more plots you might want to generate to support your narrative and recommendations that you are building toward. **As always, make sure you're interpreting your plots as you go**.

## Visualization Notebook attached to Analysis Notebook
[`visualization_and_analysis.ipynb`](../visualization_and_analysis.ipynb)

## Conclusions and Recommendations

From my analysis, I was able to conclude that the SAT has successfully managed to overtake the ACT once again as the market leader in standardized testing due to primarily to the SAT changes made to the format and structure. It not only made the test more appealing to students but also states themselves as number of them made it free during the time between 2019 from the date of the change. This is evidenced by the fact that the SAT once again, became the market-leader in standardized testing after the ACT's regression post-SAT change. While solving participation was a massive success in both participation numbers and time, the second problem is where the SAT still needs work.

The SAT change partially occurred due to the reputation that the SAT was a class-biased test, the strange scoring of the 2400 scale and detachment from the actual school work that students were learning in school lead to the ACT being more popular and while they did fix the problem with participation, the lingering problem of class-bias is still there. With more colleges pulling away from standardized testing and geographical location being so heavily tied to participation and success, there need to be more upcoming changes to rectify this issue and make the SAT a more equitable test.