# SAT & ACT EDA Notebook (2)

## Table of Contents
- [Importing Necessary Libraries & Loading Data](#Importing-Necessary-Libraries-&-Loading-Data)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Descriptive & Inferential Statistics](#Descriptive-&-Inferential-Statistics)
- [Outside Research](#Outside-Research)

## Importing Necessary Libraries & Loading Data

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from warnings import catch_warnings
from warnings import filterwarnings
import warnings

warnings.simplefilter(action="ignore")

In [2]:
combined = pd.read_csv('../Data/final.csv')
combined.head()

Unnamed: 0,state,sat_participation_17,sat_reading_and_writing_17,sat_math_17,sat_total_17,act_participation_17,act_english_17,act_math_17,act_reading_17,act_science_17,act_composite_17,sat_participation_18,sat_reading_and_writing_18,sat_math_18,sat_total_18,act_participation_18,act_composite_18
0,Alabama,5.0,593,572,1165,100.0,18.9,18.4,19.7,19.4,19.2,6.0,595,571,1166,100.0,19.1
1,Alaska,38.0,547,533,1080,65.0,18.7,19.8,20.4,19.9,19.8,43.0,562,544,1106,33.0,20.8
2,Arizona,30.0,563,553,1116,62.0,18.6,19.8,20.1,19.8,19.7,29.0,577,572,1149,66.0,19.2
3,Arkansas,3.0,614,594,1208,100.0,18.9,19.0,19.7,19.5,19.4,5.0,592,576,1169,100.0,19.4
4,California,53.0,531,524,1055,31.0,22.5,22.7,23.1,22.2,22.8,60.0,540,536,1076,27.0,22.7


## Exploratory Data Analysis

Looking at the summary statistics.

In [5]:
combined.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sat_participation_17,51.0,39.803922,35.276632,2.0,4.0,38.0,66.0,100.0
sat_reading_and_writing_17,51.0,569.117647,45.666901,482.0,533.5,559.0,613.0,644.0
sat_math_17,51.0,556.882353,47.121395,468.0,523.5,548.0,599.0,651.0
sat_total_17,51.0,1126.098039,92.494812,950.0,1055.5,1107.0,1212.0,1295.0
act_participation_17,51.0,65.254902,32.140842,8.0,31.0,69.0,100.0,100.0
act_english_17,51.0,20.931373,2.353677,16.3,19.0,20.7,23.3,25.5
act_math_17,51.0,21.182353,1.981989,18.0,19.4,20.9,23.1,25.3
act_reading_17,51.0,22.013725,2.067271,18.1,20.45,21.8,24.15,26.0
act_science_17,51.0,21.45098,1.739353,18.2,19.95,21.3,23.2,24.9
act_composite_17,51.0,21.519608,2.020695,17.8,19.8,21.4,23.6,25.5


Investigating the standard deviations by computing the standard deviation manually for each numaric column in the dataframe using the formula:

$$\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^n(x_i - \mu)^2}$$

and comparing these to the output from pandas `describe` and the output from numpy's `std` methods.

In [12]:
# function to compute standard deviation manually
def standard_dev(df, column_name):
    sum_diff_squared = 0
    n = len(df.index)
    mean = df[column_name].mean()
    for i in df.index:
        a = df[column_name][i] - mean
        sum_diff_squared += (a**2)
    var = sum_diff_squared / n
    standard_dev = var**(1/2)
    return standard_dev

In [18]:
# creating dictionary for each numeric column and its manually calculated standard deviation
{col: standard_dev(combined,col) for col in combined
      .select_dtypes(include=['int64', 'float64'])}

{'sat_participation_17': 34.92907076664508,
 'sat_reading_and_writing_17': 45.21697020437866,
 'sat_math_17': 46.65713364485503,
 'sat_total_17': 91.58351056778743,
 'act_participation_17': 31.824175751231806,
 'act_english_17': 2.3304876369363363,
 'act_math_17': 1.9624620273436781,
 'act_reading_17': 2.0469029314842646,
 'act_science_17': 1.7222161451443676,
 'act_composite_17': 2.000786081581989,
 'sat_participation_18': 36.946619223539415,
 'sat_reading_and_writing_18': 47.03460978357609,
 'sat_math_18': 47.30194550378352,
 'sat_total_18': 93.22742384464433,
 'act_participation_18': 33.70173582041031,
 'act_composite_18': 2.090779082141178}

In [17]:
#  creating dictionary for each numeric column and standard deviation using numpy
{col: np.std(combined[col]) for col in combined
      .select_dtypes(include=['int64', 'float64'])}

{'sat_participation_17': 34.92907076664508,
 'sat_reading_and_writing_17': 45.21697020437866,
 'sat_math_17': 46.65713364485503,
 'sat_total_17': 91.58351056778743,
 'act_participation_17': 31.824175751231806,
 'act_english_17': 2.3304876369363363,
 'act_math_17': 1.9624620273436781,
 'act_reading_17': 2.0469029314842646,
 'act_science_17': 1.7222161451443676,
 'act_composite_17': 2.000786081581989,
 'sat_participation_18': 36.946619223539415,
 'sat_reading_and_writing_18': 47.03460978357609,
 'sat_math_18': 47.30194550378352,
 'sat_total_18': 93.22742384464433,
 'act_participation_18': 33.70173582041031,
 'act_composite_18': 2.090779082141178}

In [25]:
#  creating dictionary for each numeric column and standard deviation using pandas
{col: combined[col].std() for col in combined
      .select_dtypes(include=['int64', 'float64'])}

{'sat_participation_17': 35.276632270013046,
 'sat_reading_and_writing_17': 45.66690138768932,
 'sat_math_17': 47.12139516560329,
 'sat_total_17': 92.49481172519046,
 'act_participation_17': 32.14084201588683,
 'act_english_17': 2.35367713980303,
 'act_math_17': 1.9819894936505533,
 'act_reading_17': 2.0672706264873146,
 'act_science_17': 1.7393530462812443,
 'act_composite_17': 2.020694891154341,
 'sat_participation_18': 37.31425633039196,
 'sat_reading_and_writing_18': 47.50262737831599,
 'sat_math_18': 47.77262322095955,
 'sat_total_18': 94.15508275097599,
 'act_participation_18': 34.03708473496081,
 'act_composite_18': 2.111583366510896}

The manually calculated standard deviations do not match up with the output from pandas, as pandas uses the sample mean *(with (n-1) in denominator)* while we used the the population mean *(with (n) in the denominator)* when we manually calculated the mean. However, numpy's std method does match up with the manually calculated standard deviations, as its default is the standard deviation for a population *(ddof = 0)*.  

#### Participation Rates

In [26]:
participation_cols = ['sat_participation_17', 'sat_participation_18',
                      'act_participation_17', 'act_participation_18']

States with the highest participation rates.

In [46]:
# 2017 SAT states with highest participation rates
combined[['state', 'sat_participation_17']].sort_values('sat_participation_17',
                                                        ascending=False).head(10)

Unnamed: 0,state,sat_participation_17
8,District of Columbia,100.0
22,Michigan,100.0
6,Connecticut,100.0
7,Delaware,100.0
29,New Hampshire,96.0
19,Maine,95.0
12,Idaho,93.0
9,Florida,83.0
21,Massachusetts,76.0
39,Rhode Island,71.0


In 2017, the District of Columbia, Michigan, Connecticut, and Delaware, have the highest SAT participation rates at 100% participation.

In [44]:
# 2018 SAT states with highest participation rates
combined[['state', 'sat_participation_18']].sort_values('sat_participation_18',
                                                        ascending=False).head(10)

Unnamed: 0,state,sat_participation_18
5,Colorado,100.0
6,Connecticut,100.0
7,Delaware,100.0
22,Michigan,100.0
12,Idaho,100.0
19,Maine,99.0
13,Illinois,99.0
39,Rhode Island,97.0
29,New Hampshire,96.0
8,District of Columbia,92.0


In 2018, Colorado, Connecticut, Delaware, Michigan, and Idaho had the highest SAT participation rates at 100% participation.

In [50]:
# 2017 ACT states with highest participation rates
combined[['state', 'act_participation_17']].sort_values('act_participation_17',
                                                        ascending=False).head(20)

Unnamed: 0,state,act_participation_17
0,Alabama,100.0
17,Kentucky,100.0
49,Wisconsin,100.0
44,Utah,100.0
42,Tennessee,100.0
40,South Carolina,100.0
36,Oklahoma,100.0
33,North Carolina,100.0
28,Nevada,100.0
26,Montana,100.0


In [66]:
print("In 2017, states with 100% ACT participation = ")
print(set(combined[combined['act_participation_17']==100]['state']))

In 2017, states with 100% ACT participation = 
{'Louisiana', 'Arkansas', 'Wisconsin', 'Alabama', 'Montana', 'South Carolina', 'Tennessee', 'Utah', 'Kentucky', 'North Carolina', 'Missouri', 'Wyoming', 'Nevada', 'Minnesota', 'Colorado', 'Mississippi', 'Oklahoma'}


In [49]:
# 2018 ACT states with highest participation rates
combined[['state', 'act_participation_18']].sort_values('act_participation_18',
                                                        ascending=False).head(20)

Unnamed: 0,state,act_participation_18
0,Alabama,100.0
17,Kentucky,100.0
49,Wisconsin,100.0
44,Utah,100.0
42,Tennessee,100.0
40,South Carolina,100.0
36,Oklahoma,100.0
35,Ohio,100.0
33,North Carolina,100.0
28,Nevada,100.0


In [67]:
print("In 2018, states with 100% ACT participation = ")
print(set(combined[combined['act_participation_18']==100]['state']))

In 2018, states with 100% ACT participation = 
{'Louisiana', 'Arkansas', 'Nebraska', 'Wisconsin', 'Alabama', 'Montana', 'South Carolina', 'Tennessee', 'Utah', 'Kentucky', 'North Carolina', 'Missouri', 'Wyoming', 'Nevada', 'Ohio', 'Mississippi', 'Oklahoma'}


Investigating any changes in States with the highest participation rates between 2017 and 2018.

In [56]:
# 2017 SAT states with highest participation rates compared to other participation metrics
combined[['state']+ participation_cols].sort_values('sat_participation_17',
                                                        ascending=False).head(10)

Unnamed: 0,state,sat_participation_17,sat_participation_18,act_participation_17,act_participation_18
8,District of Columbia,100.0,92.0,32.0,32.0
22,Michigan,100.0,100.0,29.0,22.0
6,Connecticut,100.0,100.0,31.0,26.0
7,Delaware,100.0,100.0,18.0,17.0
29,New Hampshire,96.0,96.0,18.0,16.0
19,Maine,95.0,99.0,8.0,7.0
12,Idaho,93.0,100.0,38.0,36.0
9,Florida,83.0,56.0,73.0,66.0
21,Massachusetts,76.0,80.0,29.0,25.0
39,Rhode Island,71.0,97.0,21.0,15.0


In [55]:
# 2018 SAT states with highest participation rates, compared to other participation metrics
combined[['state']+ participation_cols].sort_values('sat_participation_18',
                                                        ascending=False).head(10)

Unnamed: 0,state,sat_participation_17,sat_participation_18,act_participation_17,act_participation_18
5,Colorado,11.0,100.0,100.0,30.0
6,Connecticut,100.0,100.0,31.0,26.0
7,Delaware,100.0,100.0,18.0,17.0
22,Michigan,100.0,100.0,29.0,22.0
12,Idaho,93.0,100.0,38.0,36.0
19,Maine,95.0,99.0,8.0,7.0
13,Illinois,9.0,99.0,93.0,43.0
39,Rhode Island,71.0,97.0,21.0,15.0
29,New Hampshire,96.0,96.0,18.0,16.0
8,District of Columbia,100.0,92.0,32.0,32.0


SAT Participation rates increased significantly from 2017 to 2018 for Colorado and Illinois, going from <10% participation to 100% participation.

In [68]:
# 2017 ACT states with highest participation rates, compared to other participation metrics
combined[['state']+ participation_cols].sort_values('act_participation_17',
                                                        ascending=False).head(20)

Unnamed: 0,state,sat_participation_17,sat_participation_18,act_participation_17,act_participation_18
0,Alabama,5.0,6.0,100.0,100.0
17,Kentucky,4.0,4.0,100.0,100.0
49,Wisconsin,3.0,3.0,100.0,100.0
44,Utah,3.0,4.0,100.0,100.0
42,Tennessee,5.0,6.0,100.0,100.0
40,South Carolina,50.0,55.0,100.0,100.0
36,Oklahoma,7.0,8.0,100.0,100.0
33,North Carolina,49.0,52.0,100.0,100.0
28,Nevada,26.0,23.0,100.0,100.0
26,Montana,10.0,10.0,100.0,100.0


From 2017 to 2018, ACT participation in Illinois and Colorado decreased significantly, while their SAT participation increased significantly.

In [69]:
# 2018 ACT states with highest participation rates, compared to other participation metrics
combined[['state']+ participation_cols].sort_values('act_participation_18',
                                                        ascending=False).head(20)

Unnamed: 0,state,sat_participation_17,sat_participation_18,act_participation_17,act_participation_18
0,Alabama,5.0,6.0,100.0,100.0
17,Kentucky,4.0,4.0,100.0,100.0
49,Wisconsin,3.0,3.0,100.0,100.0
44,Utah,3.0,4.0,100.0,100.0
42,Tennessee,5.0,6.0,100.0,100.0
40,South Carolina,50.0,55.0,100.0,100.0
36,Oklahoma,7.0,8.0,100.0,100.0
35,Ohio,12.0,18.0,75.0,100.0
33,North Carolina,49.0,52.0,100.0,100.0
28,Nevada,26.0,23.0,100.0,100.0


States with the lowest participation rates.

In [72]:
# 2017 SAT states with lowest participation rates
combined[['state', 'sat_participation_17']].sort_values('sat_participation_17').head()

Unnamed: 0,state,sat_participation_17
34,North Dakota,2.0
24,Mississippi,2.0
15,Iowa,2.0
25,Missouri,3.0
44,Utah,3.0


In 2017, North Dakota, Mississippi, and Iowa had the lowest participation rates at 2% participation.

In [73]:
# 2018 SAT states with lowest participation rates
combined[['state', 'sat_participation_18']].sort_values('sat_participation_18').head()

Unnamed: 0,state,sat_participation_18
34,North Dakota,2.0
50,Wyoming,3.0
41,South Dakota,3.0
27,Nebraska,3.0
49,Wisconsin,3.0


In 2018, North Dakota had the lowest participation rates at 2% participation.

In [74]:
# 2017 ACT states with lowest participation rates
combined[['state', 'act_participation_17']].sort_values('act_participation_17').head()

Unnamed: 0,state,act_participation_17
19,Maine,8.0
29,New Hampshire,18.0
7,Delaware,18.0
39,Rhode Island,21.0
38,Pennsylvania,23.0


In [75]:
# 2018 SAT states with lowest participation rates
combined[['state', 'act_participation_18']].sort_values('act_participation_18').head()

Unnamed: 0,state,act_participation_18
19,Maine,7.0
39,Rhode Island,15.0
29,New Hampshire,16.0
7,Delaware,17.0
38,Pennsylvania,20.0


In both 2017 and 2018, Maine had the lowest ACT participation rate, at 8% and 7% participation, respectively.

In [76]:
# 2017 SAT states with lowest participation rates, compared to other participation metrics
combined[['state']+ participation_cols].sort_values('sat_participation_17').head()

Unnamed: 0,state,sat_participation_17,sat_participation_18,act_participation_17,act_participation_18
34,North Dakota,2.0,2.0,98.0,98.0
24,Mississippi,2.0,3.0,100.0,100.0
15,Iowa,2.0,3.0,67.0,68.0
25,Missouri,3.0,4.0,100.0,100.0
44,Utah,3.0,4.0,100.0,100.0


In [77]:
# 2018 SAT states with lowest participation rates, compared to other participation metrics
combined[['state']+ participation_cols].sort_values('sat_participation_18').head()

Unnamed: 0,state,sat_participation_17,sat_participation_18,act_participation_17,act_participation_18
34,North Dakota,2.0,2.0,98.0,98.0
50,Wyoming,3.0,3.0,100.0,100.0
41,South Dakota,3.0,3.0,80.0,77.0
27,Nebraska,3.0,3.0,84.0,100.0
49,Wisconsin,3.0,3.0,100.0,100.0


In [79]:
# 2017 ACT states with lowest participation rates, compared to other participation metrics
combined[['state']+ participation_cols].sort_values('act_participation_17').head()

Unnamed: 0,state,sat_participation_17,sat_participation_18,act_participation_17,act_participation_18
19,Maine,95.0,99.0,8.0,7.0
29,New Hampshire,96.0,96.0,18.0,16.0
7,Delaware,100.0,100.0,18.0,17.0
39,Rhode Island,71.0,97.0,21.0,15.0
38,Pennsylvania,65.0,70.0,23.0,20.0


In [80]:
# 2018 ACT states with lowest participation rates, compared to other participation metrics
combined[['state']+ participation_cols].sort_values('act_participation_18').head()

Unnamed: 0,state,sat_participation_17,sat_participation_18,act_participation_17,act_participation_18
19,Maine,95.0,99.0,8.0,7.0
39,Rhode Island,71.0,97.0,21.0,15.0
29,New Hampshire,96.0,96.0,18.0,16.0
7,Delaware,100.0,100.0,18.0,17.0
38,Pennsylvania,65.0,70.0,23.0,20.0


Statewide average participation rates for one test in a given year appear to be inversely proportional to statewide participation rates for the other test for the same state in the same year. 

#### Mean Total/ Composite Scores