## OP Learning Agenda: SF Class of 2014

A project to determine what, if anything, influenced the graduation success of the San Francisco class of 2014.

In [1]:
import pandas as pd
from pathlib import Path
from datetime import datetime
import statsmodels.api as sm
import numpy as np
from tabulate import tabulate

In [2]:
%matplotlib inline
%load_ext nb_black


<IPython.core.display.Javascript object>

In [3]:
today = datetime.today()
in_file = Path.cwd() / "data" / "processed" / "processed_data.pkl"
report_dir = Path.cwd() / "reports"
report_file = report_dir / "Excel_Analysis_{today:%b-%d-%Y}.xlsx"

in_file2 = Path.cwd() / "data" / "processed" / "processed_data_file2.pkl"

in_file3 = Path.cwd() / "data" / "processed" / "processed_data_file3.pkl"


<IPython.core.display.Javascript object>

In [4]:
df = pd.read_pickle(in_file)

df2 = pd.read_pickle(in_file2)

df3 = pd.read_pickle(in_file3)

<IPython.core.display.Javascript object>

In [5]:
def sf_cross_tab(df, column, normalize="index"):
    return pd.crosstab(
        df[df.site == "San Francisco"].high_school_class,
        df[df.site == "San Francisco"][column],
        normalize=normalize,
        margins=True,
    )

<IPython.core.display.Javascript object>

##  General Distrobutions

SF Class of 2014 is on track to have the highest 6 year grad rate, with almost 70% of students already graduating, but that number isn't significantly higher than the class of 2013. Though we do see a reasonably big jump from 2012 to 2013. 



#### Table 1. San Francisco 6 Year Graduation Rate by High School Class 

In [6]:
# Grad Rate Less than 6 years

grad_rate_6_year = sf_cross_tab(df, "graduated_4_year_degree_less_6_years")

print(tabulate(grad_rate_6_year, headers=["HS Class","% Did not Graduate", "Graduation Rate"], tablefmt='simple'))


HS Class      % Did not Graduate    Graduation Rate
----------  --------------------  -----------------
2011                    0.529412           0.470588
2012                    0.452381           0.547619
2013                    0.324324           0.675676
2014                    0.301887           0.698113
All                     0.391566           0.608434


<IPython.core.display.Javascript object>

#### Table 2. San Francisco 6 Year Graduation Count by High School Class 


In [28]:
sf_cross_tab(df, "graduated_4_year_degree_less_6_years", normalize=False)


graduated_4_year_degree_less_6_years,False,True,All
high_school_class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2011,18,16,34
2012,19,23,42
2013,12,25,37
2014,16,37,53
All,65,101,166


<IPython.core.display.Javascript object>

#### Table 3. San Francisco 5 Year Graduation Rate by High School Class

To be more accurate, we can look at the 5 year grad rate, but this tells essentially the same story.

In [29]:
sf_cross_tab(df, "graduated_4_year_degree_less_5_years")


graduated_4_year_degree_less_5_years,False,True
high_school_class,Unnamed: 1_level_1,Unnamed: 2_level_1
2011,0.529412,0.470588
2012,0.47619,0.52381
2013,0.324324,0.675676
2014,0.301887,0.698113
All,0.39759,0.60241


### Statistical Test

If we run an independent t-test on the 5 year graduation rates, we see that the class of 2014 is not statistically higher than the class of 2013, or 2012. 

#### P Value from t-test comparing 2014 -> 2013

In [20]:
population1 = (
    df[(df.site == "San Francisco") & (df.high_school_class == 2013)][
        "graduated_4_year_degree_less_5_years"
    ]
).values


population2 = (
    df[(df.site == "San Francisco") & (df.high_school_class == 2014)][
        "graduated_4_year_degree_less_5_years"
    ]
).values

In [22]:
# p value of independent t-test on populations above
sm.stats.ttest_ind(population1, population2)[1]


0.8234541444667163

#### P Value from t-test comparing 2014 -> 2012 

In [23]:
population1 = (
    df[(df.site == "San Francisco") & (df.high_school_class == 2012)][
        "graduated_4_year_degree_less_5_years"
    ]
).values


population2 = (
    df[(df.site == "San Francisco") & (df.high_school_class == 2014)][
        "graduated_4_year_degree_less_5_years"
    ]
).values

In [24]:
sm.stats.ttest_ind(population1, population2)[1]


0.08361118835853203

#### P Value from t-test comparing 2014 -> 2011
Note, this value is < 0.5

In [27]:
population1 = (
    df[(df.site == "San Francisco") & (df.high_school_class == 2011)][
        "graduated_4_year_degree_less_5_years"
    ]
).values


population2 = (
    df[(df.site == "San Francisco") & (df.high_school_class == 2014)][
        "graduated_4_year_degree_less_5_years"
    ]
).values

In [28]:
sm.stats.ttest_ind(population1, population2)[1]


0.03405678968608344

### Other Distributions

Based on the above results, I don't believe we can say that the class of 2014 was notably higher that previous classes. It does however appear there is an upswring in graduation rate that has been increasing since 2011, with a decent jump from 2012 -> 2013 (though not a statistically significant one).

With that in mind, here are other notable differences in the high school class distributions which might indicatate changes that are influencing the graduation rate 

#### 11th Grade College Eligibility GPA

This is the most notable distrobution change, with the class of 2014 having by far the highest 11th grade GPAs, with almost 75% of that student group having over a 3.0. Compared to 53% from the class of 2013 (the next highest)

In [33]:
sf_cross_tab(df, "gpa_bucket")

gpa_bucket,2.5 - 2.74,2.5 or less,2.75 - 2.9,3.0 - 3.49,3.5 or greater
high_school_class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2011,0.058824,0.5,0.117647,0.117647,0.205882
2012,0.02381,0.428571,0.095238,0.214286,0.238095
2013,0.189189,0.135135,0.135135,0.243243,0.297297
2014,0.09434,0.132075,0.037736,0.245283,0.490566
All,0.090361,0.283133,0.090361,0.210843,0.325301
