<a href="https://colab.research.google.com/github/AmpleOpportunity/nutrition-vs-academics/blob/master/nutrition_vs_academics_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
'''Students who are more likely to identify with healthier meal options and more 
accurately predict the caloric density of foods, on average, have higher 
academic performance as measured by GPA.'''

# dependencies
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.sandbox.stats.runs as sm


# read csv to dataframe
host = 'https://raw.githubusercontent.com/AmpleOpportunity/nutrition-vs-academics/master/food_coded.csv'
food_choices_master_df = pd.read_csv(host)
food_choices_master_df.info()

In [None]:
'''For the purposes of this study, the only fields that will be used are those
that contain food self-identification choices (calories_chicken, coffee, drink, 
fries, soup) as well as caloric identification of various food items 
(calories_chicken, calories_scone, tortilla_calories, waffle_calories) with GPA 
as the continuous variable throughout.'''

# extract relevant data points for cleaning
food_choices_extracted_df = food_choices_master_df[['GPA', 'breakfast', 'coffee', 
                                                   'drink', 'fries', 'soup', 'calories_chicken', 
                                                   'calories_scone', 
                                                   'tortilla_calories', 
                                                   'waffle_calories']]

food_choices_extracted_df.info()

'''The schema reveals some needed work before analysis. A few of the records are 
incomplete. Additionally, all data types should be interpreted as an integer or 
floating point number but GPA is interpreted as a string. All non-numerical 
records will be converted into null values for that column and then purged from 
the dataframe.'''

# drop records where missing/non-numerical values exist
food_choices_extracted_df['GPA'] = pd.to_numeric(food_choices_extracted_df['GPA'], 
                                               errors='coerce')

food_choices_cleaned_df = food_choices_extracted_df.dropna()

In [None]:
# verify new sample size integrity
print('{} records removed. Total records: {}.'.format(
    len(food_choices_extracted_df) - len(food_choices_cleaned_df), 
    len(food_choices_cleaned_df)))

'''The sample size is still robust enough to continue analysis.'''

In [None]:
# comparing each group for normality by extracting kurtosis and skewness

# create separate df to hold values
normality_analysis_df = pd.DataFrame(columns=['name', 'kurtosis', 'skewness'])

# iterate over food_choices_cleaned_df, compute kurtosis/skew, and append
# results to normality_analysis_df
for i in range(len(food_choices_cleaned_df.columns)):
  normality_analysis_df = normality_analysis_df.append(pd.Series([food_choices_cleaned_df.columns[i],
                                          stats.kurtosis(food_choices_cleaned_df.iloc[:, i]),
                                          stats.skew(food_choices_cleaned_df.iloc[:, i])],
                                         index=normality_analysis_df.columns),
                               ignore_index=True)

normality_analysis_df

In [None]:
'''All skews are within acceptable ranges. breakfast and fries are leptokurtic.
Further investigation is warranted.'''

plt.hist(food_choices_cleaned_df['breakfast'], alpha = .5)
plt.hist(food_choices_cleaned_df['fries'], alpha = .5)

'''There is an extensive preference for the first option in both of these
responses and they are unlikely to yield any deeper insight. Given that three
other operable food identification questions exist, testing can proceed with 
only the remaining variables; the breakfast and fries columns will be omitted 
from the analysis.'''

In [None]:
'''Is there a correlation between respondents' self-identification with
healthier/unhealthier food options and academic performance? Given that 
self-identification food responses are between two coded options, and the 
goal is to compare more than two unmatched groups, a Chi-Square test is 
appropriate.'''

# execute a Chi-square test for all values
id_contingency = pd.crosstab(index=food_choices_cleaned_df['GPA'], 
                          columns=[food_choices_cleaned_df['coffee'], 
                                   food_choices_cleaned_df['drink'], 
                                   food_choices_cleaned_df['soup']])

stat, p, dof, expected = stats.chi2_contingency(id_contingency)

In [None]:
'''The p-value would be a relevant measure of significance. Usually set
at 0.05, the sample size merits reevaluation of this standard. Using the 
below table, a closer approximation of significance can be determined.'''

![picture](https://github.com/AmpleOpportunity/nutrition-vs-academics/blob/master/P-value%20based%20off%20sample%20size.png?raw=true)

In [None]:
# calculate an appropriate p-value for holding a strong standard for evidence
x = np.array([.005, .003, .001, .0003])
y = np.array([30, 50, 100, 1000])
a = np.vstack([x, np.ones(len(x))]).T

'''Given that we only have four data points that span a large sample size range
with fleetingly small p-value thresholds, there is no "silver bullet" 
function (linear, quadratic, logarithmic) that's going to fit these points
perfectly. For this reason, a least-squares approach has been selected to 
fit with the intention of minimizing residuals.'''

# use least-squares to model regression line
m, c = np.linalg.lstsq(a, y, rcond=None)[0]

# using the slope-intercept form of a line, find the p-value at the sample size
pval = (len(food_choices_cleaned_df) - c) / m

# plot the least-squares result
plt.plot(x, y, 'o', label='Original data', markersize = 10)
plt.plot(x, m*x + c, 'r', label='Fitted line')
plt.legend()
plt.show()

print('Given a sample size of {}, the p-value should be approximated as {} using the least-squared method.'.format(
    len(food_choices_cleaned_df), round(pval, 4)))

In [None]:
# interpret the test statistic
prob = 1 - pval
critical = stats.chi2.ppf(prob, dof)
print('Stat: ' + str(abs(stat)),' Critical: ' + str(critical), 'P: ' + str(p))

'''We fail to reject the null hypothesis that there is no correlation between 
respondents' self-identification with healthier/unhealthier food options and 
academic performance.'''

In [None]:
'''Is there a correlation between respondents' successful identification of the
caloric content in food items versus their academic performance? Responses are
also structured categorically and across multiple samples so a Chi-Square will
again be used for this analysis.'''

cal_contingency = pd.crosstab(index = food_choices_cleaned_df['GPA'], 
                           columns=[food_choices_cleaned_df['calories_chicken'], 
                                    food_choices_cleaned_df['calories_scone'], 
                                    food_choices_cleaned_df['tortilla_calories'], 
                                    food_choices_cleaned_df['waffle_calories']])

cal_stat, cal_p, cal_dof, cal_expected = stats.chi2_contingency(cal_contingency)

cal_critical = stats.chi2.ppf(prob, cal_dof)
print('Stat: ' + str(abs(cal_stat)),' Critical: ' + str(cal_critical), 'P: ' + str(cal_p))

'''These results are of great interest. If the standard 0.05 p-value assumption 
were adopted, then an argument could be made that the results are statistically
significant. Given our computed p-value of 0.0035, we must instead fail to 
reject the null hypothesis. This marginal deviation is further validated by our 
chi-squared value vs. our critical value with 1853.422 only being about 3% shy 
of significance at a critical value of 1912.696.'''