<a href="https://colab.research.google.com/github/AmpleOpportunity/nutrition-vs-academics/blob/master/nutrition_vs_academics_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Correlations Between Nutritional Identification and Grade Point Average Among College Students

## Problem Statement
Lifestyle choices have a correlation on health; the innumerable studies on smoking risks, obesity effects, and other comorbidities have plenty to offer on these subjects. With childhood obesity a growing concern, there is a rising interest in healthy school lunches and nutritional reformation for young adults. 

While there are innumerable studies in the area of nutrition as well, many of the examinations focus on outcomes; primarily, a subject’s actual diet, either baselined from a point in time or versus another performance variable. This examination would focus on food identification, not consumption, among young adults. Specifically, the study would be taken to investigate whether students with higher academic performance also possess higher nutritional awareness. 

## Hypothesis
Students who are more likely to identify with healthier meal options and more accurately predict the caloric density of foods, on average, have higher academic performance as measured by GPA.

## Interest in Study
As a person that suffered from childhood obesity and had to develop a healthy relationship with food as an adult, nutrition is something that is foundational to my life. Though many years of trying and failing to adopt a healthy lifestyle, success did not yield itself until I realized that health is a mindset rather than a set of actions; that it is not *what* I eat, but *how* I eat from a perceptual and emotional standpoint. This research focuses on that selfsame perception of food items.

## Methodology 
Using the 2017 survey conducted by BoraPajo on [Kaggle](https://www.kaggle.com/borapajo/food-choices), perform the following analyses:
1. Using GPA as the categorical variable, look for a correlation between respondents’ identification with healthier/unhealthier food options (oatmeal vs. donuts for breakfast, frappuccino vs espresso for coffee, orange juice vs soda for beverage, McDonalds french fries vs home fries for potatoes, and vegetable vs starchy for soup) and academic performance. Assuming data is normally-distributed, a Chi-squared Test would be appropriate to examine any statistically significant variability among these groups groups.

1. Using GPA as the categorical variable, look for a correlation between correct calorie identification of given foods (chicken piadina, scone, burrito, and waffle potato sandwich) and academic performance. Assuming normal distribution, another Chi-squared Test would be appropriate again.

1. Using an independent t-test, examine the correlation between the following:
  - GPA and respondents’ self-perception of their diet (healthy, unhealthy, repetitive, unclear),
  - GPA and respondents’ body weight self-identification (slim, very fit, just right, slightly overweight, overweight, does not consider weight self-identification)
  - GPA and respondents’ likelihood to check nutritional values frequently (never, on certain products only, very rarely, on most products, on everything)

1. Using a Chi-squared Test, analyze GPA versus the previous self-perception questions as a group.

## Intended Audience
Educational institutions can use the results of a positive experiment to provide greater emphasis on nutritional mindfulness as a part of increasing academic performance. The current focus on healthier school lunches, while positive, is more compulsory whereas a confirmation of the hypothesis could be used to place emphasis on willful selection of higher quality food items.

Providers of educational instruction and coaching have the potential to use the results of the experiment to prioritize nutritional education as part of coaching study habits. Rather than focusing on study habits such as rote vs. mnemonic memorization, a more holistic approach may be adopted that students can apply more broadly.

Producers for more health-conscious food products can place their offerings in front of the market segment of students and providers for students from the position that their product supports better educational outcomes when combined with a more mindful approach to diet and nutrition.

In [None]:
'''Students who are more likely to identify with healthier meal options and more 
accurately predict the caloric density of foods, on average, have higher 
academic performance as measured by GPA.'''

# dependencies
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.sandbox.stats.runs as sm
import seaborn as sns


# read csv to dataframe
host = 'https://raw.githubusercontent.com/AmpleOpportunity/nutrition-vs-academics/master/food_coded.csv'
food_choices_master_df = pd.read_csv(host)
food_choices_master_df.info()

In [None]:
'''For the purposes of this study, the only fields that will be used are those
that contain food self-identification choices (calories_chicken, coffee, drink, 
fries, soup) as well as caloric identification of various food items 
(calories_chicken, calories_scone, tortilla_calories, waffle_calories) with GPA 
as the continuous variable throughout.'''

# extract relevant data points for cleaning
food_choices_extracted_df = food_choices_master_df[['GPA', 'breakfast', 'coffee', 
                                                   'drink', 'fries', 'soup', 'calories_chicken', 
                                                   'calories_scone', 
                                                   'tortilla_calories', 
                                                   'waffle_calories']]

food_choices_extracted_df.info()

'''The schema reveals some needed work before analysis. A few of the records are 
incomplete. Additionally, all data types should be interpreted as an integer or 
floating point number but GPA is interpreted as a string. All non-numerical 
records will be converted into null values for that column and then purged from 
the dataframe.'''

# drop records where missing/non-numerical values exist
food_choices_extracted_df['GPA'] = pd.to_numeric(food_choices_extracted_df['GPA'], 
                                               errors='coerce')

food_choices_cleaned_df = food_choices_extracted_df.dropna()

In [None]:
# verify new sample size integrity
print('{} records removed. Total records: {}.'.format(
    len(food_choices_extracted_df) - len(food_choices_cleaned_df), 
    len(food_choices_cleaned_df)))

'''The sample size is still robust enough to continue analysis.'''

In [None]:
# comparing each group for normality by extracting kurtosis and skewness

# create separate df to hold values
normality_analysis_df = pd.DataFrame(columns=['name', 'kurtosis', 'skewness'])

# iterate over food_choices_cleaned_df, compute kurtosis/skew, and append
# results to normality_analysis_df
for i in range(len(food_choices_cleaned_df.columns)):
  normality_analysis_df = normality_analysis_df.append(pd.Series([food_choices_cleaned_df.columns[i],
                                          stats.kurtosis(food_choices_cleaned_df.iloc[:, i]),
                                          stats.skew(food_choices_cleaned_df.iloc[:, i])],
                                         index=normality_analysis_df.columns),
                               ignore_index=True)

normality_analysis_df

In [None]:
'''All skews are within acceptable ranges. breakfast and fries are leptokurtic.
Further investigation is warranted.'''

plt.hist(food_choices_cleaned_df['breakfast'], alpha = .5)
plt.hist(food_choices_cleaned_df['fries'], alpha = .5)

'''There is an extensive preference for the first option in both of these
responses and they are unlikely to yield any deeper insight. Given that three
other operable food identification questions exist, testing can proceed with 
only the remaining variables; the breakfast and fries columns will be omitted 
from the analysis.'''

In [None]:
'''Is there a correlation between respondents' self-identification with
healthier/unhealthier food options and academic performance? Given that 
self-identification food responses are between two coded options, and the 
goal is to compare more than two unmatched groups, a Chi-Square test is 
appropriate.'''

# execute a Chi-square test for all values
id_contingency = pd.crosstab(index=food_choices_cleaned_df['GPA'], 
                          columns=[food_choices_cleaned_df['coffee'], 
                                   food_choices_cleaned_df['drink'], 
                                   food_choices_cleaned_df['soup']])

stat, p, dof, expected = stats.chi2_contingency(id_contingency)

In [None]:
'''The p-value would be a relevant measure of significance. Usually set
at 0.05, the sample size merits reevaluation of this standard. Using the 
below table, provided by the author's mentor, a closer approximation of 
significance can be determined.'''

![picture](https://github.com/AmpleOpportunity/nutrition-vs-academics/blob/master/P-value%20based%20off%20sample%20size.png?raw=true)

In [None]:
# calculate an appropriate p-value for holding a strong standard for evidence
x = np.array([.005, .003, .001, .0003])
y = np.array([30, 50, 100, 1000])
a = np.vstack([x, np.ones(len(x))]).T

'''Given that we only have four data points that span a large sample size range
with fleetingly small p-value thresholds, there is no "silver bullet" 
function (linear, quadratic, logarithmic) that's going to fit these points
perfectly. For this reason, a least-squares approach has been selected to 
fit with the intention of minimizing residuals.'''

# use least-squares to model regression line
m, c = np.linalg.lstsq(a, y, rcond=None)[0]

# using the slope-intercept form of a line, find the p-value at the sample size
pval = (len(food_choices_cleaned_df) - c) / m

# plot the least-squares result
plt.plot(x, y, 'o', label='Original data', markersize = 10)
plt.plot(x, m*x + c, 'r', label='Fitted line')
plt.legend()
plt.show()

print('Given a sample size of {}, the p-value should be approximated as {} using the least-squared method.'.format(
    len(food_choices_cleaned_df), round(pval, 4)))

In [None]:
# interpret the test statistic
prob = 1 - pval
critical = stats.chi2.ppf(prob, dof)
print('Stat: ' + str(abs(stat)),' Critical: ' + str(critical), 'P: ' + str(p))

'''We fail to reject the null hypothesis that there is no correlation between 
respondents' self-identification with healthier/unhealthier food options and 
academic performance.'''

In [None]:
# create boxplot to represent data

# convert coded responses into readable values
replacements_for_one = [{'coffee': 1, 'drink': 1, 'soup': 1,}, 
                        {'coffee': 'Frappuccino', 'drink': 'Soda', 'soup': 'Creamy'}]
replacements_for_two = [{'coffee': 2, 'drink': 2, 'soup': 2}, 
                        {'coffee': 'Espresso', 'drink': 'Orange Juice', 'soup': 'Vegetable'}]

boxplot_df = food_choices_cleaned_df
boxplot_df = boxplot_df.replace(replacements_for_one[0], replacements_for_one[1])
boxplot_df = boxplot_df.replace(replacements_for_two[0], replacements_for_two[1])

# generate boxplots
f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(14,4))
f.suptitle('GPA vs. Identification with Food Items', fontsize=14)
bplot = sns.boxplot(x='coffee', y='GPA', data=boxplot_df, ax=ax1)
bplot = sns.boxplot(x='drink', y='GPA', data=boxplot_df, ax=ax2)
bplot = sns.boxplot(x='soup', y='GPA', data=boxplot_df, ax=ax3)

# write to file for use in presentation
fig = bplot.get_figure()
fig.savefig('id_axis_boxplot.png', dpi=400)

In [None]:
'''Is there a correlation between respondents' successful identification of the
caloric content in food items versus their academic performance? Responses are
also structured categorically and across multiple samples so a Chi-Square will
again be used for this analysis.'''

cal_contingency = pd.crosstab(index = food_choices_cleaned_df['GPA'], 
                           columns=[food_choices_cleaned_df['calories_chicken'], 
                                    food_choices_cleaned_df['calories_scone'], 
                                    food_choices_cleaned_df['tortilla_calories'], 
                                    food_choices_cleaned_df['waffle_calories']])

cal_stat, cal_p, cal_dof, cal_expected = stats.chi2_contingency(cal_contingency)

cal_critical = stats.chi2.ppf(prob, cal_dof)
print('Stat: ' + str(abs(cal_stat)),' Critical: ' + str(cal_critical), 'P: ' + str(cal_p))

'''These results are of great interest. If the standard 0.05 p-value assumption 
were adopted, then an argument could be made that the results are statistically
significant. Given our computed p-value of 0.0035, we must instead fail to 
reject the null hypothesis. This marginal deviation is further validated by our 
chi-squared value vs. our critical value with 1853.422 only being about 3% shy 
of significance at a critical value of 1912.696.'''

In [None]:
# create a boxplot to represent data
f, (ax1, ax2, ax3, ax4) = plt.subplots(1, 4, figsize=(19,4))
f.suptitle('GPA vs. Caloric Estimation', fontsize=14)
bplot0 = sns.boxplot(x='calories_chicken', y='GPA', data=boxplot_df, ax=ax1)
bplot1 = sns.boxplot(x='calories_scone', y='GPA', data=boxplot_df, ax=ax2)
bplot2 = sns.boxplot(x='tortilla_calories', y='GPA', data=boxplot_df, ax=ax3)
bplot3 = sns.boxplot(x='waffle_calories', y='GPA', data=boxplot_df, ax=ax4)

# iterate through plots and assign them all a default color
plotlist = [bplot0, bplot1, bplot2, bplot3]
for plot in plotlist:
  for box in plot.artists:
    box.set_facecolor(sns.xkcd_rgb['windows blue'])

# highlight correct answers
bplot0.artists[3].set_facecolor(sns.xkcd_rgb['faded green'])
bplot1.artists[1].set_facecolor(sns.xkcd_rgb['faded green'])
bplot2.artists[2].set_facecolor(sns.xkcd_rgb['faded green'])
bplot3.artists[1].set_facecolor(sns.xkcd_rgb['faded green'])

# write to file for use in presentation
f.savefig('calorie_estimation_boxplot.png', dpi=400)

## Further Study
The major question of this investigation has been satisfied; we fail to reject the null hypothesis that students who are more likely to identify with healthier meal options and more accurately predict the caloric density of foods, on average, have higher academic performance as measured by GPA.

The results of the last experiment yielded insights that, while not statistically significant within the parameters defined, exhibited a weak correlation that merits additional investigation.

The intention of this writing has focused on the subjects' self-evaluation of food items but would also benefit from analyzing subjects' self-identification of *themselves* with regard to nutrition habits. The study has been appended to investigate a second question: Is there a correlation between respondents' self-perception of their own dietary identification and nutritional habits vs academic performance exists.

This additional hypothesis will be tested by evaluating subjects' current 
assessment of their diet (healthy\balanced\moderated, unhealthy\cheap\too much\random, the same thing over and over, unclear), perception of their body weight (slim, very fit, just right, slightly overweight, overweight, does not regard body in terms of weight), and frequency that nutritional values are checked for food items (never, on certain products only, very rarely, on most products, on everything).

In [None]:
# create a new dataframe with isolated responses
self_assessment_df = food_choices_master_df[['GPA', 'diet_current_coded', 
                                             'self_perception_weight', 
                                             'nutritional_check']]                                        

# drop records where missing/non-numerical values exist
self_assessment_df['GPA'] = pd.to_numeric(self_assessment_df['GPA'], 
                                               errors='coerce')

self_assessment_df = self_assessment_df.dropna()


self_assessment_df.info()
self_assessment_df.head()

In [None]:
# calculate a new p-value from the new sample size using the least-squares
# results from before
pval_for_self_assessment = pval = (len(self_assessment_df) - c) / m

print(pval_for_self_assessment)

In [None]:
# assess correlation between GPA and self-identification using a Pearson Correlation
print(stats.pearsonr(self_assessment_df['GPA'], self_assessment_df['diet_current_coded']))
print(stats.pearsonr(self_assessment_df['GPA'], self_assessment_df['self_perception_weight']))
print(stats.pearsonr(self_assessment_df['GPA'], self_assessment_df['nutritional_check']))

'''We fail to reject the null hypothesis based upon the association of these
values independently. A comparison of all values against GPA is also of interest.'''

# assess correlation between GPA and self-identification using Chi-squared Test
self_assessment_contingency = pd.crosstab(index = self_assessment_df['GPA'], 
                           columns=[self_assessment_df['diet_current_coded'], 
                                    self_assessment_df['self_perception_weight'], 
                                    self_assessment_df['nutritional_check']])

sa_stat, sa_p, sa_dof, sa_expected = stats.chi2_contingency(self_assessment_contingency)

sa_critical = stats.chi2.ppf(prob, sa_dof)
print('Stat: ' + str(abs(sa_stat)),' Critical: ' + str(sa_critical), 'P: ' + str(sa_p))

'''We once again fail to reject the null.'''

## Findings
While poor dietary habits often preexist health complications, its effect on academic performance, as measured by GPA, cannot be said to exist in this examination. No correlation between subjects' self-identification with food items, correct identification of caloric content of food items, or self-assessment of body and nutritional habits vs GPA was found to exist.

While there is sometimes a reluctance to publish a null hypothesis, there is often value in understanding what something isn't. This study may still be useful to educators and providers of academic services if only as a reference by which to eliminate non-value added areas to concentrate their efforts.

## Limitations
1. The computed p-value was of great trouble, in both selecting a method of calculation and comparing that against the caloric identification results. The least-squares results were computed liniarly where the actual data clearly illustrates a polynomial relationship between the points. This area would benefit from further exploration with additional data sets.
1. Even without rejecting the null hypothesis, caloric identification must be scrutinized by the maxim, "correlation does not imply causation." Healthy dietary habits may be a function of information, of which a higher-performing subjects might have a more robust understanding of.
1. In the final assessment, the data values are coded by the survey author whose sorting methodologies are not clearly stated. This breakdown introduces opportunities for bias among the coded samples as they likely come as a product of the author's interpretation of responses.