# Analysis of Variance, One-Way.
## Statistics Methods II - GMAT Exam Sample Data

In class we are going over ANOVA using this small sample data. In class, we are using excel but I want to try to complete all of these exercises using Python Programming Language to become more familiar with it as a statistical tool.

### Importing the Tools
After reading through a lot of examples and the documentation, I found that statsmodels and scipy are the two packages with the statistical tests I need. 

Pandas is needed to create the tabular formatting required.

In [1]:
import pandas as pd
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
import scipy.stats as stats

### Reading the sample data for the GMAT Scores.

In [2]:
gmat = pd.read_csv('C:/Users/river/Data/Applied Statistics/GMAT.csv')
gmat.head() # I believe this is the entire DF

Unnamed: 0,Fall,Winter,Spring,Summer
0,490,310,500,450
1,580,590,450,590
2,440,730,510,710
3,580,710,570,240
4,430,540,610,510


### Transforming the Data
The hardest task for me to complete was to transform the data into a format that the statsmodels package could read.
You can see in the cell above that the data is in a table format resembling a pivot table in excel, I need to first "unpivot" the data into only two columns, one with the values and the other with the Terms (treatments).

We can complete this by using the melt() method on the Pandas.DataFrame, with the necessary arguments.

I am melting the variable 'Fall', 'Winter', 'Spring', and 'Summer' down into a new column called 'Term'. I need to modify the column name that the exam scores will be under. This will be 'Data'.

In [3]:
gmat = gmat.melt(value_vars = ['Fall', 'Winter', 'Spring', 'Summer'], var_name = 'Term', value_name = 'Data')

### Creating the ANOVA Model
StatsModels has the OLS() method to create the ANOVA model to analyze. The formula and the data has to be passed through the function. We use the fit() method to finish the model.

In [4]:
model = ols(formula = 'Data ~ C(Term)', data = gmat).fit()

### Model Summary
A quick look to see what's going on inside the model I created.

In [5]:
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                   Data   R-squared:                       0.007
Model:                            OLS   Adj. R-squared:                 -0.076
Method:                 Least Squares   F-statistic:                   0.08713
Date:                Sun, 26 May 2019   Prob (F-statistic):              0.967
Time:                        17:28:09   Log-Likelihood:                -242.95
No. Observations:                  40   AIC:                             493.9
Df Residuals:                      36   BIC:                             500.7
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept           522.0000     35.02

### Passing the Model through the ANOVA Function
The type argument is the type of ANOVA: 
1 = One-Way ANOVA,
2 = Two-Way (No Replication)
3 = Two-Way (Replication)

In [6]:
anova_results = anova_lm(model, typ=1)

In [7]:
print(anova_results)
print('\n\nF-Criticial', stats.f.ppf(0.95, dfn=3, dfd=36))

            df    sum_sq       mean_sq         F    PR(>F)
C(Term)    3.0    3207.5   1069.166667  0.087135  0.966644
Residual  36.0  441730.0  12270.277778       NaN       NaN


F-Criticial 2.86626555094018


## Conclusion
The null hypothesis is that all means of the treatments are equal. Based on the results we return using the statistics packages, The F Statistic is less than the F Critical value. We fail to reject the null hypothesis using the crtitcal value test.
Another number to take note is the p-value. The confidence interval we want is 95% so the alpha is 5%. The calculated p-value in our results is 97% which means that our probablility of commiting a type I error is relatively high. Based on p-value test I fail to reject the null hypothesis.