# ANOVA

Last week we did a deep dive into Analysis of Variance.  After this hands-on exercise we have a better understanding of the underlying process which allows us to draw conclusions about the liklihood that various samples came from different populations.

Today we'll explore a multi-factor data set. We'll start out with some basic exploratory data analysis, then we'll apply our knowledge of t-tests and one-way ANOVA to each of the factors.  Finally we'll acount for both factors with a two-way ANOVA and apply a post-hoc test to determine which of our factors are significant.

The data set contains some plant growth data.  Plant height was measured for several groups. Both watering schedule and sunlight intensity were varied in the experiment.  The goal is the see if watering frequency and sunlight intensity affect the final height of the plant.

Lets get some libraries!

In [3]:
import os
import warnings
import numpy as np
import pandas as pd
import seaborn as sn
from scipy import stats as st
import matplotlib.pyplot as plt

import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

**Load the data**

Put the `plants.csv` file into your current directory and load it into a pandas DataFrame. If you forget the method, either do a search for it or use a previous studio.

In [4]:
def path_finder(file_name):
    
    directory = os.path.dirname('__file__')
    file_directory = os.path.join(directory, file_name)
       
    file_path = file_directory
    
    try:
        csv_read = pd.read_csv(file_path)
    
    except Exception as e:
        print('Failed to load data')
        print(f'Error: {e}')
        csv_read = None
        
    return csv_read

def data_sorter(data):
    
    grouped_data = {}
    
    for _, row in data.iterrows():
        
        water = row['water']
        sun = row['sun']
        height = row['height']
        
        key = (water, sun, height)
    
        if key not in grouped_data:
            grouped_data[key] = []
            
        grouped_data[key].append(row.to_dict())
        
    return grouped_data

def data_analyzer(data):
    
    warnings.filterwarnings('ignore', category=RuntimeWarning)
    
    analysis_results = {}
    
    for key, rows in data.items():
        
        water, sun, height = key
        key = (water, sun, height)

        df = pd.DataFrame(rows)
        
        stats = {
            'water mean' : df['water'].mean(skipna=True),
            'water stdev' : df['water'].std(skipna=True),
            'water median' : df['water'].median(skipna=True),
            'sun mean' : df['sun'].mean(skipna=True),
            'sun stdev' : df['sun'].std(skipna=True),
            'sun median' : df['sun'].median(skipna=True),
            'height mean' : df['height'].mean(skipna=True),
            'height stdev' : df['height'].std(skipna=True),
            'height median' : df['height'].median(skipna=True),
            'Sample Size': len(df)
        }
        
        stats = {k: float(v) if isinstance(v, np.float64) else v for k, v in stats.items()}
        
        analysis_results[key] = stats
        
    return analysis_results

# Name of CSV file
file_name = 'plants.csv'

# Gets the data from the CSV
plants_data = path_finder(file_name)

# Sorts the data from the CSV
sorted_plants_data = data_sorter(plants_data)

# Analyzes the data that was sorted 
analyzed_plants_data = data_analyzer(sorted_plants_data)

print(plants_data, sorted_plants_data, analyzed_plants_data)

TypeError: Could not convert string 'dailydailydailydaily' to numeric

**Explore the data**

Have a look at the DataFrame.  How many columns are there?  What kind of data is in each column?  How many unique values are in the categorical columns?



In [None]:
# Show the first few rows and info about the DataFrame
print(plants_data.head())
print("\nDataFrame info:")
print(plants_data.info())

# Unique values in categorical columns
print("\nUnique values in 'water':", plants_data['water'].unique())
print("Unique values in 'sun':", plants_data['sun'].unique())

Create a boxplot which has plant height on the y-axis and watering schedule on the x-axis.  

The Seaborn library is the natural choice (lookup the boxplot function), but feel free to use library you wish. Just remember to import the associated library.

In [None]:
sn.boxplot(x='water', y='height', data=plants_data)
plt.title('Plant Height by Watering Schedule')
plt.show()

Now create a second box plot with plant height on the y-axis and sun intensity on the x-axis.  Take note of any differences.

In [None]:
sn.boxplot(x='sun', y='height', data=plants_data)
plt.title('Plant Height by Sun Intensity')
plt.show()

## Comparing daily vs weekly

First we'll look at the difference between watering schedules only.  Since there are only two levels to this factor, we can split the height data into two samples.  

First we'll isolate the "daily" data using a boolean index, then we'll grab the height values from that sample.

```
daily_idx = df["water"] == "daily"
daily_height = df[daily_idx]["height"]
```

NOTE: If 'df' doesn't work, use whichever variable you chose to load the data with.

In [None]:
daily_height = plants_data[plants_data['water'] == 'daily']['height']
weekly_height = plants_data[plants_data['water'] == 'weekly']['height']
print("Daily heights:", daily_height.values)
print("Weekly heights:", weekly_height.values)

Now, create corresponding weekly height values.

In [None]:
# done above

Get our function from a couple weeks ago to do our two sample t-test, and we will compare it to the built-in method from SciPy.

```
st.ttest_ind(sample_1,sample_2)
```

Perform both tests on the two different samples. For the SciPy method, rembember to replace 'sample_1' and 'sample_2' with the variables you created.

Just for fun, let's compare this value to a one-way ANOVA using just the two samples.

```
st.f_oneway(sample_1,sample_2)
```

Notice how the p-values are the same?  For two samples these methods are actually identical.  If we are doing the manual calculation it's a little easier to do the t-test, but if we wanted to do the algebra we could prove that they are actually calculating the same thing.

So why do we have ANOVA?  As we saw last week we will often want to look at more than two samples.  In this case, doing multiple t-tests will increase our liklihood of Type I error.  The methods are not equivelent when there are more than two samples.

## Comparing sunlight intensity

This factor has three levels, so we will have to perform an ANOVA to find our significance.  

Just like we did for daily and weekly, isolate the data for each of the three levels ("low", "med", and "high") and perform a one-way ANOVA on the height.



You can pass as many samples into the `f_oneway()` as needed.

```
st.f_oneway(sample_1, sample_2, sample_3)
```

### Basic two-way ANOVA

But what happens when we want to look at both factors?  Afterall it's reasonably to think that they could both have an effect.  

In this case we need a two-way ANOVA.  We are not going to dive into the math of the two-way ANOVA, but it basically works the same as the one way.  We are ultimately comparing in-group variance to between-group variance.

The syntax is a little weird.  Just go with it.

```
model = ols('height ~ C(water) + C(sun)', data=df).fit()
sm.stats.anova_lm(model, typ=2)
```

You can google the `anova_lm` function to read about the input.  Or if you search for "two-way ANOVA Python" you'll find a lot of articles.  ANOVA is one of the most widely used statistical techniques so it's well documented.

### Two-way ANOVA with factor interaction

We just looked at the influence of both factors on the height, but what about the "factor interaction"?  What if the intense sunlight allows the plant to use more water, or if the daily watering schedule allowed the plant to more efficiently utilize the sunlight?  This would create a factor interact, where changing both of those variables together produced an outcome that was different than the effect of changing each one independently.

We basically use the same cyntax as before, but we add a term for water-sun interaction like this:

`model = ols('height ~ C(water) + C(sun) + C(water):C(sun)', data=df).fit()`
`sm.stats.anova_lm(model, typ=2)`


We can see that we do not get a significant p-value for the interaction, so we cannot conclude that there is likely to be an additional effect when we change the level of both factors.

## Post-hoc

The ANOVA we just performed told us that there was liekly to be a difference between our populations, but it did not tell us *which* populations were likely to be different.  For that we need to do a post-hoc test.

We will use a basic test called *Tukey's range test* which is also often called *Tukey's HSD (honestly signficant difference) test*.

In this case we're going to skip the math altogether.  Just try the analysis.

`tukey = pairwise_tukeyhsd(endog=df['height'],groups=df['sun'],alpha=0.05)`       

`print(tukey)`


Now try it for the watering schedule.