**ANOVA**!

*NOTE*: Sometimes, packages can't be imported because they need to be installed first. Use the following code to install libraries or packages on your program so they can be used anytime:

%pip install scipy

**Difference between T-tests and ANOVA:**

**T-tests**: Categorical variables with TWO categories (male/female, yes/no)
**ANOVA**: Categorical variables with **THREE or MORE categories** (region: southwest, northwest, southeast, northeast)

**What this tests**: is there are difference in the MEAN of the NUMERICAL variable (or feature/column) for each CATEGORY of the CATEGORICAL variable?

**Example**: If we were to divide the BMI (numerical variable) into the four region categories (categorical variable), we test to see if **at least two** *regions* are different. 

This requires some critical thinking on the way we write the code. See below:

In [2]:
from scipy import stats
import statistics as stat
import pandas as pd

df = pd.read_csv('http://www.ishelp.info/data/insurance.csv')
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


For our t-test example, we wanted to separate the bmi (numerical) into two groups: male and female (categories), then save each as it's own list to be compared. We plug-and-chug these two lists into the t-test function.

In [3]:
# compare charges and region
# how would we use the following code to compare regions and charges?

#malebmi = bmi[(Ins.sex=='male')]    
#femalebmi = bmi[(Ins.sex=='female')]

We can do the SAME EXACT THING for ANOVA! Suppose we want to test if there is a significant difference in the charges put on insurance between the different regions.

We use brackets to tell Python to pull from the charges column ONLY those values that belong to specific regions (such as, region=='southwest', indicating only charges for the southwest region will come up for this list).

THEN, we save each of these into new variables to make them easy to use.

In [4]:
southwest = df.charges[(df.region=='southwest')]
southeast = df.charges[(df.region=='southeast')]
northwest = df.charges[(df.region=='northwest')]
northeast = df.charges[(df.region=='northeast')]

# plug-and-chug into the ANOVA (F-test) function:

stats.f_oneway(southwest,southeast,northeast,northwest)

F_onewayResult(statistic=2.9696266935891193, pvalue=0.0308933560705201)

In [14]:
print(southeast.mean())
print(southwest.mean())
print(northeast.mean())
print(northwest.mean())

14735.411437609888
12346.937377292308
13406.384516385804
12417.57537396923


In [17]:
southwest = df.bmi[(df.region=='southwest')]
southeast = df.bmi[(df.region=='southeast')]
northwest = df.bmi[(df.region=='northwest')]
northeast = df.bmi[(df.region=='northeast')]

# plug-and-chug into the ANOVA (F-test) function:

stats.f_oneway(southwest,southeast,northeast,northwest)

F_onewayResult(statistic=39.49505720170283, pvalue=1.881838913929143e-24)

In [18]:
# Assuming df is your DataFrame and it contains 'bmi' and 'region' columns

# Calculate and print the mean BMI for each region
print("Mean BMI for Southeast:", df[df['region'] == 'southeast']['bmi'].mean())
print("Mean BMI for Southwest:", df[df['region'] == 'southwest']['bmi'].mean())
print("Mean BMI for Northeast:", df[df['region'] == 'northeast']['bmi'].mean())
print("Mean BMI for Northwest:", df[df['region'] == 'northwest']['bmi'].mean())


Mean BMI for Southeast: 33.35598901098901
Mean BMI for Southwest: 30.59661538461538
Mean BMI for Northeast: 29.173503086419753
Mean BMI for Northwest: 29.199784615384615


The above gives you TWO statistics: F-value = 2.9696... and p-value = 0.03, which is less than 0.05. 

**MEANING**: there is strong evidence in this dataset that *at least two regions significantly differ* in the amount of charges placed on insurance. 

Your book gives some nice ways of pulling these two statistics and representing them more professionally.

*For-Loop Option:* The following is a for-loop you can use as well for the same thing. In fact, all you need to do is change the values for col (categorical variable) and label (numerical variable)! Try it out!

*SUGGESTION*: break down this code piece by piece and see what is happening in each step of the loop. Think back to your Python exercises. Undo the hashtag before the print statements to see what is happening in each iteration of the loop.

In [19]:
col = 'region'
label = 'charges'

groups = df[col].unique()
df_grouped = df.groupby(col)

group_labels = []                      # Step 3. Create an empty list that will be a two-dimensional list of lists to store the label values associated with each category
for g in groups:                       # Step 4. Loop through the unique cateogry values ('Yes' and 'No' in this case)
    g_list = df_grouped.get_group(g)     # Step 5. Use the get_group() function to return a list containing only the records for each unique value
    # print(g_list)
    group_labels.append(g_list[label])

oneway = stats.f_oneway(*group_labels)
oneway

F_onewayResult(statistic=2.96962669358912, pvalue=0.0308933560705201)

*PRACTICE*: Try this same exercise with BMI and Age. IMPORTANT: How do you interpret each? Look at the p-values.

**WHY ARE ANOVA AND T-TESTS IMPORTANT**? We could look at graphs to see a difference between the categories, like we did for our t-tests. We could also calculate means and variances and put them in a chart to compare. However, these are *subjective* (too based on people's opinions and interpretations). We want some **SOLID EVIDENCE**, such as p-values from t-tests and ANOVA.