* **One-way ANOVA:** Compares the means of one continuous dependent variable based on three or more groups of one categorical variable.
* **Two-way ANOVA:** Compares the means of one continuous dependent variable based on three or more groups of two categorical variables.


In [1]:
# Importing necessary packages for data analysis 
import pandas as pd 
import seaborn as sns

In [2]:
# loading the diamons dataset from seaborn 
diamonds_data = sns.load_dataset("diamonds", cache=False)

# display the first 10 items
diamonds_data.head(10)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
5,0.24,Very Good,J,VVS2,62.8,57.0,336,3.94,3.96,2.48
6,0.24,Very Good,I,VVS1,62.3,57.0,336,3.95,3.98,2.47
7,0.26,Very Good,H,SI1,61.9,55.0,337,4.07,4.11,2.53
8,0.22,Fair,E,VS2,65.1,61.0,337,3.87,3.78,2.49
9,0.23,Very Good,H,VS1,59.4,61.0,338,4.0,4.05,2.39


our main focus is on one-way and two-way ANOVA. This means that our dataset needs a continuous variable, and up to two categorical variables.

**Note:** In the workplace, you will always start with a business problem and the data, and then determine the best models or tests to run on the data. You will *never* work in the reverse. For educational purposes, our goal is to teach you about ANOVA in this notebook and the accompanying resources.

In [3]:
# Checking how many diamonds are each color grade 
# using the .value_counts() method on the color column
    # returns a series  containing counts of unique values.
        # other cool attributes are ascending: bool, default false
        #bins : rather than count values, grouping them into half open bins
        #dropna: bool,default True ( Don't include NaN)

diamonds_data["color"].value_counts()

color
G    11292
E     9797
F     9542
H     8304
D     6775
I     5422
J     2808
Name: count, dtype: int64

In [4]:
# subset for colorless diamonds 
colorless = diamonds_data[diamonds_data["color"].isin(["E", "F", "H", "D", "I"])]

# Selecting only color and price columns, and reset index
colorless = colorless[['color','price']].reset_index(drop=True)

**Note:** We took a subset of colorless and near colorless diamonds. We excluded G color grade diamonds as there were many more of them, and we excluded J color grade diamonds as there were significantly fewer of them. In a workplace setting, you would typically go through a more thoughtful process of subsetting. The goal of this notebook is focusing on ANOVA, not data cleaning or variable selection.

In [5]:
# Removing dropped categories of diamond color
# this code took colorless dataframe and color column (must be a categorical type)
# removed the category values G and J from the category colors in the colorless dataframe
colorless.color = colorless.color.cat.remove_categories(["G", "J"])

#Checking that the dropped categories have been removed 
colorless['color'].values

['E', 'E', 'E', 'I', 'I', ..., 'D', 'D', 'D', 'H', 'D']
Length: 39840
Categories (5, object): ['D', 'E', 'F', 'H', 'I']

In [None]:
# Importing themath package