#  ANOVA

In statistics, **Analysis of Variance (ANOVA)** is also used to analyze the differences among group means. The difference between t-test and ANOVA is the former is ued to compare two groups whereas the latter is used to compare three or more groups. [Read more about the difference between t-test and ANOVA](http://b.link/anova24).

From the ANOVA test, you receive two numbers. The first number is called the **F-value** which indicates whether your null-hypothesis can be rejected. The critical F-value that rejects the null-hypothesis varies according to the number of total subjects and the number of subject groups in your experiment. In [this table](http://b.link/eda14) you can find the critical values of the F distribution. **If you are confused by the massive F-distribution table, don't worry. Skip F-value for now and study it at a later time. In this challenge you only need to look at the p-value.**

The p-value is another number yielded by ANOVA which already takes the number of total subjects and the number of experiment groups into consideration. **Typically if your p-value is less than 0.05, you can declare the null-hypothesis is rejected.**

In this challenge, we want to understand whether there are significant differences among various types of pokemons' `Total` value, i.e. Grass vs Poison vs Fire vs Dragon... There are many types of pokemons which makes it a perfect use case for ANOVA. 

In [59]:
# Import libraries
import pandas as pd
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns

In [60]:
# Read the text file
df = pd.read_csv('pokemon.txt')

# Write the dataframe to a CSV file
df.to_csv('pokemon.csv', index=False)
# https://www.kaggle.com/datasets/abcsds/pokemon #?

In [61]:
# Load the data:
pokemon = pd.read_csv('pokemon.csv')
pokemon = pokemon.drop(columns='#')
pokemon

Unnamed: 0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


In [50]:
# pokemon.to_excel('pokemon.xlsx', index=False)

**To achieve our goal, we use three steps:**

1. **Extract the unique values of the pokemon types.**

1. **Select dataframes for each unique pokemon type.**

1. **Conduct ANOVA analysis across the pokemon types.**

#### First let's obtain the unique values of the pokemon types. These values should be extracted from Type 1 and Type 2 aggregated. Assign the unique values to a variable called `unique_types`.

*Hint: the correct number of unique types is 19 including `NaN`. You can disregard `NaN` in next step.*

In [51]:
# Your code here
unique_types = pokemon.groupby(['Type 1','Type 2']).agg({'Type 1':pd.Series.nunique, 'Type 2':pd.Series.nunique})
unique_types

Unnamed: 0_level_0,Unnamed: 1_level_0,Type 1,Type 2
Type 1,Type 2,Unnamed: 2_level_1,Unnamed: 3_level_1
Bug,Electric,1,1
Bug,Fighting,1,1
Bug,Fire,1,1
Bug,Flying,1,1
Bug,Ghost,1,1
...,...,...,...
Water,Ice,1,1
Water,Poison,1,1
Water,Psychic,1,1
Water,Rock,1,1


In [52]:
unique_types = unique_types.T

#### Second we will create a list named `pokemon_totals` to contain the `Total` values of each unique type of pokemons.

Why we use a list instead of a dictionary to store the pokemon `Total`? It's because ANOVA only tells us whether there is a significant difference of the group means but does not tell which group(s) are significantly different. Therefore, we don't need know which `Total` belongs to which pokemon type.

*Hints:*

* Loop through `unique_types` and append the selected type's `Total` to `pokemon_groups`. Be sure to loop through BOTH `Type 1` and `Type 2` to cover all occurrances of each unique type.
* Skip the `NaN` value in `unique_types`. `NaN` is a `float` variable which you can find out by using `type()`. The valid pokemon type values are all of the `str` type.
* At the end, the length of your `pokemon_totals` should be 18.

In [53]:
# Your code here
pokemon_groups = []
def func(Type1, Type2):
    for i in Type1:
        for j in Type2:
            pokemon_groups.append(pokemon['Total'])

In [55]:
pokemon_groups.to_excel('pokemon_groups.xlsx', index=False)

AttributeError: 'list' object has no attribute 'to_excel'

In [56]:
# map() is used to substitute each value in a Series with another value.
list(map(func, pokemon['Type 1'], pokemon['Type 2']))

TypeError: 'float' object is not iterable

In [57]:
def func(row):
    return [row['Type 1'], row['Type 2'], row['Total']]

pokemon_groups = pokemon.apply(func, axis=1)

In [58]:
pokemon_groups = list(map(func, zip(pokemon['Type 1'], pokemon['Type 2'], pokemon['Total'])))

TypeError: tuple indices must be integers or slices, not str

In [34]:
def func(Type1, Type2, Total):
    return [Type1, Type2, Total]

pokemon_groups = list(map(func, pokemon['Type 1'], pokemon['Type 2'], pokemon['Total']))

In [35]:
pokemon_groups = [[Type1, Type2, Total] for Type1, Type2, Total in zip(pokemon['Type 1'], pokemon['Type 2'], pokemon['Total'])]

In [36]:
pokemon_groups

[['Grass', 'Poison', 318],
 ['Grass', 'Poison', 405],
 ['Grass', 'Poison', 525],
 ['Grass', 'Poison', 625],
 ['Fire', nan, 309],
 ['Fire', nan, 405],
 ['Fire', 'Flying', 534],
 ['Fire', 'Dragon', 634],
 ['Fire', 'Flying', 634],
 ['Water', nan, 314],
 ['Water', nan, 405],
 ['Water', nan, 530],
 ['Water', nan, 630],
 ['Bug', nan, 195],
 ['Bug', nan, 205],
 ['Bug', 'Flying', 395],
 ['Bug', 'Poison', 195],
 ['Bug', 'Poison', 205],
 ['Bug', 'Poison', 395],
 ['Bug', 'Poison', 495],
 ['Normal', 'Flying', 251],
 ['Normal', 'Flying', 349],
 ['Normal', 'Flying', 479],
 ['Normal', 'Flying', 579],
 ['Normal', nan, 253],
 ['Normal', nan, 413],
 ['Normal', 'Flying', 262],
 ['Normal', 'Flying', 442],
 ['Poison', nan, 288],
 ['Poison', nan, 438],
 ['Electric', nan, 320],
 ['Electric', nan, 485],
 ['Ground', nan, 300],
 ['Ground', nan, 450],
 ['Poison', nan, 275],
 ['Poison', nan, 365],
 ['Poison', 'Ground', 505],
 ['Poison', nan, 273],
 ['Poison', nan, 365],
 ['Poison', 'Ground', 505],
 ['Fairy', nan,

#### Now we run ANOVA test on `pokemon_totals`.

*Hints:*

* To conduct ANOVA, you can use `scipy.stats.f_oneway()`. Here's the [reference](http://b.link/scipy44).

* What if `f_oneway` throws an error because it does not accept `pokemon_totals` as a list? The trick is to add a `*` in front of `pokemon_totals`, e.g. `stats.f_oneway(*pokemon_groups)`. This trick breaks the list and supplies each list item as a parameter for `f_oneway`.

In [37]:
from scipy.stats import f_oneway

# Extract the 'Total' column from the pokemon_groups dataframe
pokemon_totals = [group[2] for group in pokemon_groups]

# Conduct the ANOVA test
f_stat, p_value = f_oneway(*pokemon_totals)

# Print the results
print("F-statistic: ", f_stat)
print("P-value: ", p_value)


ValueError: zero-dimensional arrays cannot be concatenated

In [38]:
# This code first extracts the 'Total' column from the pokemon_groups dataframe and assigns it to the variable pokemon_totals, then it uses the f_oneway() function to conduct the ANOVA test on the pokemon_totals data.

# The f_oneway() function returns two values: the F-statistic and the p-value. The F-statistic is a measure of how much variation there is between the means of the groups, and the p-value is the probability that the null hypothesis (that the means of the groups are equal) is true.

# The results of the ANOVA test can be interpreted as follows:

# If the p-value is less than the significance level (usually 0.05), then we reject the null hypothesis and conclude that there is a significant difference in the means of the groups.
# If the p-value is greater than the significance level, then we fail to reject the null hypothesis and conclude that there is not a significant difference in the means of the groups

In [39]:
# As a side note, you should make sure that the assumptions of ANOVA are met before conducting the test, such as normality and equal variances among groups. Also it's important to note that you should check the type of your Type1 and Type2 columns, if they are not categorical, you should convert them to categorical before running ANOVA.

In [40]:
# Your code here


In [41]:
# The error message you're seeing is indicating that the np.concatenate() function is trying to concatenate zero-dimensional arrays, which is not allowed. This is likely happening because the pokemon_totals variable is a list of lists containing empty lists, and the f_oneway() function is trying to concatenate them together, resulting in the error.

# To fix this, you need to make sure that the pokemon_totals variable contains only non-empty lists. You can do this by checking for empty lists before passing them to the f_oneway() function:



In [42]:
pokemon_totals = [group[2] for group in pokemon_groups if group[2]]

In [43]:
# nother thing is that, you should make sure that the input passed to the f_oneway function is a list of lists, each containing the data for one group, so the f_oneway function knows which data belongs to which group.

In [44]:
pokemon_totals = [[group[2] for group in pokemon_groups if group[0]=='Type1'], [group[2] for group in pokemon_groups if group[0]=='Type2'], ...]

In [45]:
# This way you can pass the list of lists to the f_oneway function.

# Also, it's important to check the type of your Type1 and Type2 columns, if they are not categorical, you should convert them to categorical before running ANOVA.

# Finally, make sure that the input passed to the f_oneway function is a list of lists, each containing the data for one group, so the f_oneway function knows which data belongs to which group.

In [46]:
from scipy.stats import f_oneway

# Extract the 'Total' column from the pokemon_groups dataframe
pokemon_totals = [group[2] for group in pokemon_groups]

# Conduct the ANOVA test
f_stat, p_value = f_oneway(*pokemon_totals)

# Print the results
print("F-statistic: ", f_stat)
print("P-value: ", p_value)

ValueError: zero-dimensional arrays cannot be concatenated

In [None]:
# interest_r = pd.read_csv('rate_by_city.csv')
# interest_r.head(60)
# interest_r.to_excel('rate_by_city.xlsx', index=False)

In [None]:
# interest_r[interest_r.City==1].Rate

In [None]:
# interest_r['city_count'] = interest_r.groupby('City').cumcount() ##is the new index 

# interest_r_pivot = interest_r.pivot(index='city_count', columns='City', values='Rate')
# interest_r_pivot.columns = ['City_'+str(x) for x in interest_r_pivot.columns.values]
# interest_r_pivot.head(30)

In [None]:
# interest_r['city_count'] = interest_r.groupby('City').cumcount() ##is the new index 

#### Interpret the ANOVA test result. Is the difference significant?

In [6]:
# Your comment here
