#### Independent Chi-Square
An independent Chi-Square is used when you want to determine whether two categorical variables influence each other.

The assumption for Chi-Square is expected values of each cell is greater than 5

A conginency table is necessary to run a Chi-Square in Python

#### Import Packages

In [1]:
import pandas as pd
from scipy import stats

#### Import Data
The data located here is about the lipstick content in lead. However, it contains some great categorical fields that you'll be using. The first is product type, prodType and it has two levels: LP is lipstick, and LG is lip gloss. The second is price category, priceCatgry, and it has three levels:

1: < 5 euros
2: 5-15 euros
3: > 15 euros

#### Requirements
You will test to see if the price of the product depends on whether it is a lip stick or a lip gloss.

In [2]:
LeadLipstick = pd.read_csv('C:/Users/desja/python_course/datasets/LeadLipstick.csv')
LeadLipstick.head()

Unnamed: 0,JRC_code,purchCntry,prodCntry,Pb,sdPb,shade,prodType,priceCatgry
0,C135,NL,NL,3.75,0.24,Red,LP,2
1,C18,FI,FI,2.29,0.07,Red,LP,2
2,C20,FI,IT,1.27,0.06,Red,LP,2
3,C164,DE,FR,1.21,0.06,Red,LP,2
4,C71,MT,UK,0.85,0.04,Red,LP,2


#### Test Assumptions and Run the Analysis
There is only one assumption for Chi-Square, and it is that when you are looking at the contingency tables, the expected frequencies for each cell need to have at least 5 entries per cell. In Python, the only way to easily generate an expected frequencies table is actually to run the analysis. So, you will conduct your independent Chi-Square first, and then make sure it meets this assumption!

In [4]:
LeadLipstick_crosstab = pd.crosstab(LeadLipstick['prodType'], LeadLipstick['priceCatgry'])
LeadLipstick_crosstab

priceCatgry,1,2,3
prodType,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
LG,19,43,12
LP,34,92,23


#### Results
The arguments for this function is the columns in your data frame you want to use to create the crosstab.
And the above is the result:

#### Note 
The three price categories are on the top, and the two different product types are along the side. What is shown in the cells are how many products fit in both categories. For instance, there are 19 lip glosses less than 5 euros.

In [None]:
#### Running the Independent Chi-Square
Once you have the contingency table, then you can run the function stats.chi2_contingency on the contingency table you have created:

In [5]:
stats.chi2_contingency(LeadLipstick_crosstab)

Chi2ContingencyResult(statistic=0.2969891724608704, pvalue=0.8620046738525345, dof=2, expected_freq=array([[17.58744395, 44.79820628, 11.61434978],
       [35.41255605, 90.20179372, 23.38565022]]))

#### Results
statistic=0.2969891724608704 the first value is the Chi-Square statistic
pvalue at 0.86 is higher than the pvalue of 0.05 which means there is no significant relationship between product type and product price.
Neither lipstick nor lip gloss is pricier or cheaper than the other

#### Test the Assumption of 5 Cases per Expected Cell
The last piece of the output, labeled array, is your expected count contingency table, albeit not a very pretty one! The expected count is what you would expect to happen if there was no relationship between the two variables. Since all of these values are over 5, this means that the assumption has been met, and you are free to present and discuss these results without any limitations!