# An independent Chi-Square is used when you want to determine whether two categorical variables influence each other.



## Import data and packages

In [2]:
import pandas as pd
from scipy import stats

In [3]:
# Using lead-lipstick.csv 
# prodType has 2 levels: LP for lipstick and Lg for lip gloss. 
# priceCatgry has 3 levels: 
# 1: < 5 euros
# 2: 5-15 euros
# 3: > 15 euros

lip = pd.read_csv('C:/Users/chris/Desktop/data\DS105/Lesson1/lead_lipstick.csv')

## You will test to see if the price of the product depends on whether it is a lip stick or a lip gloss.

In [4]:
lip.head()

Unnamed: 0,JRC_code,purchCntry,prodCntry,Pb,sdPb,shade,prodType,priceCatgry
0,C135,NL,NL,3.75,0.24,Red,LP,2
1,C18,FI,FI,2.29,0.07,Red,LP,2
2,C20,FI,IT,1.27,0.06,Red,LP,2
3,C164,DE,FR,1.21,0.06,Red,LP,2
4,C71,MT,UK,0.85,0.04,Red,LP,2


There is only one assumption for Chi-Square, and it is that when you are looking at the contingency tables, the expected frequencies for each cell need to have at least 5 entries per cell. In Python, the only way to easily generate an expected frequencies table is actually to run the analysis. So, you will conduct your independent Chi-Square first, and then make sure it meets this assumption!

A contingency table is also called a crosstab

In [5]:
# Create the crosstab
lip_crosstab = pd.crosstab(lip['prodType'], lip['priceCatgry'])
lip_crosstab

priceCatgry,1,2,3
prodType,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
LG,19,43,12
LP,34,92,23


# Running the Independent Chi-Square

In [6]:
# Run the contingency funtion on the contingency/crosstab table
stats.chi2_contingency(lip_crosstab)

Chi2ContingencyResult(statistic=0.2969891724608704, pvalue=0.8620046738525345, dof=2, expected_freq=array([[17.58744395, 44.79820628, 11.61434978],
       [35.41255605, 90.20179372, 23.38565022]]))

### Based on the Chi Statistic and the p value greater than .05 it does not appear there is significant relationship between the price and product type. 

# Lesson 1 Turn in for Chi-Square

Compute an independent Chi-Square to see if the shade of lipstick and the price category are related.


In [7]:
color_crosstab = pd.crosstab(lip['shade'], lip['priceCatgry'])

In [8]:
# Test assumption of 5 entries per cell in contingency table.
color_crosstab

priceCatgry,1,2,3
shade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Brown,20,30,10
Pink,20,49,12
Purple,8,23,6
Red,5,33,7


In [9]:
# Compute Independent Chi-Square
stats.chi2_contingency(color_crosstab)

Chi2ContingencyResult(statistic=7.860569553614045, pvalue=0.2484973879479863, dof=6, expected_freq=array([[14.26008969, 36.32286996,  9.41704036],
       [19.25112108, 49.03587444, 12.71300448],
       [ 8.79372197, 22.39910314,  5.80717489],
       [10.69506726, 27.24215247,  7.06278027]]))

### It appears there is not a significant relationship between shade and price