# **Chi-Square Test for Categorical Variables**

## Introduction
The chi-square test is a statistical method used to determine if there is a significant association between two categorical variables. This test is widely used in various fields, including social sciences, marketing, and healthcare, to analyze survey data, experimental results, and observational studies.

## Concept
The chi-square test is a non-parametric statistical method that evaluates whether the observed frequencies in each category differ significantly from the expected frequencies—assuming no association between the variables.

The test is based on the chi-square distribution, which is a family of distributions defined by degrees of freedom (df). These distributions are right-skewed and vary depending on df. A chi-square distribution table lists critical values for given df and significance levels (α), which we use to assess if our computed test statistic is extreme enough to reject the null hypothesis.

## Null Hypothesis and Alternative Hypothesis
The chi-square test involves formulating two hypotheses:

Null Hypothesis ( 𝐻_0 ) - Assumes that there is no association between the categorical variables, implying that any observed differences are due to random chance.

Alternative Hypothesis (H_1 ) - Assumes that there is a significant association between the variables, indicating that the observed differences are not due to chance alone.

## Formula
The chi-square statistic is calculated using the formula:

\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}

where
O_i is the observed frequency for category i .
E_i is the expected frequency for category i , calculated as:

E_i = \frac{(row \ total \times column\ total)}{grand \ total}

The sum is taken over all cells in the contingency table.

The calculated chi-square statistic is then compared to a critical value from the chi-square distribution table. This table provides critical values for different degrees of freedom (df) and significance levels ( \alpha ) .

The degrees of freedom for the test are calculated as:

df = (r - 1) \times (c - 1)

where r is the number of rows and c is the number of columns in the table.

## Chi-Square Distribution Table
A chi-square distribution table provides critical values that vary by degrees of freedom and the significance level (α). These values indicate the threshold beyond which the test statistic would be considered statistically significant.

For example:

df = 1, α = 0.05, the critical value is 3.841

If your calculated χ² > 3.841, you reject H₀

If χ² ≤ 3.841, you fail to reject H₀

The higher the χ² value, the stronger the evidence against H₀.

In [2]:
import pandas as pd
from scipy.stats import chi2_contingency
# Create the contingency table
data = [[20, 30],  # Male: [Like, Dislike]
        [25, 25]]  # Female: [Like, Dislike]
# Create a DataFrame for clarity
df = pd.DataFrame(data, columns=["Like", "Dislike"], index=["Male", "Female"])
# Perform the Chi-Square Test
chi2, p, dof, expected = chi2_contingency(df)
# Display results
print("Chi-square Statistic:", chi2)
print("Degrees of Freedom:", dof)
print("P-value:", p)
print("Expected Frequencies:\n", expected)

Chi-square Statistic: 0.6464646464646464
Degrees of Freedom: 1
P-value: 0.4213795037428697
Expected Frequencies:
 [[22.5 27.5]
 [22.5 27.5]]


Interpretation: Since the p-value (0.3156) > 0.05, we fail to reject the null hypothesis—indicating no significant association.

Applications
Market Research: Analyzing the association between customer demographics and product preferences.
Healthcare: Studying the relationship between patient characteristics and disease incidence.
Social Sciences: Investigating the link between social factors (e.g., education level) and behavioral outcomes (e.g., voting patterns).
Education: Examining the connection between teaching methods and student performance.
Quality Control: Assessing the association between manufacturing conditions and product defects.

Data Analysis with Python
Cheat Sheet: Exploratory Data Analysis
Package/Method	Description	Code Example

Complete dataframe correlation	
Correlation matrix created using all the attributes of the dataset.	

df.corr()



Specific Attribute correlation	
Correlation matrix created using specific attributes of the dataset.	

df[['attribute1','attribute2',...]].corr()



Scatter Plot	
Create a scatter plot using the data points of the dependent variable along the x-axis and the independent variable along the y-axis.	

from matlplotlib import pyplot as plt
plt.scatter(df[['attribute_1']],df[['attribute_2']])



Regression Plot	Uses the dependent and independent variables in a Pandas data frame to create a scatter plot with a generated linear regression line for the data.	

import seaborn as sns 
sns.regplot(x='attribute_1',y='attribute_2', data=df)




Box plot	
Create a box-and-whisker plot that uses the pandas dataframe, the dependent, and the independent variables.	

import seaborn as sns 
sns.boxplot(x='attribute_1',y='attribute_2', data=df)


Grouping by attributes	
Create a group of different attributes of a dataset to create a subset of the data.	

df_group = df[['attribute_1','attribute_2',...]]




GroupBy statements	a. Group the data by different categories of an attribute, displaying the average value of numerical attributes with the same category.
b. Group the data by different categories of multiple attributes, displaying the average value of numerical attributes with the same category.	
1
2
a) df_group = df.groupby(['attribute_1'],as_index=False).mean() 
b) df_group = df.groupby(['attribute_1','attribute_2'],as_index=False).mean()


Pivot Tables	
Create Pivot tables for better representation of data based on parameters	
1
grouped_pivot = df_group.pivot(index='attribute_1',columns='attribute_2')


Pseudocolor plot	
Create a heatmap image using a PsuedoColor plot (or pcolor) using the pivot table as data.	

from matlplotlib import pyplot as plt 
plt.pcolor(grouped_pivot, cmap='RdBu')

Pearson Coefficient and p-value	Calculate the Pearson Coefficient and p-value of a pair of attributes	

From scipy import stats 
pearson_coef,p_value=stats.pearsonr(df['attribute_1'],df['attribute_2'])