# Chi-Square-test

## Definition
### A chi-square test is a statistical tool that compares two datasets to determine if two categorical variables are related or independent. It helps determine if the observed data differs significantly from the expected data. 

#### Karl Pearson introduced the chi-square test in 1900. It is symbolically represented as χ2.

It is a **categorical** test for determining the difference between observed and expected data.

## Purpose
#### The goal of this test is to identify whether a disparity between actual and predicted data is due to chance or to a link between the variables under consideration. As a result, the chi-square test is an ideal choice for aiding in our understanding and interpretation of the connection between our two categorical variables.

#### A chi-square test or ***comparable nonparametric test*** is required to test a hypothesis regarding the distribution of a categorical variable. Categorical variables, which indicate categories such as animals or countries, can be ***nominal or ordinal***.

#### They **cannot have a normal distribution** since they can only have a few particular values.

## Formula
###           χ2 = ∑(Oi – Ei)2/Ei

χ2 = Chi Square

O = Observed Value

E = Expected Value 

### Uses
1. The Chi-squared test can be used to see if your data follows a well-known theoretical probability distribution like the Normal or Poisson distribution.
2. The Chi-squared test allows you to assess your trained regression model's goodness of fit on the training, validation, and test data sets.

These tests use degrees of freedom to determine if a particular null hypothesis can be rejected based on the total number of observations made in the experiments. Larger the sample size, more reliable is the result.

There are two main types of Chi-Square tests namely -
1. Independence 
2. Goodness-of-Fit 

**Independence** 
The Chi-Square Test of Independence is a derivable ( also known as inferential ) statistical test which examines whether the two sets of variables are likely to be related with each other or not. This test is used when we have counts of values for two nominal or categorical variables and is considered as non-parametric test. A relatively large sample size and independence of obseravations are the required criteria for conducting this test.

For Example- 

In a movie theatre, suppose we made a list of movie genres. Let us consider this as the first variable. The second variable is whether or not the people who came to watch those genres of movies have bought snacks at the theatre. Here the null hypothesis is that th genre of the film and whether people bought snacks or not are unrelatable. If this is true, the movie genres don’t impact snack sales. 

**Goodness-Of-Fit**
In statistical hypothesis testing, the Chi-Square Goodness-of-Fit test determines whether a variable is likely to come from a given distribution or not. We must have a set of data values and the idea of the distribution of this data. We can use this test when we have value counts for categorical variables. This test demonstrates a way of deciding if the data values have a “ good enough” fit for our idea or if it is a representative sample data of the entire population. 

For Example- 

Suppose we have bags of balls with five different colours in each bag. The given condition is that the bag should contain an equal number of balls of each colour. The idea we would like to test here is that the proportions of the five colours of balls in each bag must be exact.

Q1) Lets take example of tips dataset.

In [1]:
import scipy.stats as stats
import seaborn as sns
import pandas as pd
import numpy as np

Null and Alternate Hypotheses:

Null Hypothesis (H0): There is no significant association between gender and smoking habits.
Alternate Hypothesis (H1): There is a significant association between gender and smoking habits.
Confidence Interval: Not applicable for a chi-square test

In [2]:
dataset=sns.load_dataset('tips')
dataset # categorical data here is sex, smoker, day, time

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


### categorical data here is sex, smoker, day and time

## The Contingency Table 
### The Contingency table (also called crosstab) is used in statistics to summarise the relationship between several categorical variables. Here, we are taking a table that shows the number of people who are smoker or non smoker according to their gender.

In [3]:
dataset_table=pd.crosstab(dataset['sex'],dataset['smoker'])
print(dataset_table)

smoker  Yes  No
sex            
Male     60  97
Female   33  54


In [4]:
dataset_table.values

array([[60, 97],
       [33, 54]], dtype=int64)

## Observed Values

In [5]:
#Observed Values
Observed_Values = dataset_table.values
print("Observed Values :-\n",Observed_Values)

Observed Values :-
 [[60 97]
 [33 54]]


### Calculating Expected Values

In [7]:
stat, p_value, dof, expected_value =stats.chi2_contingency(dataset_table)
print("Chi_staistics is = ", stat)
print("p value is = ", p_value)
print("Degree of Freedom is = ", dof)
print("Expected values is = ", expected_value)

Chi_staistics is =  0.0
p value is =  1.0
Degree of Freedom is =  1
Expected values is =  [[59.84016393 97.15983607]
 [33.15983607 53.84016393]]


In [8]:
# Define significance level
alpha = 0.05

# Critical value from the chi-square distribution table
critical_value = stats.chi2.ppf(1 - alpha, dof)
critical_value

3.841458820694124

In [9]:
if stat > critical_value:
    print("Reject Null Hypothesis: There is a significant relationship.")
else:
    print("Fail to Reject Null Hypothesis: There is no significant relationship.")

if p_value < alpha:
    print("Reject Null Hypothesis: There is a significant relationship.")
else:
    print("Fail to Reject Null Hypothesis: There is no significant relationship.")

Fail to Reject Null Hypothesis: There is no significant relationship.
Fail to Reject Null Hypothesis: There is no significant relationship.
