**Chi-Square Test**

The “titanic.csv” dataset has details of passengers on board the Titanic when it met its fate in 1912. Your job is to compare how Passenger class affects the probability of survival with the help of the Chi-Square test. Load the “titanic.csv“ data into a DataFrame and perform the following tasks:
1.	Create a DataFrame with three columns from the original dataset –"PassengerID," "PClass," "Survived or not"
2.	Visualize the "PClass" and the "Survived or not" columns to get an overview of the columns
3.	Plot the correlation between passenger class and chances of survival and Calculating the survival rate for each class
4.	State Null hypothesis based on the class-wise survival rate
5.	Plot the difference between expected and observed correlations between passenger class and survival using heat maps to decide if there is a need for a Chi-Square Test
6.	Calculate Chi-Square and P – Values for each passenger class and the entire distribution
7.	Accept or reject the Null hypothesis based on the results obtained 


**Step-1:** Loading the dataset into a DataFrame.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv("titanic_dataset.csv", index_col='PassengerId', usecols=['PassengerId','Pclass','Survived'])
data

In [None]:
data.info()

**Step-2:** Visualizing the "PClass" column using Seaborn.

In [None]:
sns.countplot(data.Pclass)
plt.title("Passenger Class")

plt.show()

**Step-2:** Visualizing the "Survived or not" column using Seaborn.

In [None]:
sns.countplot(data.Survived)
plt.title("Survived or not")

plt.show()

**Step-3:** Creating a pivot table using "PClass" and "Survived or not" columns.

In [None]:
PClass_survd = pd.pivot_table(data,index=['Pclass'],columns=['Survived'],aggfunc='size')
PClass_survd

**Step-4:** Plotting correlation between passenger class and chances of survival.

In [None]:
sns.heatmap(PClass_survd,annot=True, fmt='g',square=True,cmap='hot')
plt.title('Class Vs Survived',fontsize=20)
plt.show()

**Observation:** 891 entries are present in the data. Most of the people in 3rd class died while most of the people in 1st class survived. In order to see whether there is a bias in the data, we can do the **Chi Square Test** with an assumption that there is no biase between the columns..

**Step-5:** Calculating the rate of survival for each class.

In [None]:
pct_class = PClass_survd.sum(axis=1)/891
pct_class

In [None]:
pct_survived = PClass_survd.sum(axis=0)/891
pct_survived

**Step-6:** Stating the Null hypotheisis based on survival rates.

We can see that there is 24% of total passengers in Class 1, 20.65% of total passengers in Class 2 and remaining 55.11% in Class 3.

Also 61.62% of total passeneger survived.

If we have,

**Null hypothesis:** 'The Survival does not depend on the Class the passengers were travelling'.

In [None]:
pct_class.to_frame()@(pct_survived.to_frame().T) 
# These are the proportion of people expected with null hypothesis

In [None]:
exp = round(pct_class.to_frame()@(pct_survived.to_frame().T)*891)
exp
# This is the number of people in all classes and survival history expected.

**Step-7:** Showing the the difference between expected and observed correlations between passenger class and survival using heat maps. 

In [None]:
plt.figure(figsize=(10,4))

plt.subplot('121')
sns.heatmap(PClass_survd,annot=True, fmt='g',square=True,cmap='hot')
plt.title('Observed',fontsize=20)

plt.subplot('122')
sns.heatmap(exp,annot=True, fmt='g',square=True,cmap='hot')
plt.title('Expected',fontsize=20)
plt.tight_layout()


plt.show()

**Observation:** There is a visible difference between the expected and the observed distributions. So are they by chance or not? Lets now find the Chi value and P value.

**Step-8:** Tabulating chi square values for each class and the survival rate with degrees of freedom is 2x1.

In [None]:
Chi_table = ((PClass_survd - exp)**2)/exp
Chi_table

**Step-9:** Calculating the Chi square vale and p value of the distribution.

In [None]:
from scipy.stats.distributions import chi2

Chi_value = Chi_table.sum().sum()

p_value = chi2.sf(Chi_value,1)

print("Chi square value is ",Chi_value)
print("P value is",p_value)

**Step-7:** We can also conduct Chi square test from contigency table.

In [None]:
from scipy import stats
chi2_stat, p_val, dof, ex = stats.chi2_contingency(PClass_survd)

print("Chi square value is ",chi2_stat)
print("P value is",p_val)
print("Degrees of Freedom:",dof)

**Result:** Here we see that the P value is much less than zero and we can reject the null hypothesis.

There is a great dependence of survival on the passenger class.

Having made comparisons between two categorical columns. We can use this methodology for feature selection as well.