In [60]:
import pandas as pd
import numpy as np
import scipy.stats
from scipy.stats import *

# Chi square test for nominal data
* To understand the correlation relationship between nominal variables, we use chisquare test 
* The null hypothesis is both the nominal variables are independent (no correlation between them) 
<br>
<br>
We will consider an example mentioned in the textbook to understand the concept.<br>
Suppose a group of 1500 people was surveyed. The gender of each person was noted. Each person was polled as to whether his or her preferred type of reading material was fiction or non-fiction. 250 males preferred fiction while 50 preferred non fiction. 200 females preferred fiction while 1000 preferred non fiction.
<br>
We need to check if gender and reading preference are independent of not.

We will update the attribute1, index, attribute2, columns and los (level of significance) to be used later while performing hypothesis testing

In [61]:
attribute1 = 'reading preference'
index      = ['fiction','non_fiction']

attribute2 = 'gender'
columns    = ['males','females']

los        = 0.05

print('---------------------------------------------------------------------------------------------')
print(f'We have attributes - {attribute1} and {attribute2} both of which are nominal')
print('\n')
print(f'The values corresponding to {attribute1} is {index}')
print('\n')
print(f'The values corresponding to {attribute2} is {columns}')
print('\n')
print(f'The level of significance for the test is {los}')
print('---------------------------------------------------------------------------------------------')

---------------------------------------------------------------------------------------------
We have attributes - reading preference and gender both of which are nominal


The values corresponding to reading preference is ['fiction', 'non_fiction']


The values corresponding to gender is ['males', 'females']


The level of significance for the test is 0.05
---------------------------------------------------------------------------------------------


update the values from the problem statement manually. These are going to be the observed values

In [62]:
contingency_df                              = pd.DataFrame(columns = columns,index=index)
contingency_df.loc['fiction','males']       = 250
contingency_df.loc['fiction','females']     = 200
contingency_df.loc['non_fiction','males']   = 50
contingency_df.loc['non_fiction','females'] = 1000

In [63]:
print('-----------------------------------------------------------------------------------------------------------')
print('We have contingency matrix defined as ')
display(contingency_df)
print('\n')
observed_matrix                             = contingency_df.values

row_sum                                     = contingency_df.sum(axis=1).values
col_sum                                     = contingency_df.sum(axis=0).values
expected_matrix                             = np.zeros((contingency_df.shape[0],contingency_df.shape[1]))
for i in range(expected_matrix.shape[0]):
  for j in range(expected_matrix.shape[1]):
    expected_matrix[i][j]                   = (row_sum[i]*col_sum[j])/np.sum(row_sum)

print('-----------------------------------------------------------------------------------------------------------')
print('The matrix with expected frequencies is given as ')
display(pd.DataFrame(expected_matrix,columns = columns,index=index))
print('\n')

print('-----------------------------------------------------------------------------------------------------------')
test_statistic                               = 0
for i in range(expected_matrix.shape[0]):
  for j in range(expected_matrix.shape[1]):
    test_statistic                          += ((observed_matrix[i][j]-expected_matrix[i][j])**2)/expected_matrix[i][j]
print(f'The value of test statistic is {test_statistic}')

degree_of_freedom                            = (len(index)-1)*(len(columns)-1)
chisq_val                                    = scipy.stats.chi2.ppf(1-los, df=degree_of_freedom)
print(f'The value of chisquare at {degree_of_freedom} degrees of freedom is {chisq_val}')
print('\n')
print('------------------------------------------------------------------------------------------------------------')

print('Conclusion - ')
if test_statistic<chisq_val:
  print(f'we may accept the Null Hypothesis H0: {attribute1} and {attribute2} are independent at level of significance being {los}')
else:
  print(f'We may accept the alternative hypothesis that {attribute1} and {attribute2} are dependent/strongly correlated at level of significance being {los}')
print('\n')
print('------------------------------------------------------------------------------------------------------------')

-----------------------------------------------------------------------------------------------------------
We have contingency matrix defined as 


Unnamed: 0,males,females
fiction,250,200
non_fiction,50,1000




-----------------------------------------------------------------------------------------------------------
The matrix with expected frequencies is given as 


Unnamed: 0,males,females
fiction,90.0,360.0
non_fiction,210.0,840.0




-----------------------------------------------------------------------------------------------------------
The value of test statistic is 507.93650793650795
The value of chisquare at 1 degrees of freedom is 3.841458820694124


------------------------------------------------------------------------------------------------------------
Conclusion - 
We may accept the alternative hypothesis that reading preference and gender are dependent/strongly correlated at level of significance being 0.05


------------------------------------------------------------------------------------------------------------


### Example: Vampire, Werewolves and Witches using Chi Square test

About 1000 down-worlders (vampires, warewolves and witches) interviewed, and their killings in last 3 months of humans and shadowhunters were noted.
* 40 vampires killed humans, 30 vampires were intereted in killing shadowhunters
* About 15 werewolves killed humans compared to 20 killings of shadowhunters
* The witches killed about 3 humans and 2 shadowhunters.
<br>
One needs to find if there is an association between the killing preference and downworld category. 
<br>
We want to test the hypothesis that the downworlders don't differentiate between humans or shadowhunters while killing them. 

In [64]:
attribute1 = 'downworld_category'
index      = ['vampire','werewolves','witches']

attribute2 = 'killing_preference'
columns    = ['humans','shadowhunters']

los        = 0.05

print('---------------------------------------------------------------------------------------------')
print(f'We have attributes - {attribute1} and {attribute2} both of which are nominal')
print('\n')
print(f'The values corresponding to {attribute1} is {index}')
print('\n')
print(f'The values corresponding to {attribute2} is {columns}')
print('\n')
print(f'The level of significance for the test is {los}')
print('---------------------------------------------------------------------------------------------')

---------------------------------------------------------------------------------------------
We have attributes - downworld_category and killing_preference both of which are nominal


The values corresponding to downworld_category is ['vampire', 'werewolves', 'witches']


The values corresponding to killing_preference is ['humans', 'shadowhunters']


The level of significance for the test is 0.05
---------------------------------------------------------------------------------------------


In [65]:
contingency_df                              = pd.DataFrame(columns = columns,index=index)
values                                      = np.array([[40,30],[15,20],[3,2]])

for i in range(len(index)):
  for j in range(len(columns)):
    contingency_df.loc[index[i],columns[j]] = values[i][j]


In [66]:
print('-----------------------------------------------------------------------------------------------------------')
print('We have contingency matrix defined as ')
display(contingency_df)
print('\n')
observed_matrix                             = contingency_df.values

row_sum                                     = contingency_df.sum(axis=1).values
col_sum                                     = contingency_df.sum(axis=0).values
expected_matrix                             = np.zeros((contingency_df.shape[0],contingency_df.shape[1]))
for i in range(expected_matrix.shape[0]):
  for j in range(expected_matrix.shape[1]):
    expected_matrix[i][j]                   = (row_sum[i]*col_sum[j])/np.sum(row_sum)

print('-----------------------------------------------------------------------------------------------------------')
print('The matrix with expected frequencies is given as ')
display(pd.DataFrame(expected_matrix,columns = columns,index=index))
print('\n')

print('-----------------------------------------------------------------------------------------------------------')
test_statistic                               = 0
for i in range(expected_matrix.shape[0]):
  for j in range(expected_matrix.shape[1]):
    test_statistic                          += ((observed_matrix[i][j]-expected_matrix[i][j])**2)/expected_matrix[i][j]
print(f'The value of test statistic is {test_statistic}')

degree_of_freedom                            = (len(index)-1)*(len(columns)-1)
chisq_val                                    = scipy.stats.chi2.ppf(1-los, df=degree_of_freedom)
print(f'The value of chisquare at {degree_of_freedom} degrees of freedom is {chisq_val}')
print('\n')
print('------------------------------------------------------------------------------------------------------------')

print('Conclusion - ')
if test_statistic<chisq_val:
  print(f'we may accept the Null Hypothesis H0: {attribute1} and {attribute2} are independent at level of significance being {los}')
else:
  print(f'We may accept the alternative hypothesis that {attribute1} and {attribute2} are dependent/strongly correlated at level of significance being {los}')
print('\n')
print('------------------------------------------------------------------------------------------------------------')

-----------------------------------------------------------------------------------------------------------
We have contingency matrix defined as 


Unnamed: 0,humans,shadowhunters
vampire,40,30
werewolves,15,20
witches,3,2




-----------------------------------------------------------------------------------------------------------
The matrix with expected frequencies is given as 


Unnamed: 0,humans,shadowhunters
vampire,36.909091,33.090909
werewolves,18.454545,16.545455
witches,2.636364,2.363636




-----------------------------------------------------------------------------------------------------------
The value of test statistic is 2.021599090564608
The value of chisquare at 2 degrees of freedom is 5.991464547107979


------------------------------------------------------------------------------------------------------------
Conclusion - 
we may accept the Null Hypothesis H0: downworld_category and killing_preference are independent at level of significance being 0.05


------------------------------------------------------------------------------------------------------------


# Correlation coefficient for Numeric data
* We calculate correlation for numeric data. 
* Correlation coefficient is also called Person's product moment coefficient
* Scatter plots are usually used to view correlation between attributes
* The value of correlation is usually between 1 and -1
* Value close to 1 will mean highly correlated value
* Vale close to -1 will mean inversely correlated value

<br>
We will try to understand the relationship between both correlation and covariance using an example from the text-book

In [67]:
allelec  = [6,5,4,3,2]
hightech = [20,10,14,5,5]

df                            = pd.DataFrame(columns = ['AllElectronics(X)','HighTech(Y)'])
df['AllElectronics(X)']       = allelec
df['HighTech(Y)']             = hightech
df['X-mean(X)']               = df['AllElectronics(X)']-df['AllElectronics(X)'].mean()
df['Y-mean(Y)']               = df['HighTech(Y)']-df['HighTech(Y)'].mean()
df['(X-mean(X))*(Y-mean(Y))'] = df['X-mean(X)']*df['Y-mean(Y)']
display(df)
print('\n')
cov                           = df['(X-mean(X))*(Y-mean(Y))'].mean()
print(f'The covariance for the given data is {cov}')


df['(X-mean(X))^2']           = (df['AllElectronics(X)']-df['AllElectronics(X)'].mean())**2
df['(Y-mean(Y))^2']           = (df['HighTech(Y)']-df['HighTech(Y)'].mean())**2
print('\n')
display(df)

var1                         = df['(X-mean(X))^2'].mean()
var2                         = df['(Y-mean(Y))^2'].mean()

correl                       = cov/(np.sqrt(var1*var2))
print('\n')
print(f'The correlation between the two variables is {correl}')
print('\n')

Unnamed: 0,AllElectronics(X),HighTech(Y),X-mean(X),Y-mean(Y),(X-mean(X))*(Y-mean(Y))
0,6,20,2.0,9.2,18.4
1,5,10,1.0,-0.8,-0.8
2,4,14,0.0,3.2,0.0
3,3,5,-1.0,-5.8,5.8
4,2,5,-2.0,-5.8,11.6




The covariance for the given data is 7.0




Unnamed: 0,AllElectronics(X),HighTech(Y),X-mean(X),Y-mean(Y),(X-mean(X))*(Y-mean(Y)),(X-mean(X))^2,(Y-mean(Y))^2
0,6,20,2.0,9.2,18.4,4.0,84.64
1,5,10,1.0,-0.8,-0.8,1.0,0.64
2,4,14,0.0,3.2,0.0,0.0,10.24
3,3,5,-1.0,-5.8,5.8,1.0,33.64
4,2,5,-2.0,-5.8,11.6,4.0,33.64




The correlation between the two variables is 0.867442794919067




#### Calculating covariance using in-built functions

In [68]:
from numpy import cov

In [69]:
column1            = 'AllElectronics(X)'
column2            = 'HighTech(Y)'

print('-------------------------------------------------------------------------------------')
covariance_matrix  = cov(df[column1],df[column2])
print('The variance-covariance matrix between the two numerical attributes is given as - ')
display(covariance_matrix)
print('\n')
print(f'The covariance between {column1} and {column2} is {covariance_matrix[0,1]}')
print('\n')
print(f'The Variance of {column1} is {covariance_matrix[0,0]}')
print('\n')
print(f'The Variance of {column2} is {covariance_matrix[1,1]}')
print('\n')
correlation        = covariance_matrix[0,1]/np.sqrt(covariance_matrix[0,0]*covariance_matrix[1,1])
print(f'Correlation is {correlation} which is exactly the same as we obtained before')


-------------------------------------------------------------------------------------
The variance-covariance matrix between the two numerical attributes is given as - 


array([[ 2.5 ,  8.75],
       [ 8.75, 40.7 ]])



The covariance between AllElectronics(X) and HighTech(Y) is 8.75


The Variance of AllElectronics(X) is 2.5


The Variance of HighTech(Y) is 40.7


Correlation is 0.867442794919067 which is exactly the same as we obtained before


In [71]:
correlation,_ = pearsonr(df[column1], df[column2])
print(f'Correlation calculated from the inbuilt module in scipy is {correlation} which is also same as other two before')

Correlation calculated from the inbuilt module in scipy is 0.867442794919067 which is also same as other two before


# Important note
* If two attributes are independent, then the covariance is 0
* However, the covariance of 0 may not mean independence
<br>
Let us demonstrate with the following example. <br>
Let x be a vector whose mean is 0. Let y be a vector such that y=x^2.
<br>
Clearly x and y are dependent and a quadratic relationship exists. However, the correlation between the two is 0

In [82]:
x = [-1,-3,-2,1,2,3]
y = [val**2 for val in x]
print(f'The two vectors are {x} and {y}')
correlation,_ = pearsonr(x,y)
print(f'The correlation between both the vectors is {round(correlation,3)}')

The two vectors are [-1, -3, -2, 1, 2, 3] and [1, 9, 4, 1, 4, 9]
The correlation between both the vectors is 0.0
