# Chi-Squared test of independence

## Michael NANA KAMENI

Following from our earlier example, is the way a business collects marketing data independent of the responses to the marketing strategies to target customers? We can use a chi-squared test to test for independence. We use data from [THE INFLUENCE OF MARKETING INTELLIGENCE ON PERFORMANCES OF ROMANIAN RETAILERS](http://conferinta.management.ase.ro/archives/2014/pdf/32.pdf). Romanian retailers want to promote a new eco-label product. They consider how to promote the new eco-label using three marketing strategies -- strategic, tactical and operational strategies. Are these strategies independent of the way marketing information was collected for each of the strategies?

### The data
The data comes from the [paper](http://conferinta.management.ase.ro/archives/2014/pdf/32.pdf). There are three different marketing strategies, namely strategic, tactical and operational. There are three sources of marketing intelligence, namely the retail sector, promotion events and information from competitors. 

In [1]:
import pandas as pd
import scipy.stats as stats

In [2]:
%matplotlib inline

In [3]:
raw_data = {'intelligence': ['retail', 'retail', 'retail', 'promotion', 'promotion', 'promotion', 'competitors', 'competitors', 'competitors'], 
        'strategy': ['strategic', 'tactical', 'operational', 'strategic', 'tactical', 'operational', 'strategic', 'tactical', 'operational'], 
        'scores':[13,9,17,8,12,15,3,7,6]}
data = pd.DataFrame(raw_data, columns = ['intelligence', 'strategy', 'scores'])
data

Unnamed: 0,intelligence,strategy,scores
0,retail,strategic,13
1,retail,tactical,9
2,retail,operational,17
3,promotion,strategic,8
4,promotion,tactical,12
5,promotion,operational,15
6,competitors,strategic,3
7,competitors,tactical,7
8,competitors,operational,6


Create a pivot table of the entries. The pivot table is contingency table we can apply the chi-squared test.

In [4]:
observed= data.pivot(index='intelligence', columns='strategy')
observed

Unnamed: 0_level_0,scores,scores,scores
strategy,operational,strategic,tactical
intelligence,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
competitors,6,3,7
promotion,15,8,12
retail,17,13,9


### The hypothesis

We state the null and alternative hypothesis as follows:

$H_0:$ the marketing strategy is independent of the source of marketing intelligence.

$H_1:$ the marketing strategy is not independent of the source of marketing intelligence.

We use a 5% significance level to test the null hypothesis.

In [5]:
t, p, l, a = stats.chi2_contingency(observed=observed)
print('Test statistic', t)
print('p-value', p)

Test statistic 3.0657003945302814
p-value 0.5468914357195986


### The results

A p value of 0.5468 indicates strong support for the null hypothesis. There is strong support to indicate that the marketing strategy is independent of the source of marketing intelligence. 

### Exercise

Now that you have seen a chi-square test for independence applied to the marketing strategy example, let's try another example to see how well you understood the concepts. For this exercise, we present you with two candidates who stood for election. Shown are the number of people who voted for them by age-group. Answer the questions that follow to apply chi-square test for independence yourself.
<table class="MsoTable15Plain3" style="border-collapse: collapse; mso-yfti-tbllook: 1184; mso-padding-alt: 0cm 5.4pt 0cm 5.4pt;" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr style="mso-yfti-irow: -1; mso-yfti-firstrow: yes; mso-yfti-lastfirstrow: yes;">
<td style="width: 150.15pt; border: none; border-bottom: solid #7F7F7F 1.0pt; mso-border-bottom-themecolor: text1; mso-border-bottom-themetint: 128; mso-border-bottom-alt: solid #7F7F7F .5pt; padding: 0cm 5.4pt 0cm 5.4pt;" width="150" valign="top">
<p class="MsoNormal" style="mso-yfti-cnfc: 517;"><b><span style="text-transform: uppercase;">&nbsp;</span></b></p>
</td>
<td style="width: 150.15pt; border: none; border-bottom: solid #7F7F7F 1.0pt; mso-border-bottom-themecolor: text1; mso-border-bottom-themetint: 128; mso-border-bottom-alt: solid #7F7F7F .5pt; padding: 0cm 5.4pt 0cm 5.4pt;" width="150" valign="top">
<p class="MsoNormal" style="mso-yfti-cnfc: 1;"><b><span style="text-transform: uppercase;">candidate A</span></b></p>
</td>
<td style="width: 150.2pt; border: none; border-bottom: solid #7F7F7F 1.0pt; mso-border-bottom-themecolor: text1; mso-border-bottom-themetint: 128; mso-border-bottom-alt: solid #7F7F7F .5pt; padding: 0cm 5.4pt 0cm 5.4pt;" width="150" valign="top">
<p class="MsoNormal" style="mso-yfti-cnfc: 1;"><b><span style="text-transform: uppercase;">candidate B</span></b></p>
</td>
</tr>
<tr style="mso-yfti-irow: 0;">
<td style="width: 150.15pt; border: none; border-right: solid #7F7F7F 1.0pt; mso-border-right-themecolor: text1; mso-border-right-themetint: 128; mso-border-right-alt: solid #7F7F7F .5pt; background: #F2F2F2; mso-background-themecolor: background1; mso-background-themeshade: 242; padding: 0cm 5.4pt 0cm 5.4pt;" width="150" valign="top">
<p class="MsoNormal" style="mso-yfti-cnfc: 68;"><b><span style="text-transform: uppercase;">18-25</span></b></p>
</td>
<td style="width: 150.15pt; background: #F2F2F2; mso-background-themecolor: background1; mso-background-themeshade: 242; padding: 0cm 5.4pt 0cm 5.4pt;" width="150" valign="top">
<p class="MsoNormal" style="mso-yfti-cnfc: 64;">2670</p>
</td>
<td style="width: 150.2pt; background: #F2F2F2; mso-background-themecolor: background1; mso-background-themeshade: 242; padding: 0cm 5.4pt 0cm 5.4pt;" width="150" valign="top">
<p class="MsoNormal" style="mso-yfti-cnfc: 64;">1560</p>
</td>
</tr>
<tr style="mso-yfti-irow: 1; mso-yfti-lastrow: yes;">
<td style="width: 150.15pt; border: none; border-right: solid #7F7F7F 1.0pt; mso-border-right-themecolor: text1; mso-border-right-themetint: 128; mso-border-right-alt: solid #7F7F7F .5pt; padding: 0cm 5.4pt 0cm 5.4pt;" width="150" valign="top">
<p class="MsoNormal" style="mso-yfti-cnfc: 4;"><b><span style="text-transform: uppercase;">25-40</span></b></p>
</td>
<td style="width: 150.15pt; padding: 0cm 5.4pt 0cm 5.4pt;" width="150" valign="top">
<p class="MsoNormal">13578</p>
</td>
<td style="width: 150.2pt; padding: 0cm 5.4pt 0cm 5.4pt;" width="150" valign="top">
<p class="MsoNormal">4121</p>
</td>
</tr>
<tr style="mso-yfti-irow: 1; mso-yfti-lastrow: yes;">
<td style="width: 150.15pt; border: none; border-right: solid #7F7F7F 1.0pt; mso-border-right-themecolor: text1; mso-border-right-themetint: 128; mso-border-right-alt: solid #7F7F7F .5pt; padding: 0cm 5.4pt 0cm 5.4pt;" width="150" valign="top">
<p class="MsoNormal" style="mso-yfti-cnfc: 4;"><b><span style="text-transform: uppercase;">40-60</span></b></p>
</td>
<td style="width: 150.15pt; padding: 0cm 5.4pt 0cm 5.4pt;" width="150" valign="top">
<p class="MsoNormal">13578</p>
</td>
<td style="width: 150.2pt; padding: 0cm 5.4pt 0cm 5.4pt;" width="150" valign="top">
<p class="MsoNormal">4121</p>
</td>
</tr>
<tr style="mso-yfti-irow: 1; mso-yfti-lastrow: yes;">
<td style="width: 150.15pt; border: none; border-right: solid #7F7F7F 1.0pt; mso-border-right-themecolor: text1; mso-border-right-themetint: 128; mso-border-right-alt: solid #7F7F7F .5pt; padding: 0cm 5.4pt 0cm 5.4pt;" width="150" valign="top">
<p class="MsoNormal" style="mso-yfti-cnfc: 4;"><b><span style="text-transform: uppercase;">over 60<br /></span></b></p>
</td>
<td style="width: 150.15pt; padding: 0cm 5.4pt 0cm 5.4pt;" width="150" valign="top">
<p class="MsoNormal">13578</p>
</td>
<td style="width: 150.2pt; padding: 0cm 5.4pt 0cm 5.4pt;" width="150" valign="top">
<p class="MsoNormal">4121</p>
</td>
</tr>
</tbody>
</table>

#### Question 1
Determine whether the two categorical variables are independent or whether there is an association.

#### Question 1.1 
State the null and alternative hypothesis

We state the null and alternative hypothesis as follows:
- $H_0$: Age Interval is independant to Candidates.
- $H_1$: Age Interval is not independant to Candidates.

We use a 5% significance level to test the null hypothesis.

#### Qeustion 1.2
Conduct the hypothesis test and interpret your results.

In [6]:
# Let's compute data information into a dataframe 
raw_data = {'Interval of electors Age': ['18-25', '18-25', '25-40', '25-40', '40-60', '40-60', 'OVER 60', 'OVER 60'], 
        'Candidates': ['CANDIDATE A', 'CANDIDATE B', 'CANDIDATE A', 'CANDIDATE B', 'CANDIDATE A', 'CANDIDATE B', 'CANDIDATE A', 'CANDIDATE B'], 
        'scores':[2670,1560,13578,4121,13578,4121,13578,4121]}
data = pd.DataFrame(raw_data, columns = ['Interval of electors Age', 'Candidates', 'scores'])
data

Unnamed: 0,Interval of electors Age,Candidates,scores
0,18-25,CANDIDATE A,2670
1,18-25,CANDIDATE B,1560
2,25-40,CANDIDATE A,13578
3,25-40,CANDIDATE B,4121
4,40-60,CANDIDATE A,13578
5,40-60,CANDIDATE B,4121
6,OVER 60,CANDIDATE A,13578
7,OVER 60,CANDIDATE B,4121


Let's create a pivot table contingency. The pivot table is contingency table where we can apply the chi-square test.

In [7]:
observed = data.pivot(index ='Interval of electors Age', columns = 'Candidates')
observed

Unnamed: 0_level_0,scores,scores
Candidates,CANDIDATE A,CANDIDATE B
Interval of electors Age,Unnamed: 1_level_2,Unnamed: 2_level_2
18-25,2670,1560
25-40,13578,4121
40-60,13578,4121
OVER 60,13578,4121


In [8]:
t, p, l, a = stats.chi2_contingency(observed=observed)
print('Test statistic', t)
print('p-value', p)

Test statistic 393.82723297578605
p-value 4.810547035582015e-85


The result is in support the rejection of the null hypothesis since we have obtain $p \approx 4.81*\exp(-85)$ which is very inferior to 0.05. This means Candidates and interval of electors age are associated.

Now that you have completed this exercise, what are some applications in your work in which this test would be useful. Discuss below.

## [Chi-square test and its application in hypothesis testing](https://www.researchgate.net/publication/277935900_Chi-square_test_and_its_application_in_hypothesis_testing)

In medical research, there are studies which often collect data on categorical variables that can be summarized as a series of counts. These counts are commonly arranged in a tabular format known as a contingency table. The chi-square test statistic can be used to evaluate whether there is an association between the rows and columns in a contingency table. More specifically, this statistic can be used to determine whether there is any difference between the study groups in the proportions of the risk factor of interest. Chi-square test and the logic of hypothesis testing were developed by Karl Pearson. This article describes in detail what is a chi-square test, on which type of data it is used, the assumptions associated with its application, how to manually calculate it and how to make use of an online calculator for calculating the Chi-square statistics and its associated P-value.

## [Chi-Square Test for Goodness of Fit in a Plant Breeding](https://passel2.unl.edu/view/lesson/9beaa382bf7e)
In this lesson's case study example, tomato breeders are trying to incorporate genes which code for resistance to bacterial spot disease into cultivated tomato lines. One gene which has been characterized is Rx-4. A tomato line, 6.8068, was chosen as one parent for use in a crossing scheme because it is resistant to bacterial spot and contains this gene. This line also closely resembles cultivated tomato, therefore not as much backcrossing would be needed to remove unwanted traits. OH88119 was chosen as the other parent because even though susceptible to bacterial spot, it was a suitable parent for commercial hybrids due to other desirable traits in its  genome such as fruit size, fruit color, time to maturity,...