<a href="https://colab.research.google.com/github/sergey-jr/Interactive-Statistics-Notebooks/blob/master/Chi_square_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# METADATA

- [Fastpages](https://fastpages.fast.ai/fastpages/jupyter/2020/02/21/introducing-fastpages.html) - the serving solution, they can beautifully present/server notebooks as blog posts, with code highlighting, visualiztion enabled, etc (e.g. [1](https://drscotthawley.github.io/devblog3/2019/02/08/My-1st-NN-Part-3-Multi-Layer-and-Backprop.html)) - maybe too complex for our needs so far

- [This blog post](https://towardsdatascience.com/interactive-controls-for-jupyter-notebooks-f5c94829aee6) - that explains how to use 
IPyWidgets and stuff

- Awesome visuzliation library [Altair](https://altair-viz.github.io/gallery/index.html)

In [0]:
! pip install ipywidgets
! jupyter nbextension enable --py widgetsnbextension

import ipywidgets as widgets
from ipywidgets import interact, interact_manual

# Chi-square test

### Table of contents
1.   Intro
2.   Examples

## Intro

The term "chi-squared test," also written as χ<sup>2</sup> test, refers to certain types of statistical hypothesis tests that are valid to perform when the test statistic is chi-squared distributed under the null hypothesis. Often, however, the term is used to refer to Pearson's chi-squared test and variants thereof. Pearson's chi-squared test is used to determine whether there is a statistically significant difference (i.e., a magnitude of difference that is unlikely to be due to chance alone) between the expected frequencies and the observed frequencies in one or more categories of a so-called contingency table.

There are two types of chi-square tests. Both use the chi-square statistic and distribution for different purposes:

- A chi-square goodness of fit test determines if a sample data matches a population.
- A chi-square test for independence compares two variables in a contingency table to see if they are related. In a more general sense, it tests to see whether distributions of categorical variables differ from each another.
 - A very small chi square test statistic means that your observed data fits your expected data extremely well. In other words, there is a relationship.
 - A very large chi square test statistic means that the data does not fit very well. In other words, there isn’t a relationship.

## Setup


In [0]:
import numpy as np
import scipy.stats as stats
import pandas as pd
import json
import matplotlib.pyplot as plt


## Examples

Example 1.

256 visual artists were surveyed to find out their zodiac sign. The results were: Aries (29), Taurus (24), Gemini (22), Cancer (19), Leo (21), Virgo (18), Libra (19), Scorpio (20), Sagittarius (23), Capricorn (18), Aquarius (20), Pisces (23). Test the hypothesis that zodiac signs are evenly distributed across visual artists.

In [0]:
expect = 256/12
data = [["Aries", 29, expect], ["Taurus", 24, expect], ["Gemini", 22, expect], ["Cancer", 19, expect], ["Leo", 21, expect], 
        ["Virgo", 18, expect], ["Libra", 19, expect],["Scorpio", 20, expect], ["Sagittarius", 23, expect], ["Sagittarius", 23, expect], 
        ["Capricorn", 18, expect], ["Aquarius", 20, expect], ["Pisces", 23, expect]] 
data = np.array([np.array(item) for item in data])

In [0]:
df = pd.DataFrame(data, columns=["zodiac sign", "Observed", "Excpected"])
df.Observed = df.Observed.astype(np.float32)
df.Excpected = df.Excpected.astype(np.float32)

In [17]:
@interact
def check_chisquare(alpha=np.arange(0, 0.101, 0.001)):
  digree_freedom = df.shape[0] - 1
  chi_calc = (((df["Observed"].values - df["Excpected"].values)**2)/df["Excpected"].values).sum()
  krit = stats.chi2.ppf(1 - alpha, digree_freedom - 3)
  if chi_calc < krit:
    return "Null hypothesis is true"
  if chi_calc > krit:
    return "Null hypothesis can be rejected"

interactive(children=(Dropdown(description='alpha', options=(0.0, 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.…

Example 2. 

We have some sets of data, each of them consisits from ~250 elements. We want to check theitr dictribution. The distributions we wil chick on ot normal, uniform and longnorm.

In [18]:
!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1DDUaP-_eHH5a1P2AcCpjqO61oFT_kHzn' -O datasets.zip
!unzip datasets.zip -d ./datasets

--2020-03-31 16:48:19--  https://docs.google.com/uc?export=download&id=1DDUaP-_eHH5a1P2AcCpjqO61oFT_kHzn
Resolving docs.google.com (docs.google.com)... 74.125.203.100, 74.125.203.138, 74.125.203.102, ...
Connecting to docs.google.com (docs.google.com)|74.125.203.100|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-10-a8-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/mc9ud51egioq44c5845pd6ov1dop4qnc/1585673250000/02153115373879107722/*/1DDUaP-_eHH5a1P2AcCpjqO61oFT_kHzn?e=download [following]
--2020-03-31 16:48:20--  https://doc-10-a8-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/mc9ud51egioq44c5845pd6ov1dop4qnc/1585673250000/02153115373879107722/*/1DDUaP-_eHH5a1P2AcCpjqO61oFT_kHzn?e=download
Resolving doc-10-a8-docs.googleusercontent.com (doc-10-a8-docs.googleusercontent.com)... 108.177.125.132, 2404:6800:4008:c01::84
Connecting to doc-10-a8-docs.googleusercontent.com (doc-10-a8-

In [0]:
def set_narrow(arr, arr1):
    arr = list(arr)
    arr1 = list(arr1)
    arr2 = list(arr)
    i = 0
    while any([item < 5 for item in arr2]):
        if i < len(arr2):
            if arr2[i] < 5:
                if i == 0:
                    arr[i + 1] += arr[i]
                    arr2[i + 1] += arr[i]
                elif i == len(arr2) - 2:
                    arr[i] += arr[i + 1]
                    arr2[i] += arr[i + 1]
                    arr2.pop()
                    arr1.pop()
                    i -= 1
                else:
                    arr[i - 1] += arr[i]
                    arr2[i - 1] += arr[i]
                if i not in [len(arr2) - 2, len(arr2) - 1]:
                    arr2.pop(i)
                    arr1.pop(i)
                    i -= 1
            i += 1
        else:
            break
    return np.array(arr2), np.array(arr1)

In [20]:
@interact
def check_distribution(variant=range(1, 16), alpha=np.arange(0.001, 0.101, 0.001)):
  data = json.load(open(f'./datasets/var{variant}.json'))
  x = np.array(data['x'], dtype=float)
  n = len(x)  # count of points
  m = int(round(3.32 * np.log10(n) + 1)) # parts to divide 
  p, intervals = np.histogram(x, m)  # p - array of count of points that fall into the intervals
  a, b = x.min(), x.max()
  print("The original set:")
  print("p=", p)
  print("delta=", intervals)
  print("m=", m)
  p, delta = set_narrow(p, intervals)  # narrowing of the set
  m1 = len(p)
  print("Narrowed set:")
  print("p=", p)
  print("delta=", delta)
  print("m1=", m1)

  # plotting bar chart
  X = np.array([(delta[j] + delta[j + 1]) / 2 for j in range(m1)])
  Y = np.array([p[j] / (delta[j + 1] - delta[j]) for j in range(m1)])
  fig = plt.figure(dpi=100)
  plt.bar(X, Y, 1)

  # set interval min=-inf; max=inf
  delta[0] = -np.inf
  delta[-1] = np.inf
  # setting distribution type
  if data['low'] == 'lognorm':
      mu, sigma = np.log(x).mean(), np.sqrt(np.log(x).var())
      dist = stats.lognorm(sigma, scale=np.exp(mu))
      print(mu, sigma)
  elif data['low'] == 'exp':
      la = 1 / x.mean()
      dist = stats.expon(scale=1 / la)
      print(la)
  else:
      dist = stats.uniform(a, b)
      print(a, b)
  # setting real percentage of fall into intervals multiply by N
  nt = np.array([dist.cdf(delta[j + 1]) - dist.cdf(delta[j]) for j in range(m1 - 1)]) * n

  # calculating chi
  chi = np.array([(p[j] - nt[j]) ** 2 / nt[j] for j in range(m1 - 1)]).sum()
  # finding table value
  krit = stats.chi2.ppf(1 - alpha, m - 3)
  # plotting
  h = 10 ** -3
  r = np.arange(a - h * 2, b + h * 2, h)
  y1 = dist.pdf(r) * n
  plt.plot(r, y1, linewidth=2, color='y')
  plt.show()
  
  if chi < krit:
    print("Data is goes from {} distribution".format(data['low']))
  if chi > krit:
    print("Data is not goes from {} distribution".format(data['low']))

interactive(children=(Dropdown(description='variant', options=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …

## Applications

In cryptanalysis, the chi-squared test is used to compare the distribution of plaintext and (possibly) decrypted ciphertext. The lowest value of the test means that the decryption was successful with high probability.This method can be generalized for solving modern cryptographic problems.

In bioinformatics, chi-squared test is used to compare the distribution of certain properties of genes (e.g, genomic content, mutation rate, interaction network clustering, etc.) belonging to different categories (e.g., disease genes, essential genes, genes on a certain chromosome etc.).

## Conclusion

Two potential disadvantages of chi square are:

- The chi square test can only be used for data put into classes (bins). If you have non-binned data you’ll need to make a frequency table or histogram before performing the test.
- Another disadvantage of the chi-square test is that it requires a sufficient sample size in order for the chi-square approximation to be valid.

## References / Acknowledgements

1. https://en.wikipedia.org/wiki/Chi-squared_test
2. https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/chi-square/
3. https://www.statisticshowto.datasciencecentral.com/goodness-of-fit-test/