<img src="http://i.imgur.com/7fYntuW.jpg">

# 1. <a href="https://en.wikipedia.org/wiki/Chi-squared_test">Chi-squared tests</a> , <a href="https://ko.wikipedia.org/wiki/%EC%B9%B4%EC%9D%B4%EC%A0%9C%EA%B3%B1_%EA%B2%80%EC%A0%95#cite_note-sugeun-1">카이제곱 검정</a>

카이제곱검정은 카이제곱 분포에 기초한 통계적 방법으로, <br>
<U>관찰된 빈도가 기대되는 빈도와 의미있게 다른지의 여부를 검증</U> 하기 위해 사용되는 검증방법이다. <br>
자료가 빈도로 주어졌을 때, 특히 명목척도 자료의 분석에 이용된다. <br>
***
카이제곱 값은 χ2 = Σ (관측값 - 기댓값)2 / 기댓값 으로 계산한다.
***
### 검증유형
* **동질성 검증**: <U>'변인의 분포가 이항분포나 정규분포와 동일하다'</U> 라는 가설을 설정한다. <br>
    이는 어떤 모집단의 표본이 그 모집단을 대표하고 있는지를 검증하는 데 사용한다.
* **독립성 검증**: 변인이 두 개 이상일 때 사용되며, <br>
    기대빈도는 <U>'두 변인이 서로 상관이 없고 독립적'</U> 이라고 기대하는 것을 의미하며 <br>
    관찰빈도와의 차이를 통해 기대빈도의 진위여부를 밝힌다.
***
### 기본가정
* 종속변인이 명목변인에 의한 <U>질적변인이거나 범주변인</U> 이어야 한다.
* 표본이 모집단에서 <U>무선으로 추출</U> 되어야 한다.
* 각 범주에 포함할 수 있도록 기대되는 빈도를 기대빈도라고 하는데, 이 <U>기대빈도가 5 이상</U> 이어야 한다. 5보다 적으면 사례 수를 증가시켜야 한다.
* 각 칸에 있는 <U>빈도는 다른 칸의 사례와 상관없이 독립적</U> 이어야 한다.

In [30]:
usincome = pandas.read_csv("adult.data.csv")
usincome.columns = ["age","workclass","fnlwgt","education","education_num","marital_status","occupation","relationship","race","sex","capital_gain","capital_loss","hours_per_week","native_country","high_income"]
usincome.index +=1
usincome.head(3)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K


### 1-1. 사용할 데이터, 사용할 변수 설명
1990 US Census, 미국 소득과 인구 통계 <a href="https://archive.ics.uci.edu/ml/machine-learning-databases/adult/">(자료)</a> <br>
총 32561 행의 데이터

* sex -- 성별
***
* race -- 인종
***
* high_income -- 50k 이상 버는 사람

### 1-2. 동질성 검증
<img src="http://i.imgur.com/EvjJqzF.jpg" title="source: imgur.com" />
<img src="http://i.imgur.com/hkSlLTB.jpg" title="source: imgur.com" />
<img src="http://i.imgur.com/kRvHSAs.jpg" title="source: imgur.com" />

In [40]:
female_diff = (10771 - 16280.5) / 16280.5
male_diff = (21790 - 16280.5) / 16280.5

female_diff = (10771 - 16280.5) ** 2 / 16280.5
male_diff = (21790 - 16280.5) ** 2 / 16280.5
gender_chisq = female_diff + male_diff
gender_chisq

3728.950615767329

#### Generating A Distribution
* Randomly generate 32561 numbers that range from 0-1.
* Based on the expected probabilities, assign Male or Female to each number.
* Compute the observed frequences of Male and Female.
* Compute the chi-squared value and save it.
* Repeat several times.
* Create a histogram of all the chi-squared values.

In [107]:
chi_squared_values = []
from numpy.random import random
import matplotlib.pyplot as plt
%matplotlib inline
for i in range(1000):
    sequence = random((32561,))
    sequence[sequence < .5] = 0
    sequence[sequence >= .5] = 1
    male_count = len(sequence[sequence == 0])
    female_count = len(sequence[sequence == 1])
    male_diff = (male_count - 16280.5) ** 2 / 16280.5
    female_diff = (female_count - 16280.5) ** 2 / 16280.5
    chi_squared = male_diff + female_diff
    chi_squared_values.append(chi_squared)

#plt.hist(chi_squared_values)
chi_squared

0.962163324222229

<img src="http://i.imgur.com/WO05Ows.jpg" title="source: imgur.com" />

샘플링한 카이제곱 분포에서 최대값은 12 <br>
우리가 구했떤 카이제곱 값은 3728로 위 분포를 엄청나게 벗어나기 때문에 통계적으로 의미가 있다.

#### Smaller Samples

32561 행 -> 100 행 (같은 관측값과 기대값의 비율)

<img src="http://i.imgur.com/9y59Ls1.jpg">
<img src="http://i.imgur.com/PasKXIl.jpg" title="source: imgur.com" />

32561행을 100으로 나누면 325.61<br>
100행의 샘플에서 나온 카이제곱값 11.4522에 325.61을 곱하면 <br>
처음 구했던 카이제곱값인 3728.95이 나온다.

<img src="http://i.imgur.com/MgitBxa.jpg" title="source: imgur.com" />
<img src="http://i.imgur.com/40jKG4x.jpg" title="source: imgur.com" />
***
카이제곱값은 샘플 사이즈에 비례한다

In [114]:
HT1 = (8-5)**2/5 + (2-5)**2/5
HT2 = (800-500)**2/500 + (200-500)**2/500
print(HT1, HT2)

3.6 360.0


#### degree of freedom

<img src="http://i.imgur.com/YWVT5kX.jpg" title="source: imgur.com" />
***
<img src="http://i.imgur.com/9y59Ls1.jpg">
<img src="http://i.imgur.com/rDRQ2uj.jpg" title="source: imgur.com" /></a>
***

Rather than constructing another chi-squared sampling distribution for 4 degrees of freedom, we can use a function from the SciPy library to do it more quickly.

In [54]:
diffs = []
observed = [27816, 3124, 1039, 311, 271]
expected = [26146.5, 3939.9, 944.3, 260.5, 1269.8]

for i, obs in enumerate(observed):
    exp = expected[i]
    diff = (obs - exp) ** 2 / exp
    diffs.append(diff)
    
race_chisq = sum(diffs)
race_chisq

1080.485936593381

In [57]:
#Q) 수행시간 어떻게 비교?

In [69]:
from scipy.stats import chisquare
import numpy as np
observed = np.array([27816, 3124, 1039, 311, 271])
expected = np.array([26146.5, 3939.9, 944.3, 260.5, 1269.8])

chisquare_value, race_pvalue = chisquare(observed, expected)
[chisquare_value, race_pvalue]
chisquare(observed, expected)

Power_divergenceResult(statistic=1080.485936593381, pvalue=1.2848494674873035e-232)

## 1-3. 독립성 검증

교차표 cross table을 통해 성별과, 소득에 대한 두 범주형 변수를 살펴보자
***
<img src="http://i.imgur.com/rXBsUNI.jpg" title="source: imgur.com" />
***
독립성 검증을 통해 두 범주형 변수에 상관관계가 있는지 알아보자.
We can apply the chi-squared test <br>
(also known as the chi-squared test of association)
***
<img src="http://i.imgur.com/oknJepd.jpg" title="source: imgur.com" />
***
-> 기대빈도를 계산한다
(관측값 - 기대빈도) ** 2 / 기대빈도

In [59]:
males_over50k = .669 * .241 * 32561
males_under50k = .669 * .759 * 32561
females_over50k = .331 * .241 * 32561
females_under50k = .331 * .759 * 32561

In [118]:
observed = [6662, 1179, 15128, 9592]
expected = [5249.8, 2597.4, 16533.5, 8180.3]
values = []

for i, obs in enumerate(observed):
    exp = expected[i]
    value = (obs - exp) ** 2 / exp
    values.append(value)

chisq_gender_income = sum(values)
chisq_gender_income

1517.5510981525103

In [119]:
import numpy as np
from scipy.stats import chisquare

observed = np.array([6662, 1179, 15128, 9592])
expected = np.array([5249.8, 2597.4, 16533.5, 8180.3])

chisq_value, pvalue_gender_income = chisquare(observed, expected)
chisquare(observed, expected)

Power_divergenceResult(statistic=1517.5510981525103, pvalue=0.0)

<h2> 1-4. Jeopardy Quesions

In [79]:
import pandas
import csv
jeopardy = pandas.read_csv("jeopardy.csv")
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [80]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [81]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

In [82]:
import re

def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    return text

def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [83]:
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)

In [84]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [85]:
jeopardy["Air Date"] = pandas.to_datetime(jeopardy["Air Date"])

In [86]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

In [87]:
def count_matches(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [88]:
jeopardy["answer_in_question"].mean()

0.060493257069335872

<h3> Answer terms in the question </h3>
<p> The answer only appears in the question about 6% of the time. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.</p>

In [89]:
question_overlap = []
terms_used = set()
for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap
jeopardy["question_overlap"].mean()

0.69087373156719623

<h3> Question overlap </h3>
<p> There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions. </p>

In [90]:
def determine_value(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)

In [91]:
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

comparison_terms = list(terms_used)[:5]
observed_expected = []
for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

[(0, 3), (2, 2), (0, 1), (1, 1), (1, 0)]

In [92]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=1.2058885383806519, pvalue=0.27214791766902047),
 Power_divergenceResult(statistic=0.88975496332255899, pvalue=0.34554371914834681),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=0.44487748166127949, pvalue=0.50477764875459963),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047)]

<h3>Chi-squared results</h3>
<p>None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.</p>