<a href="https://colab.research.google.com/github/Lord-of-Lite/6_Nations_Data_Challenge/blob/main/Scotland_Chi_Square.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 1. England v Scotland
## Colab, Github and a basic prediction

### Import libraries, load the data and clean it up

In [34]:
# WHAT THIS DOES
# Loads in the Pandas library and names it pd

import pandas as pd


In [35]:
# WHAT THIS DOES
# Selects the data you'd like to analyse and creates the 'Scotland' dataframe

address = 'https://raw.githubusercontent.com/Lord-of-Lite/6_Nations_Data_Challenge/main/England_Scotland_2021.csv'
scotland = pd.read_csv(address)

In [None]:
# WHAT THIS DOES
# Let's you have a look at the first 5 lines of the spreadsheet

scotland.head(5)


In [37]:
# WHAT THIS DOES
# Creates a dataframe called 'test' we'll use for our analysis

test = scotland[['Result','Ground']]

In [38]:
# WHAT THIS DOES
# Filters dataset using 'or' condition (denoted as '|')

test = test.loc[(test['Ground'] == 'Twickenham') | (test['Ground'] == 'Murrayfield')]
test.Ground.value_counts()

Twickenham     49
Murrayfield    47
Name: Ground, dtype: int64

### Looking at summary statistics

In [40]:
# WHAT THIS DOES
# Creates crosstabs based on number of events

number = pd.crosstab(test.Result, test.Ground)
print(number.round(2))


Ground  Murrayfield  Twickenham
Result                         
draw              4           6
lost             22           4
won              21          39


In [41]:
# WHAT THIS DOES
# Creates crosstabs based on percentage by row

percentage = pd.crosstab(test.Result, test.Ground, margins=True, normalize='index')
print((percentage*100).round(2))

Ground  Murrayfield  Twickenham
Result                         
draw          40.00       60.00
lost          84.62       15.38
won           35.00       65.00
All           48.96       51.04


###Hypothesis test
####Chi-squared test of independence

In [42]:
# WHAT THIS DOES
# Loads in the Pandas library and names it pd

from scipy.stats import chi2_contingency

In [43]:
#test.index_col='Result'

#SciPy’s chi2_contingency() returns four values, 𝜒2 value, p-value, degree of freedom and expected values.
chi2_contingency(number)

(18.22778315909015,
 0.00011012531872014555,
 2,
 array([[ 4.89583333,  5.10416667],
        [12.72916667, 13.27083333],
        [29.375     , 30.625     ]]))

In [44]:
# Tidy up what's in the array

df=chi2_contingency(number)[3]
pd.DataFrame(
    data=df[:,:],
    index = ['draw','lost','won'],
    columns = ['Murrayfield','Twickenham']
).round(0)

Unnamed: 0,Murrayfield,Twickenham
draw,5.0,5.0
lost,13.0,13.0
won,29.0,31.0


In [45]:
#Critical values
# The level of significance and degree of freedom can be used to find the critical value.
# To find critical values, you need to import chi2 from scipy.state and define probability from the level of significance

from scipy.stats import chi2
significance = 0.01
p = 1 - significance
dof = chi2_contingency(number)[2]
critical_value = chi2.ppf(p, dof)
critical_value.round(2)

9.21

In [46]:
# The null and alternative hypotheses

chi, pval, dof, exp = chi2_contingency(number)
print('p-value is: ', pval)
significance = 0.01
p = 1 - significance
critical_value = chi2.ppf(p, dof)
print('chi=%.6f, critical value=%.6f\n' % (chi, critical_value))
if chi > critical_value:
    print("""At %.2f level of significance, we reject the null hypotheses and accept H1. 
They are not independent.""" % (significance))
else:
    print("""At %.2f level of significance, we accept the null hypotheses. 
They are independent.""" % (significance))


p-value is:  0.00011012531872014555
chi=18.227783, critical value=9.210340

At 0.01 level of significance, we reject the null hypotheses and accept H1. 
They are not independent.


## Useful links

If you'd like to explore further what we've learned this week, here are some useful links:

- [Gentle Introduction to Chi-Square Test for Independence](https://towardsdatascience.com/gentle-introduction-to-chi-square-test-for-independence-7182a7414a95#92ef)

- [Comparison of the Chi-Square Tests](https://philschatz.com/statistics-book/contents/m47082.html)

- [Pandas](https://pandas.pydata.org/pandas-docs/stable/)