# Hypothesis test
Author: Myron Kukhta (xkukht01)

### Dependency import

In [175]:
import pandas as pd
from scipy import stats

# Hypothesis test 1
### Task  
*"On first-class roads, accidents with personal injury consequences were equally likely as on the highways."*  

### Method:
*Chi-squareTestd* 

Initially we will load statistical data on road accidents.

In [176]:
df = pd.read_pickle("accidents.pkl.gz")

LLet's familiarise ourselves with the data structure and leave only the ones we are interested in. In this case:

- **p9** - Effect on:
  - `1` - health
  - `2` - own (personal impact)

- **p36** - Type of road:
  - `0` - First-class road
  - `1` - Highways  
  - Other values will be removed.

For more convenience, let's convert the data into a human-readable format.

In [177]:
df1 = df[['p36', 'p9']]
df1 = df1[(df1['p36'] == 0) | (df1['p36'] == 1)]
df1['Road type'] = df1['p36'].map({0:'highways', 1:'first class road'})
df1['With a health effect'] = (df1['p9'] == 1)

To run the test, we must transform our table into a contingency table


In [178]:
statistic_tab = pd.crosstab(index=df1['Road type'], columns=df1['With a health effect'])
statistic_tab

With a health effect,False,True
Road type,Unnamed: 1_level_1,Unnamed: 2_level_1
first class road,14773,7059
highways,6674,1247


Chi-squareTestd is a statistical test that relies on comparing the observed relationship between two qualitative variables in some context and the expected one.
In our case, it is the type of road and an observation about the number of accidents in the context of their effects on human health. 

In [179]:
test_statistic, p, dof, expect = stats.chi2_contingency(statistic_tab)
output_of_test = pd.DataFrame({
    'Metrics':['statistic', 'p-value', 'Degree of freedom'],
    'Value': [test_statistic, p, dof]
})
output_of_test

Unnamed: 0,Metrics,Value
0,statistic,794.1533
1,p-value,1.0075e-174
2,Degree of freedom,1.0


As we can observe, the statistical test does not exceed the expected result, indicating that the hypothesis is false and there is a clear correlation between road type and the probability of health hazard based on road type. 

In [180]:
if p > 0.05:
    print('On first-class roads, accidents with personal injury consequences were EQUALLY likely as on the highways.')
else:
    print('On first-class roads, accidents with personal injury consequences were NOT EQUALLY likely as on the highways.')



On first-class roads, accidents with personal injury consequences were NOT EQUALLY likely as on the highways.


From the above comparison between real and expected data, we can observe an interesting statistic that *first-class roads are more life-threatening* than highways. This is a surprising fact to me!

In [181]:
statistic_tab - expect

With a health effect,False,True
Road type,Unnamed: 1_level_1,Unnamed: 2_level_1
first class road,-964.266965,964.266965
highways,964.266965,-964.266965


# Hypothesis test 2
### Task  
*"On first-class roads, accidents with personal injury consequences were equally likely as on the highways."*  

### Method:
*Chi-squareTestd* 