# Hypothesis testing

In [None]:
import numpy as np
import pandas as pd

## Test description

We take the seconds spent during the specific intervals in the blitz game (3 minutes per game) of chess. Since the time is measured in a single game, our data is paired. Using hypothesis testing we want to analyze on which interval does the player spend more time and give a hint on which part of his/her game the player has to work more thoroughly.

## Entering data

first_8 variable holds the time spent on moves 15-22.

In [None]:
first_8 = [42,79,11,69,53,62,30,33,79,42,42,70,23,17,52,66,29,54,33,
     42,59,44,33,77,73,93,29,51,65,27,29,61,70,25,42,33,27,29]

first_8 = np.array(first_8)

second_8 variable holds the time spent on moves 23-30.

In [None]:
second_8 = [23,15,69,33,49,29,54,79,20,38,64,22,24,42,58,32,31,76,31,
           30,54,53,58,42,33,38,14,57,41,16,41,35,22,57,29,21,33,30]
second_8 = np.array(second_8)

The time is given by seconds:

42: 42 seconds

79: 1 minute 19 seconds

### Constructing a pandas dataframe

15_22 column contains the time spent on moves 15-22 in the game and the column 23_30 contains the time spent on moves 23_30 **in the same game**.
Thus our data is paired and it gives us a lot of possibilites for doing statistical analysis and reaching useful conclusions!

In [None]:
time_dict = {'15_22':first_8,
             '23_30':second_8}

data = pd.DataFrame(time_dict)

data['diff'] = data['15_22'] - data['23_30']
data.head()

Unnamed: 0,15_22,23_30,diff
0,42,23,19
1,79,15,64
2,11,69,-58
3,69,33,36
4,53,49,4


## Formulating the null and alternative hypotheses

The null hypothesis (H0) is: data['diff'].mean <= 0 

The alternative hypothesis is: data['diff'].mean > 0

## Calculating necessary test statistics

In [None]:
pop_mean = 0
sample_mean = np.mean(data['diff'])
n = len(first_8)
sample_sd = np.std(data['diff'])

We don't know population standard deviation, the number of data is 38 which is greather then 30, hence we cannot use the Z distribution, but we can use the student: t distribution to calculate p-value.

## Conducting t test

In [None]:
t = (sample_mean - pop_mean)/(sample_sd/np.sqrt(n))
df = n - 2       # df - degrees of freedom
alpha = 0.05     # the maximum probability of type 1 error
print(t)

1.707280546807981


In [None]:
from scipy import stats
#Student t distribution, n=38,  1-tail
#deg_freedom = n - 2 = 38 - 2 = 36
critical_t = stats.t.ppf(1-alpha,df)   

In [None]:
if t > critical_t:
    print('We reject the null hypothesis which is: pop_mean <= 0')
    print('Hence we say that the mean time spent on moves 15-22 is greater\
 than the mean time spent on moves 23-30.')
else:
    print('We cannot reject the null hypothesis which is: pop_mean <= 0')
    print('Thus we cannot say that the mean time spent on moves 15-22 is greater\
 than the mean time spent on moves 23-30.')

We reject the null hypothesis which is: pop_mean <= 0
Hence we say that the mean time spent on moves 15-22 is greater than the mean time spent on moves 23-30.

We can recommend a training program according to the interval on which the player spends more time


## Conclusions and recommendations

We can recommend a training program according to the interval
on which the player spends more time. If he spends more time during the moves 15-22 then he probably should take a look on his early middle-game skills or should try to play faster in that part and if he spends more time during the moves 23-30, he has to study more of late middle-games.