# 08.02.20

**Author:** Miron Rogovets

---

### Task 1. Chi-square test. Use data_games.dta file.

**1.1.** Analyze the relationship between **payment_type** and **payment_method** using Chi-square statistical test. Is the Chi-square test applicable for this pair of variables? If yes, formulate hypotheses, interpret the results of analysis and make conclusions. Create a suitable graph to demonstrate the relationship between these two variables.

In [31]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency

In [32]:
df = pd.read_stata('data/data_games.dta')
df.head()

Unnamed: 0,id,pack_id,crystalls_balance_before_buy,crystalls_bought,country,payment,utc_timestamp,payment_type,payment_method
0,2052791000.0,3.0,0.0,41.0,GB,644.0,1414842000.0,offer,general
1,1275033000.0,1.0,10.0,7.0,US,205.0,1414814000.0,offer,general
2,200001500000000.0,2.0,2.0,14.0,US,514.0,1414866000.0,regular,general
3,1119068000.0,4.0,0.0,70.0,GB,1289.0,1414917000.0,regular,general
4,200002800000000.0,3.0,0.0,30.0,US,1029.0,1414946000.0,regular,general


To test whether variables _payment_type_ and _payment_method_ are associated we define the following hypothesis:

**H0:** The variables _payment_type_ and _payment_method_ are **independent**

**H1:** The variables _payment_type_ and _payment_method_ are **not independent**

We also choose a significance level $\alpha$ **= 0.05**

In [33]:
tab = pd.crosstab(df.payment_type, df.payment_method)
tab

payment_method,fb_promotion,general,giftcard,mobile
payment_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
offer,0,13538,0,301
regular,498,89740,58,1372


In [34]:
alpha = 0.05
stat, p, dof, expected = chi2_contingency(tab)

In [35]:
print('significance=%f, p=%f' % (alpha, p))
if p <= alpha:
    print('Variables are associated (reject H0)')
else:
    print('Variables are not associated(fail to reject H0)')

significance=0.050000, p=0.000000
Variables are associated (reject H0)


In [36]:
stacked = tab.stack().reset_index().rename(columns={0:'value'})
stacked

Unnamed: 0,payment_type,payment_method,value
0,offer,fb_promotion,0
1,offer,general,13538
2,offer,giftcard,0
3,offer,mobile,301
4,regular,fb_promotion,498
5,regular,general,89740
6,regular,giftcard,58
7,regular,mobile,1372


**1.2.** Analyze the relationship between **payment_type** and **crystalls_balance_before_buy** using Chi-square statistical test. Is the Chi-square test applicable for this pair of variables? If yes, formulate hypotheses, interpret the results of analysis and make conclusions. Create a suitable graph to demonstrate the relationship between these two variables.

In [15]:
tab = pd.crosstab(df.crystalls_balance_before_buy, df.payment_type)
tab

payment_type,offer,regular
crystalls_balance_before_buy,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,4047,28833
1.0,2250,13951
2.0,1287,9337
3.0,919,6221
4.0,649,5159
...,...,...
6439.0,1,0
6453.0,1,0
6528.0,0,1
6735.0,0,1


---

### Task 2. Scatterplot. Normality test. Correlation. Use data_games.dta file.

**2.1.** Create a scatterplot between **crystalls_balance_before_buy** and **payment**. Copy the scatterplot into this file.

**2.2.** Run the suitable normality test to conclude whether the distribution of **payment** variable is significantly different from the normal. Formulate hypothesis. Make conclusions.

Calculate an appropriate correlation coefficient between three pairs of variables. Fill in the table below. Interpret the results.


| Variables | Type of the appropriate correlation coefficient | Hypotheses | Strength of the relationship | Direction of the relationship | Significance of the relationship |
|:---|:---:|:---:|:---:|:---:|:---:|
| crystalls_balance_before_buy and payment |   |   |   |   |   |
| crystalls_balance_before_buy and crystalls_bought |   |   |   |   |   |
| crystalls_bought and payment |   |   |   |   |   |

---

### Task 3. Partial correlation. Use health_funding.dta file.

Calculate paired correlation coefficient between **funding** and **disease** variables. Now calculate the correlation coefficient between the same pair of variables controlling for the number of visits (**visits** variable). Interpret the results of analysis.

---

### Task 4. T-tests and Nonparametric tests.

**4.1.** Use **auto.dta** file (example datasets). Select an appropriate test to check if there is a difference in the mean length of foreign and non-foreign cars. Explain you selection. Formulate the hypotheses. Interpret the results of analysis. 

**4.2.** Use **data_games.dta** file. Select an appropriate test to understand whether there is a difference in payments between the people who have used different payment types. Explain you selection. Formulate the hypotheses. Interpret the results of analysis. 

**4.3.** Use **data_games.dta** file. Select an appropriate test to understand whether there is a difference in payments between the people who have used different payment methods. Explain you selection. Formulate the hypotheses. Interpret the results of analysis.