## A/B Testing

In [2]:
In modern data analytics, deciding whether two numerical samples come from the same
underlying distribution is called A/B testing. The name refers to the labels of the two
samples, A and B.

SyntaxError: invalid syntax (860480761.py, line 1)

#### Smokers and Nonsmokers

The dataset "baby.csv" contains the following variables for 1,174 mother-baby pairs: 
the baby's birth weight in ounces, 
the number of gestational days, 
the mother's age in completed years,
the mother's height in inches, 
pregnancy weight in pounds, and 
whether or not the mother smoked during pregnancy.

### One of the aims of the study was to see whether maternal smoking was associated with birth weight.

##### Start by selecting just Birth Weight and Maternal Smoker. There are 715 nonsmokers among the women in the sample, and 459 smokers.

In [None]:
import matplotlib
import matplotlib.pyplot as plots
%matplotlib inline
import numpy as np

In [None]:
import pandas as pd

In [None]:
baby = pd.read_csv('https://github.com/data-8/textbook/raw/gh-pages/data/baby.csv')

In [None]:
# or from 'https://raw.githubusercontent.com/data-8/textbook/gh-pages/data/baby.csv'

In [None]:
baby.to_csv('E:/2020/DS/19AI611/python_data_csv/baby.csv')

In [None]:
baby

In [None]:
smoking_and_birthweight = baby[['Maternal Smoker', 'Birth Weight']]
smoking_and_birthweight

In [None]:
smoking_and_birthweight['Maternal Smoker'] == True

In [None]:
smoker = smoking_and_birthweight['Birth Weight'] [smoking_and_birthweight['Maternal Smoker'] == True]
smoker

In [None]:
non_smoker = smoking_and_birthweight['Birth Weight'] [smoking_and_birthweight['Maternal Smoker'] == False]
non_smoker

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
smoking_and_birthweight.hist(by ='Maternal Smoker')

In [None]:
import seaborn as sns
smoker.hist(histtype='stepfilled', alpha=.5, bins=20)   # default number of bins = 10
non_smoker.hist(histtype='stepfilled', alpha=.5, color=sns.desaturate("indianred", .75), bins=10)
plt.xlabel('Women',fontsize=15)
plt.ylabel('Baby weight',fontsize=15)
plt.show()

The distribution of the weights of the babies born to mothers who smoked appears to be
shifted slightly to the left of the distribution corresponding to non-smoking mothers. The
weights of the babies of the mothers who smoked seem lower, on average than the weights
of the babies of the non-smokers. 

### The Hypotheses
We can try to answer this question by a test of hypotheses. The chance model that we will
test says that there is no underlying difference; the distributions in the samples are different
just due to chance. Formally, this is the null hypothesis.
Null hypothesis: In the population, the distribution of birth weights of babies is the same for
mothers who don't smoke as for mothers who do. The difference in the sample is due to chance.
Alternative hypothesis: In the population, the babies of the mothers who smoke have a
lower birth weight, on average, than the babies of the non-smokers. """

### Test Statistic
The alternative hypothesis compares the average birth weights of the two groups and says
that the average for the mothers who smoke is smaller. Therefore it is reasonable for us to
use the difference between the two group means as our statistic.

We will do the subtraction in the order "average weight of the smoking group - average
weight of the non-smoking group". Small values (that is, large negative values) of this
statistic will favor the alternative hypothesis

The observed value of the test statistic is about -9.3 ounces.
means_table = smoking_and_birthweight.group('Maternal Smoker',np.average)
means_table

In [None]:
means_table = smoking_and_birthweight.groupby('Maternal Smoker').mean()
type(means_table)

In [None]:
means_table

In [None]:
observed_difference = means_table['Birth Weight'][1] - means_table['Birth Weight'][0]

### Predicting the Statistic Under the Null Hypothesis

To see how the statistic should vary under the null hypothesis, we have to figure out how to
simulate the statistic under that hypothesis. A clever method based on random permutations
does just that.

#### Random permutation.

If there were no difference between the two distributions in the underlying population, then whether a birth weight has the label True or False with respect to maternal smoking should make no difference to the average. The idea, then, is to shuffle all the birth weights randomly among the mothers. This is called random permutation.

Take the difference of the two new group means: the mean of the shuffled weights assigned to the smokers and the mean of the shuffled weights assigned to the non-smokers. This is a simulated value of the test statistic under the null hypothesis.

In [None]:
smoking_and_birthweight

There are 1,174 rows in the table. To shuffle all the birthweights, we will draw a random
sample of 1,174 rows without replacement. Then the sample will include all the rows of the
table, in random order.
We can use the method sample with the optional replace=False argument.


In [None]:
shuffled = smoking_and_birthweight.sample(1174,replace = False)
shuffled

In [None]:
shuffled_weights = shuffled['Birth Weight']
type(shuffled_weights)

In [None]:
original_and_shuffled= smoking_and_birthweight.assign(shuffled_weights=shuffled_weights.values )

In [None]:
original_and_shuffled

Each mother now has a random birth weight assigned to her. If the null hypothesis is true, all these random arrangements should be equally likely. See how different the average weights are in the two randomly selected groups. 

In [None]:
all_group_means= original_and_shuffled.groupby('Maternal Smoker').mean()
all_group_means

The averages of the two randomly selected groups are quite a bit closer than the averages of the two original groups.

In [None]:
difference = all_group_means['shuffled_weights'][0]- all_group_means['shuffled_weights'][1]
difference

#### But could a different shuffle have resulted in a larger difference between the group average

To get a sense of the variability, simulate the difference many times. 

##### One simulation

In [3]:
smoking_and_birthweight = baby[['Maternal Smoker', 'Birth Weight']]
shuffled = smoking_and_birthweight.sample(1174,replace = False)
shuffled_weights = shuffled['Birth Weight']
original_and_shuffled = smoking_and_birthweight.assign(shuffled_weights=shuffled_weights.values )
all_group_means= original_and_shuffled.groupby('Maternal Smoker').mean()
difference = all_group_means['shuffled_weights'][0]- all_group_means['shuffled_weights'][1]
difference

NameError: name 'baby' is not defined

#### Permutation Test

Tests based on random permutations of the data are called permutation tests. Simulate the test statistic – the
difference between the averages of the two groups – many times and collect the differences in an array. 

In [None]:
import numpy as np
import array
differences = np.zeros(5000)

In [None]:
for i in np.arange(5000):
    smoking_and_birthweight = baby[['Maternal Smoker', 'Birth Weight']]
    shuffled = smoking_and_birthweight.sample(1174,replace = False)
    shuffled_weights = shuffled['Birth Weight']
    original_and_shuffled = smoking_and_birthweight.assign(shuffled_weights=shuffled_weights.values )
    all_group_means= original_and_shuffled.groupby('Maternal Smoker').mean()
    difference = all_group_means['shuffled_weights'][0]- all_group_means['shuffled_weights'][1]
    differences[i] = difference

In [None]:
differences

In [None]:
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

In [None]:
differences_df = pd.DataFrame(differences)
differences_df

In [None]:
differences_df.hist(bins = np.arange(-5,5,0.5))
plt.title('Prediction Under Null Hypotheses');
plt.xlabel('Differences between Group Averages',fontsize=15)
plt.ylabel('Units',fontsize=15);
print('Observed Difference:', observed_difference)

Notice how the distribution is centered around 0. This makes sense, because under the null hypothesis the two groups should have roughly the same average. Therefore the difference between the group averages should be around 0.

The observed difference in the original sample is about -9.27 ounces, which doesn't even appear on the horizontal scale of the histogram. The observed value of the statistic and the predicted behavior of the statistic under the null hypothesis are inconsistent.

#### The conclusion of the test is that the data support the alternative more than they support the null. The average birth weight of babies born to mothers who smoke is less than the average birth weight of babies born to non-smokers. 

If you want to compute an empirical P-value, remember that low values of the statistic favor
the alternative hypothesis.

In [None]:
np.count_nonzero(differences <= observed_difference)/differences.size

The empirical P-value is 0, meaning that none of the 5,000 observed samples resulted in a difference of -9.27 or lower. This is an approximation; the exact chance of getting a difference in that range is not 0 but it is vanishingly small.

#### Assignment - Write a Function to Simulate the Differences Under the Null Hypothesis and test whether there was any difference in the ages of the smoking and non-smoking mothers.