## Assignment

To help you find your bearings with regard to t-tests, calculate the t-values for the following numbers:

1. 𝑦1¯=5 ,  𝑦2¯=8 ,  𝑠1=1 ,  𝑠2=3 ,  𝑁1=200 ,  𝑁2=500 
1. 𝑦1¯=1090 ,  𝑦2¯=999 ,  𝑠1=400 ,  𝑠2=30 ,  𝑁1=900 ,  𝑁2=100 
1. 𝑦1¯=45 ,  𝑦2¯=40 ,  𝑠1=45 ,  𝑠2=40 ,  𝑁1=2000 ,  𝑁2=2000

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
%matplotlib inline


In [2]:
#Keeping the random data the same over multiple runs of the code.
np.random.seed(42)


In [3]:
def data_maker(x, y):
#Storing our randomly generated data and labels.
    data = []
    groups = []
    labels_std = []
    labels_sizes = []

#The values we use for the standard deviations and the sample sizes.
    std = [x[0], y[0]]
    sample_sizes = [x[1], y[1]]
    
#Generating data for each group for each combination of variability and sample size.
    for var in std:
        for size in sample_sizes:
            data.extend(np.random.normal(x[2],var,size))
            data.extend(np.random.normal(y[2],var,size))
            labels_std.extend([var]*size*2)
            labels_sizes.extend([size]*size*2)
            groups.extend(['group1']*size)
            groups.extend(['group2']*size)

#Putting the data together in a data frame and returning it
    data = pd.DataFrame({'data': data, 
                         'groups' : groups,
                        'variability':labels_std,
                        'size':labels_sizes})
    return data


In [4]:
def t_tester(data, x, y):

# Setting the three non-data columns to work as multi-indices. 
# This makes it much easier to get subsections of stacked data.
    data_test = data.set_index(['groups','size','variability'])
    
#The values we use for the standard deviations and the sample sizes.
    std = [x[0], y[0]]
    sample_sizes = [x[1], y[1]]

# Storing our t-values and p-values (we'll get to p-values in a sec).
    tvalues=[]
    pvalues=[]

#For each combination of sample size and variability, compare the two groups using a t-test
    for size in sample_sizes:
        for var in std:
            a = data_test['data'].xs(('group1',size,var),level=('groups','size','variability'))
            b = data_test['data'].xs(('group2',size,var),level=('groups','size','variability'))
            tval,pval=stats.ttest_ind(b, a,equal_var=True)
            tvalues.append(tval)
            pvalues.append(pval)
    return print("The tvalues are as follows", tvalues)


### Problem #1

1. 𝑦1¯=5 ,  𝑦2¯=8 ,  𝑠1=1 ,  𝑠2=3 ,  𝑁1=200 ,  𝑁2=500 

In [5]:
# lists to use when creating our datasets
DATA_1X = [1, 200, 5]
DATA_1Y = [3, 500, 8]


In [6]:
# using data_maker to create a dataframe based on previous lists
P1_DF = data_maker(DATA_1X, DATA_1Y)
P1_DF.head()


Unnamed: 0,data,groups,variability,size
0,5.496714,group1,1,200
1,4.861736,group1,1,200
2,5.647689,group1,1,200
3,6.52303,group1,1,200
4,4.765847,group1,1,200


In [7]:
# using t_tester to get t-values grouping based on column
t_tester(P1_DF, DATA_1X, DATA_1Y)


The tvalues are as follows [32.58910059852413, 10.37464123986258, 48.1992971776322, 17.26981222125739]


### Problem #2

1. 𝑦1¯=1090 ,  𝑦2¯=999 ,  𝑠1=400 ,  𝑠2=30 ,  𝑁1=900 ,  𝑁2=100 

In [8]:
#std, sample size, and sample mean
DATA_2X = [400, 900, 1090]
DATA_2Y = [30, 100, 999]


In [9]:
P2_DF = data_maker(DATA_2X, DATA_2Y)
P2_DF.head()


Unnamed: 0,data,groups,variability,size
0,1163.533807,group1,400,900
1,2167.213466,group1,400,900
2,1229.92001,group1,400,900
3,688.378161,group1,400,900
4,1051.814301,group1,400,900


In [10]:
t_tester(P2_DF, DATA_2X, DATA_2Y)


The tvalues are as follows [-6.622557090008956, -64.7139612214455, -1.772836309857788, -19.38769160154602]


### Problem #3

1. 𝑦1¯=45 ,  𝑦2¯=40 ,  𝑠1=45 ,  𝑠2=40 ,  𝑁1=2000 ,  𝑁2=2000

In [11]:
#std, sample size, and sample mean
DATA_3X = [45, 2000, 45]
DATA_3Y = [40, 2000, 40]


In [12]:
P3_DF = data_maker(DATA_3X, DATA_3Y)
P3_DF.head()


Unnamed: 0,data,groups,variability,size
0,2.297123,group1,45,2000
1,4.189371,group1,45,2000
2,-60.770513,group1,45,2000
3,120.279069,group1,45,2000
4,6.120907,group1,45,2000


In [13]:
t_tester(P3_DF, DATA_3X, DATA_3Y)


The tvalues are as follows [-4.944695381160954, -5.372905703529876, -4.944695381160954, -5.372905703529876]
