# Independant T-Test Baseball Teams

## Pandas

The goal is to perform a t-test on baseball data.  To start with we need to learn a bit about Pandas.  Pandas is a data analysis library - You can read more about in Chapter 4 of [Python Data Analysis](http://www.amazon.com/Python-Data-Analysis-Ivan-Idris/dp/1783553359).  We will go over pandas in more detail next class. 

In [1]:
import pandas as pd

# we also import a special IPython function that allows us to print nice tables
from IPython.display import display

If we like to read csv files, our best friend is pandas [read_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)

In [2]:
mlb=pd.read_csv("mlb2015teams.csv")

What am I??  What am I?? (use type to find out)

In [3]:
type(mlb)

pandas.core.frame.DataFrame

Well looks like we have a data frame, let's take a look at the shape and head() (first few lines) and tail (last few lines) of the data... inspecting the data this way is a fairly standard first step

In [4]:
print (mlb.shape)

(15, 32)


In [5]:
display(mlb.head())

Unnamed: 0,Tm,#Bat,BatAge,wins,R/G,G,PA,AB,R,H,...,OPS+,TB,GDP,HBP,SH,SF,IBB,LOB,old_team,winning_team
0,BAL,48,27.9,81,4.4,162,6007,5485,713,1370,...,96,2307,127,51,20,32,23,990,False,False
1,BOS,51,28.3,78,4.62,162,6237,5640,748,1495,...,98,2338,127,46,30,42,28,1142,False,False
2,CHW,40,28.2,76,3.84,162,6070,5533,622,1381,...,91,2103,125,65,30,37,22,1065,False,False
3,CLE,49,27.9,81,4.16,161,6109,5439,669,1395,...,94,2179,134,39,47,50,34,1147,False,False
4,DET,47,28.3,74,4.28,161,6159,5605,689,1515,...,106,2355,152,41,23,35,36,1111,False,False


In [6]:
display(mlb.tail())

Unnamed: 0,Tm,#Bat,BatAge,wins,R/G,G,PA,AB,R,H,...,OPS+,TB,GDP,HBP,SH,SF,IBB,LOB,old_team,winning_team
10,OAK,52,27.9,68,4.28,162,6171,5600,694,1405,...,94,2212,124,40,14,38,21,1102,False,False
11,SEA,51,28.6,76,4.05,162,6131,5544,656,1379,...,102,2279,123,36,38,35,31,1080,True,False
12,TBR,51,28.4,80,3.98,162,6071,5485,644,1383,...,99,2226,121,84,19,47,22,1075,True,False
13,TEX,57,28.6,88,4.64,162,6187,5511,751,1419,...,98,2278,99,76,43,54,32,1130,True,True
14,TOR,52,29.4,93,5.5,162,6232,5509,891,1480,...,118,2518,140,54,36,62,12,1057,True,True


Inspect the tail and note that the last entry appears to be a summary of all the data, so we only want the first 15 rows.  To get the first 15 rows we use a technique called slicing.  This [StackOverflow Post](http://stackoverflow.com/questions/509211/explain-pythons-slice-notation) explains slicing a list.  When we slice a Pandas DataFrame we get rows a through b.

In [7]:
mlb=mlb[0:15]
display(mlb.tail())

Unnamed: 0,Tm,#Bat,BatAge,wins,R/G,G,PA,AB,R,H,...,OPS+,TB,GDP,HBP,SH,SF,IBB,LOB,old_team,winning_team
10,OAK,52,27.9,68,4.28,162,6171,5600,694,1405,...,94,2212,124,40,14,38,21,1102,False,False
11,SEA,51,28.6,76,4.05,162,6131,5544,656,1379,...,102,2279,123,36,38,35,31,1080,True,False
12,TBR,51,28.4,80,3.98,162,6071,5485,644,1383,...,99,2226,121,84,19,47,22,1075,True,False
13,TEX,57,28.6,88,4.64,162,6187,5511,751,1419,...,98,2278,99,76,43,54,32,1130,True,True
14,TOR,52,29.4,93,5.5,162,6232,5509,891,1480,...,118,2518,140,54,36,62,12,1057,True,True


Calculate Min and Max

The next step divides the data into two segments, winners (above 81 wins) and losers (below and including 81 wins)

In [8]:
winners=mlb[mlb.winning_team==True]
losers=mlb[mlb.winning_team==False]

In [9]:
winners.tail()

Unnamed: 0,Tm,#Bat,BatAge,wins,R/G,G,PA,AB,R,H,...,OPS+,TB,GDP,HBP,SH,SF,IBB,LOB,old_team,winning_team
7,LAA,51,28.7,85,4.08,162,5990,5417,661,1331,...,97,2144,116,58,37,40,34,1013,True,True
8,MIN,44,28.3,83,4.3,162,6017,5467,696,1349,...,90,2182,133,40,30,41,31,993,False,True
9,NYY,56,31.1,87,4.72,162,6268,5567,764,1397,...,105,2343,105,63,24,54,23,1151,True,True
13,TEX,57,28.6,88,4.64,162,6187,5511,751,1419,...,98,2278,99,76,43,54,32,1130,True,True
14,TOR,52,29.4,93,5.5,162,6232,5509,891,1480,...,118,2518,140,54,36,62,12,1057,True,True


In [10]:
losers.wins.mean()

76.75

In [11]:
print(losers.shape)
print(winners.shape)

(8, 32)
(7, 32)


Finally we import scipy and use the independent t-test to see if there is a difference in number of home runs between groups

In [12]:
import scipy.stats as stats

In [13]:
stats.ttest_ind(losers["R"], winners["R"])

Ttest_indResult(statistic=-2.204133724396779, pvalue=0.046145779112762554)

In [14]:
print (winners.columns)

Index(['Tm', '#Bat', 'BatAge', 'wins', 'R/G', 'G', 'PA', 'AB', 'R', 'H', '2B',
       '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'BA', 'OBP', 'SLG', 'OPS',
       'OPS+', 'TB', 'GDP', 'HBP', 'SH', 'SF', 'IBB', 'LOB', 'old_team',
       'winning_team'],
      dtype='object')


In [15]:
help(stats.ttest_ind)

Help on function ttest_ind in module scipy.stats.stats:

ttest_ind(a, b, axis=0, equal_var=True, nan_policy='propagate')
    Calculate the T-test for the means of *two independent* samples of scores.
    
    This is a two-sided test for the null hypothesis that 2 independent samples
    have identical average (expected) values. This test assumes that the
    populations have identical variances by default.
    
    Parameters
    ----------
    a, b : array_like
        The arrays must have the same shape, except in the dimension
        corresponding to `axis` (the first, by default).
    axis : int or None, optional
        Axis along which to compute test. If None, compute over the whole
        arrays, `a`, and `b`.
    equal_var : bool, optional
        If True (default), perform a standard independent 2 sample test
        that assumes equal population variances [1]_.
        If False, perform Welch's t-test, which does not assume equal
        population variance [2]_.
    
   

In [16]:
stats.ttest_ind(losers.wins,winners.wins)

Ttest_indResult(statistic=-5.056734273759606, pvalue=0.00021982855596534845)

In [20]:
losers.to_csv('losers.csv')

In [19]:
winners.to_csv('winners.csv')