# Statistical Analysis using Python Numpy

## Introduction

The T test is well known in the field of statistics.
It is used to test the hypothesis using a set of data sample
from the population. To perform the T test, a population
sample size, the mean or average of each population
in the pool, and
Standard deviation are all required.

## Task 1: Analyze the T-Test problem and use the Python Pandas to rean from the CSV into a Data Frame

So in this first task, we're going to describe the problem
that we're trying to solve.
We're going to create a null hypothesis, which is what
we're going to try to prove.
With our statistical analysis, we will use the Python Pandas
package to read from the C S V file into a data frame.
We will observe the dimensions of the data frame and we
will print the data frame.


Our data was downloaded from Kaggel.com. 
You'll see that it's a comma separated file and there is
a header here to define the __gender__, __race/ethnicity__,
__parental level of education__, __lunch__, __test preparation__,
And there's also a __math score__ column and so on.

So to begin, we will form a null hypothesis.
We're going to state that there is no difference in __math test
scores__ between males and females.
The alternate hypothesis states that there is a difference
in math scores between males and females.


In [1]:
# Import package
import pandas as pd
import numpy as np
from scipy import stats
import os

In [2]:
# Read from the scores.csv using pandas
df = pd.read_csv('scores.csv')

In [3]:
# print the shape
print(df.shape)

(1000, 8)


In [8]:
df.columns

Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course', 'math score', 'reading score',
       'writing score'],
      dtype='object')

In [4]:
# print the df
print(df)

     gender race/ethnicity parental level of education         lunch  \
0    female        group B           bachelor's degree      standard   
1    female        group C                some college      standard   
2    female        group B             master's degree      standard   
3      male        group A          associate's degree  free/reduced   
4      male        group C                some college      standard   
..      ...            ...                         ...           ...   
995  female        group E             master's degree      standard   
996    male        group C                 high school  free/reduced   
997  female        group C                 high school  free/reduced   
998  female        group D                some college      standard   
999  female        group D                some college  free/reduced   

    test preparation course  math score  reading score  writing score  
0                      none          72             72         

And we can see that we printed out the representative
comma separated file within the data frame, so we can be
confident that we have our data ready to process. Great!

In this task, we analyzed the T-test problem and used
the python pandas to read from the CSV file into a data frame.
In the next task we obtain will two Numpy arrays  from the data
frame column to represent female students scores and male
student scores.

## Task 2: Obtain two Numpy arrays from the DataFrame column to represent Female student scores and Male Student scores.

In this task, we will obtain two Numpy arrays  from the data
frame columns to represent female student scores and male
student scores.

So we're working with two data sets:
the __female students math scores__
and the __male student math scores__. And we're going to use
a paired T test to verify or not verify the null hypothesis.

We will need a list to contain the data from the math Score
column for each category of student.
Since our null hypothesis indicates that there is
no difference in math scores between female and male students,
we need to filter on the __Gender__ column and gather the scores
from the __math score__ column for each gender.


In [9]:
# Create an array called FM female math scores.
# We're only going to select the math score column.
# We just want the values in the array.
# We don't want any indices, so we are going to get the raw values
# from that dataframe to our array.

fm = df.loc[df['gender'] == 'female', 'math score'].values

In [10]:
# We're assuming that this is a numpy array.
# So if we just want to verify this, we can do that
# with type, then pass the array name to it.
# We should be able to get our type.

print(type(fm))

<class 'numpy.ndarray'>


In [12]:
# We can also print out the the array itself just as a sanity check.

print(fm)

[ 72  69  90  71  88  38  65  50  69  18  54  65  69  67  62  69  63  56
  74  50  75  58  53  50  55  66  57  71  33  82   0  69  59  60  39  58
  41  61  62  47  73  76  71  58  73  65  79  63  58  65  85  58  87  52
  70  77  51  99  75  78  51  55  79  88  87  51  75  59  76  59  42  22
  68  59  70  66  61  75  81  96  58  68  67  79  63  43  81  46  71  52
  97  46  50  65  77  66  62  69  45  78  67  65  57  74  58  42  83  34
  56  55  52  45  72  88  67  64  80  56  58  65  71  60  62  64  70  65
  64  44  99  63  69  88  71  47  65  85  59  65  73  70  37  67  65  67
  74  53  49  73  68  59  77  56  67  75  71  43  41  82  41  83  61  24
  35  61  69  72  77  52  63  46  59  61  42  80  58  52  27  44  73  45
  82  59  80  85  71  80  38  38  67  64  57  62  77  65  50  85  59  49
  88  54  63  82  52  70  51  82  57  47  59  62  61  47  67  78  59  73
  67  64 100  65  53  79 100  71  77  84  26  77  83  63  68  59  90  71
  55  52  49  60  65  64  83  81  54  54  59  66  7

In [13]:
# Get the male math scores
mm = df.loc[df['gender'] == 'male', 'math score'].values
mm

array([ 47,  76,  40,  64,  58,  40,  78,  88,  46,  66,  44,  74,  73,
        69,  70,  40,  97,  81,  57,  55,  59,  65,  82,  53,  77,  53,
        88,  52,  58,  79,  39,  62,  67,  45,  61,  63,  61,  49,  44,
        30,  80,  49,  50,  72,  42,  27,  71,  43,  78,  65,  79,  68,
        60,  98,  66,  62,  54,  84,  91,  63,  83,  72,  65,  82,  89,
        53,  87,  74,  58,  51,  70,  71,  57,  88,  88,  73, 100,  62,
        77,  54,  62,  60,  66,  82,  49,  52,  53,  72,  94,  62,  45,
        65,  80,  62,  48,  76,  77,  61,  59,  55,  69,  59,  74,  82,
        81,  80,  35,  60,  87,  84,  66,  61,  87,  86,  57,  68,  76,
        46,  92,  83,  80,  63,  54,  84,  73,  59,  75,  85,  89,  68,
        47,  80,  54,  78,  79,  76,  59,  69,  58,  88,  83,  73,  53,
        45,  81,  97,  88,  77,  76,  86,  63,  78,  67,  46,  71,  40,
        90,  81,  56,  80,  69,  99,  51,  66,  67,  71,  83,  63,  61,
        28,  82,  71,  47,  62,  90,  76,  49,  58,  67,  79,  6

In [15]:
# quick check of the sizes

print('Male Math Score:',mm.size, ' ', 'Female Math Score:',fm.size)

Male Math Score: 482   Female Math Score: 518


We have slightly more of a sample size for our female math
scores than we do for our male math scores.
So we need to take that into account when we do the
statistics on these thes two samples.

Great. So in this task we obtained to two numpy arrays from 
the data frame column to represent female students scores and
male student scores.
In The next task we compute the variance of the two arrays using
standard deviation from each array.
This variance is called the pooled variance,
Because these arrays are two different sizes.
So we need to take that size difference into account.

## Task 3: Compute the variance of the two arrays using the standard deviation from each array.

In the last task, we obtained two Numpy arrays 
from the data frame column to represent female student scores
and male student scores.

We discovered that these two samples had slightly different
sizes, so we need to take them to account in this task when
we compute the variance of the two arrays using the standard
deviation from each array.

Standard deviation is a measurement of how much the variables
in the sample vary from the mean of the sample or the average
value of the sample.

So we will implement the function that will eventually
return the T value and P value given the two arrays
as inputs. We will compute the mean of each sample using numpy.
We will then compute the pooled variance
since the population sizes are different by analyzing
the formula and converting it into python code.

![image.png](attachment:65fd7d2f-557b-4fbb-ab47-3594a5787d62.png)

Above is the formula for standard
error. The highlighted portion shows the pooled variance
calculation that we're going to be implementing here. This is the portion that
we're going to compute in this particular task.

Note: The formula came from the University of Michigan website.

In [16]:
# Create a function that will compute the T value and P value.
# The numpy object contains a method to compute the mean of the two arrays, 
# which are both numpy arrays as we verified last time.
# Compute the the standard deviation.

def computeTandP(arr1, arr2):
    t,p = 0,0
    mean12, mean2 = np.mean(arr1), np.mean(arr2)
    
    s1, s2 = np.std(arr1), np.std(arr2)
    
    var = (((arr1.size-1)*(s1**2)) + ((arr2.size - 1)*(s2**2)))/(arr1.size+arr2.size-2)
    return t,p

## Task 4: Add the Numpy code to compute the pooled Variance and standard deviation and determine the T-value and P-value of the data sets.

In the last task we computed the variance of the two arrays using the standard
deviation from each array.

In this task we will add the numpy code to the function
to compute the pooled variance and standard deviation
and determine the T value and P value of the data sets.

So we will use the formula again to to create the standard
error that we need in order to compute the two sample T test.

![image.png](attachment:50b42b05-50f7-4bcc-a389-bbcbbc4815e2.png)

So we will use the pool standard deviation to compute
the standard error.

We will use the difference between the means and standard
error to compute the T value.

We will compute the degrees of freedom and will compute the P
value using  the stats package.


So the first thing we're going to do is go back into our
compute T and P  function, and we'll add the the formula
for the standard. The pooled standard variation.
So we call this SP. SP is simply the square root of the
pooled variance that we calculated last time.
So we'll use pooled variance to calculate the pooled
standard deviation which we'll call SP.

In [19]:
def computeTandP(arr1, arr2):
    t,p = 0,0
    mean1, mean2 = np.mean(arr1), np.mean(arr2)
    
    s1, s2 = np.std(arr1), np.std(arr2)
    
    var = (((arr1.size-1)*(s1**2)) + ((arr2.size - 1)*(s2**2)))/(arr1.size+arr2.size-2)
    
    sp = np.sqrt(var) # Standard Deviation
    
    ste = sp*np.sqrt(1/arr1.size + 1/arr2.size) # standard error
# standard error measures how much or little the sample population data
# represents the entire population of math students, for example.
# that's just a new indication of the accuracy of the data.
# How much does it vary, for example.
    
    t = (mean1-mean2) / ste # then we'll compute the t value.
    
    df = arr1.size + arr2.size -2 # next, we can identify the degrees of freedom .
# The degrees of freedom is the number of data items given
# and the number needed to give an estimate, assuming
# you are given a mean. So, for example, if we have an array
# of size 100 and you have 99 values in the array
# and we know the mean is some value, your degrees of freedom
# is 99 because the last number has to add up, such
# that the total divided by the total number equals the mean.
# So you're free to vary the other 99 variables.
# But the last variable has to add up such that the mean stays
# as was advertised, so that's degrees of freedom.
    
    p = 2*(1-stats.t.cdf(t,df)) # calculate the P 
# so we can see that the T value and the P value are closely related.
  
    return t,p # Finally return the p and T value.

In [20]:
# So finally, we could go down into our main program and actually call the method.

t, p = computeTandP(mm, fm)

In [21]:
# Then we will just simply print out the T value and P value.

print("T value: ", t)
print("P value: ", p)

T value:  5.388622466450282
P value:  8.859047007092613e-08


So we can see now that the T value is relatively large, 5.389 approximately.
And the P value is incredibly small. It's 8.85 times 10 to the minus eighth, roughly. Which is
incredibly small, which tends to tell us to reject the null hypothesis; that the math scores for males and females are the same.

In the next task we will add a function to remove outliers from each set of data,
then re-compute the T value and P value.


## Task 5: Add a function to remove outliers from each set of data, then re-compute the T-value and P-value.

In the last task we added the numpy code to the function to compute
the pooled variance, the standard deviation and determined
the T value and P value from the data sets.

In this task, we add a function to remove outliers
from each set of data, then re compute the T value and P
value. We shouldn't normally remove outliers in the source
of the data itself, at least not before a thorough
investigation, and possibly talk to the people gathering
the data.. 

We can do this programmatically in python.
We will gather the upper and lower quartiles.
Then we'll remove them from the array using the calculated
inter quartile IQR Range.

So, given the output of the T value and P value from the last
task, we see a very small P value and a relatively large T
value. This P value is less than 0.5 - a lot less than 0.5.
So this tells us to REJECT the null hypothesis and that there is
no difference in math test scores between males and females
based on the samples of the population we have obtained
from the spreadsheet.

It is not our job to make judgements from the statistics;
but only to provide the information so more research
may be done and perhaps more statistical tests based
on other parameters from our gathered data.

Another thing we can do is remove outliers in the data to see
if removing the extreme values has any effect on the T value
and P value.

So let's go ahead and create another function here and let's
call it __remove outliers__.


In [22]:
def removeOutliers(arr):
    q3,q1 = np.percentile(arr, [75, 25]) # get the percentiles. q3 is 75, q1 is 25.
    iqr = q3 - q1 #calculate interquartile range
    lower = q1 - 1.5 * iqr
    upper = q3 +1.5 * iqr
# So once we have the lower and upper limits then we
# can remove the elements below the lower and above the upper
# by keeping the middle and return the resulting array.
    arr = arr[(arr>=lower) & (arr <= upper)]
    return arr

Since we already know what our T values and P values, 
let's go ahead and remove the outliers.


In [23]:
# remove the outliers and return in the same array names

fm  =removeOutliers(fm)
mm = removeOutliers(mm)

In [24]:
# Computing the T and P over again.

t, p = computeTandP(mm, fm)

In [25]:
# print out the T value and P value again.

print("T value: ", t)
print("P value: ", p)

T value:  5.142350675301441
P value:  3.269459232235761e-07


Well, it made our  P value a little bit larger, but the
T value is still significantly large, and the P value is still
very, very small.
So this still tells us to reject the null hypothesis that
female and male test scores are the same.