# Lambda School Data Science - Unit 1 Sprint 2 Module 1

---

## Assignment:  Hypothesis Testing (t-tests)

# Objectives

* Objective 01 - explain the purpose of a t-test and identify applications
* Objective 02 - set up and run a one-sample t-test
* Objective 03 - set up and run a two-sample independent t-test


# Use the following to answer questions 1) - 5)

Adapted from: A. Bhatnagar and V.K. Mehta (2007). "Efficacy of Deltamethrin and Cyfluthrin Impregnated Cloth Over Uniform Against Mosquito Bites," Medical Journal Armed Forces India, Vol. 63, pp. 120-122.

Mosquito nets have traditionally been an important tool to prevent mosquito bites in parts of the world where malaria is endemic. However, it may not be practical for an army that is on the move to set up and carry mosquito nets each night and day. Impregnating soldiers’ uniforms with insect repellant solves the mobility problem but also has drawbacks. First, the insect repellant quickly becomes ineffective with repeated washing and ironing and must be frequently reapplied. Second, in hot and humid climates the insect repellant can be absorbed through the skin, and the long-term effects of this exposure are unknown. One compromise is to have soldiers apply patches treated with insect repellant to their clothing. These patches would last longer because they would not be washed or ironed and would not expose the entire body to the insect repellant.

### Experiment description:

The `Mosquito.xlsx` dataset contains data recorded in an experiment conducted on male soldiers in the Indian Army who were stationed in the Tezpur/Solmara garrison in Northeast India. 

Thirty soldiers were randomly selected to receive one of three types of mosquito single repellant patch. After giving informed consent, the study participants affixed the patches at predetermined points on their uniforms. Research assistants (who were blinded to the type of repellant used) counted the number of times a mosquito landed on each individual in an hour. 

Medical officers with the Indian Army have recorded data on mosquito bites and related illness for many years and can say with authority that the mean number of mosquito touches for soldiers not wearing any mosquito repellant is **8.2 per hour**.

**We wish to determine if wearing a single repellant patch changes the mean number of mosquito touches for soldiers compared to not wearing any mosquito repellant.**
 

In [1]:
# Run this cell to load the dataset in a DataFrame
import pandas as pd

mosquito_url = 'https://raw.githubusercontent.com/Chelsea-Myers/Lambda-Intro/master/Mosquito.csv'

mosquito_df = pd.read_csv(mosquito_url, skipinitialspace=True, header=0)

In [2]:
# Print out the shape and look at the DataFrame

print(mosquito_df.shape)
mosquito_df.head()

(90, 2)


Unnamed: 0,ID,Mosq_count
0,1,4
1,2,10
2,3,13
3,4,0
4,5,11


### Answer checks

We're going to start using the assert statements we learned about earlier to check our work. The cells with thes `assert` statements can be skipped or deleted but you should try to leave them in. It's a way to check your work as you go through this Module Project and also get some feedback if you have an error.

In [3]:
# This is an ANSWER CHECK cell.
# Don't alter this cell if you want accurate feedback
#------------------------------------------------------------------------------#

# check if the DataFrame was created
assert not mosquito_df.empty, 'Make sure to run the cell to load in your dataset.'
# check the shape of the DataFrame
assert mosquito_df.shape == (90,2), 'Is your data loaded correctly?'
print('Correct! Continue to the next question.')

Correct! Continue to the next question.


1)  Write the null and alternative hypotheses for this scenario in words and symbols.

Null hypothesis:

Alternative hypothesis: 

$H_0: \mu =$ 8.2

$H_a: \mu \neq$ 8.2

2) Calculate the mean number of mosquito touches in the sample. Assign your answer to the variable `mosquito_touch_mean`.

In [4]:
# mean number of mosquito touches

mosquito_touch_mean = mosquito_df['Mosq_count'].mean()
print(mosquito_touch_mean)


8.011111111111111


In [5]:
# This is an ANSWER CHECK cell.
# Don't alter this cell if you want accurate feedback
#------------------------------------------------------------------------------#

# check the value of the mean
assert round(mosquito_touch_mean) == 8, 'Did you use the `.mean()` method?'
print('Correct! Continue to the next question.')

Correct! Continue to the next question.


3) Calculate the standard deviation of the number of mosquito touches in the sample. Assign your answer to `mosquito_touch_std`.

In [7]:
#### YOUR CODE HERE ####
mosquito_touch_std = mosquito_df['Mosq_count'].std()
print(mosquito_touch_std)

3.2825532828777257


In [8]:
# This is an ANSWER CHECK cell.
# Don't alter this cell if you want accurate feedback
#------------------------------------------------------------------------------#

# check the value of the standard deviation
assert round(mosquito_touch_std) == 3, 'Did you use the .std() method?'
print('Correct! Continue to the next question.')

Correct! Continue to the next question.


4) Conduct a 1-sample t-test to test your hypotheses. Assign your t-test result to the variable `mosquito_pval`.

In [12]:
# Use the 'ttest_1samp' from the stats package
from scipy import stats
mosquito_pval = stats.stats.ttest_1samp(mosquito_df['Mosq_count'], 8.2)
print(mosquito_pval[1])
#### YOUR CODE HERE ####


0.5864980356272131


In [13]:
# This is an ANSWER CHECK cell.
# Don't alter this cell if you want accurate feedback
#------------------------------------------------------------------------------#

# check the p-value
assert round(mosquito_pval[1], 3) == 0.586, 'Did you use the correct population mean?'
print('Correct! Continue to the next question.')

Correct! Continue to the next question.


5) Report your conclusion at the 0.05 significance level. Write your answer in the cell below.

Conclusion: We fail to reject the null hypothesis at the 0.05 significance level that the mean number of mosquitos landing on soldiers, without the repellant patch, was 8.2 per hour.

#Use the following information to answer questions 6) - 10)



More than 14,000 people finished the 2020 Disney Marathon held on January 12. 
The results by age and gender group are included in the Disney.csv dataset. 


**We wish to determine if the mean finishing time for male and female marathon runners is the same or if there is a difference in the mean finishing time between male and female marathon runners.**


Source: Track Shack. 2020. [2020 Disney Marathon Race Results](https://www.trackshackresults.com/disneysports/results/wdw/wdw20/mar_results.php)

In [14]:
# Run this cell to load the dataset into a DataFrame
import pandas as pd

disney_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Disney_Marathon/Disney.csv'

disney_df = pd.read_csv(disney_url, skipinitialspace=True, header=0)

Take a look at your DataFrame. 

In [17]:
#### YOUR CODE HERE ####
disney_mean = disney_df['time'].mean()
print(disney_mean)
disney_df.head()

6.0621262031894485


Unnamed: 0,ID,gender,age,group,time
0,1,M,30,M30-34,2.371944
1,2,M,26,M25-29,2.450556
2,3,M,32,M30-34,2.457778
3,4,M,35,M35-39,2.655833
4,5,M,26,M25-29,2.736111


6)  Write the null and alternative hypotheses for this scenario in words and symbols.

Null hypothesis:

Alternative hypothesis: 

$H_0: \mu =$ 6.06

$H_a: \mu \neq$ 6.06

7) Create two separate Series (a pandas DataFrame column is a Series):
* one containing finishing times for male participants (`male_finish`)
* one containing finishing times for female participants (`female_finish`)

In [25]:
#### YOUR CODE HERE ####
male_times = disney_df[disney_df['gender']=='M']
female_times = disney_df[disney_df['gender']=='F']
male_finish = male_times['time']
female_finish = female_times['time']

In [26]:
# This is an ANSWER CHECK cell.
# Don't alter this cell if you want accurate feedback
#------------------------------------------------------------------------------#

# check the length of each Series
assert  male_finish.shape == (6577,), 'Make sure you selected M and only have a single column.'
assert  female_finish.shape == (7529,), 'Make sure you selected F and only have a single column'
print('Correct! Continue to the next question')

Correct! Continue to the next question


8) Calculate the mean finishing time for male and female participants separately. Name your variables `male_finish_mean` and `female_finish_mean`.

In [29]:
#### YOUR CODE HERE ####
male_finish_mean = male_finish.mean()
female_finish_mean = female_finish.mean()
print(male_finish_mean)
print(female_finish_mean)

5.799159782400031
6.291841988756132


In [28]:
# This is an ANSWER CHECK cell.
# Don't alter this cell if you want accurate feedback
#------------------------------------------------------------------------------#

# check the values of the means
assert  round(male_finish_mean, 2) == 5.80, 'Did you use the .mean() method?'
assert  round(female_finish_mean, 2) == 6.29, 'Did you use the .mean() method?'
print('Correct! Continue to the next question')

Correct! Continue to the next question


9) Calculate standard deviation of the mean finishing time for male and female participants separately. Name your variables `male_finish_std` and `female_finish_std`.

In [30]:
#### YOUR CODE HERE ####
male_finish_std = male_finish.std()
female_finish_std = female_finish.std()
print(male_finish_std)
print(female_finish_std)

1.100676340530379
0.896690100351361


In [31]:
# This is an ANSWER CHECK cell.
# Don't alter this cell if you want accurate feedback
#------------------------------------------------------------------------------#

# check the values of the standard deviations
assert  round(male_finish_std, 2) == 1.10, 'Did you use the .std() method?'
assert  round(female_finish_std, 2) == 0.90, 'Did you use the .std() method?'
print('Correct! Continue to the next question')

Correct! Continue to the next question


10) Conduct a 2-sample t-test to test your hypotheses.
* Assign the t-statistic to a variable called `disney_tval`
* Assign the p-value to a variable called `disney_pval`

Hint: The function returns two values and you can assign them with one line (example):

`variable1, variable2` = `some.function(arguments)`

In [32]:
# Use the 'ttest_ind' from the stats package
from scipy import stats

#### YOUR CODE HERE ####
disney_tval, disney_pval = stats.stats.ttest_ind(male_finish, female_finish)
print(disney_tval)
print(disney_pval)

-29.27857393997243
5.485138013952879e-183


11) Report your conclusion at the 0.05 significance level.

The low p-value means that the chosen mean will be very unlikely to show up in observations. Therefore, we have to reject the null hypothesis that the mean time between males and females would be no different.



---



In your own words: 

12) Explain the Central Limit Theorem.
-  No matter the distribution that the sample is pulled from, if the sample size is large enough, it's mean will be close to the population mean.

13) Describe the Normal Distribution.
-  Almost all observations are right near the mean and few are far away.

14) Describe the relationship between the Normal distribution and the t-distribution.
-  By finding t-values from multiple samples, those values can be plotted to show a t-distribution.

15) Write about who William Sealy Gosset was.
-  Gosset worked as head brewer and in quality assurance at Guiness. He came up with the idea that one or multiple sample sizes can change the distribution.

#Portfolio Project Milestone

Write the first draft of your research question.  This is a question that should be answerable with two visualizations and a blog post.  

You should have selected a dataset and have a good idea what your research question is by the end of the day.  If you don't - please ask for help from your instructor, track team or mentor.