### FetchMaker

This project was provided by Code Academy:

*Congratulations! You’ve just started working at the hottest new tech startup, FetchMaker. FetchMaker’s mission is to match up prospective dog owners with their perfect pet. FetchMaker has been collecting data on their adoptable dogs, and it’s your job to analyze some of that data.*

In [37]:
# Importing our modules
import numpy as np
import pandas as pd
from scipy.stats import binom_test
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from scipy.stats import chi2_contingency

# Loading in our data
data = pd.read_csv('Dog Data.csv')

data.head()

Unnamed: 0,is_rescue,weight,tail_length,age,color,likes_children,is_hypoallergenic,name,breed
0,0,6,2.25,2,black,1,0,Huey,chihuahua
1,0,4,5.36,4,black,0,0,Cherish,chihuahua
2,0,7,3.63,3,black,0,1,Becka,chihuahua
3,0,5,0.19,2,black,0,0,Addie,chihuahua
4,0,5,0.37,1,black,1,1,Beverlee,chihuahua


*FetchMaker estimates (based on historical data for all dogs) that 8% of dogs in their system are rescues.*

*They would like to know if whippets are significantly more or less likely than other dogs to be a rescue.*

In [18]:
# Getting our is_rescue coulumn for the breed whippet
whippet_rescue = data.is_rescue[data.breed == 'whippet']

# number of whippet rescue dogs
num_whippet_rescues = np.sum(whippet_rescue == 1)

# number of whippets
num_whippets = len(whippet_rescue)

We will use a hypothesis test to test the following null and alternative hypothesis
- Null: 8% of whippets are rescues
- Alternative: more or less than 8% of whippets are rescues
- Significance threshold of 0.05

In [19]:
# Hypothesis test
pval = binom_test(x = num_whippet_rescues, n = num_whippets, p = 0.08)

print('P-value:',"{:.2f}".format(pval))
print('Significance threshold: 0.05')
print('P-value is not statistically significant and indicates strong evidence for the null hypothesis.')

P-value: 0.58
Significance threshold: 0.05
P-value is not statistically significant and indicates strong evidence for the null hypothesis.


Mid-sized Dog Weights

*Three of FetchMaker’s most popular mid-sized dog breeds are 'whippet's, 'terrier's, and 'pitbull's. Is there a significant difference in the average weights of these three dog breeds?*

We will run a single hypothesis test to address the following null and alternative hypotheses:
- Null: whippets, terriers, and pitbulls all weigh the same amount on average
- Alternative: whippets, terriers, and pitbulls do not all weigh the same amount on average (at least one pair of breeds has differing average weights)

This test addresses an association between two variables: a non-binary categorical variable (breed, with three possible options) and a quantitative variable (weight). It is not a good idea to run three separate two-sample t-tests here, because running multiple t-tests increases our chances of a type I error, or a false positive. In order to run a single hypothesis test with three categories, we should use an ANOVA.



In [24]:
# Weights for each breed
wt_whippets = data.weight[data.breed == 'whippet']
wt_terriers = data.weight[data.breed == 'terrier']
wt_pitbulls = data.weight[data.breed == 'pitbull']

Fstat, pval = f_oneway(wt_whippets, wt_terriers, wt_pitbulls)
print('P-value:',"{:.2f}".format(pval))
print('Significance threshold: 0.05')
print('P-value is statistically significant and indicates strong evidence for the alternative hypothesis.')

P-value: 0.00
Significance threshold: 0.05
P-value is statistically significant and indicates strong evidence for the alternative hypothesis.


In [31]:
# Running another hypothesis test to determine which of the breeds weigh different amounts on average
data_wtp = data[data.breed.isin(['whippet', 'terrier', 'pitbull'])]
tukey = pairwise_tukeyhsd(data_wtp.weight, data_wtp.breed, 0.05)
print(tukey)

 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
 group1  group2 meandiff p-adj   lower  upper  reject
-----------------------------------------------------
pitbull terrier   -13.24  0.001 -16.728 -9.752   True
pitbull whippet    -3.34 0.0639  -6.828  0.148  False
terrier whippet      9.9  0.001   6.412 13.388   True
-----------------------------------------------------


In [36]:
# FetchMaker wants to know if 'poodle's and 'shihtzu's come in different colors

# Subsetting poodle and shitzu
data_ps = data[data.breed.isin(['poodle', 'shitzu'])]

# Creating a contingency table of color vs breed
Xtab = pd.crosstab(data.color, data.breed)

print(Xtab)

breed  chihuahua  greyhound  pitbull  poodle  rottweiler  shihtzu  terrier  \
color                                                                        
black         13         24       12      17          11       10       24   
brown         32         20        9      13          27       36       10   
gold          10          8       12       8           3        6        9   
grey          41         45       60      52          51       41       45   
white          4          3        7      10           8        7       12   

breed  whippet  
color           
black       65  
brown        0  
gold         0  
grey        35  
white        0  


Running a hypothesis test for the following
- Null: There is an association between breed (poodle vs. shihtzu) and color.
- Alternative: There is not an association between breed (poodle vs. shihtzu) and color.

This test investigates an association between two categorical variables, so we can use a Chi-Square test.

In [40]:
chi2, pval, dof, exp = chi2_contingency(Xtab)
print('P-value:',"{:.2f}".format(pval))
print('Significance threshold: 0.05')
print('P-value is statistically significant and indicates strong evidence for the alternative hypothesis.')

P-value: 0.00
Significance threshold: 0.05
P-value is statistically significant and indicates strong evidence for the alternative hypothesis.
