<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Hypothesis-Testing---App-startup" data-toc-modified-id="Hypothesis-Testing---App-startup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Hypothesis Testing - App startup</a></span></li><li><span><a href="#Whippet-rescues---binomial-test" data-toc-modified-id="Whippet-rescues---binomial-test-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Whippet rescues - binomial test</a></span></li><li><span><a href="#Significance-of-weights---ANOVA" data-toc-modified-id="Significance-of-weights---ANOVA-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Significance of weights - ANOVA</a></span></li><li><span><a href="#Which-breed-is-significantly-different?---Tukey's-Pairwise-test" data-toc-modified-id="Which-breed-is-significantly-different?---Tukey's-Pairwise-test-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Which breed is significantly different? - Tukey's Pairwise test</a></span></li><li><span><a href="#Significance-of-colour-with-each-dog-breed---Chi-Squared-test" data-toc-modified-id="Significance-of-colour-with-each-dog-breed---Chi-Squared-test-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Significance of colour with each dog breed - Chi-Squared test</a></span></li></ul></div>

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import binom_test
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from scipy.stats import chi2_contingency

# Hypothesis Testing - App startup

This project uses data from a tech startup that looks to match dogs with prospective owners.

The aim is to do some general analysis on the data for the dogs in the system, and put into practice some hypothesis tests.

In [3]:
dogs = pd.read_csv('dog_data.csv')
dogs.head()

Unnamed: 0,is_rescue,weight,tail_length,age,color,likes_children,is_hypoallergenic,name,breed
0,0,6,2.25,2,black,1,0,Huey,chihuahua
1,0,4,5.36,4,black,0,0,Cherish,chihuahua
2,0,7,3.63,3,black,0,1,Becka,chihuahua
3,0,5,0.19,2,black,0,0,Addie,chihuahua
4,0,5,0.37,1,black,1,1,Beverlee,chihuahua


In [4]:
len(dogs)

800

# Whippet rescues - binomial test

If typically we expect 8% of the dogs on the platform, then we can find whether some dogs are significantly more likely to be rescues than average.

Looking at whippets:

In [5]:
whippet_rescue = dogs.is_rescue[dogs.breed == 'whippet']
num_whippet_rescues = np.sum(whippet_rescue == 1)
print(num_whippet_rescues)

6


In [6]:
num_whippets = len(whippet_rescue)
print(num_whippets)

100


In [7]:
# running a binomial test on the number of whippets being rescues
pval = binom_test(num_whippet_rescues, num_whippets, 0.08)
print(pval)

0.5811780106238098


There is no significance to the number of whippets that are rescues as the p-value is 0.58, and the significance threshold is p<0.05.

# Significance of weights - ANOVA

We can run analysis of variance (ANOVA) for some dog breeds' weights to see if there is significance in the difference in weights among these breeds.

In [8]:
wt_whippets = dogs.weight[dogs.breed == 'whippet']
wt_terriers = dogs.weight[dogs.breed == 'terrier']
wt_pitbulls = dogs.weight[dogs.breed == 'pitbull']

In [9]:
# running ANOVA on the dog weights
Fstat, pval = f_oneway(wt_whippets, wt_terriers, wt_pitbulls)
print(pval)

3.276415588274815e-17


In [10]:
# Subset to just whippets, terriers, and pitbulls
dogs_wtp = dogs[dogs.breed.isin(['whippet', 'terrier', 'pitbull'])]

# Which breed is significantly different? - Tukey's Pairwise test

This will expand on the previous test that showed that there is a significance between the dog weights, and will show exactly which breed(s) are the most different relative to each other.

In [11]:
# Run a Tukey's Range test
output = pairwise_tukeyhsd(dogs_wtp.weight, dogs_wtp.breed)
print(output)

 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
 group1  group2 meandiff p-adj   lower  upper  reject
-----------------------------------------------------
pitbull terrier   -13.24  0.001 -16.728 -9.752   True
pitbull whippet    -3.34 0.0639  -6.828  0.148  False
terrier whippet      9.9  0.001   6.412 13.388   True
-----------------------------------------------------


As can be seen, the pitbull and whippet breeds are not significantly different in weight, but the pitbull vs terrier and terrier vs whippet weights are significantly different.

# Significance of colour with each dog breed - Chi-Squared test

In [12]:
# Subset to just poodles and shihtzus
dogs_ps = dogs[dogs.breed.isin(['poodle', 'shihtzu'])]

In [14]:
# Create a contingency table of color vs. breed
Xtab = pd.crosstab(dogs_ps.color, dogs_ps.breed)
Xtab

breed,poodle,shihtzu
color,Unnamed: 1_level_1,Unnamed: 2_level_1
black,17,10
brown,13,36
gold,8,6
grey,52,41
white,10,7


In [15]:
# Run a Chi-Square test
chi2, pval, dof, exp = chi2_contingency(Xtab)
print(pval)

0.005302408293244593


This a p-value that is lower than the threshold 0.05 level, i.e. this is statistically significant. In other words, the colour of a dog breed is heavily dependent on the breed.