We would like to know if the effects we see in the sample(observed data) are likely to occur in the population.

The way classical hypothesis testing works is by conducting a statistical test to answer the following question:

Given the sample and an effect, what is the probability of seeing that effect just by chance?

Here are the steps on how we would do this

Compute test statistic

Define null hypothesis

Compute p-value

Interpret the result

If p-value is very low(most often than now, below 0.05), the effect is considered statistically significant. That means that effect is unlikely to have occured by chance. The inference? The effect is likely to be seen in the population too.

This process is very similar to the proof by contradiction paradigm. We first assume that the effect is false. That's the null hypothesis. Next step is to compute the probability of obtaining that effect (the p-value). If p-value is very low(<0.05 as a rule of thumb), we reject the null hypothesis.

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib as mpl
%matplotlib inline

In [2]:
import seaborn as sns
sns.set(color_codes=True)

In [5]:
weed_pd = pd.read_csv("Weed_Price.csv", parse_dates=[-1])

# parse_dates[-1] means that the last column is interpreted as date column while reading csv file

In [6]:
weed_pd["month"] = weed_pd.date.apply(lambda x: x.month)
weed_pd["year"] = weed_pd.date.apply(lambda x: x.year)

In [7]:
weed_pd.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date,month,year
0,Alabama,339.06,1042,198.64,933,149.49,123,2014-01-01,1,2014
1,Alaska,288.75,252,260.6,297,388.58,26,2014-01-01,1,2014
2,Arizona,303.31,1941,209.35,1625,189.45,222,2014-01-01,1,2014
3,Arkansas,361.85,576,185.62,544,125.87,112,2014-01-01,1,2014
4,California,248.78,12096,193.56,12812,192.92,778,2014-01-01,1,2014


# Let's work on weed prices in California in 2014


In [8]:
weed_ca_2014 = weed_pd[(weed_pd.State=="California") & (weed_pd.year==2014)]

In [10]:
#Mean and standard deviation of high quality weed's price
print( "Mean:", weed_ca_2014.HighQ.mean())
print( "Standard Deviation:", weed_ca_2014.HighQ.std())

Mean: 245.89423076923077
Standard Deviation: 1.289907939371412


In [11]:

#Confidence interval on the mean
stats.norm.interval(0.95, loc=weed_ca_2014.HighQ.mean(), scale = weed_ca_2014.HighQ.std()/np.sqrt(len(weed_ca_2014)))


(245.7617184927259, 246.02674304573566)

# Question: Are high-quality weed prices in Jan 2014 significantly higher than in Jan 2015?

In [12]:

#Get the data
weed_ca_jan2014 = np.array(weed_pd[(weed_pd.State=="California") & (weed_pd.year==2014) & (weed_pd.month==1)].HighQ)
weed_ca_jan2015 = np.array(weed_pd[(weed_pd.State=="California") & (weed_pd.year==2015) & (weed_pd.month==1)].HighQ)

In [14]:
print( "Mean-2014 Jan:", weed_ca_jan2014.mean())
print( "Mean-2015 Jan:", weed_ca_jan2015.mean())

Mean-2014 Jan: 248.4454838709677
Mean-2015 Jan: 243.60225806451612


In [15]:
print( "Effect size:", weed_ca_jan2014.mean() - weed_ca_jan2015.mean())

Effect size: 4.843225806451585


Null Hypothesis: Mean prices aren't significantly different

Perform t-test and determine the p-value.

In [21]:
statitic,pval = stats.ttest_ind(weed_ca_jan2014, weed_ca_jan2015, equal_var=True)

In [22]:
if pval<0.05:
    print("reject null hypothesis")
else:
    ("do not reject null hypothesis")

reject null hypothesis



p-value is the probability that the effective size was by chance. And here, p-value is almost 0.

Conclusion: The price difference is significant. But is a price increase of $4.85 a big deal? The price decreased in 2015 by almost 2%. Always remember to look at effect size.

Problem Determine if prices of medium quality weed for Jan 2015 and Feb 2015 are significantly different for New York.

 

## Assumption of t-test


One assumption is that the data used came from a normal distribution.


There's a [Shapiro-Wilk test](https://en.wikipedia.org/wiki/Shapiro-Wilk) to test for normality. If p-value is less than 0.05, then there's a low chance that the distribution is normal.

In [17]:
stats.shapiro(weed_ca_jan2015)


ShapiroResult(statistic=0.9469061493873596, pvalue=0.1281934529542923)

In [18]:
stats.shapiro(weed_ca_jan2014)

ShapiroResult(statistic=0.9353516101837158, pvalue=0.06142302602529526)

In [19]:
#We seem to be good.

## Exercise: Impact of regulation and deregulation.

Information on regulation of Weed in the US by State [wiki](Impact of regulation and deregulation on a couple of states )

Alaska legalized it on 4th Nov 2014. Find if prices significantly changed in Dec 2014 compared to Oct 2014.

Maryland decriminalized possessing weed from Oct 1, 2014. Find if prices of weed changed significantly in Oct 2014 compared to Sep 2014

## Chi-Square test for goodness fit

O is observed frequency

E is expected frequency

 CHII Bis the chi-square statistic
 
Let's assume the proportion of people who bought High, Medium and Low quality weed in Jan-2014 as the expected proportion. Find if proportion of people who bought weed in Jan 2015 conformed to the norm

In [24]:

weed_jan2014 = weed_pd[(weed_pd.year==2014) & (weed_pd.month==1)][["HighQN", "MedQN", "LowQN"]]
weed_jan2015 = weed_pd[(weed_pd.year==2015) & (weed_pd.month==1)][["HighQN", "MedQN", "LowQN"]]

In [25]:
Expected = np.array(weed_jan2014.apply(sum, axis=0))
Observed = np.array(weed_jan2015.apply(sum, axis=0))

In [27]:
print( "Expected:", Expected, "\n" , "Observed:", Observed)

Expected: [2918004 2644757  263958] 
 Observed: [4057716 4035049  358088]


In [28]:

print( "Expected:", Expected/np.sum(Expected.astype(float)), "\n" , "Observed:", Observed/np.sum(Observed.astype(float)))

Expected: [0.5007971  0.45390159 0.04530131] 
 Observed: [0.48015461 0.47747239 0.042373  ]


In [32]:
from scipy.stats import chisquare

In [33]:
stats.chisquare(Observed,Expected)

ValueError: For each axis slice, the sum of the observed frequencies must agree with the sum of the expected frequencies to a relative tolerance of 1e-08, but the percent differences are:
0.45036220212438594

In [36]:
statistic = ((Observed - Expected)**2/Expected).sum()

In [37]:
statistic

1209562.2775169075