<a href="https://colab.research.google.com/github/AilingLiu/Inferential_Statistics/blob/master/Inferential_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook summarizes the testing methods from <b>Inferential Statitistics </b> course taught by <u>University of Amsterdam</u> on [Coursera](https://www.coursera.org/learn/inferential-statistics). The course had taught how to conduct statistical test usng R. Here, I am using Python to do the test. All the formulas used in this document can be found [here](https://github.com/AilingLiu/Inferential_Statistics/blob/master/FormulasTables.pdf).

In [0]:
import numpy as np
import pandas as pd
import scipy.stats as st

# Compare Two Groups

<b>Construct Hypotheses</b>

When we are testing between two competing hypotheses, a null hypothesis $H_0$ and an alternative hypothesis $H_1$, we generally assume that the null hypothesis is true unless the data shows a strong indication that this is not the case. 

By doing hypotheses testing, we <u>test the probability of finding a sample statistic given that the null hypothesis is true</u>. If the null hypothesis is true, the difference between a sample statistics and the population parameter is <b>due to sampling error</b>, that is, fluctuations in the sample from the population. However, **if the probability of finding a sample statistic as extreme as ours under the null hypothesis is very small, we generally reject the null hypothesis**.

> Test your understanding:


1.   Imagine we have found a p value of 0.30 called p1 and another p value of 0.02 called p2, do these p values indicate strong evidence or weak evidence in favour of the null hypothesis? 
>> Answer: p1 indicates strong evidence in favour of the null hypothesis; p2 indicates weak evidence in favour of the null hypothesis.
2.   What does a p value of 0.20 mean?
>> Answer: A p value of 0.20 means that there's a probability of 20% of obtaining a similar result or more extreme given that the null hypothesis is true



## Z test to compare two proportions from independent samples


We usually calculate two things:

1.   The difference between two sample proportions
2.   The standard error

> Example
<br>In this exercise we have a sample of 100 males with a proportion of left wing voters of 0.6 and a sample of 150 females with a proportion of left-wing voters of 0.42. 

In [0]:
nmale=100
nfemale=150
malep=0.6
femalep=0.42

#pooled proportion
poolp=(nmale*malep+nfemale*femalep)/(nmale+nfemale)

#standard error under the null hypothesis
se=np.sqrt(poolp*(1-poolp)*(1/nmale + 1/nfemale))

#z calculated value
z_val = (malep - femalep)/se

#corresponding p value
p_val = (1-st.norm.cdf(z_val))*2

sig = 0.05
if p_val <=sig:
  conclusion='Rejected'
else:
  conclusion='Not enough evidence to reject'

print(f'Calculated Z value: {z_val:.4f}\nPvalue is: {p_val:.4f} \nConclusion on Null Hypothesis given {sig} significance level: {conclusion}')

Calculated Z value: 2.7889
Pvalue is: 0.0053 
Conclusion on Null Hypothesis given 0.05 significance level: Rejected


Another way to conduct the test is to get the confidence interval of the difference from the two proportions. If 0 (null hypothesis) falls inside the interval, we will reject null hypotheseis. We need two parameters to conduct this test:

1.   The z score corresponding to the selected confidence level: $(1-conf)/2$.
2.   The standarad error for the difference between two proportions

In [0]:
#z score under given confidence level
sig=0.01
z_score = np.abs(st.norm.ppf(sig/2))

# standard error for the difference
sep=np.sqrt(malep*(1-malep)/nmale + femalep*(1-femalep)/nfemale)

#lower bound of confidence interval
lb = (malep-femalep) - z_score*sep
#upper bound of confidence interval
ub = (malep-femalep) + z_score*sep

print(f'{(1-sig)*100} percent confidence interval:\n[{lb:.4f}, {ub:.4f}]')

99.0 percent confidence interval:
[0.0166, 0.3434]


There are differences and these differences are significant because the 99% confidence interval does not contain 0.

The equivalent z test for two independent proportions is [proportions_ztest](https://www.statsmodels.org/stable/generated/statsmodels.stats.proportion.proportions_ztest.html).

In [0]:
#equivalent in python

from statsmodels.stats.proportion import proportions_ztest
x_success = np.array([nmale*malep, nfemale*femalep])
n_total = np.array([nmale, nfemale])
z_val, p_val = proportions_ztest(count=x_success, nobs=n_total, alternative='two-sided')

#make a function to give conclusion directly based on pvalue and significance level
def testeval(sig, pval):
  
  """
  Conclusion of rejection status on null hypothesis given significance level
  and the pvalue corresponding to calculated test statistic.

  Parameters
  ----------
  sig: float
    Significance level. Governs the chance of a false positive.
      A significance level of 0.05 means that there is a 5% chance of
      a false positive.
  
  pval: float
    Calculated p value. The probability of obtaining a similar results
    or more extreme given null hypothesis is true.
  
  Returns:
  --------
  result: string
    Conclusion of rejection status on null hypothesis.
  """

  if pval <=sig:
    result = 'Rejected'
  else:
    result = 'Not enough evidence to reject'
  return result

siglevel = 0.05
conclusion = testeval(siglevel, p_val)
print(f'Calculated Z value: {z_val:.4f}\nPvalue is: {p_val:.4f} \nConclusion on Null Hypothesis given {siglevel} significance level: {conclusion}\n')

siglevel = 0.01
conclusion = testeval(siglevel, p_val)
print(f'Calculated Z value: {z_val:.4f}\nPvalue is: {p_val:.4f} \nConclusion on Null Hypothesis given {siglevel} significance level: {conclusion}\n')

def prop_confint_2ind(count, nobs, alpha=0.05):
  
  """
  A/B test for two proportions;
  given a success a trial size of group A and B compute
  its confidence interval;
  resulting confidence interval matches R's prop.test function

  Parameters
  ----------
  count: array
      Number of successes in each group

  nobs: array
      Size, or number of observations in each group

  alpha : float, default 0.05
      Significance level. Governs the chance of a false positive.
      A significance level of 0.05 means that there is a 5% chance of
      a false positive. In other words, our confidence level is
      1 - 0.05 = 0.95

  Returns
  -------
  prop_diff : float
      Difference between the two proportion

  confint : 1d ndarray
      Confidence interval of the two proportion test
  """  

  a_success, b_success = count[0], count[1]
  a_size, b_size = nobs[0], nobs[1]
  a_prop, b_prop = a_success/a_size, b_success/b_size
  prop_diff = a_prop-b_prop

  #z score under given confidence level
  z_score = np.abs(st.norm.ppf(alpha/2))

  # standard error for the difference
  sep=np.sqrt(a_prop*(1-a_prop)/a_size + b_prop*(1-b_prop)/b_size)

  #lower bound of confidence interval
  lb = prop_diff - z_score*sep
  #upper bound of confidence interval
  ub = prop_diff + z_score*sep
  return prop_diff, [lb, ub]

sig=0.01
diff, [lowerb, upperb] = prop_confint_2ind(count=x_success, nobs=n_total, alpha=sig)
print(f'{(1-sig)*100} percent confidence interval:\n[{lowerb:.4f}, {upperb:.4f}]')

Calculated Z value: 2.7889
Pvalue is: 0.0053 
Conclusion on Null Hypothesis given 0.05 significance level: Rejected

Calculated Z value: 2.7889
Pvalue is: 0.0053 
Conclusion on Null Hypothesis given 0.01 significance level: Rejected

99.0 percent confidence interval:
[0.0166, 0.3434]


## T test to compare compare two means from independent samples

we usually calculate 2 other things first

1.   The difference between two independent sample means
2.   The standard error of the difference between two independent sample means

> Example
<br>In this exercise we have a sample of 100 males that do sports on average 4.2 hours per week and a sample of 150 females that do sports on average 5.8 hours per week. 
*  Case a: the population variances are unequal in two groups
*  Case b: the populatin variances are equal in two groups

In [0]:
#Case a: the population variances are unequal
nmale=100
malemean=4.2
stdmale=2.3
nfemale=150
femalemean=5.8
stdfemale=3.1

#standard eror for the difference between two means
se=np.sqrt(stdmale**2/nmale+stdfemale**2/nfemale)

#mean difference
diff=malemean-femalemean

#t value
t_val=diff/se

#degree of freedom
df=se**2/((1/(nmale-1)*(stdmale**2/nmale)**2)+(1/(nfemale-1)*(stdfemale**2/nfemale)**2))

#calculate the p value
pval=(1-st.t.cdf(np.abs(t_val), df))*2
siglevel=0.01
conclusion=testeval(siglevel, pval)
print(f'Calculated T value: {t_val:.4f}\nPvalue is: {pval:.4f} \nConclusion on Null Hypothesis given {siglevel} significance level: {conclusion}\n')

# calculate the 99% confidence interval
t_score=np.abs(st.t.ppf(siglevel/2, df))
lb = diff-t_score*(se)
ub = diff+t_score*(se)
print(f'{(1-siglevel)*100} percent confidence interval:\n[{lb:.4f}, {ub:.4f}]')


Calculated T value: -4.6783
Pvalue is: 0.0000 
Conclusion on Null Hypothesis given 0.01 significance level: Rejected

99.0 percent confidence interval:
[-2.4817, -0.7183]


In [0]:
#Case b: the population variances are equal
nmale=100
malemean=4.2
nfemale=150
femalemean=5.8
std=2.8

#mean difference
diff=malemean-femalemean

#pooled standard deviation
s=np.sqrt(((nmale-1)*std**2 + (nfemale-1)*std**2)/(nmale-1+nfemale-1))

#standard eror for the difference between two means
se=s*np.sqrt(1/nmale+1/nfemale)

#t value
t_val=diff/se

#degree of freedom
df=nmale+nfemale-2

#calculate the p value
pval=(1-st.t.cdf(np.abs(t_val), df))*2
siglevel=0.01
conclusion=testeval(siglevel, pval)
print(f'Calculated T value: {t_val:.4f}\nPvalue is: {pval:.4f} \nConclusion on Null Hypothesis given {siglevel} significance level: {conclusion}\n')

# calculate the 99% confidence interval
t_score=np.abs(st.t.ppf(siglevel/2, df))
lb = diff-t_score*(se)
ub = diff+t_score*(se)
print(f'{(1-siglevel)*100} percent confidence interval:\n[{lb:.4f}, {ub:.4f}]')


Calculated T value: -4.4263
Pvalue is: 0.0000 
Conclusion on Null Hypothesis given 0.01 significance level: Rejected

99.0 percent confidence interval:
[-2.5383, -0.6617]


Equivalent t test for two independent is [ttest_ind](https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.ttest_ind.html) from scipy or [ttest_ind](https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.ttest_ind.html) from statsmodels. Both methods take data points as array directly, without specifically giving mean, standard deviation, or size.

In [0]:
from statsmodels.stats.weightstats import ttest_ind

#generate random data with mean, std, size as above sample.
## equal variance
rvmale=np.random.normal(loc=malemean, scale=std, size=nmale)
rvmale_fix = (rvmale - np.mean(rvmale)) * (std / np.std(rvmale)) + malemean #fix mean problem
rvfemale=np.random.normal(loc=femalemean, scale=std, size=nfemale)
rvfemale_fix = (rvfemale - np.mean(rvfemale)) * (std / np.std(rvfemale)) + femalemean #fix mean problem

t_val, pval, df=ttest_ind(rvmale_fix, rvfemale_fix, alternative='two-sided', usevar='pooled', value=0)
conclusion=testeval(0.01, pval)
print(f'Calculated T value: {t_val:.4f}\nPvalue is: {pval:.4f} \nConclusion on Null Hypothesis given {0.01} significance level: {conclusion}\n')

Calculated T value: -4.4085
Pvalue is: 0.0000 
Conclusion on Null Hypothesis given 0.01 significance level: Rejected



In [0]:
## unequal variance
rvmale=np.random.normal(loc=malemean, scale=stdmale, size=nmale)
rvmale_fix = (rvmale - np.mean(rvmale)) * (stdmale / np.std(rvmale)) + malemean #fix mean problem
rvfemale=np.random.normal(loc=femalemean, scale=stdfemale, size=nfemale)
rvfemale_fix = (rvfemale - np.mean(rvfemale)) * (stdfemale / np.std(rvfemale)) + femalemean #fix mean problem

t_val, pval, df=ttest_ind(rvmale_fix, rvfemale_fix, alternative='two-sided', usevar='unequal', value=0)
conclusion=testeval(0.01, pval)
print(f'Calculated T value: {t_val:.4f}\nPvalue is: {pval:.4f} \nConclusion on Null Hypothesis given {0.01} significance level: {conclusion}\n')

Calculated T value: -4.6591
Pvalue is: 0.0000 
Conclusion on Null Hypothesis given 0.01 significance level: Rejected



How to interpret the result?

Given that the null hypothesis is true, there is a probability of 0.000005 (5.21345e-06) of obtaining a result equally or more extreme. We are 99% confident that the population difference in hours of sport per week between males and females is between -2.4817 and -0.7183 hours per week.

## Comparing two proportions for paired sample - McNemar's Test

Working with dependent data, such as twins, couples, same subject from different time, we will need to use different methods from above.

> Example
<br> Our research question here is whether there is a difference between the proportion of surveyed individuals that approve of the European union and the proportion of their partners that approve of the European union. What would be a good pair of hypotheses?
<br>Answer
<br>$H_0$: The proportion of EU approval is not different in surveyed individuals and their partners. $H_1$: The proportion of EU approval is different in surveyed individuals and their partners

In [0]:
import pandas as pd

col_index=pd.MultiIndex.from_tuples([('Partner Approves of the EU', 'Yes'), ('Partner Approves of the EU', 'No')])
row_index=pd.MultiIndex.from_tuples([('Survey Individuals that approve of the EU', 'Yes'),('Survey Individuals that approve of the EU', 'No')])
survey = pd.DataFrame(np.array([[150, 50], [35, 100]]), index=row_index, columns=col_index)
survey['row_totals'] = survey.sum(axis=1)
s=survey.sum(axis=0)
s.name='column_totals'
survey = survey.append(s)
display(survey)

Unnamed: 0_level_0,Partner Approves of the EU,Partner Approves of the EU,row_totals
Unnamed: 0_level_1,Yes,No,Unnamed: 3_level_1
"(Survey Individuals that approve of the EU, Yes)",150,50,200
"(Survey Individuals that approve of the EU, No)",35,100,135
column_totals,185,150,335


In [0]:
#calculate z value
z_val=(50-35)/np.sqrt(50+35)

#get pvalue
pval=(1-st.norm.cdf(np.abs(z_val)))*2
siglevel=0.05
conclusion=testeval(siglevel, pval)
print(f'Calculated Z value: {z_val:.4f}\nPvalue is: {pval:.4f} \nConclusion on Null Hypothesis given {siglevel} significance level: {conclusion}\n')

Calculated Z value: 1.6270
Pvalue is: 0.1037 
Conclusion on Null Hypothesis given 0.05 significance level: Not enough evidence to reject



The equivalent [mcnemar's test](http://www.statsmodels.org/dev/generated/statsmodels.stats.contingency_tables.mcnemar.html) in statsmodelss.

In [0]:
from statsmodels.stats.contingency_tables import mcnemar
result = mcnemar(survey.iloc[:2, :2].to_numpy(), exact=False, correction=False)
print(result)

pvalue      0.1037416782365415
statistic   2.6470588235294117


## Compare two means for paired samples

> Example
<br>An example when we would do this is if we would want to know the effectiveness of a diet on people's weight. Our research question here is whether the diet that we have invented leads to a reduction in weight. As such our research question is directional. What would be a good set of hypotheses?
<br>Answer: 
<br>$H_0$: There is no difference in people's weight before and after the diet. $H_1$: There is a reduction in weight after taking the diet.

In [0]:
#generate data
pre_weight=np.random.normal(loc=81.53587, scale=8.113578, size=100)
pre_weight_fix=(pre_weight-np.mean(pre_weight))*(8.113578/np.std(pre_weight))+81.5358

post_weight=np.random.normal(loc=78.20945, scale=9.223542, size=100)
post_weight_fix=(post_weight-np.mean(post_weight))*(9.223542/np.std(pre_weight))+78.20945

# get the difference of the two means
diff = pre_weight_fix.mean()-post_weight_fix.mean()

#standard deviation of the differences
stddiff = np.std(pre_weight_fix-post_weight_fix)

#standard error of the difference
se=stddiff/np.sqrt(100)

tval=diff/se 
pval=(1-st.t.cdf(np.abs(tval), 100-1))*2
siglevel=0.05
conclusion=testeval(siglevel, pval)
print(f'Calculated t value: {tval:.4f}\nPvalue is: {pval:.4f} \nConclusion on Null Hypothesis given {siglevel} significance level: {conclusion}\n')

Calculated t value: 2.3788
Pvalue is: 0.0193 
Conclusion on Null Hypothesis given 0.05 significance level: Rejected



# Categorical Association

## Chi-square

We would like to find the association between two categorical varianbes. In below example, we are an advertisement company that have collected data coming from three different groups: Student, Parent, and Corporate. We are interested to know which ad type interests which group, so we can invest corresponding ads in those groups. Below data shows votes from different audience groups on their favorite ads: Party, Child, Office. 

To be specific, we need to find out two things:
1.   Is the ad type has any association with audience group?
2.   If there's association, which audience is in favor of which ad type?

In [2]:
data = pd.DataFrame(np.array([[12, 5, 6],[7, 15, 7],[5, 5, 14]]), columns=['Party', 'Child', 'Office'], index=['Student', 'Parent', 'Corporate'])
display(data)

Unnamed: 0,Party,Child,Office
Student,12,5,6
Parent,7,15,7
Corporate,5,5,14


We can conduct chi-square test to test the association between two categorical variables using scipy modules.

In [3]:
from scipy.stats import chi2_contingency

c_stat, pval, df, expected_val = chi2_contingency(data)
expected_val = expected_val.round(1)

print('Expected Value if these two categorical varianbles are not related:')
display(expected_val)

print('Calculated P value: {}'.format(pval))

Expected Value if these two categorical varianbles are not related:


array([[ 7.3,  7.6,  8.2],
       [ 9.2,  9.5, 10.3],
       [ 7.6,  7.9,  8.5]])

Calculated P value: 0.005408290803578588


Here we can see that the expected values are far off from the observed data. Moreover, there are only 0.5% chance to observe such data if audience group has nothing to do with advertisement type. Hene, we rejected the null hypothesis in favor of the alternative hypothesis, i.e. the audience group has preference on their ad type. 

But how strong is this association? We can use Cramer's V to check its strength, where 0 means no association, and 1 is perfect. 

In [4]:
n=data.sum().sum() #the total number of observation
m=min(data.shape)-1 #either the number of rows or columns whichever the smallest - 1
cramerV=np.sqrt(c_stat/(n*m))
print(cramerV)

0.3107928316933293


Cramers V is about 0.31, which is pretty modest.

But which ads is preferred or least preferred by each audience group? We can use standard residusals to see where there is the most deviation from the expected values.

To standardise our residuals we need to divide each residual value by its standard error. We can directly get this value by using statsmodel api.

In [7]:
import statsmodels.api as sm
# need to make a contingency table format
table = sm.stats.Table(data)

#standardized residuals
table.standardized_resids

Unnamed: 0,Party,Child,Office
Student,2.544487,-1.363592,-1.132685
Parent,-1.096214,2.744428,-1.629494
Corporate,-1.369141,-1.520432,2.822363


From this you can see that the biggest values are for the student + party cell, parent + child cell and the corporate + office cell.

## Chi-square goodness of fit

Your null hypothesis is that 60% of college students go to parties regularly, 30% go occaisionally and 10% never go. You want to test if your observed data matched this proportions. The solution of this question will be very similar to the above question. The difference will be you will need to calculate the expected counts based on the expected proportions listed in null hypothesis.

In [17]:
exp_p=np.array([0.6, 0.3, 0.1]) #expected proportions
exp_count=data.loc['Student'].sum()*exp_p
display(pd.DataFrame([exp_count, data.loc['Student']], index=['expected', 'observed'], columns=data.columns))


Unnamed: 0,Party,Child,Office
expected,13.8,6.9,2.3
observed,12.0,5.0,6.0


We can see that `Office` group deviates the most from expected values. We will be using [chisqure](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html) to get test statistic and p value.

In [19]:
from scipy.stats import chisquare
c_stat, pval = chisquare(data.loc['Student'], f_exp=exp_count)
print(f'chi square: {c_stat:.4f}\n p value: {pval:.4f}')

chi square: 6.7101
 p value: 0.0349


As the calculated p value is smaller than 0.05 significance level, we rejected the null hypothesis. It means college students in our sample differed from the expected distribution.

## Fisher's Exact Test

One of the assumption for chi-squre test is the minimum count in each cell is 5. When this assumption is not met, we can use Fisher's Exact Test to check two categorical variables' association.

Fisher's exact test compares the observed values to a probability distribution. We find this comparison distribution by examining all possible rearrangements of our table. The restrictions are that the marginal frequencies must be the same.

> Example
<br>You had expected that parents would like the ad with a child in it because you thought that people with children like children more.
To investigate this further, you took a sample of 15 adults, asked them whether or not they have children, and whether or not they like children. The results are saved in your console as a 2x2 table named child.

We can perform Fisher's exact test using the function [fisher_exact](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html).

> Null Hpothesis:
<br> $H_0$: The two variables are independent.

In [20]:
child = pd.DataFrame(np.array([[7, 10], [1, 9]]), columns=['like', 'dislike'], index=['children', 'nochildren'])
display(child)

Unnamed: 0,like,dislike
children,7,10
nochildren,1,9


In [22]:
from scipy.stats import fisher_exact
odds_ratio, pval = fisher_exact(child, alternative='two-sided')
print(f'Odds ratio: {odds_ratio:.4f}\n p value: {pval:.4f}')

Odds ratio: 6.3000
 p value: 0.1895


The probability that we would observe this or an even more imbalanced ratio by chance is about 18.95%. Using significance level at 5%, we cannot conclude that our observed imbalance is statistically significant; 
there is probably not an association between having children and liking them.

# Regression

## Simple Regression

For simple linear regresion, we focus on two parameters: intercept, and slope. The intercept is the value of the response variable when the predictor is 0. If we do not have any predictors(i.e. no other clues), the average of response variable is a common choise for estimation.

There are many ways to get these two parameters in Python. You can check this [blog](https://www.freecodecamp.org/news/data-science-with-python-8-ways-to-do-linear-regression-and-measure-their-speed-b5577d75f8b/) to choose your favorite. Here I will use [linregress](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.linregress.html) function from scipy, and [OLS](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html) function from statsmodel.

> Example:
Does people like you because you give them money?

In [39]:
#from scipy
money  = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) #predictor
liking = np.array([2.2, 2.8, 4.5, 3.1, 8.7, 5.0, 4.5, 8.8, 9.0, 9.2]) #response

slope, intercept, correlation, pval, stderror=st.linregress(money, liking)
rsquared = correlation**2

result = pd.DataFrame([slope, intercept, correlation, pval, stderror, rsquared], index='slope, intercept, correlation, pval, stderror, rsquared'.split(', '), columns=['values']).round(4)
display(result)

Unnamed: 0,values
slope,0.7782
intercept,1.5
correlation,0.8303
pval,0.0029
stderror,0.1847
rsquared,0.6893


From correlation of 0.8303, there is a strong positive correlation: the more money you give someone, the more they like you. But even if we do not give them any money, i.e. predictor =0, people like you an amount of 1.50 any way. Furthermore, the r-squared explains how well the predictor describes the response variable. And the p value of 0.0029 shows this model is significant.

In [44]:
#from statsmodel
import statsmodels.api as sm
money = sm.add_constant(money)
results = sm.OLS(liking, money).fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.689
Model:                            OLS   Adj. R-squared:                  0.650
Method:                 Least Squares   F-statistic:                     17.75
Date:                Thu, 13 Feb 2020   Prob (F-statistic):            0.00294
Time:                        10:31:20   Log-Likelihood:                -18.248
No. Observations:                  10   AIC:                             40.50
Df Residuals:                       8   BIC:                             41.10
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.5000      1.146      1.309      0.2

  "anyway, n=%i" % int(n))


From statsmodel, Statsmodel has a comprehensive summary that is useful for analysis. `R-squared:` tells us how much of the variance in the response variable (liking) is explained by the predictor variable (money); `Prob (F-statistic)` gives the probability of observations given all regression coefficients equal zero is true,i.e. there will be no relationship between money and liking. Under 0.05 significance level, we will reject the null hypothesis and conclude this model is significant. The parameters are returned with P value indicating their significance as well(ho: each parameter is zero). If we have more predictors, this table will be useful for variable selection.



## Testing Assumptions