In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import scipy.stats as stats

#### 1. A physician is evaluating a new diet for her patients with a family history of heart disease. To test the effectiveness of this diet, 16 patients are placed on the diet for 6 months. Their weights and triglyceride levels are measured before and after the study, and the physician wants to know if either set of measurements has changed. (Data set: dietstudy.csv)

In [2]:
diet = pd.read_csv("dietstudy.csv")
diet

Unnamed: 0,patid,age,gender,tg0,tg1,tg2,tg3,tg4,wgt0,wgt1,wgt2,wgt3,wgt4
0,1,45,Male,180,148,106,113,100,198,196,193,188,192
1,2,56,Male,139,94,119,75,92,237,233,232,228,225
2,3,50,Male,152,185,86,149,118,233,231,229,228,226
3,4,46,Female,112,145,136,149,82,179,181,177,174,172
4,5,64,Male,156,104,157,79,97,219,217,215,213,214
5,6,49,Female,167,138,88,107,171,169,166,165,162,161
6,7,63,Male,138,132,146,143,132,222,219,215,215,210
7,8,63,Female,160,128,150,118,123,167,167,166,162,161
8,9,52,Male,107,120,129,195,174,199,200,196,196,193
9,10,45,Male,156,103,126,135,92,233,229,229,229,226


In [3]:
print("Average of Triglyceride levels - Before: {}".format(diet.tg0.mean()))
print("Average of Triglyceride levels - After: {}".format(diet.tg4.mean()))
print("=======================================")
print("Average of Patients weight - Before: {}".format(diet.wgt0.mean()))
print("Average of Patients weight - After: {}".format(diet.wgt4.mean()))

Average of Triglyceride levels - Before: 138.4375
Average of Triglyceride levels - After: 124.375
Average of Patients weight - Before: 198.375
Average of Patients weight - After: 190.3125


###### Step: 1 -> Null hypothesis -: H0: No change in weights and triglyceride levels
###### Step: 2 -> Alternate hypothesis -: Ha: Change in weights and triglyceride levels
###### Step: 3 -> Confidence Level -: 0.05
###### Step: 4 -> Decision Rule -: 95%
###### Step: 5 -> Test statistic -: 2 sample T test

In [5]:
ht_trig = stats.ttest_rel(diet.tg0, diet.tg4)

In [18]:
ht_trig.pvalue

0.24874946576903698

In [19]:
ht_trig.pvalue < 0.05

False

In [23]:
ht_weight = stats.ttest_rel(diet.wgt0, diet.wgt4)

In [24]:
ht_weight.pvalue

1.137689414996614e-08

In [25]:
ht_weight.pvalue < 0.05

True

##### Step: 6a -> Decision for Triglyceride levels: As pvalue is more than 0.05, it can be stated that we failed to reject the Null Hypothesis and can conclude there was no significant change in the triglyceride levels measured before and after the study.

##### Step: 6b -> Decision for Patient's weights: As pvalue is less than 0.05, it can be stated that we can reject the Null Hypothesis and can conclude there was a significant change in the weights measured before and after the study.

# =============================================================

#### 2. An analyst at a department store wants to evaluate a recent credit card promotion. To this end, 500 cardholders were randomly selected. Half received an ad promoting a reduced interest rate on purchases made over the next three months, and half received a standard seasonal ad. Is the promotion effective to increase sales? (Data set: creditpromo.csv)

In [11]:
credit = pd.read_csv('creditpromo.csv')

In [12]:
credit.head(5)

Unnamed: 0,id,insert,dollars
0,148,Standard,2232.771979
1,572,New Promotion,1403.807542
2,973,Standard,2327.092181
3,1096,Standard,1280.030541
4,1541,New Promotion,1513.5632


In [13]:
stan = credit[credit['insert'] == 'Standard']
promo = credit[credit['insert'] == 'New Promotion']

In [14]:
print("Average sales without promotion: {}".format(stan.dollars.mean()))
print("Average sales with promotion: {}".format(promo.dollars.mean()))

Average sales without promotion: 1566.3890309659348
Average sales with promotion: 1637.4999830647992


##### Step: 1 -> Null hypothesis -: H0: Sales with promotional ad equals Sales with standard seasonal ads
##### Step: 2 -> Alternate hypothesis -: Ha: Sales with promotional ad is greater than Sales with standard seasonal ads
##### Step: 3 -> Confidence Level -: 0.05
##### Step: 4 -> Decision Rule -: 95%
##### Step: 5 -> Test statistic -: F test or ANOVA

In [15]:
promo_test = stats.f_oneway(promo.dollars, stan.dollars)

In [16]:
promo_test

F_onewayResult(statistic=5.109510902319474, pvalue=0.024225996894148064)

In [44]:
promo_test.pvalue < 0.05

True

##### Step: 6 -> Decision: Null hypothesis can be rejected as p-value is less than 0.05, the recent credit card promotion was successful.

# ====================================================

#### 3. An experiment is conducted to study the hybrid seed production of bottle gourd under open field conditions. The main aim of the investigation is to compare natural pollination and hand pollination. The data are collected on 10 randomly selected plants from each of natural pollination and hand pollination. The data are collected on fruit weight (kg), seed yield/plant (g) and seedling length (cm). (Data set: pollination.csv)

In [47]:
pollination = pd.read_csv('pollination.csv')
pollination.head(5)

Unnamed: 0,Group,Fruit_Wt,Seed_Yield_Plant,Seedling_length
0,Natural,1.85,147.7,16.86
1,Natural,1.86,136.86,16.77
2,Natural,1.83,149.97,16.35
3,Natural,1.89,172.33,18.26
4,Natural,1.8,144.46,17.9


##### a. Is the overall population of Seed yield/plant (g) equals to 200?

##### Step: 1 -> Null hypothesis -: H0: Overall population of Seed yield/plant (g) == 200
##### Step: 2 -> Alternate hypothesis -: Ha: Overall population of Seed yield/plant (g) != 200
##### Step: 3 -> Confidence Level -: 0.05
##### Step: 4 -> Decision Rule -: 95%
##### Step: 5 -> Test statistic -: 1 sample T-Test

In [54]:
pollination.Seed_Yield_Plant.mean()

180.8035

In [51]:
seed_yield = stats.ttest_1samp(pollination.Seed_Yield_Plant, 200)

In [55]:
seed_yield.statistic

-2.3009121248548645

In [52]:
seed_yield.pvalue

0.032891040921283025

In [53]:
seed_yield.pvalue < 0.05

True

##### Step: 6 -> Decision -: As pvalue is lower than the Confidence level, Null Hypothesis can be rejected. Overall population of Seed Yield Plant is not equal to 200.

# -----------------------------------------------------------------------------------------------------

#### b. Test whether the natural pollination and hand pollination under open field conditions are equally effective or are significantly different.

##### Step: 1 -> Null hypothesis -: H0: Natural pollination and hand pollination under open field conditions are equally effective
##### Step: 2 -> Alternate hypothesis -: Ha: Natural pollination and hand pollination under open field conditions are significantly different
##### Step: 3 -> Confidence Level -: 0.05
##### Step: 4 -> Decision Rule -: 95%
##### Step: 5 -> Test statistic -: F test or ANOVA

In [59]:
natural_pol = pollination[pollination['Group'] == 'Natural']
hand_pol = pollination[pollination['Group'] == 'Hand']

In [60]:
print("Natural pollination and Fruit_Wt: {}".format(natural_pol.Fruit_Wt.mean()))
print("Hand pollination and Fruit_Wt: {}".format(hand_pol.Fruit_Wt.mean()))
print('\n')
print("Natural pollination and Seed_Yield_Plant: {}".format(natural_pol.Seed_Yield_Plant.mean()))
print("Hand pollination and FruitSeed_Yield_Plant_Wt: {}".format(hand_pol.Seed_Yield_Plant.mean()))
print('\n')
print("Natural pollination and Seedling_length: {}".format(natural_pol.Seedling_length.mean()))
print("Hand pollination and Seedling_length: {}".format(hand_pol.Seedling_length.mean()))

Natural pollination and Fruit_Wt: 1.848
Hand pollination and Fruit_Wt: 2.5660000000000003


Natural pollination and Seed_Yield_Plant: 146.009
Hand pollination and FruitSeed_Yield_Plant_Wt: 215.598


Natural pollination and Seedling_length: 17.706999999999994
Hand pollination and Seedling_length: 18.589999999999996


In [70]:
Fruit_Wt_test = stats.f_oneway(hand_pol.Fruit_Wt, natural_pol.Fruit_Wt)
Seed_Yield_Plant_test = stats.f_oneway(hand_pol.Seed_Yield_Plant, natural_pol.Seed_Yield_Plant)
Seedling_length_test = stats.f_oneway(hand_pol.Seedling_length, natural_pol.Seedling_length)

In [80]:
print('Fvalue for Fruit_Wt: {}'.format(Fruit_Wt_test.statistic))
print('Pvalue for Fruit_Wt: {}'.format(Fruit_Wt_test.pvalue))
p = Fruit_Wt_test.pvalue

if( p < 0.05):
    print('We reject Null hypothesis')
else:
    print('We fail to reject null hypothesis')

Fvalue for Fruit_Wt: 312.2285329744269
Pvalue for Fruit_Wt: 8.078362076486376e-13
We reject Null hypothesis


In [81]:
print('Fvalue for Seed_Yield_Plan: {}'.format(Seed_Yield_Plant_test.statistic))
print('Pvalue for Seed_Yield_Plan: {}'.format(Seed_Yield_Plant_test.pvalue))
p = Seed_Yield_Plant_test.pvalue

if( p < 0.05):
    print('We reject Null hypothesis')
else:
    print('We fail to reject null hypothesis')

Fvalue for Seed_Yield_Plan: 194.8330366298043
Pvalue for Seed_Yield_Plan: 4.27148158548435e-11
We reject Null hypothesis


In [82]:
print('Fvalue for Seedling_length: {}'.format(Seedling_length_test.statistic))
print('Pvalue for Seedling_length: {}'.format(Seedling_length_test.pvalue))

p = Seedling_length_test.pvalue

if( p < 0.05):
    print('We reject Null hypothesis')
else:
    print('We fail to reject null hypothesis')

Fvalue for Seedling_length: 6.46293337115627
Pvalue for Seedling_length: 0.020428817064110556
We reject Null hypothesis


##### Step: 6 -> Decision -: Null hypothesis was rejected in all 3 situations. Therefore, Natural pollination and hand pollination under open field conditions are significantly different.

# ============================================================

#### 4. An electronics firm is developing a new DVD player in response to customer requests. Using a prototype, the marketing team has collected focus data for different age groups viz. Under 25; 25-34; 35-44; 45-54; 55-64; 65 and above. Do you think that consumers of various ages rated the design differently? (Data set: dvdplayer.csv).

##### Step: 1 -> Null hypothesis -: H0: Consumers rated equally
##### Step: 2 -> Alternate hypothesis -: Ha: Consumers rated differently
##### Step: 3 -> Confidence Level -: 0.05
##### Step: 4 -> Decision Rule -: 95%
##### Step: 5 -> Test statistic -: F test or ANOVA

In [95]:
dvd = pd.read_csv('dvdplayer.csv')
dvd.head()

Unnamed: 0,agegroup,dvdscore
0,65 and over,38.454803
1,55-64,17.669677
2,65 and over,31.704307
3,65 and over,25.92446
4,Under 25,30.450007


In [84]:
dvd.agegroup.unique()

array(['65 and over', '55-64', 'Under 25', '25-34', '45-54', '35-44'],
      dtype=object)

In [85]:
d1 = dvd[dvd['agegroup'] == 'Under 25']
d2 = dvd[dvd['agegroup'] == '25-34']
d3 = dvd[dvd['agegroup'] == '35-44']
d4 = dvd[dvd['agegroup'] == '45-54']
d5 = dvd[dvd['agegroup'] == '55-64']
d6 = dvd[dvd['agegroup'] == '65 and over']

In [86]:
print("Average rating given by Age group - 'Under 25': {}".format(d1.dvdscore.mean()))
print("Average rating given by Age group - '25-34': {}".format(d2.dvdscore.mean()))
print("Average rating given by Age group - '35-44': {}".format(d3.dvdscore.mean()))
print("Average rating given by Age group - '45-54': {}".format(d4.dvdscore.mean()))
print("Average rating given by Age group - '55-64': {}".format(d5.dvdscore.mean()))
print("Average rating given by Age group - '65 and over': {}".format(d6.dvdscore.mean()))

Average rating given by Age group - 'Under 25': 28.749228425584857
Average rating given by Age group - '25-34': 31.67800994228487
Average rating given by Age group - '35-44': 37.018058228022774
Average rating given by Age group - '45-54': 39.1183687318563
Average rating given by Age group - '55-64': 28.447335692086785
Average rating given by Age group - '65 and over': 28.00279130552502


In [92]:
dvd_test = stats.f_oneway(d1.dvdscore, d2.dvdscore, d3.dvdscore, d4.dvdscore, d5.dvdscore, d6.dvdscore)

In [96]:
print('Fvalue for DVD Score: {}'.format(dvd_test.statistic))
print('Pvalue for DVD Score: {}'.format(dvd_test.pvalue))

p = dvd_test.pvalue

if( p < 0.05):
    print('We reject Null hypothesis')
else:
    print('We fail to reject null hypothesis')

Fvalue for DVD Score: 6.992526962676518
Pvalue for DVD Score: 3.087324905679639e-05
We reject Null hypothesis


##### Step: 6 -> Decision -: As pvalue is greater than the Confidence Level, consumers of various ages rated the design differently.

# =======================================================

#### 5. A survey was conducted among 2800 customers on several demographic characteristics. Working status, sex, age, age-group, race, happiness, no. of child, marital status, educational qualifications, income group etc. had been captured for that purpose. (Data set: sample_survey.csv).

In [94]:
sample = pd.read_csv('sample_survey.csv')
sample.head(10)

Unnamed: 0,id,wrkstat,marital,childs,age,educ,paeduc,maeduc,speduc,degree,...,agecat,childcat,news1,news2,news3,news4,news5,car1,car2,car3
0,1,Working full time,Divorced,2.0,60.0,12.0,12.0,12.0,,High school,...,55 to 64,1-2,No,No,No,No,No,American,Japanese,Japanese
1,2,Working part-time,Never married,0.0,27.0,17.0,20.0,,,Junior college,...,25 to 34,,No,No,Yes,No,No,American,German,Japanese
2,3,Working full time,Married,2.0,36.0,12.0,12.0,12.0,16.0,High school,...,35 to 44,1-2,No,No,No,Yes,Yes,American,American,
3,4,Working full time,Never married,0.0,21.0,13.0,,12.0,,High school,...,Less than 25,,No,No,No,Yes,Yes,American,Other,
4,5,Working full time,Never married,0.0,35.0,16.0,,12.0,,Bachelor,...,35 to 44,,No,No,No,No,No,American,American,Korean
5,6,Working full time,Divorced,1.0,33.0,16.0,9.0,6.0,,Bachelor,...,25 to 34,1-2,No,Yes,No,Yes,Yes,American,Korean,Japanese
6,7,Working full time,Separated,0.0,43.0,12.0,14.0,12.0,,High school,...,35 to 44,,No,No,No,Yes,Yes,Korean,American,American
7,8,Working full time,Never married,0.0,29.0,13.0,16.0,12.0,,High school,...,25 to 34,,No,No,No,Yes,Yes,Other,Japanese,
8,9,Working part-time,Married,2.0,39.0,18.0,16.0,12.0,13.0,Bachelor,...,35 to 44,1-2,Yes,No,Yes,No,No,American,Korean,
9,10,Working full time,Divorced,0.0,45.0,15.0,16.0,12.0,,Junior college,...,45 to 54,,No,Yes,No,Yes,Yes,Korean,Japanese,Korean


#### a. Is there any relationship in between work force status with marital status?

##### Step: 1 -> Null hypothesis -: H0: There is no relationship between labour force status and marital status
##### Step: 2 -> Alternate hypothesis -: Ha: There is a relationship between labour force status and marital status
##### Step: 3 -> Confidence Level -: 0.05
##### Step: 4 -> Decision Rule -: 95%
##### Step: 5 -> Test statistic -: Chi square test

In [104]:
sample_new = pd.crosstab(sample.wrkstat, sample.marital)
chi2_stat, p_value, dof, ex = stats.chi2_contingency(sample_new)

print("===Chi2 Stat===")
print(chi2_stat)
print("\n")

print("===Degrees of Freedom===")
print(dof)
print("\n")

print("===P-Value===")
print(p_value)
print("\n")

print("===Contingency Table===")
print(ex)

===Chi2 Stat===
729.2421426572284


===Degrees of Freedom===
28


===P-Value===
1.4875268409067568e-135


===Contingency Table===
[[ 51.69187279 155.8869258   76.84240283  10.77879859  32.8       ]
 [  8.51024735  25.66431095  12.65088339   1.7745583    5.4       ]
 [ 62.09328622 187.25441696  92.30459364  12.94770318  39.4       ]
 [ 12.45017668  37.5459364   18.50777385   2.59611307   7.9       ]
 [  7.24946996  21.86219081  10.77667845   1.51166078   4.6       ]
 [  9.14063604  27.56537102  13.58798587   1.90600707   5.8       ]
 [246.95477032 744.74028269 367.10989399  51.495053   156.7       ]
 [ 47.90954064 144.48056537  71.21978799   9.99010601  30.4       ]]


In [105]:
if( p_value < 0.05):
    print('We reject Null hypothesis')
else:
    print('We fail to reject null hypothesis')

We reject Null hypothesis


##### Step: 6 ->  Decision -: Null hypothesis can been rejected,  there is a relationship between work force status and marital status.

# ========================================================

#### b. Do you think educational qualification is somehow controlling the marital status?

##### Step: 1 -> Null hypothesis -: H0: There is no relationship between educational qualitication and marital status
##### Step: 2 -> Alternate hypothesis -: Ha: There is a relationship between educational qualitication and marital status
##### Step: 3 -> Confidence Level -: 0.05
##### Step: 4 -> Decision Rule -: 95%
##### Step: 5 -> Test statistic -: Chi square test

In [106]:
sample_new1 = pd.crosstab(sample.degree, sample.marital)
chi2_stat, p_value, dof, ex = stats.chi2_contingency(sample_new1)

print("===Chi2 Stat===")
print(chi2_stat)
print("\n")

print("===Degrees of Freedom===")
print(dof)
print("\n")

print("===P-Value===")
print(p_value)
print("\n")

print("===Contingency Table===")
print(ex)

===Chi2 Stat===
122.68449020508541


===Degrees of Freedom===
16


===P-Value===
1.6707923432360119e-18


===Contingency Table===
[[ 75.06345268 227.39312301 111.83268345  15.75824176  47.95249911]
 [ 32.19248493  97.52215526  47.9617157    6.75824176  20.56540234]
 [235.55476781 713.57674583 350.9393832   49.45054945 150.4785537 ]
 [ 32.66359447  98.94930876  48.66359447   6.85714286  20.86635945]
 [ 67.52570011 204.55866714 100.60262318  14.17582418  43.1371854 ]]


In [107]:
if( p_value < 0.05):
    print('We reject Null hypothesis')
else:
    print('We fail to reject null hypothesis')

We reject Null hypothesis


##### Step: 6 -> Decision -: Null hypothesis can been rejected, educational qualification is somehow controlling marital status.

# ===========================================================

##### c. Is happiness is driven by earnings or marital status?

##### Step: 1 -> Null hypothesis -: H0: There is no relationship between Happiness and earnings or Happiness and marital status
##### Step: 2 -> Alternate hypothesis -: Ha: There is a relationship between Happiness and earnings or Happiness and marital status
##### Step: 3 -> Confidence Level -: 0.05
##### Step: 4 -> Decision Rule -: 95%
##### Step: 5 -> Test statistic -: Chi square test

In [113]:
sample_new2 = pd.crosstab(sample.happy, sample.marital)
chi2_stat, p_value, dof, ex = stats.chi2_contingency(sample_new2)

print("===Chi2 Stat===")
print(chi2_stat)
print("\n")

print("===Degrees of Freedom===")
print(dof)
print("\n")

print("===P-Value===")
print(p_value)
print("\n")

print("===Contingency Table===")
print(ex)

===Chi2 Stat===
260.6894389418282


===Degrees of Freedom===
8


===P-Value===
9.3147261197964e-52


===Contingency Table===
[[ 53.6969697  162.06060606  79.27272727  11.15151515  33.81818182]
 [248.58538324 750.24527629 366.98609626  51.62495544 156.55828877]
 [140.71764706 424.69411765 207.74117647  29.22352941  88.62352941]]


In [114]:
if( p_value < 0.05):
    print('We reject Null hypothesis')
else:
    print('We fail to reject null hypothesis')

We reject Null hypothesis


In [115]:
sample_new3 = pd.crosstab(sample.happy, sample.income)
chi2_stat, p_value, dof, ex = stats.chi2_contingency(sample_new3)

print("===Chi2 Stat===")
print(chi2_stat)
print("\n")

print("===Degrees of Freedom===")
print(dof)
print("\n")

print("===P-Value===")
print(p_value)
print("\n")

print("===Contingency Table===")
print(ex)

===Chi2 Stat===
178.9505306121643


===Degrees of Freedom===
22


===P-Value===
1.4107677273473057e-26


===Contingency Table===
[[  3.89520355  23.12777106  21.66706973  29.82265216 191.35187424
    2.92140266   3.89520355   4.26037888   4.01692866   5.72108021
    7.06005643   4.26037888]
 [ 18.16041919 107.82748892 101.01733172 139.04070939 892.1305925
   13.62031439  18.16041919  19.86295848  18.72793229  26.67311568
   32.91575977  19.86295848]
 [  9.94437727  59.04474002  55.31559855  76.13663845 488.51753325
    7.45828295   9.94437727  10.87666264  10.25513906  14.60580411
   18.0241838   10.87666264]]


In [116]:
if( p_value < 0.05):
    print('We reject Null hypothesis')
else:
    print('We fail to reject null hypothesis')

We reject Null hypothesis


##### Step: 6 -> Decision -: Yes, happiness is driven by earnings or marital status.