**A physician is evaluating a new diet for her patients with a family history of heart disease. To test the effectiveness of this diet, 16 patients are placed on the diet for 6 months. Their weights and triglyceride levels are measured before and after the study, and the physician wants to know if either set of measurements has changed. (Data set: dietstudy.csv)**

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats

In [2]:
dietstudy=pd.read_csv('dietstudy.csv')
dietstudy

Unnamed: 0,patid,age,gender,tg0,tg1,tg2,tg3,tg4,wgt0,wgt1,wgt2,wgt3,wgt4
0,1,45,Male,180,148,106,113,100,198,196,193,188,192
1,2,56,Male,139,94,119,75,92,237,233,232,228,225
2,3,50,Male,152,185,86,149,118,233,231,229,228,226
3,4,46,Female,112,145,136,149,82,179,181,177,174,172
4,5,64,Male,156,104,157,79,97,219,217,215,213,214
5,6,49,Female,167,138,88,107,171,169,166,165,162,161
6,7,63,Male,138,132,146,143,132,222,219,215,215,210
7,8,63,Female,160,128,150,118,123,167,167,166,162,161
8,9,52,Male,107,120,129,195,174,199,200,196,196,193
9,10,45,Male,156,103,126,135,92,233,229,229,229,226


### HO: Before Triglyceride levels == After Triglyceride levels 
### HA: Before Triglyceride levels != After Triglyceride levels 

In [3]:
print(stats.ttest_rel(a=dietstudy.tg0,b=dietstudy.tg1))
print(stats.ttest_rel(a=dietstudy.tg0,b=dietstudy.tg2))
print(stats.ttest_rel(a=dietstudy.tg0,b=dietstudy.tg3))
print(stats.ttest_rel(a=dietstudy.tg0,b=dietstudy.tg4))

Ttest_relResult(statistic=1.708339375326079, pvalue=0.10817790329711016)
Ttest_relResult(statistic=1.4652897466766746, pvalue=0.16348782284970503)
Ttest_relResult(statistic=1.6460302971284584, pvalue=0.12054358887600601)
Ttest_relResult(statistic=1.2000008533342437, pvalue=0.24874946576903698)


In [4]:
print(dietstudy.tg0.mean())
print(dietstudy.tg1.mean())
print(dietstudy.tg2.mean())
print(dietstudy.tg3.mean())
print(dietstudy.tg4.mean())

138.4375
124.5625
124.375
118.8125
124.375


**P-value is high so we cannot reject H0,as the sample size is very small, we cannot accept the null hypothesis. From the mean values, we can see that it has changed significantly and so the diet might have an impact on the triglyceride levels.**

### HO:Before Weight == After Weight
### HA:Before Weight != After Weight

In [5]:
print(stats.ttest_rel(a=dietstudy.wgt0,b=dietstudy.wgt1))
print(stats.ttest_rel(a=dietstudy.wgt0,b=dietstudy.wgt2))
print(stats.ttest_rel(a=dietstudy.wgt0,b=dietstudy.wgt3))
print(stats.ttest_rel(a=dietstudy.wgt0,b=dietstudy.wgt4))

Ttest_relResult(statistic=3.69481872265182, pvalue=0.0021619105203089487)
Ttest_relResult(statistic=9.405816697932298, pvalue=1.1117975587210374e-07)
Ttest_relResult(statistic=10.2633853406995, pvalue=3.545567049017907e-08)
Ttest_relResult(statistic=11.174521688532522, pvalue=1.137689414996614e-08)


In [6]:
print(dietstudy.wgt0.mean())
print(dietstudy.wgt1.mean())
print(dietstudy.wgt2.mean())
print(dietstudy.wgt3.mean())
print(dietstudy.wgt4.mean())

198.375
196.125
194.125
192.125
190.3125


**P-value is low, so we reject H0 and accept alternate hypothesis. So, the diet has an impact on the weight change. We can verify this by comparing the mean value decrease.**

**An analyst at a department store wants to evaluate a recent credit card promotion. To this end, 500 cardholders were randomly selected. Half received an ad promoting a reduced interest rate on purchases made over the next three months, and half received a standard seasonal ad. Is the promotion effective to increase sales? (Data set: creditpromo.csv)**

In [7]:
creditpromo=pd.read_csv('creditpromo.csv')
creditpromo

Unnamed: 0,id,insert,dollars
0,148,Standard,2232.771979
1,572,New Promotion,1403.807542
2,973,Standard,2327.092181
3,1096,Standard,1280.030541
4,1541,New Promotion,1513.563200
5,1947,New Promotion,1729.627996
6,2001,New Promotion,1609.705918
7,2130,Standard,1476.624884
8,2616,Standard,1460.769753
9,2886,New Promotion,1854.489028


#### H0: The new promo sales = Standard seasonal ad sales 
#### H1: The new promo sales > Standard seasonal ad sales

In [8]:
seasonal_ad=creditpromo["dollars"][creditpromo["insert"]=="Standard"]
new_promo=creditpromo["dollars"][creditpromo["insert"]=="New Promotion"]

In [9]:
print(seasonal_ad.mean())
print(new_promo.mean())

1566.3890309659348
1637.4999830647992


In [10]:
print(seasonal_ad.std())
print(new_promo.std())

346.67304707417804
356.7031686883037


In [11]:
stats.ttest_ind(a=seasonal_ad, b=new_promo, equal_var=False)

Ttest_indResult(statistic=-2.260422726464996, pvalue=0.024226348191648994)

In [12]:
stats.ttest_ind(a=seasonal_ad, b=new_promo, equal_var=True)

Ttest_indResult(statistic=-2.2604227264649963, pvalue=0.024225996894147814)

### As the p-value is low and the mean sales of new promo is significantly higher than seasonal ad sales, we can reject the null hypothesis and conclude that new promo has definitely increased the sales.

**An experiment is conducted to study the hybrid seed production of bottle gourd under open field conditions. The main aim of the investigation is to compare natural pollination and hand pollination. The data are collected on 10 randomly selected plants from each of natural pollination and hand pollination. The data are collected on fruit weight (kg), seed yield/plant (g) and seedling length (cm). (Data set: pollination.csv)
a. Is the overall population of Seed yield/plant (g) equals to 200?
b. Test whether the natural pollination and hand pollination under open field conditions are equally effective or are significantly different.**

In [13]:
pollination=pd.read_csv('pollination.csv')
pollination

Unnamed: 0,Group,Fruit_Wt,Seed_Yield_Plant,Seedling_length
0,Natural,1.85,147.7,16.86
1,Natural,1.86,136.86,16.77
2,Natural,1.83,149.97,16.35
3,Natural,1.89,172.33,18.26
4,Natural,1.8,144.46,17.9
5,Natural,1.88,138.3,16.95
6,Natural,1.89,150.58,18.15
7,Natural,1.79,140.99,18.86
8,Natural,1.85,140.57,18.39
9,Natural,1.84,138.33,18.58


In [14]:
pollination.columns

Index(['Group', 'Fruit_Wt', 'Seed_Yield_Plant', 'Seedling_length'], dtype='object')

### a
### H0: Overall population of Seed_Yield_Plant ==200
### H1: Overall population of Seed_Yield_Plant !=200

In [15]:
stats.ttest_1samp(a=pollination["Seed_Yield_Plant"],popmean=200)

Ttest_1sampResult(statistic=-2.3009121248548645, pvalue=0.032891040921283025)

In [16]:
pollination["Seed_Yield_Plant"].mean()

180.8035

**P-Value is less than 0.05, so reject null hypothesis and from the mean we can infer that overall population mean is below 200**

### b
### H0: Natural pollination equally effective as hand pollination
### H1: Natural pollination significantly different then hand pollination

In [17]:
Natural=pollination['Seed_Yield_Plant'][pollination['Group']=="Natural"]
Hand=pollination['Seed_Yield_Plant'][pollination['Group']=="Hand"]
Natural_wt=pollination['Fruit_Wt'][pollination['Group']=="Natural"]
Hand_wt=pollination['Fruit_Wt'][pollination['Group']=="Hand"]

In [18]:
stats.ttest_ind(a=Natural,b=Hand,equal_var=True)

Ttest_indResult(statistic=-13.958260515902547, pvalue=4.2714815854843853e-11)

In [19]:
stats.ttest_ind(a=Natural,b=Hand,equal_var=False)

Ttest_indResult(statistic=-13.958260515902547, pvalue=5.136161282685624e-11)

In [20]:
stats.ttest_ind(a=Natural_wt,b=Hand_wt,equal_var=True)

Ttest_indResult(statistic=-17.669989614440286, pvalue=8.078362076486221e-13)

In [21]:
stats.ttest_ind(a=Natural_wt,b=Hand_wt,equal_var=True)

Ttest_indResult(statistic=-17.669989614440286, pvalue=8.078362076486221e-13)

In [22]:
print(Natural.std())
print(Hand.std())
print(Natural_wt.std())
print(Hand_wt.std())

10.496086519375792
11.7637123770045
0.03457680661303979
0.12375603240066954


In [23]:
print(Natural.mean())
print(Hand.mean())
print(Natural_wt.mean())
print(Hand_wt.mean())

146.009
215.598
1.848
2.5660000000000003


### Both fruit weight and seed_yield_plant P-values are less that 0.05, so we reject null hypothesis. From the mean and SD values we can accept the alternate hypothesis. 

**An electronics firm is developing a new DVD player in response to customer requests. Using a prototype, the marketing team has collected focus data for different age groups viz. Under 25; 25-34; 35-44; 45-54; 55-64; 65 and above. Do you think that consumers of various ages rated the design differently? (Data set: dvdplayer.csv).**

In [24]:
dvdplayer=pd.read_csv('dvdplayer.csv')
dvdplayer

Unnamed: 0,agegroup,dvdscore
0,65 and over,38.454803
1,55-64,17.669677
2,65 and over,31.704307
3,65 and over,25.924460
4,Under 25,30.450007
5,Under 25,35.609909
6,65 and over,29.677695
7,25-34,38.167369
8,65 and over,23.509700
9,45-54,26.051029


### H0: Customers of different agegroups have rated equally
### H1: Customers of different agegroups have rated differently

In [25]:
s1=dvdplayer["dvdscore"][dvdplayer["agegroup"]=="Under 25"]
s2=dvdplayer["dvdscore"][dvdplayer["agegroup"]=="25-34"]
s3=dvdplayer["dvdscore"][dvdplayer["agegroup"]=="35-44"]
s4=dvdplayer["dvdscore"][dvdplayer["agegroup"]=="45-54"]
s5=dvdplayer["dvdscore"][dvdplayer["agegroup"]=="55-64"]
s6=dvdplayer["dvdscore"][dvdplayer["agegroup"]=="65 and over"]

In [26]:
stats.f_oneway(s1,s2,s3,s4,s5,s6)

F_onewayResult(statistic=6.992526962676518, pvalue=3.087324905679639e-05)

In [27]:
print(s1.mean())
print(s2.mean())
print(s3.mean())
print(s4.mean())
print(s5.mean())
print(s6.mean())

28.749228425584857
31.67800994228487
37.018058228022774
39.1183687318563
28.447335692086785
28.00279130552502


**As the P-value is low, reject null hypothesis. So, customers of differenct agegroups have rated differently**

#### A survey was conducted among 2800 customers on several demographiccharacteristics. Working status, sex, age, age-group, race, happiness, no. of child, marital status, educational qualifications, income group etc. had beencaptured for that purpose. (Data set: sample_survey.csv).
#### a. Is there any relationship in between labour force status with marital status?
#### b. Do you think educational qualification is somehow controlling the marital status?
#### c. Is happiness is driven by earnings or marital status?

In [28]:
sample_survey=pd.read_csv('sample_survey.csv')
sample_survey

Unnamed: 0,id,wrkstat,marital,childs,age,educ,paeduc,maeduc,speduc,degree,...,agecat,childcat,news1,news2,news3,news4,news5,car1,car2,car3
0,1,Working full time,Divorced,2.0,60.0,12.0,12.0,12.0,,High school,...,55 to 64,1-2,No,No,No,No,No,American,Japanese,Japanese
1,2,Working part-time,Never married,0.0,27.0,17.0,20.0,,,Junior college,...,25 to 34,,No,No,Yes,No,No,American,German,Japanese
2,3,Working full time,Married,2.0,36.0,12.0,12.0,12.0,16.0,High school,...,35 to 44,1-2,No,No,No,Yes,Yes,American,American,
3,4,Working full time,Never married,0.0,21.0,13.0,,12.0,,High school,...,Less than 25,,No,No,No,Yes,Yes,American,Other,
4,5,Working full time,Never married,0.0,35.0,16.0,,12.0,,Bachelor,...,35 to 44,,No,No,No,No,No,American,American,Korean
5,6,Working full time,Divorced,1.0,33.0,16.0,9.0,6.0,,Bachelor,...,25 to 34,1-2,No,Yes,No,Yes,Yes,American,Korean,Japanese
6,7,Working full time,Separated,0.0,43.0,12.0,14.0,12.0,,High school,...,35 to 44,,No,No,No,Yes,Yes,Korean,American,American
7,8,Working full time,Never married,0.0,29.0,13.0,16.0,12.0,,High school,...,25 to 34,,No,No,No,Yes,Yes,Other,Japanese,
8,9,Working part-time,Married,2.0,39.0,18.0,16.0,12.0,13.0,Bachelor,...,35 to 44,1-2,Yes,No,Yes,No,No,American,Korean,
9,10,Working full time,Divorced,0.0,45.0,15.0,16.0,12.0,,Junior college,...,45 to 54,,No,Yes,No,Yes,Yes,Korean,Japanese,Korean


In [29]:
sample_survey.isna().sum()

id             0
wrkstat        1
marital        1
childs         7
age            4
educ          12
paeduc       791
maeduc       433
speduc      1521
degree        10
sex            0
race           0
born          13
parborn       20
granborn     202
income       329
rincome      983
polviews     141
cappun       233
postlife     766
happy         26
hapmar      1495
owngun       963
news         962
tvhours      495
howpaid     1485
ethnic       594
eth1         275
eth2        1730
eth3        2388
confinan     968
conbus      1011
coneduc      951
conpress     970
conmedic     957
contv        956
agecat         4
childcat       7
news1          0
news2          0
news3          0
news4          0
news5          0
car1           0
car2         566
car3        1160
dtype: int64

In [30]:
sample_survey.columns

Index(['id', 'wrkstat', 'marital', 'childs', 'age', 'educ', 'paeduc', 'maeduc',
       'speduc', 'degree', 'sex', 'race', 'born', 'parborn', 'granborn',
       'income', 'rincome', 'polviews', 'cappun', 'postlife', 'happy',
       'hapmar', 'owngun', 'news', 'tvhours', 'howpaid', 'ethnic', 'eth1',
       'eth2', 'eth3', 'confinan', 'conbus', 'coneduc', 'conpress', 'conmedic',
       'contv', 'agecat', 'childcat', 'news1', 'news2', 'news3', 'news4',
       'news5', 'car1', 'car2', 'car3'],
      dtype='object')

#### a
#### H0: There is no relationship between labour force status and marital status.
#### H1: There is a relationship between labour force status and marital status.

In [31]:
t=pd.crosstab(sample_survey.wrkstat, sample_survey.marital, margins= True)
t

marital,Divorced,Married,Never married,Separated,Widowed,All
wrkstat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Keeping house,25,200,35,13,55,328
Other,12,16,14,4,8,54
Retired,53,168,17,6,150,394
School,7,9,60,2,1,79
Temporarily not working,9,23,11,1,2,46
"Unemployed, laid off",10,13,32,0,3,58
Working full time,295,778,392,58,44,1567
Working part-time,35,138,102,9,20,304
All,446,1345,663,93,283,2830


In [32]:
stats.chi2_contingency(observed=t)

(729.2421426572284,
 1.820339965538765e-127,
 40,
 array([[5.16918728e+01, 1.55886926e+02, 7.68424028e+01, 1.07787986e+01,
         3.28000000e+01, 3.28000000e+02],
        [8.51024735e+00, 2.56643110e+01, 1.26508834e+01, 1.77455830e+00,
         5.40000000e+00, 5.40000000e+01],
        [6.20932862e+01, 1.87254417e+02, 9.23045936e+01, 1.29477032e+01,
         3.94000000e+01, 3.94000000e+02],
        [1.24501767e+01, 3.75459364e+01, 1.85077739e+01, 2.59611307e+00,
         7.90000000e+00, 7.90000000e+01],
        [7.24946996e+00, 2.18621908e+01, 1.07766784e+01, 1.51166078e+00,
         4.60000000e+00, 4.60000000e+01],
        [9.14063604e+00, 2.75653710e+01, 1.35879859e+01, 1.90600707e+00,
         5.80000000e+00, 5.80000000e+01],
        [2.46954770e+02, 7.44740283e+02, 3.67109894e+02, 5.14950530e+01,
         1.56700000e+02, 1.56700000e+03],
        [4.79095406e+01, 1.44480565e+02, 7.12197880e+01, 9.99010601e+00,
         3.04000000e+01, 3.04000000e+02],
        [4.46000000e+02, 1.345

**P-value is low, with 95% confidence level rejecting null hypothesis. Accepting alternative hypothesis that a relationship exist between labour force status and marital status**

### b
### H0: There is no relationship between educational status and marital status.
### H1: There is a relationship between educational status and marital status.

In [33]:
b=pd.crosstab(sample_survey.degree, sample_survey.marital, margins= True)
b

marital,Divorced,Married,Never married,Separated,Widowed,All
degree,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Bachelor,58,251,129,12,28,478
Graduate,29,123,41,3,9,205
High school,241,686,367,58,148,1500
Junior college,45,108,46,3,6,208
LT High school,70,174,77,17,92,430
All,443,1342,660,93,283,2821


In [34]:
stats.chi2_contingency(observed=b)

(122.68449020508541,
 7.424404099753273e-15,
 25,
 array([[  75.06345268,  227.39312301,  111.83268345,   15.75824176,
           47.95249911,  478.        ],
        [  32.19248493,   97.52215526,   47.9617157 ,    6.75824176,
           20.56540234,  205.        ],
        [ 235.55476781,  713.57674583,  350.9393832 ,   49.45054945,
          150.4785537 , 1500.        ],
        [  32.66359447,   98.94930876,   48.66359447,    6.85714286,
           20.86635945,  208.        ],
        [  67.52570011,  204.55866714,  100.60262318,   14.17582418,
           43.1371854 ,  430.        ],
        [ 443.        , 1342.        ,  660.        ,   93.        ,
          283.        , 2821.        ]]))

**P-value is low, with 95% confidence level rejecting null hypothesis. Accepting alternative hypothesis that a relationship exist between educational status and marital status**

#### b
#### H0: There is no relationship between happiness and earnings or marital status.
#### H1: There is a relationship between happiness and earnings or marital status.

In [35]:
c1=pd.crosstab(sample_survey.happy, sample_survey.marital, margins= True)
c1

marital,Divorced,Married,Never married,Separated,Widowed,All
happy,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Not too happy,72,71,108,30,59,340
Pretty happy,278,684,426,49,137,1574
Very happy,93,582,120,13,83,891
All,443,1337,654,92,279,2805


In [36]:
stats.chi2_contingency(observed=c1)

(260.68943894182826,
 7.762777322980048e-47,
 15,
 array([[  53.6969697 ,  162.06060606,   79.27272727,   11.15151515,
           33.81818182,  340.        ],
        [ 248.58538324,  750.24527629,  366.98609626,   51.62495544,
          156.55828877, 1574.        ],
        [ 140.71764706,  424.69411765,  207.74117647,   29.22352941,
           88.62352941,  891.        ],
        [ 443.        , 1337.        ,  654.        ,   92.        ,
          279.        , 2805.        ]]))

In [37]:
c2=pd.crosstab(sample_survey.income, sample_survey.happy, margins= True)
c2

happy,Not too happy,Pretty happy,Very happy,All
income,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
$1000 TO 2999,7,20,5,32
$10000 - 14999,39,107,44,190
$15000 - 19999,33,119,26,178
$20000 - 24999,40,155,50,245
$25000 or more,113,888,571,1572
$3000 TO 3999,9,11,4,24
$4000 TO 4999,9,13,10,32
$5000 TO 5999,6,18,11,35
$6000 TO 6999,14,13,6,33
$7000 TO 7999,12,21,14,47


In [38]:
stats.chi2_contingency(observed=c2)

(178.9505306121643,
 7.234749067043263e-21,
 36,
 array([[   3.89520355,   18.16041919,    9.94437727,   32.        ],
        [  23.12777106,  107.82748892,   59.04474002,  190.        ],
        [  21.66706973,  101.01733172,   55.31559855,  178.        ],
        [  29.82265216,  139.04070939,   76.13663845,  245.        ],
        [ 191.35187424,  892.1305925 ,  488.51753325, 1572.        ],
        [   2.92140266,   13.62031439,    7.45828295,   24.        ],
        [   3.89520355,   18.16041919,    9.94437727,   32.        ],
        [   4.26037888,   19.86295848,   10.87666264,   35.        ],
        [   4.01692866,   18.72793229,   10.25513906,   33.        ],
        [   5.72108021,   26.67311568,   14.60580411,   47.        ],
        [   7.06005643,   32.91575977,   18.0241838 ,   58.        ],
        [   4.26037888,   19.86295848,   10.87666264,   35.        ],
        [ 302.        , 1408.        ,  771.        , 2481.        ]]))

**Very low P-value, thus rejecting null hypothesis and accepting alternate hypothesis. There exists a relationship between happiness and earnings or marital status.**