In [3]:
import pandas as pd
import numpy as np
import os
import scipy.stats as stats
import matplotlib as mt
import matplotlib.pyplot as plt

## 1. A physician is evaluating a new diet for her patients with a family history of heart disease. To test the effectiveness of this diet, 16 patients are placed on the diet for 6 months. Their weights and triglyceride levels are measured before and after the study, and the physician wants to know if either set of measurements has changed - Two Paired Sample T-Test

In [7]:
DietData = pd.read_csv("dietstudy.csv")
DietData

Unnamed: 0,patid,age,gender,tg0,tg1,tg2,tg3,tg4,wgt0,wgt1,wgt2,wgt3,wgt4
0,1,45,Male,180,148,106,113,100,198,196,193,188,192
1,2,56,Male,139,94,119,75,92,237,233,232,228,225
2,3,50,Male,152,185,86,149,118,233,231,229,228,226
3,4,46,Female,112,145,136,149,82,179,181,177,174,172
4,5,64,Male,156,104,157,79,97,219,217,215,213,214
5,6,49,Female,167,138,88,107,171,169,166,165,162,161
6,7,63,Male,138,132,146,143,132,222,219,215,215,210
7,8,63,Female,160,128,150,118,123,167,167,166,162,161
8,9,52,Male,107,120,129,195,174,199,200,196,196,193
9,10,45,Male,156,103,126,135,92,233,229,229,229,226


In [66]:
male = DietData.age[DietData.gender =="Male"]
male

0     45
1     56
2     50
4     64
6     63
8     52
9     45
13    59
14    52
Name: age, dtype: int64

### Two-Sample Paired Test
 - Let's take tg0 as the pre-test-triglyceride and wgt0 pre-test-weight
 - tg1,tg2,tg3,tg4 triglyceride for respective months after the initating the test
 - wgt1,wgt2,wgt4 weights for respective months after the initating the test

#### So, we will check in the first step if :
  - H0: Pre_treatment == After_first_montht
  - H1: Pre_treatment <> After_first_montht
- if H0 is rejected, we will consider that their is change in the figures, otherwise we will go for 2nd month and so on
- Taking Alpha to be 0.05

In [17]:
#a = Triglyceride_pre_treatment
#b = Triglyceride_after_first_month

In [20]:
month1 = stats.ttest_rel(a=DietData.tg0,b=DietData.tg1)
month1

Ttest_relResult(statistic=1.708339375326079, pvalue=0.10817790329711016)

In [29]:
Wgt_month1 = stats.ttest_rel(a=DietData.wgt0,b=DietData.wgt1)
Wgt_month1

Ttest_relResult(statistic=3.69481872265182, pvalue=0.0021619105203089487)

#### Since the P Value is 0.10, we reject H0, so we will go for the second month, for trigylceride
 - HO : Pre_treatment == After_Second_Month
 - H1 : Pre_treatment <> After_Second_Month
#### The P value for weight is less then 0.05, so we can accept that their is a change in weight, however we will check for 2nd month, to see there are constant changes 

In [22]:
month2 = stats.ttest_rel(a=DietData.tg0,b=DietData.tg2)
month2

Ttest_relResult(statistic=1.4652897466766746, pvalue=0.16348782284970503)

In [31]:
Wgt2_month2 = stats.ttest_rel(a=DietData.wgt0,b=DietData.wgt2)
Wgt2_month2

Ttest_relResult(statistic=9.405816697932298, pvalue=1.1117975587210374e-07)

#### Again, we P value is very high then 0.05, we have to check it for 3rd Month
 - H0 : Pre_treatment == After_Third_Month
 - H1 : Pre_treatment <> After_Third_Month
#### For Weight the P value is continously low, hence we can say that their are changes in weight post treatment 

In [24]:
month3 = stats.ttest_rel(a=DietData.tg0,b=DietData.tg3)
month3

Ttest_relResult(statistic=1.6460302971284584, pvalue=0.12054358887600601)

#### Since the p value in 3rd month is also greater then 0.05, we will go for the 4th month

In [27]:
month4 = stats.ttest_rel(a=DietData.tg0,b=DietData.tg4)
month4

Ttest_relResult(statistic=1.2000008533342437, pvalue=0.24874946576903698)

#### There are no changes in the Triglyceride

## 2. An analyst at a department store wants to evaluate a recent credit card promotion. To this end, 500 cardholders were randomly selected. Half received an ad promoting a reduced interest rate on purchases made over the next three months, and half received a standard seasonal ad. Is the promotion effective to increase sales?

In [157]:
CreditPromo = pd.read_csv("creditpromo.csv")
CreditPromo.isnull().sum()

id         0
insert     0
dollars    0
dtype: int64

In [158]:
CreditPromo.head()

Unnamed: 0,id,insert,dollars
0,148,Standard,2232.771979
1,572,New Promotion,1403.807542
2,973,Standard,2327.092181
3,1096,Standard,1280.030541
4,1541,New Promotion,1513.5632


In [159]:
CreditPromo.dtypes

id           int64
insert      object
dollars    float64
dtype: object

#### Independent Sample-t-Test
 - We need to check the mean of spend of customer's who received the promotion is equal to the one who dosen't received the promotions
 - H0 : Dollars_Spent_promotional == Dollars_spent_standard
 - H1 : Dollars_Spent_promotional <> Dollars_spent_standard

In [161]:
Standard = CreditPromo.dollars[CreditPromo["insert"]=="Standard"]

In [162]:
Promotional = CreditPromo.dollars[CreditPromo["insert"]=="New Promotion"]

In [166]:
ind_test_unequal = stats.ttest_ind(a=Standard,b=Promotional,equal_var=False)
ind_test_unequal.statistic

-2.260422726464996

In [167]:
ind_test_equal = stats.ttest_ind(a=Standard,b=Promotional,equal_var=True)
ind_test_equal.statistic

-2.2604227264649963

#### Since the difference between the variance of equal and unequal is almost similar, we will consider results from ind_test_equal

In [168]:
ind_test_equal.pvalue

0.024225996894147814

#### As the value is < 0.05 we have to accept the H0, and state that promotion was not effective to increase the sales

## 3. An experiment is conducted to study the hybrid seed production of bottle gourd under open field conditions. The main aim of the investigation is to compare natural pollination and hand pollination. The data are collected on 10 randomly selected plants from each of natural pollination and hand pollination. The data are collected on fruit weight (kg), seed yield/plant (g) and seedling length (cm). (Data set: pollination.csv)
 - a. Is the overall population of Seed yield/plant (g) equals to 200? - One Sample T test
 - b. Test whether the natural pollination and hand pollination under open field
conditions are equally effective or are significantly different. - Annonva Test

In [169]:
pollination = pd.read_csv("pollination.csv")
pollination.head()

Unnamed: 0,Group,Fruit_Wt,Seed_Yield_Plant,Seedling_length
0,Natural,1.85,147.7,16.86
1,Natural,1.86,136.86,16.77
2,Natural,1.83,149.97,16.35
3,Natural,1.89,172.33,18.26
4,Natural,1.8,144.46,17.9


#### a. One-Sample T-Test
 - Population Mean : 200
 - Second Variable : Seed_Yield_Plant
 - H0 : Populatio Mean == Sample Mean
 - To check if the population of Seed_Yield_Plant is equal to the sample mean i.e 200
 - H1 : Populatio Mean <> Sample Mean
 - Population of Seed_Yield_Plant is not equal to the sample mean
 - Taking alpha to be 0.05

In [72]:
stats.ttest_1samp(a=pollination.Seed_Yield_Plant,popmean=200)

Ttest_1sampResult(statistic=-2.3009121248548645, pvalue=0.032891040921283025)

#### From the above test, the p value is < alpha, hence we will reject the null hypothesis, which state that population of Seed_Yield is not equal to 200

#### b. Annova Test
 - For each of the Continous variable for different segment of Group
 - assuming alpha to be 0.05
 - H0 : There is no infulence of Group on Fruit Weight
 - H1 : There is infulence of Group on Fruit Weight

In [81]:
sample1 = pollination.Fruit_Wt[pollination.Group=="Natural"]
sample2 = pollination.Fruit_Wt[pollination.Group=="Hand"]

In [83]:
aov = stats.f_oneway(sample1, sample2)
aov

F_onewayResult(statistic=312.228532974426, pvalue=8.078362076486568e-13)

#### Since the P value is very very less than alpha we will reject H0, hence there is a influence on Fruit Weight in different Groups of Pollination

#### We will check the influence of Group - Hand & Natural on Seed_yield and Seedling_length

In [89]:
sample_seed_1 = pollination.Seed_Yield_Plant[pollination.Group=="Natural"]
sample_seed_2 = pollination.Seed_Yield_Plant[pollination.Group=="Hand"]

In [91]:
aov1 = stats.f_oneway(sample_seed_1,sample_seed_2)
aov1.pvalue

4.271481585484407e-11

#### In this case also the P value is very less, therefore there is influence of Hand and Natural Pollination on the Seed_Yield_Plant

In [92]:
sample_seed_lenght_1 = pollination.Seedling_length[pollination.Group=="Natural"]
sample_seed_lenght_2 = pollination.Seedling_length[pollination.Group=="Hand"]
aov3 = stats.f_oneway(sample_seed_lenght_1,sample_seed_lenght_2)

In [93]:
aov3.pvalue

0.020428817064110556

#### P value says that there is influence of Hand and Natural Pollination on Seed_lenght

#### 4.  An electronics firm is developing a new DVD player in response to customer requests. Using a prototype, the marketing team has collected focus data for different age groups viz. Under 25; 25-34; 35-44; 45-54; 55-64; 65 and above. Do you think that consumers of various ages rated the design differently?

 - H0 : That they have not rated the design differently
 - H1 : They have rated the design differently
 - Taking alpha as 0.05
 - Annova Test

In [95]:
Dvd = pd.read_csv("dvdplayer.csv")
Dvd

Unnamed: 0,agegroup,dvdscore
0,65 and over,38.454803
1,55-64,17.669677
2,65 and over,31.704307
3,65 and over,25.924460
4,Under 25,30.450007
...,...,...
63,45-54,46.567682
64,65 and over,23.999491
65,Under 25,24.994419
66,65 and over,33.538502


In [98]:
Dvd.isnull().sum()
Dvd.agegroup[-(Dvd.agegroup.duplicated())]

0     65 and over
1           55-64
4        Under 25
7           25-34
9           45-54
15          35-44
Name: agegroup, dtype: object

In [99]:
Dvd = Dvd.assign(age = np.where(Dvd.agegroup=="65 and over",1,
                               (np.where(Dvd.agegroup=="55-64",2,
                                        (np.where(Dvd.agegroup=="45-54",3,
                                                 (np.where(Dvd.agegroup=="35-44",4,
                                                          (np.where(Dvd.agegroup=="25-34",5,6))))))))))

In [100]:
Dvd

Unnamed: 0,agegroup,dvdscore,age
0,65 and over,38.454803,1
1,55-64,17.669677,2
2,65 and over,31.704307,1
3,65 and over,25.924460,1
4,Under 25,30.450007,6
...,...,...,...
63,45-54,46.567682,3
64,65 and over,23.999491,1
65,Under 25,24.994419,6
66,65 and over,33.538502,1


In [104]:
s1 = Dvd.dvdscore[Dvd.age == 1]
s2 = Dvd.dvdscore[Dvd.age == 2]
s3 = Dvd.dvdscore[Dvd.age == 3]
s4 = Dvd.dvdscore[Dvd.age == 4]
s5 = Dvd.dvdscore[Dvd.age == 5]
s6 = Dvd.dvdscore[Dvd.age == 6]

In [106]:
Aov = stats.f_oneway(s1,s2,s3,s4,s5,s6)
Aov.pvalue

3.087324905679639e-05

#### Consediring the pvalue from the Test, we reject the H0 and accept H1 that the different age category people have rated the design differently

#### 5. A survey was conducted among 2800 customers on several demographic characteristics. Working status, sex, age, age-group, race, happiness, no. of child, marital status, educational qualifications, income group etc. had been captured for that purpose. (Data set: sample_survey.csv).
 - a. Is there any relationship in between labour force status with marital status?
 - b. Do you think educational qualification is somehow controlling the marital status?
 - c. Is happiness is driven by earnings or marital status?

In [135]:
survey = pd.read_csv("sample_survey.csv")
survey
survey.columns

Index(['id', 'wrkstat', 'marital', 'childs', 'age', 'educ', 'paeduc', 'maeduc',
       'speduc', 'degree', 'sex', 'race', 'born', 'parborn', 'granborn',
       'income', 'rincome', 'polviews', 'cappun', 'postlife', 'happy',
       'hapmar', 'owngun', 'news', 'tvhours', 'howpaid', 'ethnic', 'eth1',
       'eth2', 'eth3', 'confinan', 'conbus', 'coneduc', 'conpress', 'conmedic',
       'contv', 'agecat', 'childcat', 'news1', 'news2', 'news3', 'news4',
       'news5', 'car1', 'car2', 'car3'],
      dtype='object')

#### a. Finding Relationship Status for Labour Force Status and Marital Status
 - So we gonna need, columns : wrkstat & marital for this
 - Since both are Categorical in nature, we will go aheas with chi-square test
 - H0 : WorkStatus not affecting Marital Status
 - H1 : WorkStatus affecting Marital Status

In [117]:
np.where(survey.wrkstat.isnull())
survey.wrkstat = survey.wrkstat.fillna(survey.wrkstat.mode()[0])

In [118]:
np.where(survey.wrkstat.isnull())

(array([], dtype=int64),)

In [119]:
np.where(survey.marital.isnull())
survey.marital = survey.marital.fillna(survey.marital.mode()[0])

In [120]:
np.where(survey.marital.isnull())

(array([], dtype=int64),)

In [121]:
Labour_Martial_crosstab = pd.crosstab(survey.wrkstat,survey.marital,margins=True)

In [122]:
Labour_Martial_crosstab

marital,Divorced,Married,Never married,Separated,Widowed,All
wrkstat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Keeping house,25,200,35,13,55,328
Other,12,16,14,4,8,54
Retired,53,168,17,6,150,394
School,7,9,60,2,1,79
Temporarily not working,9,23,11,1,2,46
"Unemployed, laid off",10,13,32,0,3,58
Working full time,295,780,392,58,44,1569
Working part-time,35,138,102,9,20,304
All,446,1347,663,93,283,2832


In [123]:
chi = stats.chi2_contingency(observed=Labour_Martial_crosstab)

In [124]:
chi

(729.7434446415614,
 1.435325149487234e-127,
 40,
 array([[5.16553672e+01, 1.56008475e+02, 7.67881356e+01, 1.07711864e+01,
         3.27768362e+01, 3.28000000e+02],
        [8.50423729e+00, 2.56843220e+01, 1.26419492e+01, 1.77330508e+00,
         5.39618644e+00, 5.40000000e+01],
        [6.20494350e+01, 1.87400424e+02, 9.22394068e+01, 1.29385593e+01,
         3.93721751e+01, 3.94000000e+02],
        [1.24413842e+01, 3.75752119e+01, 1.84947034e+01, 2.59427966e+00,
         7.89442090e+00, 7.90000000e+01],
        [7.24435028e+00, 2.18792373e+01, 1.07690678e+01, 1.51059322e+00,
         4.59675141e+00, 4.60000000e+01],
        [9.13418079e+00, 2.75868644e+01, 1.35783898e+01, 1.90466102e+00,
         5.79590395e+00, 5.80000000e+01],
        [2.47095339e+02, 7.46272246e+02, 3.67318856e+02, 5.15243644e+01,
         1.56789195e+02, 1.56900000e+03],
        [4.78757062e+01, 1.44593220e+02, 7.11694915e+01, 9.98305085e+00,
         3.03785311e+01, 3.04000000e+02],
        [4.46000000e+02, 1.347

#### With this value of p 1.435325149487234e-127, we can say that their is a huge influence of WorkStauts on Marital Status. As we reject H0

#### b. Do you think educational qualification is somehow controlling the marital status?
 - H0 : Education Qualification have no effect on Marital Status
 - H1 : Education Qualification have effect on Marital Status
 - Taking Alpha to be 0.05

In [128]:
survey.degree.isnull().sum()

0

In [127]:
survey.degree = survey.degree.fillna(survey.degree.mode()[0])

In [129]:
degree_marital_crosstab = pd.crosstab(survey.degree,survey.marital,margins=True)

In [130]:
degree_marital_crosstab

marital,Divorced,Married,Never married,Separated,Widowed,All
degree,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Bachelor,58,251,129,12,28,478
Graduate,29,123,41,3,9,205
High school,244,690,370,58,148,1510
Junior college,45,109,46,3,6,209
LT High school,70,174,77,17,92,430
All,446,1347,663,93,283,2832


In [131]:
chi2 = stats.chi2_contingency(observed=degree_marital_crosstab)

In [134]:
chi2

(123.18796051189827,
 6.0450187703046e-15,
 25,
 array([[  75.27824859,  227.35381356,  111.90466102,   15.6970339 ,
           47.76624294,  478.        ],
        [  32.28460452,   97.50529661,   47.99258475,    6.73199153,
           20.4855226 ,  205.        ],
        [ 237.80367232,  718.20974576,  353.50635593,   49.58686441,
          150.89336158, 1510.        ],
        [  32.91454802,   99.40783898,   48.92902542,    6.86334746,
           20.88524011,  209.        ],
        [  67.71892655,  204.52330508,  100.66737288,   14.12076271,
           42.96963277,  430.        ],
        [ 446.        , 1347.        ,  663.        ,   93.        ,
          283.        , 2832.        ]]))

#### The P value 6.0450187703046e-15, signifies that Education Qualification Does have effects on Marital Status

#### C. Is happiness is driven by earnings or marital status?
 - Will conduct two different Chi-Square Test, happiness with Marital Status & Earning
 - Depending on that we can conclude happiness is driven by earning or marital status or both


In [136]:
survey.happy

0        Pretty happy
1        Pretty happy
2          Very happy
3          Very happy
4        Pretty happy
            ...      
2827       Very happy
2828    Not too happy
2829     Pretty happy
2830     Pretty happy
2831     Pretty happy
Name: happy, Length: 2832, dtype: object

In [141]:
survey.happy.isnull().sum()

0

In [140]:
survey.happy = survey.happy.fillna(survey.happy.mode()[0])

In [145]:
survey.income[-(survey.income.duplicated())]

0      $25000 or more
1      $15000 - 19999
5      $20000 - 24999
16     $10000 - 14999
18           LT $1000
38      $8000 TO 9999
48      $7000 TO 7999
63      $6000 TO 6999
114     $3000 TO 3999
116     $5000 TO 5999
230     $1000 TO 2999
300     $4000 TO 4999
Name: income, dtype: object

In [144]:
survey.income = survey.income.fillna(survey.income.mode()[0])

#### Conducting Chi-Square for Happiness with Marital Staus
 - H0 : Happiness is not influenced by Marital Staus
 - H1 : Happiness is influenced by Marital Staus

In [149]:
Happy_Marital = pd.crosstab(survey.happy,survey.marital,margins=True)

In [150]:
Happy_Marital_chi = stats.chi2_contingency(observed=Happy_Marital)

In [151]:
Happy_Marital_chi

(261.19407805911874,
 6.107340101846463e-47,
 15,
 array([[  53.56411162,  161.6531261 ,   79.625574  ,   11.16919816,
           33.98799011,  340.        ],
        [ 252.06640763,  760.72059343,  374.70858354,   52.56093253,
          159.94348287, 1600.        ],
        [ 140.36948075,  423.62628047,  208.66584246,   29.2698693 ,
           89.06852702,  891.        ],
        [ 446.        , 1346.        ,  663.        ,   93.        ,
          283.        , 2831.        ]]))

#### By the P- Value, we can say that Happiness Does depend on Marital Status

##### Conduct Chi-Square for Happiness and Income
 - H0 : Happiness is not influenced by Income
 - H1 : Happiness is influenced by Income

In [153]:
Happiness_Income_Cross = pd.crosstab(survey.happy,survey.income)
Happiness_Income_Cross

income,$1000 TO 2999,$10000 - 14999,$15000 - 19999,$20000 - 24999,$25000 or more,$3000 TO 3999,$4000 TO 4999,$5000 TO 5999,$6000 TO 6999,$7000 TO 7999,$8000 TO 9999,LT $1000
happy,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Not too happy,7,39,33,40,151,9,9,6,14,12,9,11
Pretty happy,20,109,120,157,1074,11,13,18,13,21,31,14
Very happy,5,44,26,50,691,4,10,11,6,14,19,11


In [154]:
Happiness_Income_Cross_chi = stats.chi2_contingency(observed=Happiness_Income_Cross)

In [155]:
Happiness_Income_Cross_chi

(177.41659890776808,
 2.7900820621651447e-26,
 22,
 array([[   3.84180791,   23.05084746,   21.49011299,   29.6539548 ,
          230.02824859,    2.88135593,    3.84180791,    4.2019774 ,
            3.96186441,    5.64265537,    7.08333333,    4.3220339 ],
        [  18.09039548,  108.54237288,  101.19314972,  139.63524011,
         1083.16242938,   13.56779661,   18.09039548,   19.78637006,
           18.65572034,   26.57026836,   33.35416667,   20.35169492],
        [  10.06779661,   60.40677966,   56.31673729,   77.71080508,
          602.80932203,    7.55084746,   10.06779661,   11.01165254,
           10.38241525,   14.78707627,   18.5625    ,   11.32627119]]))

#### The P-Value signifies that Happiness is influenced by Income as well, therefore : Happiness is influence by both Income and Marital Status