## 1. A physician is evaluating a new diet for her patients with a family history of heart disease. To test the effectiveness of this diet, 16 patients are placed on the diet for 6 months. Their weights and triglyceride levels are measured before and after the study, and the physician wants to know if either set of measurements has changed. (Data set: dietstudy.csv)

In [1]:
import pandas as pd

In [2]:
import numpy as np

In [3]:
import scipy.stats as stats

In [4]:
import math

In [5]:
diet_study = pd.read_csv('dietstudy.csv')

In [6]:
diet_study.head(2)

Unnamed: 0,patid,age,gender,tg0,tg1,tg2,tg3,tg4,wgt0,wgt1,wgt2,wgt3,wgt4
0,1,45,Male,180,148,106,113,100,198,196,193,188,192
1,2,56,Male,139,94,119,75,92,237,233,232,228,225


### Considering tg0,wgt0 as initial & tg4,wgt4 as final readings

In [7]:
stats.ttest_rel(a=diet_study.tg0,b=diet_study.tg4)

Ttest_relResult(statistic=1.2000008533342437, pvalue=0.24874946576903698)

## Since p-value is high we accept the null so no change in terms of triglyceride levels 
## even after diet

In [8]:
stats.ttest_rel(a=diet_study.wgt0,b=diet_study.wgt4)

Ttest_relResult(statistic=11.174521688532522, pvalue=1.137689414996614e-08)

## Since p-value is low we accept the alternate so there is change in terms of weight levels 
## even after diet

## 2. An analyst at a department store wants to evaluate a recent credit card promotion. To this end, 500 cardholders were randomly selected. Half received an ad promoting a reduced interest rate on purchases made over the next three months, and half received a standard seasonal ad. Is the promotion effective to increase sales? (Data set: creditpromo.csv)

In [9]:
credit_promo = pd.read_csv('creditpromo.csv')


In [10]:
credit_promo.head(2)

Unnamed: 0,id,insert,dollars
0,148,Standard,2232.771979
1,572,New Promotion,1403.807542


In [11]:
credit_promo.columns

Index(['id', 'insert', 'dollars'], dtype='object')

In [12]:
credit_pro = credit_promo[credit_promo['insert'] =='New Promotion']

In [13]:
credit_pro.head(2)

Unnamed: 0,id,insert,dollars
1,572,New Promotion,1403.807542
4,1541,New Promotion,1513.5632


In [14]:
credit_std = credit_promo[(credit_promo['insert'] =='Standard')]

In [15]:
credit_std.head(2)

Unnamed: 0,id,insert,dollars
0,148,Standard,2232.771979
2,973,Standard,2327.092181


In [16]:
stats.ttest_ind(a=credit_pro.dollars,b=credit_std.dollars, equal_var = False)

Ttest_indResult(statistic=2.260422726464996, pvalue=0.024226348191648994)

# Since p-value is less we accept the alternative hypothesis i.e there is change after promotion

## 3. An experiment is conducted to study the hybrid seed production of bottle gourd under open field conditions. The main aim of the investigation is to compare natural pollination and hand pollination. The data are collected on 10 randomly selected plants from each of natural pollination and hand pollination. The data are collected on fruit weight (kg), seed yield/plant (g) and seedling length (cm). (Data set: pollination.csv) 

### a. Is the overall population of Seed yield/plant (g) equals to 200? 

In [17]:
poll = pd.read_csv('pollination.csv')

In [18]:
poll.head(2)

Unnamed: 0,Group,Fruit_Wt,Seed_Yield_Plant,Seedling_length
0,Natural,1.85,147.7,16.86
1,Natural,1.86,136.86,16.77


In [19]:
stats.ttest_1samp(a = poll.Seed_Yield_Plant, popmean =200 )

Ttest_1sampResult(statistic=-2.3009121248548645, pvalue=0.032891040921283025)

# Since p-value is less we accept the alternate i.e the overall population of Seed yield/plant (g)  is equal to 200

### b. Test whether the natural pollination and hand pollination under open field conditions are equally effective or are significantly different. 

In [20]:
pol_nat = poll[(poll['Group'] =='Natural')]

In [21]:
pol_nat.head(2)

Unnamed: 0,Group,Fruit_Wt,Seed_Yield_Plant,Seedling_length
0,Natural,1.85,147.7,16.86
1,Natural,1.86,136.86,16.77


In [22]:
pol_han = poll[(poll['Group'] =='Hand')]

In [23]:
pol_han.head(2)

Unnamed: 0,Group,Fruit_Wt,Seed_Yield_Plant,Seedling_length
10,Hand,2.58,224.26,18.18
11,Hand,2.74,197.5,18.07


In [24]:
stats.ttest_ind(a=pol_han.Fruit_Wt,b=pol_nat.Fruit_Wt, equal_var = False)

Ttest_indResult(statistic=17.669989614440286, pvalue=4.306871213074868e-09)

# Since the p-value is less we accept alternate i.e there is change equally effective

## 4. An electronics firm is developing a new DVD player in response to customer requests. Using a prototype, the marketing team has collected focus data for different age groups viz. Under 25; 25-34; 35-44; 45-54; 55-64; 65 and above. Do you think that consumers of various ages rated the design differently? (Data set: dvdplayer.csv).

In [25]:
dvd =pd.read_csv('dvdplayer.csv')

In [26]:
dvd.head()

Unnamed: 0,agegroup,dvdscore
0,65 and over,38.454803
1,55-64,17.669677
2,65 and over,31.704307
3,65 and over,25.92446
4,Under 25,30.450007


# Just checking below with the series & dataframe attributes

In [27]:
a7 = dvd.dvdscore[dvd.agegroup=='Under 25']
a8 = dvd.dvdscore[dvd.agegroup=='25-34']
a9 = dvd.dvdscore[dvd.agegroup=='35-44']
a10 = dvd.dvdscore[dvd.agegroup=='45-54']
a11= dvd.dvdscore[dvd.agegroup=='55-64']
a12 = dvd.dvdscore[dvd.agegroup=='65 and over']

In [28]:
a1 = dvd[(dvd['agegroup'] =='Under 25')]
a2= dvd[(dvd['agegroup'] =='25-34')]
a3= dvd[(dvd['agegroup'] =='35-44')]
a4= dvd[(dvd['agegroup'] =='45-54')]
a5= dvd[(dvd['agegroup'] =='55-64')]
a6= dvd[(dvd['agegroup'] =='65 and over')]


In [29]:
stats.f_oneway(a7,a8,a9,a10,a11,a12)

F_onewayResult(statistic=6.992526962676518, pvalue=3.087324905679639e-05)

In [30]:
stats.f_oneway(a1.dvdscore,a2.dvdscore,a3.dvdscore,a4.dvdscore,a5.dvdscore,a6.dvdscore)

F_onewayResult(statistic=6.992526962676518, pvalue=3.087324905679639e-05)

# Since the p-value is less & F-value high so they are rating differently

## 5. A survey was conducted among 2800 customers on several demographic characteristics. Working status, sex, age, age-group, race, happiness, no. of child, marital status, educational qualifications, income group etc. had been captured for that purpose. (Data set: sample_survey.csv). 

### a. Is there any relationship in between labour force status with marital status? 

In [31]:
sample = pd.read_csv('sample_survey.csv')

In [32]:
sample.head(2)

Unnamed: 0,id,wrkstat,marital,childs,age,educ,paeduc,maeduc,speduc,degree,...,agecat,childcat,news1,news2,news3,news4,news5,car1,car2,car3
0,1,Working full time,Divorced,2.0,60.0,12.0,12.0,12.0,,High school,...,55 to 64,1-2,No,No,No,No,No,American,Japanese,Japanese
1,2,Working part-time,Never married,0.0,27.0,17.0,20.0,,,Junior college,...,25 to 34,,No,No,Yes,No,No,American,German,Japanese


In [33]:
contingency_table = pd.crosstab(sample.wrkstat, sample.marital, margins = True) 
stats.chi2_contingency(observed= contingency_table) 

(729.2421426572284,
 1.820339965538765e-127,
 40,
 array([[5.16918728e+01, 1.55886926e+02, 7.68424028e+01, 1.07787986e+01,
         3.28000000e+01, 3.28000000e+02],
        [8.51024735e+00, 2.56643110e+01, 1.26508834e+01, 1.77455830e+00,
         5.40000000e+00, 5.40000000e+01],
        [6.20932862e+01, 1.87254417e+02, 9.23045936e+01, 1.29477032e+01,
         3.94000000e+01, 3.94000000e+02],
        [1.24501767e+01, 3.75459364e+01, 1.85077739e+01, 2.59611307e+00,
         7.90000000e+00, 7.90000000e+01],
        [7.24946996e+00, 2.18621908e+01, 1.07766784e+01, 1.51166078e+00,
         4.60000000e+00, 4.60000000e+01],
        [9.14063604e+00, 2.75653710e+01, 1.35879859e+01, 1.90600707e+00,
         5.80000000e+00, 5.80000000e+01],
        [2.46954770e+02, 7.44740283e+02, 3.67109894e+02, 5.14950530e+01,
         1.56700000e+02, 1.56700000e+03],
        [4.79095406e+01, 1.44480565e+02, 7.12197880e+01, 9.99010601e+00,
         3.04000000e+01, 3.04000000e+02],
        [4.46000000e+02, 1.345



# p is less: ALT Accepted

### There is relationship  between labour force status and marital status i.e. the labour force does  influence the marital status .We can say this with 95% confidence.

### b. Do you think educational qualification is somehow controlling the marital status? 

In [34]:
contingency_table = pd.crosstab(sample.marital, sample.degree, margins = True) 
stats.chi2_contingency(observed= contingency_table) 

(122.68449020508541,
 7.424404099753273e-15,
 25,
 array([[  75.06345268,   32.19248493,  235.55476781,   32.66359447,
           67.52570011,  443.        ],
        [ 227.39312301,   97.52215526,  713.57674583,   98.94930876,
          204.55866714, 1342.        ],
        [ 111.83268345,   47.9617157 ,  350.9393832 ,   48.66359447,
          100.60262318,  660.        ],
        [  15.75824176,    6.75824176,   49.45054945,    6.85714286,
           14.17582418,   93.        ],
        [  47.95249911,   20.56540234,  150.4785537 ,   20.86635945,
           43.1371854 ,  283.        ],
        [ 478.        ,  205.        , 1500.        ,  208.        ,
          430.        , 2821.        ]]))



# p is less: ALT Accepted

### There is relationship  between educational qualification and marital status i.e. the educational qualification does  influence the marital status .We can say this with 95% confidence.

### c. Is happiness is driven by earnings or marital status? 

In [35]:
sample.columns

Index(['id', 'wrkstat', 'marital', 'childs', 'age', 'educ', 'paeduc', 'maeduc',
       'speduc', 'degree', 'sex', 'race', 'born', 'parborn', 'granborn',
       'income', 'rincome', 'polviews', 'cappun', 'postlife', 'happy',
       'hapmar', 'owngun', 'news', 'tvhours', 'howpaid', 'ethnic', 'eth1',
       'eth2', 'eth3', 'confinan', 'conbus', 'coneduc', 'conpress', 'conmedic',
       'contv', 'agecat', 'childcat', 'news1', 'news2', 'news3', 'news4',
       'news5', 'car1', 'car2', 'car3'],
      dtype='object')

In [36]:
contingency_table = pd.crosstab(sample.marital, sample.happy, margins = True) 
stats.chi2_contingency(observed= contingency_table) 

(260.68943894182826,
 7.762777322980048e-47,
 15,
 array([[  53.6969697 ,  248.58538324,  140.71764706,  443.        ],
        [ 162.06060606,  750.24527629,  424.69411765, 1337.        ],
        [  79.27272727,  366.98609626,  207.74117647,  654.        ],
        [  11.15151515,   51.62495544,   29.22352941,   92.        ],
        [  33.81818182,  156.55828877,   88.62352941,  279.        ],
        [ 340.        , 1574.        ,  891.        , 2805.        ]]))

In [37]:
contingency_table = pd.crosstab(sample.income, sample.happy, margins = True) 
stats.chi2_contingency(observed= contingency_table) 

(178.9505306121643,
 7.234749067043263e-21,
 36,
 array([[   3.89520355,   18.16041919,    9.94437727,   32.        ],
        [  23.12777106,  107.82748892,   59.04474002,  190.        ],
        [  21.66706973,  101.01733172,   55.31559855,  178.        ],
        [  29.82265216,  139.04070939,   76.13663845,  245.        ],
        [ 191.35187424,  892.1305925 ,  488.51753325, 1572.        ],
        [   2.92140266,   13.62031439,    7.45828295,   24.        ],
        [   3.89520355,   18.16041919,    9.94437727,   32.        ],
        [   4.26037888,   19.86295848,   10.87666264,   35.        ],
        [   4.01692866,   18.72793229,   10.25513906,   33.        ],
        [   5.72108021,   26.67311568,   14.60580411,   47.        ],
        [   7.06005643,   32.91575977,   18.0241838 ,   58.        ],
        [   4.26037888,   19.86295848,   10.87666264,   35.        ],
        [ 302.        , 1408.        ,  771.        , 2481.        ]]))

# Since the p-value is low for both and the conditions affect for both so we compare the chi square value as such happy vs marital has 260 & happy vs income has 178 so we go with happiness is driven with marital