The FEV, which is an acronym for forced expiratory volume, is a measure of how much air a person can exhale (in litres) during a forced breath. In this dataset, the FEV of 606 children, between the ages of 6 and 17, were measured. The dataset also provides additional information on these children: their age, their height, their gender and, most importantly, whether the child is a smoker or a non-smoker.

Ref for research paper
https://www.tandfonline.com/doi/full/10.1080/10691898.2005.11910559

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import numpy as np
import seaborn as sns

In [2]:
df= pd.read_csv('fev.txt', usecols=['AGE', 'FEV', 'HEIGHT','SEX', 'SMOKE'])
df.head()

Unnamed: 0,AGE,FEV,HEIGHT,SEX,SMOKE
0,9,1.708,57.0,2,2
1,8,1.724,67.5,2,2
2,7,1.72,54.5,2,2
3,9,1.558,53.0,1,2
4,9,1.895,57.0,1,2


In [3]:

df['SMOKE'] = df['SMOKE'].replace({2:0})
df['SEX'] = df['SEX'].replace({2:0})
df.head()

Unnamed: 0,AGE,FEV,HEIGHT,SEX,SMOKE
0,9,1.708,57.0,0,0
1,8,1.724,67.5,0,0
2,7,1.72,54.5,0,0
3,9,1.558,53.0,1,0
4,9,1.895,57.0,1,0


Note that we have changed the values of categoical variable. Non smoker is 0, smoker is 1 and Male is 1, female is 0

## Model 1

In [4]:
import warnings 
warnings.filterwarnings('ignore')
import statsmodels.api as sm
X = df[['SMOKE']]
y = df[['FEV']]
X_constant = sm.add_constant(X)
lin_reg_smoke_fev = sm.OLS(y,X_constant).fit()
lin_reg_smoke_fev.summary()

0,1,2,3
Dep. Variable:,FEV,R-squared:,0.06
Model:,OLS,Adj. R-squared:,0.059
Method:,Least Squares,F-statistic:,41.79
Date:,"Sun, 25 Oct 2020",Prob (F-statistic):,1.99e-10
Time:,09:41:11,Log-Likelihood:,-813.88
No. Observations:,654,AIC:,1632.0
Df Residuals:,652,BIC:,1641.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.5661,0.035,74.037,0.000,2.498,2.634
SMOKE,0.7107,0.110,6.464,0.000,0.495,0.927

0,1,2,3
Omnibus:,55.456,Durbin-Watson:,1.094
Prob(Omnibus):,0.0,Jarque-Bera (JB):,67.952
Skew:,0.736,Prob(JB):,1.76e-15
Kurtosis:,3.57,Cond. No.,3.38


We have fit the model for SMOKE and FEV. The regression equation is fev = 2.57 + 0.711 smoke. Taken at face value, this model indicates that, on average, smokers have about a 0.71 liter larger FEV than nonsmokers. R2 value is 0.06. 

In [5]:
import warnings 
warnings.filterwarnings('ignore')
import statsmodels.api as sm
X = df[['AGE']]
y = df[['FEV']]
X_constant = sm.add_constant(X)
lin_reg_age_fev = sm.OLS(y,X_constant).fit()
lin_reg_age_fev.summary()

0,1,2,3
Dep. Variable:,FEV,R-squared:,0.572
Model:,OLS,Adj. R-squared:,0.572
Method:,Least Squares,F-statistic:,872.2
Date:,"Sun, 25 Oct 2020",Prob (F-statistic):,2.4500000000000003e-122
Time:,09:41:11,Log-Likelihood:,-556.51
No. Observations:,654,AIC:,1117.0
Df Residuals:,652,BIC:,1126.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.4316,0.078,5.541,0.000,0.279,0.585
AGE,0.2220,0.008,29.533,0.000,0.207,0.237

0,1,2,3
Omnibus:,35.892,Durbin-Watson:,1.608
Prob(Omnibus):,0.0,Jarque-Bera (JB):,51.347
Skew:,0.454,Prob(JB):,7.08e-12
Kurtosis:,4.03,Cond. No.,36.7


In [6]:
import warnings 
warnings.filterwarnings('ignore')
import statsmodels.api as sm
X = df[['HEIGHT']]
y = df[['FEV']]
X_constant = sm.add_constant(X)
lin_reg_height_fev = sm.OLS(y,X_constant).fit()
lin_reg_height_fev.summary()

0,1,2,3
Dep. Variable:,FEV,R-squared:,0.754
Model:,OLS,Adj. R-squared:,0.753
Method:,Least Squares,F-statistic:,1995.0
Date:,"Sun, 25 Oct 2020",Prob (F-statistic):,1.5700000000000003e-200
Time:,09:41:11,Log-Likelihood:,-376.05
No. Observations:,654,AIC:,756.1
Df Residuals:,652,BIC:,765.1
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-5.4327,0.181,-29.939,0.000,-5.789,-5.076
HEIGHT,0.1320,0.003,44.662,0.000,0.126,0.138

0,1,2,3
Omnibus:,37.49,Durbin-Watson:,1.608
Prob(Omnibus):,0.0,Jarque-Bera (JB):,91.349
Skew:,0.284,Prob(JB):,1.46e-20
Kurtosis:,4.74,Cond. No.,662.0


In [7]:
import warnings 
warnings.filterwarnings('ignore')
import statsmodels.api as sm
X = df[['SEX']]
y = df[['FEV']]
X_constant = sm.add_constant(X)
lin_reg_sex_fev = sm.OLS(y,X_constant).fit()
lin_reg_sex_fev.summary()

0,1,2,3
Dep. Variable:,FEV,R-squared:,0.043
Model:,OLS,Adj. R-squared:,0.042
Method:,Least Squares,F-statistic:,29.61
Date:,"Sun, 25 Oct 2020",Prob (F-statistic):,7.5e-08
Time:,09:41:11,Log-Likelihood:,-819.67
No. Observations:,654,AIC:,1643.0
Df Residuals:,652,BIC:,1652.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.4512,0.048,51.505,0.000,2.358,2.545
SEX,0.3613,0.066,5.441,0.000,0.231,0.492

0,1,2,3
Omnibus:,20.399,Durbin-Watson:,0.945
Prob(Omnibus):,0.0,Jarque-Bera (JB):,21.851
Skew:,0.445,Prob(JB):,1.8e-05
Kurtosis:,2.905,Cond. No.,2.65


The R square value for each fit is 

    FEV on SMOKE : 0.06
    FEV on AGE : 0.572
    FEV on HT : 0.754
    FEV on GENDER : 0.043

It makes sense that height(body size) is most highly associated with FEV(high lung capacity)

## Model 2

The regression equation is fev = 0.367 + 0.231 age - 0.209 smoke

In [8]:
import warnings 
warnings.filterwarnings('ignore')
import statsmodels.api as sm
X = df[['AGE', 'SMOKE']]
y = df[['FEV']]
X_constant = sm.add_constant(X)
lin_reg_sa = sm.OLS(y,X_constant).fit()
lin_reg_sa.summary()

0,1,2,3
Dep. Variable:,FEV,R-squared:,0.577
Model:,OLS,Adj. R-squared:,0.575
Method:,Least Squares,F-statistic:,443.3
Date:,"Sun, 25 Oct 2020",Prob (F-statistic):,3.2500000000000004e-122
Time:,09:41:11,Log-Likelihood:,-553.17
No. Observations:,654,AIC:,1112.0
Df Residuals:,651,BIC:,1126.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.3674,0.081,4.511,0.000,0.207,0.527
AGE,0.2306,0.008,28.176,0.000,0.215,0.247
SMOKE,-0.2090,0.081,-2.588,0.010,-0.368,-0.050

0,1,2,3
Omnibus:,37.843,Durbin-Watson:,1.635
Prob(Omnibus):,0.0,Jarque-Bera (JB):,51.741
Skew:,0.49,Prob(JB):,5.82e-12
Kurtosis:,3.969,Cond. No.,43.7


This provides an estimate of the average difference between smokers' and nonsmokers' FEV conditional on AGE.
When we add the age column we see that the non smokers , on average, has 0.21 litres more capacity than the smokers.
The R square value decreases from 0.60 to 0.57

Next we see Height( which suggest body size ) and SMOKE

In [9]:
import warnings 
warnings.filterwarnings('ignore')
import statsmodels.api as sm
X = df[['HEIGHT', 'SMOKE']]
y = df[['FEV']]
X_constant = sm.add_constant(X)
lin_reg_sh = sm.OLS(y,X_constant).fit()
lin_reg_sh.summary()

0,1,2,3
Dep. Variable:,FEV,R-squared:,0.754
Model:,OLS,Adj. R-squared:,0.753
Method:,Least Squares,F-statistic:,995.9
Date:,"Sun, 25 Oct 2020",Prob (F-statistic):,8.77e-199
Time:,09:41:11,Log-Likelihood:,-376.05
No. Observations:,654,AIC:,758.1
Df Residuals:,651,BIC:,771.5
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-5.4276,0.188,-28.935,0.000,-5.796,-5.059
HEIGHT,0.1319,0.003,42.808,0.000,0.126,0.138
SMOKE,0.0063,0.059,0.108,0.914,-0.109,0.122

0,1,2,3
Omnibus:,37.541,Durbin-Watson:,1.609
Prob(Omnibus):,0.0,Jarque-Bera (JB):,91.729
Skew:,0.284,Prob(JB):,1.21e-20
Kurtosis:,4.745,Cond. No.,686.0


Th R Square value has increased. Coefficeint suggest based on height smokers has 0.07 l capacity more than non smokers on average basis of height but if we check the p value( which is greather than 0.05 ) we fail to reject the null hypothesis.

We will make add a new feature which will have all the square of all heights

In [10]:
df['HEIGHT_SQ'] = df['HEIGHT'] ** 2
df.head()

Unnamed: 0,AGE,FEV,HEIGHT,SEX,SMOKE,HEIGHT_SQ
0,9,1.708,57.0,0,0,3249.0
1,8,1.724,67.5,0,0,4556.25
2,7,1.72,54.5,0,0,2970.25
3,9,1.558,53.0,1,0,2809.0
4,9,1.895,57.0,1,0,3249.0


In [11]:
import warnings 
warnings.filterwarnings('ignore')
import statsmodels.api as sm
X = df[['SMOKE', 'HEIGHT_SQ']]
y = df[['FEV']]
X_constant = sm.add_constant(X)
lin_reg_shh = sm.OLS(y,X_constant).fit()
lin_reg_shh.summary()

0,1,2,3
Dep. Variable:,FEV,R-squared:,0.765
Model:,OLS,Adj. R-squared:,0.765
Method:,Least Squares,F-statistic:,1062.0
Date:,"Sun, 25 Oct 2020",Prob (F-statistic):,1.14e-205
Time:,09:41:11,Log-Likelihood:,-360.12
No. Observations:,654,AIC:,726.2
Df Residuals:,651,BIC:,739.7
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.5015,0.094,-16.045,0.000,-1.685,-1.318
SMOKE,-0.0070,0.057,-0.122,0.903,-0.120,0.106
HEIGHT_SQ,0.0011,2.48e-05,44.233,0.000,0.001,0.001

0,1,2,3
Omnibus:,34.577,Durbin-Watson:,1.607
Prob(Omnibus):,0.0,Jarque-Bera (JB):,98.115
Skew:,0.176,Prob(JB):,4.95e-22
Kurtosis:,4.865,Cond. No.,22100.0


In [12]:
import warnings 
warnings.filterwarnings('ignore')
import statsmodels.api as sm
X = df[['AGE','HEIGHT', 'SMOKE']]
y = df[['FEV']]
X_constant = sm.add_constant(X)
lin_reg_sha = sm.OLS(y,X_constant).fit()
lin_reg_sha.summary()

0,1,2,3
Dep. Variable:,FEV,R-squared:,0.768
Model:,OLS,Adj. R-squared:,0.767
Method:,Least Squares,F-statistic:,715.7
Date:,"Sun, 25 Oct 2020",Prob (F-statistic):,1.86e-205
Time:,09:41:11,Log-Likelihood:,-356.98
No. Observations:,654,AIC:,722.0
Df Residuals:,650,BIC:,739.9
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-4.6160,0.224,-20.618,0.000,-5.056,-4.176
AGE,0.0597,0.010,6.247,0.000,0.041,0.079
HEIGHT,0.1091,0.005,23.115,0.000,0.100,0.118
SMOKE,-0.1102,0.060,-1.837,0.067,-0.228,0.008

0,1,2,3
Omnibus:,31.673,Durbin-Watson:,1.661
Prob(Omnibus):,0.0,Jarque-Bera (JB):,64.104
Skew:,0.292,Prob(JB):,1.2e-14
Kurtosis:,4.419,Cond. No.,851.0


on average age non smokers 1.8 l ( which is near to 2 ) more capacity than the smokers. R square being 0.76, 76% variance is explained of our model

In [13]:
import warnings 
warnings.filterwarnings('ignore')
import statsmodels.api as sm
X = df[['AGE','HEIGHT', 'SMOKE', 'SEX']]
y = df[['FEV']]
X_constant = sm.add_constant(X)
lin_reg_shas = sm.OLS(y,X_constant).fit()
lin_reg_shas.summary()

0,1,2,3
Dep. Variable:,FEV,R-squared:,0.775
Model:,OLS,Adj. R-squared:,0.774
Method:,Least Squares,F-statistic:,560.0
Date:,"Sun, 25 Oct 2020",Prob (F-statistic):,9.100000000000001e-209
Time:,09:41:11,Log-Likelihood:,-345.9
No. Observations:,654,AIC:,701.8
Df Residuals:,649,BIC:,724.2
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-4.4570,0.223,-20.001,0.000,-4.895,-4.019
AGE,0.0655,0.009,6.904,0.000,0.047,0.084
HEIGHT,0.1042,0.005,21.901,0.000,0.095,0.114
SMOKE,-0.0872,0.059,-1.472,0.141,-0.204,0.029
SEX,0.1571,0.033,4.731,0.000,0.092,0.222

0,1,2,3
Omnibus:,22.758,Durbin-Watson:,1.645
Prob(Omnibus):,0.0,Jarque-Bera (JB):,43.271
Skew:,0.207,Prob(JB):,4.02e-10
Kurtosis:,4.19,Cond. No.,861.0


## Model 3