This is the Python jupyter notebook for birthweight and smoking analysis.

We will follow the exercise steps given below:

## Step 1 - importing the necessary data analysis libraries

We will start by importing the necessary libraries for our analysis. We will be using pandas, numpy, and odfpy libraries. We will import more libraries as we go along.

All used libraries can be found in the requirements.txt file.

In [1]:

import os
import pandas as pd


## Step 2 - Reading the Excel file

We will read the excel file using the pandas library. We will be reading the file named "Earnings and Height.xlsx" from our "Used data and given exercise" folder, which contains the data of earnings and height of US workers. <br> <br>
 The xlsx file given in [The Stock and Watson Website]("https://media.pearsoncmg.com/ph/bp/bp_stock_econometrics_4_cw/content/datapages/stock04_data05.html") once again appears to be corrupted. Pandas and openpyxl libraries cannot read it. We will use the converted odf file instead. We will use the odfpy library to read the file, which is installed on setup.

We will save this is a pandas dataframe named "df".

In [2]:
df = pd.read_excel('../Used data and given exercise/birthweight_smoking.ods')

## Step 3 - Exploring the data

Let's first take a look at the data. We will use the head() function to see the first few rows of the data.

In [16]:
df.head()

Unnamed: 0,nprevist,alcohol,tripre1,tripre2,tripre3,tripre0,birthweight,smoker,unmarried,educ,age,drinks
0,12,0,1,0,0,0,4253,1,1,12,27,0
1,5,0,0,1,0,0,3459,0,0,16,24,0
2,12,0,1,0,0,0,2920,1,0,11,23,0
3,13,0,1,0,0,0,2600,0,0,17,28,0
4,9,0,1,0,0,0,3742,0,0,13,27,0


In [15]:
# Summary of the data:
df.describe()

Unnamed: 0,nprevist,alcohol,tripre1,tripre2,tripre3,tripre0,birthweight,smoker,unmarried,educ,age,drinks
count,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0
mean,10.991667,0.019333,0.804,0.153,0.033,0.01,3382.933667,0.194,0.226667,12.907,26.889,0.058333
std,3.672069,0.137717,0.397035,0.360048,0.178666,0.099515,592.162889,0.395495,0.418745,2.166699,5.362487,0.687814
min,0.0,0.0,0.0,0.0,0.0,0.0,425.0,0.0,0.0,0.0,14.0,0.0
25%,9.0,0.0,1.0,0.0,0.0,0.0,3062.0,0.0,0.0,12.0,23.0,0.0
50%,12.0,0.0,1.0,0.0,0.0,0.0,3420.0,0.0,0.0,12.0,27.0,0.0
75%,13.0,0.0,1.0,0.0,0.0,0.0,3750.0,0.0,0.0,14.0,31.0,0.0
max,35.0,1.0,1.0,1.0,1.0,1.0,5755.0,1.0,1.0,17.0,44.0,21.0


As we can see, the data set is  and neatly organized. It has 3000 rows and 12 columns with a value for each variable. The variables show the age and other characteristics of the mothers, prenatal visits.


## Step 4 - Regressions of variable effects on birthweight
We will now use the statsmodels library to perform three regressions to find the effects of other variables on birthweight. We will use the OLS (ordinary least squares) method to perform the regression.

In [26]:
import statsmodels.api as sm
from scipy import stats # for calculating t-values and confidence intervals

# Birthweight on Smoker

# Defining the independent and dependent variables:
X = df["smoker"]
y = df["birthweight"]

# Adding a constant to the independent variables:
X = sm.add_constant(X)

# Performing the regression:
model = sm.OLS(y, X).fit()

# Printing the summary of the regression:
print(model.summary())

### Calculating 95% CI, by hand:

SE = model.bse

# Accessing the coefficient and SE for smoker:

coef_smoker = model.params['smoker']
SE_smoker = SE['smoker']

# t-value for 95% CI, two-tailed:

alpha = 0.05  # 5%
n = len(y)
k = len(model.params) - 1  # Number of predictors excluding the intercept
t_value = stats.t.ppf(1 - alpha / 2, df=n - k - 1)

# Calculating the 95% confidence interval
lower_bound = coef_smoker - t_value * SE_smoker
upper_bound = coef_smoker + t_value * SE_smoker

print(f"95% Confidence Interval for smoker: ({lower_bound.round(3)}, {upper_bound.round(3)})")

                            OLS Regression Results                            
Dep. Variable:            birthweight   R-squared:                       0.029
Model:                            OLS   Adj. R-squared:                  0.028
Method:                 Least Squares   F-statistic:                     88.28
Date:                Mon, 17 Nov 2025   Prob (F-statistic):           1.09e-20
Time:                        19:54:09   Log-Likelihood:                -23364.
No. Observations:                3000   AIC:                         4.673e+04
Df Residuals:                    2998   BIC:                         4.674e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       3432.0600     11.871    289.115      0.0

### Quick explanation of the regression:

The regression shows that the effect of smoking on birthweight is negative, -253.223 grams on average, with a p-value of 0.000. This means that the effect of smoking on birthweight is statistically significant at the 5% level.

The 95% CI is given to us in the summary, it is [-306,074; -200,383]. Also solved by hand.

## Expanded regression with more variables:
We will now perform the regression of birthweight on smoker, alcohol and number of prenatal visits.

In [31]:

# Birthweight on Smoker, Alcohol, and Number of Prenatal Visits
print(df.head())
# Defining the independent and dependent variables:
X = df[["smoker", "alcohol", "nprevist"]]
y = df["birthweight"]

# Adding a constant to the independent variables:
X = sm.add_constant(X)

# Performing the regression:
model = sm.OLS(y, X).fit()

# Printing the summary of the regression:
print(model.summary())

### Calculating 95% CI, by hand:

SE = model.bse

# Accessing the coefficient and SE for smoker:

coef_smoker = model.params['smoker']
SE_smoker = SE['smoker']

# t-value for 95% CI, two-tailed:

alpha = 0.05  # 5%
n = len(y)
k = len(model.params) - 1  # Number of predictors excluding the intercept
t_value = stats.t.ppf(1 - alpha / 2, df=n - k - 1)

# Calculating the 95% confidence interval
lower_bound = coef_smoker - t_value * SE_smoker
upper_bound = coef_smoker + t_value * SE_smoker

print(f"95% Confidence Interval for smoker: ({lower_bound.round(3)}, {upper_bound.round(3)})")

   nprevist  alcohol  tripre1  tripre2  tripre3  tripre0  birthweight  smoker  \
0        12        0        1        0        0        0         4253       1   
1         5        0        0        1        0        0         3459       0   
2        12        0        1        0        0        0         2920       1   
3        13        0        1        0        0        0         2600       0   
4         9        0        1        0        0        0         3742       0   

   unmarried  educ  age  drinks  
0          1    12   27       0  
1          0    16   24       0  
2          0    11   23       0  
3          0    17   28       0  
4          0    13   27       0  
                            OLS Regression Results                            
Dep. Variable:            birthweight   R-squared:                       0.073
Model:                            OLS   Adj. R-squared:                  0.072
Method:                 Least Squares   F-statistic:                    

### Quick explanation of the regression:

The regression shows that the effect of smoking on birthweight is negative, -217.580 grams on average a bit lower than before, with a p-value of 0.000. This means that the effect of smoking on birthweight is statistically significant at the 5% level.

The 95% CI is given to us in the summary, it is [-269,892; -165,268]. Also solved by hand.

## Expanded regression with even more variables:
We will now perform the regression of birthweight on smoker, alcohol, number of prenatal visits and marital status.

In [33]:

# Birthweight on Smoker, Alcohol, Number of Prenatal Visits and Marital Status
print(df.head())
# Defining the independent and dependent variables:
X = df[["smoker", "alcohol", "nprevist", "unmarried"]]
y = df["birthweight"]

# Adding a constant to the independent variables:
X = sm.add_constant(X)

# Performing the regression:
model = sm.OLS(y, X).fit()

# Printing the summary of the regression:
print(model.summary())

### Calculating 95% CI, by hand:

SE = model.bse

# Accessing the coefficient and SE for smoker:

coef_smoker = model.params['smoker']
SE_smoker = SE['smoker']

# t-value for 95% CI, two-tailed:

alpha = 0.05  # 5%
n = len(y)
k = len(model.params) - 1  # Number of predictors excluding the intercept
t_value = stats.t.ppf(1 - alpha / 2, df=n - k - 1)

# Calculating the 95% confidence interval
lower_bound = coef_smoker - t_value * SE_smoker
upper_bound = coef_smoker + t_value * SE_smoker

print(f"95% Confidence Interval for smoker: ({lower_bound.round(3)}, {upper_bound.round(3)})")

   nprevist  alcohol  tripre1  tripre2  tripre3  tripre0  birthweight  smoker  \
0        12        0        1        0        0        0         4253       1   
1         5        0        0        1        0        0         3459       0   
2        12        0        1        0        0        0         2920       1   
3        13        0        1        0        0        0         2600       0   
4         9        0        1        0        0        0         3742       0   

   unmarried  educ  age  drinks  
0          1    12   27       0  
1          0    16   24       0  
2          0    11   23       0  
3          0    17   28       0  
4          0    13   27       0  
                            OLS Regression Results                            
Dep. Variable:            birthweight   R-squared:                       0.089
Model:                            OLS   Adj. R-squared:                  0.087
Method:                 Least Squares   F-statistic:                    

### Quick explanation of the regression:

The regression shows that the effect of smoking on birthweight is negative, -175.377 grams on average even lower than before, with a p-value of 0.000. This means that the effect of smoking on birthweight is statistically significant at the 5% level.

The 95% CI is given to us in the summary, it is [-269,892; -165,268]. Also solved by hand.

## Step 5 - Ommited variable bias explanations

Regression 1: seems to suffer the most from OVB. From our regression results we can guess that smoking correlates with drinking and being unmarried, which also affect birthweight negatively. We could also check the covariances between the variables to see if there is any multicollinearity, but we won't do that here.

Regression 2: seems to suffer less from OVB, because it includes alcohol in the regression.

## Step 6 - Interpretation of unmarried in Regression 3
Let's return to the third regression, which includes unmarried in the regression and calculate the 95% CI for unmarried, it is given in summary but we will calculate it by hand. We will also calculate the t-value of unmarried by hand.

In [42]:

# Birthweight on Smoker, Alcohol, Number of Prenatal Visits and Marital Status
print(df.head())
# Defining the independent and dependent variables:
X = df[["smoker", "alcohol", "nprevist", "unmarried"]]
y = df["birthweight"]

# Adding a constant to the independent variables:
X = sm.add_constant(X)

# Performing the regression:
model = sm.OLS(y, X).fit()

# Printing the summary of the regression:
print(model.summary())

### Calculating 95% CI, by hand:

SE = model.bse

# Accessing the coefficient and SE for smoker:

coef_unmarried = model.params['unmarried']
SE_unmarried = SE['unmarried']

# t-value for 95% CI, two-tailed:

alpha = 0.05  # 5%
n = len(y)
k = len(model.params) - 1  # Number of predictors excluding the intercept
t_value = stats.t.ppf(1 - alpha / 2, df=n - k - 1)

# Calculating the t value of unmarried:

t_value_unmarried = abs(coef_unmarried / SE_unmarried)

print(t_value)
print(t_value_unmarried)

# Calculating the 95% confidence interval
lower_bound = coef_unmarried - t_value * SE_unmarried
upper_bound = coef_unmarried + t_value * SE_unmarried

print(f"95% Confidence Interval for Unmarried: ({lower_bound.round(3)}, {upper_bound.round(3)})")

print(f"Is unmarried statistically significant at alpha = 0.05? {t_value_unmarried > t_value}")

   nprevist  alcohol  tripre1  tripre2  tripre3  tripre0  birthweight  smoker  \
0        12        0        1        0        0        0         4253       1   
1         5        0        0        1        0        0         3459       0   
2        12        0        1        0        0        0         2920       1   
3        13        0        1        0        0        0         2600       0   
4         9        0        1        0        0        0         3742       0   

   unmarried  educ  age  drinks  
0          1    12   27       0  
1          0    16   24       0  
2          0    11   23       0  
3          0    17   28       0  
4          0    13   27       0  
                            OLS Regression Results                            
Dep. Variable:            birthweight   R-squared:                       0.089
Model:                            OLS   Adj. R-squared:                  0.087
Method:                 Least Squares   F-statistic:                    

### Quick explanation of CI, statistical significance and implications of the policy suggestion:

The 95% CI for unmarried is [-238,128; -136,139]. The coefficient is statistically significant at 5% level, it has the largest negative impact on birthweight, being unmarried might imply being subject to a variety of other socio-economic factors, like poverty, income, bad eating habits which could affect the birthweight negatively. <br>

The suggested policy of increasing the marriage rate might lead to higher birthweight but we cannot say for certain.

## Step 7 - Additional regression with other variables

In [50]:

# Birthweight on Smoker, Alcohol, Number of Prenatal Visits, Marital Status and Education
print(df.head())
# Defining the independent and dependent variables:
X = df[["smoker", "alcohol", "nprevist", "educ", "unmarried"]]
y = df["birthweight"]

# Adding a constant to the independent variables:
X = sm.add_constant(X)

# Performing the regression:
model = sm.OLS(y, X).fit()

# Printing the summary of the regression:
print(model.summary())

### Calculating 95% CI, by hand:

SE = model.bse

# Accessing the coefficient and SE for smoker:

coef_smoker = model.params['smoker']
SE_smoker = SE['smoker']

# t-value for 95% CI, two-tailed:

alpha = 0.05  # 5%
n = len(y)
k = len(model.params) - 1  # Number of predictors excluding the intercept
t_value = stats.t.ppf(1 - alpha / 2, df=n - k - 1)

# Calculating the 95% confidence interval
lower_bound = coef_smoker - t_value * SE_smoker
upper_bound = coef_smoker + t_value * SE_smoker

print(f"95% Confidence Interval for smoker: ({lower_bound.round(3)}, {upper_bound.round(3)})")

   nprevist  alcohol  tripre1  tripre2  tripre3  tripre0  birthweight  smoker  \
0        12        0        1        0        0        0         4253       1   
1         5        0        0        1        0        0         3459       0   
2        12        0        1        0        0        0         2920       1   
3        13        0        1        0        0        0         2600       0   
4         9        0        1        0        0        0         3742       0   

   unmarried  educ  age  drinks  
0          1    12   27       0  
1          0    16   24       0  
2          0    11   23       0  
3          0    17   28       0  
4          0    13   27       0  
                            OLS Regression Results                            
Dep. Variable:            birthweight   R-squared:                       0.089
Model:                            OLS   Adj. R-squared:                  0.087
Method:                 Least Squares   F-statistic:                    

### Quick explanation of the regression:

Education seems to be irrelevant to the effect on birthweight. <br>
We can say that our CI for birthweight seems to be quite robust since they overlap from regression to regression. As more variables are added, the CI shifts but stays largely the same. <br>

We can conclude that the effect of smoking on birthweight is negative.