# Possible determinants of Birthweight

## Introduction

The dataset I will be working with described birthweight and possible association with a number of variables.

The variables I am most interested in are: **Birthweight** (babies’ birthweight, in lbs), **smoker** (1 = yes, 0 = no); **motherage** (mothers’ age in years); **mppwt** (mothers’ pre pregnancy weight, in lbs), **Gestation** (Gestational age of pregnancy in weeks), **Length** (babies' length in inches) and **catsmoke** (babies born to mothers who were categorised either as non-smokers, smoked 'some' cigarettes or smoked 'many' cigarettes)

## Script

### Reading in the csv file from my directory

In [18]:
import pandas as pd
import numpy as np

In [None]:
birthweight = pd.read_csv(r"C:\Users\cgonu20\Desktop\DATA SCIENCES\birthweight_win.csv")
birthweight.head()

In [None]:
#visualising whole dataset.
birthweight

In [None]:
#42 time points of obtained cases.
birthweight.count()

In [None]:
#Having a graphic look at the whole dataset to see what shape it has.
birthweight.plot()

### Simple calculations/ descriptive analysis

In [None]:
#This shows the average birthweight (in kg) of all the babies in the dataset
birthweight["Birthweight"].mean()

In [None]:
#The average age of mothers in this study
birthweight["motherage"].mean()

In [None]:
#The average prepregnancy weight of mothers
birthweight["mppwt"].mean()

In [None]:
#Average gestational age of babies
birthweight["Gestation"].mean()

##### Now that I have a glipse of the basic look of the dataset, I want to find out the maximum and minimum values for the babies birthweight from the columns

In [None]:
#maximum
birthweight["Birthweight"].max()

In [None]:
#minimum
birthweight["Birthweight"].min()

#### I will then like to find out the row that contain the maximum and minimium babies birthweight
 ##### With this I am hoping to be able to see the row that recorded the biggest/smallest babies weight at birth and make some assuptioms on why this maybe so

In [None]:
birthweight.loc[birthweight['Birthweight'] == 10.0] 

In [None]:
birthweight.loc[birthweight['Birthweight'] == 4.2]

From the above output, one can quickly assume an association between Gestation age and the babies' birthweight. I will be exploring this further later on

## Manipulating the dataset

#### Colapsing columns containing variables that explained babies fathers attributes. 
This is beacause I am only interested in just babies and their mothers, I have also summated the variables, such that each are for individual mother and baby ID

In [None]:
#Notice the position of the 'ID' now!!!

birthweight_collapsed = birthweight.groupby(birthweight['id']).aggregate({'Birthweight': 'sum', 'Gestation': 'sum', 'smoker': 'sum', 'motherage': 'sum', 'mnocig': 'sum', 'catsmoke': 'sum', 'mheight': 'sum', 'mage35': 'sum', 'mppwt': 'sum', 'LowBirthWeight': 'sum'})
birthweight_collapsed

## Visualisation of the Relationships between each numerical variables

In [None]:
#This gives me the linear correlation between all these variables

%matplotlib inline
import seaborn as sns
corr = birthweight_collapsed.corr()

#plot the heatmap
sns.heatmap(corr, vmin=-1.0, vmax=1.0, square=True, cmap="RdBu")

In [None]:
from pandas.plotting import scatter_matrix

#Making a scatter matrix for the different variables and visualsing how each are spread out
redish = '#Bithweight'
orangeish = '#Gestation'
blueish = '#smoker'
colormap = np.array([redish,blueish,orangeish])
a = scatter_matrix(birthweight_collapsed, figsize=(16, 16))

#### Exploring further the correlations between birthweight and other variables

In [None]:
birthweight_collapsed.corr()

From the output above, I noticed that each variable is completely positively correlated to itself. In the next line of code I have replaced this with a NaN value to enable me accurately report observed correlations 

In [None]:
import numpy as np
corr = birthweight_collapsed.corr()
np.fill_diagonal(corr.values, np.nan)
corr

From the output above, I can see a positive correlation between **Birthweight** and **Gestation**, and between **mnocig** and **smoker** 

The highest negative correlation was seen between **smoker** and **Birthweight**, however, the significance of this association is not tested and as such not known.

In [None]:
#I will be exploring the association between Birthweight and Gestation
from pandas.plotting import scatter_matrix
import numpy as np

In [None]:
#visualising the relationship between gestational age and babies weight at birth

sns.relplot(data=birthweight_collapsed, x="Birthweight", y="Gestation")

In [None]:
#Adding unit of measurement and assigning the values to the variable "LowBirthWeight to make sense of the plot

g = sns.relplot(data=birthweight, x="Birthweight", y="Gestation", hue="LowBirthWeight", style="LowBirthWeight")
g.set_axis_labels("Birthweight (lbs)", "Gestation (wks)")

The first thing that I would conclude is that there exist a clear association between gestational age and birthweight. However, this variable alone may not be the only factor influencing babies weight at birth. There seems to be other confounding variables **(e.g mother's that smoke in pregnancy and those that do not, mother's pre pregnancy weight etc,)** and I want to explore this further

In [None]:
#Looking at how maternal pre-pregancy weight is associated to birthweight scale graphically
sns.displot(data=birthweight, x="mppwt", hue="LowBirthWeight", kind="kde", common_norm=False)

In [None]:
#Looking at how baby length (inches) is associated with birthweight graphically
birthweight.plot.scatter(
"length", 
    "Birthweight",)

### Plotting birthweight against smoking status of mothers 'catsmoke= none smokers, smoked some, and smoked many

In [None]:
sns.catplot(data=birthweight, x="catsmoke", y="Birthweight")

In [None]:
sns.catplot(data=birthweight, x="catsmoke", y="Birthweight", kind= "box")

##### Plotting Birthweight against Gestation and assigning LowBirthWeight scale and smoking status to the variables

In [None]:
sns.relplot(data=birthweight, x="Gestation", y="Birthweight", hue="LowBirthWeight", size="smoker")

## Conclusion 1

I can see from the first two plot above that in general, mothers who do not smoke has a slightly higher average compared to the other two category. 

The last plot indicated that babies born after full term (38 gestational age (weeks) has normal birthweight compared to those born below this age. 

Again, it seems that babies born to mothers who are non-smokers have normal birthweight compared to others. The significance of this negative association was however not determined.

### Having a go with creating and training a model (machine learning)

In [None]:
#Narrowing the dataset to fewer variables for further exploration of the dataset
birthweight_target = birthweight[["Birthweight", "Gestation", "smoker"]]
birthweight_target

### Linear regression

In [None]:
from sklearn.linear_model import LinearRegression

#fit intercept is true as I can't have negative value.
model = LinearRegression(fit_intercept=False)
model

In [None]:
# I wanted to predict Birthweight from Gestational age.
X = birthweight_target[["Birthweight"]]
y = birthweight_target["Gestation"]

In [None]:
model.fit(X, y)

In [None]:
x_fit = pd.DataFrame({"Birthweight": [X["Birthweight"].min(), X["Birthweight"].max()]})
y_pred = model.predict(x_fit)

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
birthweight_target.plot.scatter("Birthweight", "Gestation", ax=ax)
ax.plot(x_fit["Birthweight"], y_pred, linestyle=":")

In [None]:
#Redefining X and y variables
print("Model gradient: ", model.coef_[0])
print("Model intercept:", model.intercept_)

In [None]:
X = birthweight_target[["Birthweight"]]
y = birthweight_target["Gestation"]

In [None]:
#Splitting the data
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=42)

In [None]:
#Plotting to see that train and test are from the same distribution
import matplotlib.pyplot as plt

fig, ax = plt.subplots()

ax.scatter(train_X, train_y, color="red", marker="o", label="train")
ax.scatter(test_X, test_y, color="blue", marker="x", label="test")
ax.legend()

In [None]:
#Passing train to the fit function

from sklearn.linear_model import LinearRegression

model = LinearRegression(fit_intercept=True)
model.fit(train_X, train_y)

In [None]:
#Calling the score method to check the fit

model.score(test_X, test_y)

Although I found that the model scored to be 0.7, it has a relatively no good fit to the data so it was best to try and fit the data using **non-linear regression**

### Polynomial regression

In [None]:
#importing 
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures, SplineTransformer
from sklearn.pipeline import make_pipeline


model = make_pipeline(PolynomialFeatures(degree=1), Ridge(alpha=1e-3)

In [None]:
#split the data into test and train
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(X, y)

In [None]:
#Plotting to see that train and test are from the same distribution
import matplotlib.pyplot as plt

fig, ax = plt.subplots()

ax.scatter(train_X, train_y, color="red", marker="o", label="train")
ax.scatter(test_X, test_y, color="blue", marker="x", label="test")
ax.legend()

In [None]:
#fitting the training data to the model
model.fit(train_X, train_y)

In [None]:
#score for the polynomial regression model 
model.score(test_X, test_y)

#tried the model initially with a degree of 4, 3 and 2. visually degree of 1 gave a better score.

In [None]:
#model prediction
import numpy as np
x_fit = pd.DataFrame({"Birthweight": np.linspace(X["Birthweight"].min(), X["Birthweight"].max())})

y_pred = model.predict(x_fit)

In [None]:
#visualising the model
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
birthweight_target.plot.scatter("Birthweight", "Gestation", ax=ax)
ax.plot(x_fit["Birthweight"], y_pred, linestyle=":")

## Conclusion 2

Both the linear regression and polynomial regression did not give a good fit for Birthweight and Gestation,

I did this for all of the confounding variables as most are all continuous predictor variables, therefore their association with baby’s birthweight (output variable) is not very clear. This could explained the non-linear wiggly correlation I got. 

For example, having an abnormally high or low weight may adversely affect baby’s weight at birth (birthweight). Whatever angle one looks at it, it seems that there is not solely definite positive or negative that can be observed from these variables. I have considered further study to learn how to fit a smoothing spline in python as this will be, to my understanding the way around it.

#### THE END

Code running time 

56 seconds