## Module 13 Exercise - Correlation and Linear Regression

### 1. Complete the code below to import the four libraries that we've primarily covered in the class. 

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import statsmodels.formula.api as sm

### 2. Import the "babies.csv" file and name it df. 

<b>BACKGROUND INFO</b>

    The Child Health and Development Studies considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. The goal is to model the weight of the infants (bwt, in ounces) using variables including length of pregnancy in days (gestation), mother's age in years (age), mother's height in inches (height), whether the child was the first born (parity), mother's pregnancy weight in pounds (weight), and whether the mother was a smoker (smoke).

<b>VARIABLES</b>

    case - id number
    bwt - birthweight, in ounces
    gestation - length of gestation, in days
    parity - binary indicator for a first pregnancy (0=first pregnancy)
    age - mother's age in years
    height - mother's height in inches
    weight - mother's weight in pounds
    smoke - binary indicator for whether the mother smokes

In [2]:
df = pd.read_csv("babies.csv")

### 3. Check the shape of the dataset. How many columns are there? How many rows?

In [3]:
print(df.shape)

(1236, 8)


### 4. Check the first 10 rows and the last 10 rows of the dataset. Drop the column "case". 

In [4]:
df.head(10)

Unnamed: 0,case,bwt,gestation,parity,age,height,weight,smoke
0,1,120,284.0,0,27.0,62.0,100.0,0.0
1,2,113,282.0,0,33.0,64.0,135.0,0.0
2,3,128,279.0,0,28.0,64.0,115.0,1.0
3,4,123,,0,36.0,69.0,190.0,0.0
4,5,108,282.0,0,23.0,67.0,125.0,1.0
5,6,136,286.0,0,25.0,62.0,93.0,0.0
6,7,138,244.0,0,33.0,62.0,178.0,0.0
7,8,132,245.0,0,23.0,65.0,140.0,0.0
8,9,120,289.0,0,25.0,62.0,125.0,0.0
9,10,143,299.0,0,30.0,66.0,136.0,1.0


In [5]:
df.head(10)

Unnamed: 0,case,bwt,gestation,parity,age,height,weight,smoke
0,1,120,284.0,0,27.0,62.0,100.0,0.0
1,2,113,282.0,0,33.0,64.0,135.0,0.0
2,3,128,279.0,0,28.0,64.0,115.0,1.0
3,4,123,,0,36.0,69.0,190.0,0.0
4,5,108,282.0,0,23.0,67.0,125.0,1.0
5,6,136,286.0,0,25.0,62.0,93.0,0.0
6,7,138,244.0,0,33.0,62.0,178.0,0.0
7,8,132,245.0,0,23.0,65.0,140.0,0.0
8,9,120,289.0,0,25.0,62.0,125.0,0.0
9,10,143,299.0,0,30.0,66.0,136.0,1.0


In [6]:
df.drop(columns = "case", inplace = True)

### 5. Is there any missing data in the dataset? Use whichever code you like to check.

In [7]:
df.isnull().sum()

bwt           0
gestation    13
parity        0
age           2
height       22
weight       36
smoke        10
dtype: int64

### 6. The amount of missing data is small considering the size of our dataset. Drop all rows in the dataset that have missing data. Create a copy of your dataset without missing data. 

In [8]:
df.dropna(inplace = True)

### 7. Are there any duplicate rows in the dataset? Check, if there are, drop them. 

In [9]:
df.loc[df.duplicated()]

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke


In [10]:
df.shape

(1174, 7)

### 8. Check each of your numeric columns for outliers - pick one method and use it for all the columns. 

In [11]:
dfz = df.copy()

In [12]:
dfz["zscore_bwt"] = np.abs(stats.zscore(dfz["bwt"]))
dfz["zscore_gestation"] =np.abs(stats.zscore(dfz["gestation"]))
dfz["zscore_age"] =np.abs(stats.zscore(dfz["age"]))
dfz["zscore_height"] = np.abs(stats.zscore(dfz["height"]))
dfz["zscore_weight"] = np.abs(stats.zscore(dfz["weight"]))

z_outliers = dfz.loc[dfz["zscore_bwt"] > 3].index
dfz = dfz.drop(z_outliers) 

z1_outliers = dfz.loc[dfz["zscore_gestation"]>3].index
dfz = dfz.drop(z1_outliers)

z2_outliers = dfz.loc[dfz["zscore_age"]>3].index
dfz = dfz.drop(z2_outliers)

z3_outliers = dfz.loc[dfz["zscore_height"]>3].index
dfz = dfz.drop(z3_outliers)

z4_outliers = dfz.loc[dfz["zscore_weight"]>3].index
dfz = dfz.drop(z4_outliers)

In [13]:
dfz.drop(columns =["zscore_weight", "zscore_height", "zscore_age", "zscore_gestation", "zscore_bwt"], inplace = True)

In [14]:
dfz.head()

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
0,120,284.0,0,27.0,62.0,100.0,0.0
1,113,282.0,0,33.0,64.0,135.0,0.0
2,128,279.0,0,28.0,64.0,115.0,1.0
4,108,282.0,0,23.0,67.0,125.0,1.0
5,136,286.0,0,25.0,62.0,93.0,0.0


In [15]:
print(dfz.shape)

(1134, 7)


### 9. Print the descriptive statistics for each numeric column. What is the average age of the mothers? What is the average gestation period?

In [16]:
dfz.describe()

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
count,1134.0,1134.0,1134.0,1134.0,1134.0,1134.0,1134.0
mean,119.661376,279.42328,0.267196,27.206349,64.040564,127.218695,0.390653
std,17.860299,13.930125,0.442691,5.802403,2.468179,18.385539,0.488112
min,65.0,232.0,0.0,15.0,57.0,87.0,0.0
25%,109.0,272.0,0.0,23.0,62.0,114.0,0.0
50%,120.0,280.0,0.0,26.0,64.0,125.0,0.0
75%,131.0,288.0,1.0,31.0,66.0,137.0,1.0
max,174.0,324.0,1.0,44.0,71.0,190.0,1.0


In [17]:
# the avg age of the mothers is 27
# the avg gestation period is 279 days
# the middle value for weight of the mother is 125lbs
# the rest of the variables find their medians at the avg of their 2 median values.

### 10. Let's model birthweight based on the characteristics of the mother. We want to distinguish between the numeric and categorical variables. Replace the values 0/1 in the parity and smoke column with meaningful labels (i.e. smokes, doesn't smoke).

In [18]:
# mother's characteristics here are: age, height, weight
dfz["parity"].replace([0, 1], ["First Preg", "Not First Preg"], inplace = True)
dfz["smoke"].replace([0.0, 1.0], ["doesn't smoke", "smokes"], inplace = True)
dfz.head()

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
0,120,284.0,First Preg,27.0,62.0,100.0,doesn't smoke
1,113,282.0,First Preg,33.0,64.0,135.0,doesn't smoke
2,128,279.0,First Preg,28.0,64.0,115.0,smokes
4,108,282.0,First Preg,23.0,67.0,125.0,smokes
5,136,286.0,First Preg,25.0,62.0,93.0,doesn't smoke


### 11. Run a correlation matrix with your dataset. Which variables are correlated with birthweight? Describe the strength of the correlation between all the continuous independent variables and birthweight. 

In [19]:
dfz.corr()

Unnamed: 0,bwt,gestation,age,height,weight
bwt,1.0,0.411186,0.026259,0.212844,0.166821
gestation,0.411186,1.0,-0.05402,0.072687,0.045045
age,0.026259,-0.05402,1.0,-0.001016,0.16076
height,0.212844,0.072687,-0.001016,1.0,0.46382
weight,0.166821,0.045045,0.16076,0.46382,1.0


In [20]:
# As observed, there is no linear relationship between the birthweight of newborns 
# and all the continuous independent variables, except for respectively a moderate positive linear relationship
# with the gestation period, and weak positive linear relationship with the age, weight, and height of the mothers

### 12. Determine the relationship between birthweight and the categorical variables: parity and smoke. Use the groupby function to determine if there are any differences between birthweight and the different levels of the variables.  Does it seem like there is a relationship between these variables and birthweight?

In [21]:
dfz[["bwt"]].groupby(dfz["parity"]).mean()
## mothers who are having their first child have higher bwt's on average

Unnamed: 0_level_0,bwt
parity,Unnamed: 1_level_1
First Preg,120.174489
Not First Preg,118.254125


In [22]:
dfz[["bwt"]].groupby(dfz["smoke"]).mean()
## non-smokers have higher bwt's on average

Unnamed: 0_level_0,bwt
smoke,Unnamed: 1_level_1
doesn't smoke,123.337192
smokes,113.927765


In [23]:
# it seems that the birth weight of newborns is high when the mother doesn't smoke
# and there is no parity 

### 13. Let's construct your regression model. Firstly, which variables do you plan to include in your model, and why? In the space below, write your justification for why you are including each variable. 

The Child Health and Development Studies considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. The goal is to model the weight of the infants (bwt, in ounces) using variables including length of pregnancy in days (gestation), mother's age in years (age), mother's height in inches (height), whether the child was the first born (parity), mother's pregnancy weight in pounds (weight), and whether the mother was a smoker (smoke).

The regression model will include the dependent variable bwt, gestation period, the mother's characteristics (height, weight) and potential habits (smoke).

Genetics and habits are factors that can influence the variation in the weight of the infant. Given the correlation (though moderate relationship) between the birthweight and the gestation period, it would be worth further investigating the variation in birthweight attributed to how long the pregnancy lasted and/or how additional days influence bwt.
Now though the rest of the varaibles (height and weight of the mothers, parity and smoke) have a weak linear relationship with bwt, the demographics of the mothers can be used as potential confounding variable though they don't influence the birthweight of newborns. Moreover, it's worth checking the slight differences in the average birthweight between first and not-first pregnancies. The same of the latter applies to smoking, explore the slight differences in the average birthweight for those who smokes and those who don't. 

### 14. Construct your regression model and print the summary. Write out your full interpretation of the regression results. If you are not happy with the results, tweak your model and run it again. 

In [24]:
## create the regression model
result = sm.ols('bwt ~ gestation + age + height + weight + C(parity) + C(smoke)', data = dfz).fit()

## print the regression model summary
result.summary()

0,1,2,3
Dep. Variable:,bwt,R-squared:,0.264
Model:,OLS,Adj. R-squared:,0.26
Method:,Least Squares,F-statistic:,67.37
Date:,"Tue, 29 Nov 2022",Prob (F-statistic):,1.1e-71
Time:,17:16:53,Log-Likelihood:,-4703.6
No. Observations:,1134,AIC:,9421.0
Df Residuals:,1127,BIC:,9457.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-97.8230,15.101,-6.478,0.000,-127.452,-68.194
C(parity)[T.Not First Preg],-3.6400,1.113,-3.272,0.001,-5.823,-1.457
C(smoke)[T.smokes],-8.1849,0.944,-8.669,0.000,-10.037,-6.332
gestation,0.4926,0.033,14.867,0.000,0.428,0.558
age,-0.0249,0.085,-0.291,0.771,-0.193,0.143
height,1.2259,0.211,5.819,0.000,0.813,1.639
weight,0.0486,0.029,1.698,0.090,-0.008,0.105

0,1,2,3
Omnibus:,4.377,Durbin-Watson:,2.08
Prob(Omnibus):,0.112,Jarque-Bera (JB):,5.187
Skew:,0.007,Prob(JB):,0.0748
Kurtosis:,3.331,Cond. No.,10400.0


### Interpretation

My adj r-squared is 0.260, meaning 26% of the variation in bwt can be explained by our all the independent variables included in our model.  This is a fairly low value and it would be important to target a higher value for better fit.



* Parity: Coef(-3.6) - On avg, women who have had more than one pregnacies have babies with a birthweight 3.6 oz lower than those on their first pregnancy, when controlling for gestation lenght, age, weight and height of the mother, and smoking status.

* Smoke: Coef(-8.2) - On avg, women who smoke have babies with a birthweight 8.2 oz lower than those who don't smoke, when controlling for gestation lenght, age, weight and height of the mother, and number of prior pregnancies.

* Gestation: 
   * Coef(0.49) - For every 1-day increase in gestation period, birthweight increases by 0.49 (when controlling for smoking status, age, weight and height of the mother, and number of prior pregnancies.)
   * p-val(0.00) - Statistically significant
             
* Age:
   * Coef(-0.02) - For every 1-year increase in age of the monther, birthweight decreases by 0.02 (when controlling for smoking status, height and weight of the mother, and number of prior pregnancies.)
   * p-val(0.77) - Not statistically significant as 77% of the time, the results we see are just by chance.
       
* Height: 
   * Coef(1.2) - For every 1-inch increase in height of the mother, birthweight increases by 1.2 (when controlling for smoking status, age and weight of the mother, and number of prior pregnancies.)
   * p-val(0.00) - Statistically signigicant
          
* Weight: 
   * Coef(0.05) - For every 1-lb increase in weight of the mother, birthweight increases by 0.05 (when controlling for smoking status, age and height of the mother, and number of prior pregnancies.)
   * p-val(0.09) - Not statistically significant as 9% of the time, the results we see are just by chance.



### 15. Create three scenarios (i.e. make up specific values) and predict the birthweight given these factors using the information on the model you were most pleased with. 

### Scenario 1

    * 30 year old mom, non-smoker, first pregnancy, gestation period of 290 days, height 70 inches and 120 pounds. 

In [25]:
## We can use the predict function (from statsmodel library) to predict the outcome given specific input
## reference the model with your function to reference the appropriate coef's!

# model_name.predict({'variable1_name':value1, 'variable2_name':value2, ...})

result.predict({
    'gestation': 290, 
    'parity': "First Preg", 
    'age': 30, 
    'height': 70,
    'weight':120,
    'smoke':"doesn't smoke"})

0    135.92636
dtype: float64

### Scenario 2

    * 25 year old mom, smoker, third pregnancy, gestation period of 240 days, height 66 inches and 200 pounds. 

In [26]:
result.predict({
    'gestation': 240, 
    'parity': "Not First Preg", 
    'age': 25, 
    'height': 66,
    'weight':200,
    'smoke':"smokes"})

0    98.584757
dtype: float64

### Scenario 3

    * 45 year old mom, smoker, first pregnancy, gestation period of 300 days, height 60 inches and 199 pounds. 

In [27]:
result.predict({
    'gestation': 300, 
    'parity': "First Preg", 
    'age': 45, 
    'height': 60,
    'weight':199,
    'smoke':"smokes"})

0    123.877583
dtype: float64

## Great Job!! Submit your assignment via Canvas. 