## Module 13 Exercise - Correlation and Linear Regression

### 1. Complete the code below to import the four libraries that we've primarily covered in the class. 

In [None]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import statsmodels.formula.api as sm

### 2. Import the "babies.csv" file and name it df. 

<b>BACKGROUND INFO</b>

    The Child Health and Development Studies considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. The goal is to model the weight of the infants (bwt, in ounces) using variables including length of pregnancy in days (gestation), mother's age in years (age), mother's height in inches (height), whether the child was the first born (parity), mother's pregnancy weight in pounds (weight), and whether the mother was a smoker (smoke).

<b>VARIABLES</b>

    case - id number
    bwt - birthweight, in ounces
    gestation - length of gestation, in days
    parity - binary indicator for a first pregnancy (0=first pregnancy)
    age - mother's age in years
    height - mother's height in inches
    weight - mother's weight in pounds
    smoke - binary indicator for whether the mother smokes

In [None]:
df = pd.read_csv("babies.csv")

### 3. Check the shape of the dataset. How many columns are there? How many rows?

In [None]:
print(df.shape)

### 4. Check the first 10 rows and the last 10 rows of the dataset. Drop the column "case". 

In [None]:
df.head(10)

In [None]:
df.head(10)

In [None]:
df.drop(columns = "case", inplace = True)

### 5. Is there any missing data in the dataset? Use whichever code you like to check.

In [None]:
df.isnull().sum()

### 6. The amount of missing data is small considering the size of our dataset. Drop all rows in the dataset that have missing data. Create a copy of your dataset without missing data. 

In [None]:
df.dropna()

### 7. Are there any duplicate rows in the dataset? Check, if there are, drop them. 

In [None]:
df.loc[df.duplicated()]

### 8. Check each of your numeric columns for outliers - pick one method and use it for all the columns. 

In [None]:
dfz = df.copy()

In [None]:
dfz["zscore_bwt"] = np.abs(stats.zscore(dfz["bwt"]))
dfz["zscore_gestation"] =np.abs(stats.zscore(dfz["gestation"]))
dfz["zscore_age"] =np.abs(stats.zscore(dfz["age"]))
dfz["zscore_height"] = np.abs(stats.zscore(dfz["height"]))
dfz["zscore_weight"] = np.abs(stats.zscore(dfz["weight"]))

z_outliers = dfz.loc[dfz["zscore_bwt"] > 3].index
z1_outliers = dfz.loc[dfz["zscore_gestation"]>3].index
z2_outliers = dfz.loc[dfz["zscore_age"]>3].index
z3_outliers = dfz.loc[dfz["zscore_height"]>3].index
z4_outliers = dfz.loc[dfz["zscore_weight"]>3].index

print(z_outliers)
print(z1_outliers)
print(z2_outliers)
print(z3_outliers)
print(z4_outliers)

In [None]:
dfz.loc[[632, 829, 912, 978, 1139]]

In [None]:
dfz = dfz.drop(z_outliers)
dfz = dfz.drop(z1_outliers)
dfz = dfz.drop(z2_outliers)
dfz = dfz.drop(z3_outliers)
dfz = dfz.drop(z4_outliers)

print(dfz.shape)

### 9. Print the descriptive statistics for each numeric column. What is the average age of the mothers? What is the average gestation period?

In [None]:
dfz.head()

In [None]:
# Measures of Central Tendency for Qualitative data

print("The avg value of each qualitative data")
print(dfz[["bwt"]].mean())  
print(dfz[["gestation"]].mean())
print(dfz[["age"]].mean())
print(dfz[["height"]].mean())
print(dfz[["weight"]].mean())
print(" ")

print("The most reccurring value of in each numerical column")
print(dfz[["bwt"]].mode())  
print(dfz[["gestation"]].mode())
print(dfz[["age"]].mode())
print(dfz[["height"]].mode())
print(dfz[["weight"]].mode())
print(" ")

print("The median value is ")
print(dfz[["bwt"]].median())  
print(dfz[["gestation"]].median())
print(dfz[["age"]].median())
print(dfz[["height"]].median())
print(dfz[["weight"]].median())
print(" ")

In [None]:
dfz.describe()

In [None]:
# the middle value for weight of the mother is 125lbs
# the rest of the variables find their medians at the avg of their 2 median values.

### 10. Let's model birthweight based on the characteristics of the mother. We want to distinguish between the numeric and categorical variables. Replace the values 0/1 in the parity and smoke column with meaningful labels (i.e. smokes, doesn't smoke).

In [None]:
# mother's characteristics here are: age, height, weight
dfz["parity"].replace([0, 1], ["First Preg", "Not First Preg"], inplace = True)
dfz["smoke"].replace([0.0, 1.0], ["doesn't smoke", "smokes"], inplace = True)
dfz.head()

### 11. Run a correlation matrix with your dataset. Which variables are correlated with birthweight? Describe the strength of the correlation between all the continuous independent variables and birthweight. 

In [None]:
dfz.corr()

In [None]:
# As observed, there is no linear relationship between the birthweight of newborns 
# and all the continuous independent variables, except for a positive linear relationship
# between the gestation period and the bwt

## birthweight is correlated to:
## gestation (0.4) moderate positive correlation
## age (0.03) very weak positive correlation
## height (0.2) weak positive correlation
## weight (0.2) weak positive correlation

### 12. Determine the relationship between birthweight and the categorical variables: parity and smoke. Use the groupby function to determine if there are any differences between birthweight and the different levels of the variables.  Does it seem like there is a relationship between these variables and birthweight?

In [None]:
dfz[["bwt"]].groupby(dfz["parity"]).mean()
## mothers who are having their first child have higher bwt's on average

In [None]:
dfz[["bwt"]].groupby(dfz["smoke"]).mean()
## non-smokers have higher bwt's on average

In [None]:
# it seems that the birth weight of newborns is high when the mother doesn't smoke
# and there is no parity 

### 13. Let's construct your regression model. Firstly, which variables do you plan to include in your model, and why? In the space below, write your justification for why you are including each variable. 

The Child Health and Development Studies considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. The goal is to model the weight of the infants (bwt, in ounces) using variables including length of pregnancy in days (gestation), mother's age in years (age), mother's height in inches (height), whether the child was the first born (parity), mother's pregnancy weight in pounds (weight), and whether the mother was a smoker (smoke).

The regression model will include the gestation period, the mother's characteristics (height, weight) and potential habits (smoke)

genetics and habits are factors that can influence the variation in the weight of the infant

### 14. Construct your regression model and print the summary. Write out your full interpretation of the regression results. If you are not happy with the results, tweak your model and run it again. 

In [None]:
## create the regression model
result = sm.ols('bwt ~ gestation + age + height + weight + C(parity) + C(smoke)', data = dfz).fit()

## print the regression model summary
result.summary()

### 15. Create three scenarios (i.e. make up specific values) and predict the birthweight given these factors using the information on the model you were most pleased with. 

### Scenario 1

    * 40 year old mom, non-smoker, first pregnancy, gestation period of 290 days, height 70 inches and 120 pounds. 

In [None]:
## We can use the predict function (from statsmodel library) to predict the outcome given specific input
## reference the model with your function to reference the appropriate coef's!

# model_name.predict({'variable1_name':value1, 'variable2_name':value2, ...})

result.predict({
    'gestation': 290, 
    'parity': "First Preg", 
    'age': 30, 
    'height': 70,
    'weight':120,
    'smoke':"doesn't smoke"})

### Scenario 2

    * 25 year old mom, smoker, third pregnancy, gestation period of 240 days, height 66 inches and 200 pounds. 

In [None]:
result.predict({
    'gestation': 240, 
    'parity': "Not First Preg", 
    'age': 25, 
    'height': 66,
    'weight':200,
    'smoke':"smokes"})

### Scenario 3

    * 45 year old mom, smoker, first pregnancy, gestation period of 300 days, height 60 inches and 199 pounds. 

In [None]:
result.predict({
    'gestation': 300, 
    'parity': "First Preg", 
    'age': 45, 
    'height': 60,
    'weight':199,
    'smoke':"smokes"})

## Great Job!! Submit your assignment via Canvas. 