## Module 11: Intro to Correlation and Linear Regression 

***

You've learned to explore your dataset features, clean up messy data, and use descriptive statistics to summarize the characteristics of your dataset. 

We are now moving forward with exploring the relationship <b>between</b> variables. Determining how your variables interact is one of the most useful skills you will learn while working with data. 

The first step is understanding the dependency of your variables. 

***

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import seaborn as sns
from matplotlib import pyplot as plt

## <font color=DODGERBLUE>Independent and Dependent Variables</font>

When you start to look at the relationship between variables, there are two classifications of variables that need to be considered. 

***

### Independent Variables (IV)
An independent variable, also known as a predictor variable, is a variable that is independent of the other variables in your dataset. Consider this variable the "cause". Independent variables are commonly represented with "x".  

### Dependent Variables (DV)
A dependent variable, also known as an outcome variable, is a variable whose value is dependent on the values of the other variables in your dataset. Consider this variable the "effect". We assume that changes in the values of the independent variable(s) will result in changes to the dependent variable. Dependent variables are commonly represented with "y". 

***

### Examples 
The distinction between IV and DV is an important concept to master because several analyses require identification of which variables are IV and DV. IV's cause a change in the DV, and it isn't possible for the DV to cause a change in the IV.

#### Do flowers grow fastest under fluorescent or natural light?

    - IV: Type of light flowers are grown under
    - DV: Rate of flower growth

#### What is the effect of diet and regular soda on blood sugar levels?

    - IV: Type of soda
    - DV: Blood sugar levels

#### How does cellphone use before bedtime influence sleep?

    - IV: Amount of phone use before bedtime
    - DV: Hours of sleep, quality of sleep, restfulness, etc. 

In [2]:
df = pd.read_csv("EduGradeData.csv")

df.head()

## which variables can be considered independent and dependent?

Unnamed: 0,fname,lname,gender,age,exercise,level_of_fit,hours,level_of_study,grade,home_state
0,Marcia,Pugh,female,17,3,low,10,moderate,82.4,NJ
1,Kadeem,Morrison,male,18,4,low,4,low,78.2,MA
2,Nash,Powell,male,18,5,low,9,moderate,79.3,OH
3,Noelani,Wagner,female,14,2,high,7,moderate,83.2,FL
4,Noelani,Cherry,female,18,4,low,15,high,87.4,OH


## <font color=GOLDENROD>Your Turn</font>

    1. Import the "insurance.csv" file; name the dataset 'ins'. Preview the first 5 rows. 
    2. In the space below, determine what type of variable is in each column (qualitative/quantitative). 
    3. What variable(s) could be considered dependent?

In [3]:
df2 = pd.read_csv("insurance.csv")
df2.head()

Unnamed: 0,age,sex,bmi,children,smoker,drinker,comorbidities,prescription,provider_type,charges
0,54,female,47.41,0,yes,yes,5,yes,out-of-network,63770.42801
1,45,male,30.36,0,yes,yes,5,yes,out-of-network,62592.87309
2,52,male,34.485,3,yes,yes,2,yes,out-of-network,60021.39897
3,31,female,38.095,1,yes,yes,4,yes,out-of-network,58571.07448
4,33,female,35.53,0,yes,yes,5,yes,out-of-network,55135.40209


# <font color=DODGERBLUE>Introduction to Correlation & Association</font>

***

### Correlation
***
Correlation is a tool that can be used to determine th strength of a relationship between two numeric variables. Correlation describes the direction (+/-) and magnitude (how large) of a <b>linear relationship</b> exists between two numeric variables. This is the initial check if your numeric variables have any kind of meaningful relationship. 

Correlation values range from "-1" to "+1". Variables that are positively correlated are closer to "+1" and variables that are negatively correlated are closer to "-1". 

#### Positively correlated variables move in the same direction - as one variable increases, the other variable increases in the same direction.

        - Increase in study time >> increases in test score
        - Decrease in sugar consumption >> decrease in blood sugar levels
        
#### Negatively correlated variables move in the opposite direction - as one variable increases, the other variable decreases (or vice versa). 

        - Decrease in daily spending >> increase in total savings
        - Increase in weight >> decrease in mobility

In [4]:
df2.head()

Unnamed: 0,age,sex,bmi,children,smoker,drinker,comorbidities,prescription,provider_type,charges
0,54,female,47.41,0,yes,yes,5,yes,out-of-network,63770.42801
1,45,male,30.36,0,yes,yes,5,yes,out-of-network,62592.87309
2,52,male,34.485,3,yes,yes,2,yes,out-of-network,60021.39897
3,31,female,38.095,1,yes,yes,4,yes,out-of-network,58571.07448
4,33,female,35.53,0,yes,yes,5,yes,out-of-network,55135.40209


In [5]:
## Create a correlation matrix

df2.corr()

Unnamed: 0,age,bmi,children,comorbidities,charges
age,1.0,0.109272,0.042469,0.611903,0.299008
bmi,0.109272,1.0,0.012759,0.06546,0.198341
children,0.042469,0.012759,1.0,-0.282273,0.067998
comorbidities,0.611903,0.06546,-0.282273,1.0,0.505246
charges,0.299008,0.198341,0.067998,0.505246,1.0


### Strength of Relationship between Variables
***
The strength of the relationship between variables is determined by the value of the correlation coefficient. To interpret its value, see which of the following values your correlation coefficient is closest to:

- <b>Exactly –1</b>: A perfect downhill (negative) linear relationship
- <b>–0.70</b>: A strong downhill (negative) linear relationship
- <b>–0.50</b>: A moderate downhill (negative) relationship
- <b>–0.30</b>: A weak downhill (negative) linear relationship
- <b>0</b>: No linear relationship
- <b>+0.30</b>: A weak uphill (positive) linear relationship
- <b>+0.50</b>: A moderate uphill (positive) relationship
- <b>+0.70</b>: A strong uphill (positive) linear relationship
- <b>Exactly +1</b>: A perfect uphill (positive) linear relationship

#### It is important to understand that <u>correlation is not the same as causation</u>. Identifying a correlation between two factors does not automatically mean one factor causes another factor to occur.

***

## <font color=GOLDENROD>Your Turn</font>

    1. Create a correlation matrix with the insurance dataset. 
    2. What can you say about the relationship between the independent variables and the dependent variable?

In [6]:
df2.corr()

Unnamed: 0,age,bmi,children,comorbidities,charges
age,1.0,0.109272,0.042469,0.611903,0.299008
bmi,0.109272,1.0,0.012759,0.06546,0.198341
children,0.042469,0.012759,1.0,-0.282273,0.067998
comorbidities,0.611903,0.06546,-0.282273,1.0,0.505246
charges,0.299008,0.198341,0.067998,0.505246,1.0


### Qualitative Variables and Association
***
The correlation matrix is great for examining the relationship between two numeric variables, however, this method does not work when you want to assess the relationship between qualitative variables, or one qualitative and one numeric variable. 

For these situations, we are interested in determining the differences between groups. These differences include variations in counts or means. If we identify notable differences between groups (the average test score for male students is 89, and the average test score for female students is 92) - this is evidence that there is some underlying relationship present that can be further explored. 

In [7]:
## Two qualitative variables

pd.crosstab(df["gender"], df["level_of_fit"])

level_of_fit,high,low
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,387,613
male,410,590


In [8]:
## One qualitative variable, one numeric

df["grade"].groupby(df["gender"]).mean()

gender
female    82.7173
male      82.3948
Name: grade, dtype: float64

## <font color=GOLDENROD>Your Turn</font>

    1. Using the insurance dataset, determine if there is any association between the categorical independent variables and the dependent variable. 

### Variance and your Dependent Variable
***
Understanding which variables have a relationship with your dependent variable is the first step. The next step is to better understand what is influencing that relationship, or the variance seen in your dependent variable. <B>Variance</B> describes the variation in values for a specific variable. 

We know that some people are taller than others - not everyone is exactly the same height. If we were to ask 100 people their specific height - we would not receive 100 identical answers; instead, we would likely receive different values (with some overlap). This variation in values is variance. When variance is observed, the next step is to determine what the factors are between the variance - in other words, what factors are responsible for the difference in height between the 100 people? Gender is likely an important consideration - as males tend to be taller on average. But is that the only reason some people are taller or shorter than others? Maybe ethnicity is influencing this variance - some ethnic groups are naturally taller than others. Genetics might also contribute; as taller parents typically have taller children.

Gender, ethnicity, genetics, etc - are all factors that may contribute to the variation in height. In this example, these factors are independent variables and the height of the individual is the dependent variable. The key question is: <b>how do changes in the independent variables explain the variance in the dependent variable?</b> An additional question is: <b>how much of the variation in the dependent variable can be attributed to each independent variable?</b>

Once we understand the factors that are influencing the dependent variable, we can purposefully manipulate the values of the independent variables to predict how that will change the dependent variable. This is where regression comes in handy! 

In [9]:
df.head(10)

Unnamed: 0,fname,lname,gender,age,exercise,level_of_fit,hours,level_of_study,grade,home_state
0,Marcia,Pugh,female,17,3,low,10,moderate,82.4,NJ
1,Kadeem,Morrison,male,18,4,low,4,low,78.2,MA
2,Nash,Powell,male,18,5,low,9,moderate,79.3,OH
3,Noelani,Wagner,female,14,2,high,7,moderate,83.2,FL
4,Noelani,Cherry,female,18,4,low,15,high,87.4,OH
5,Neil,Whitley,male,16,5,low,16,high,88.7,NJ
6,Nelle,Golden,female,17,1,high,9,moderate,80.2,PA
7,Armando,Hoffman,male,17,5,low,18,high,95.1,MI
8,Illiana,Rojas,female,15,5,low,9,moderate,76.5,LA
9,Neil,Wooten,male,15,3,low,15,high,89.7,TN


# <font color=DODGERBLUE>Introduction to Linear Regression</font>
***

<b>Linear Regression</b> is a powerful statistical tool that allows for a closer examination of the linear relationship between <b>a continuous dependent variable</b> and various independent variables. 

Linear regression allows you to :
1. Determine if the relationship between the dependent and independent variables is statistically significant (or accurate).
2. Identify how much of the variation in the dependent variable is explained by the selection of independent variables. 
3. Determine the direction and magnitude of the relationships between variables, and 
4. Predict what the value of the dependent variable would be given specific input from the independent variables. 

You can use linear regression to predict the salary of a lawyer (DV) based on the number of years they practiced law (IV). You could also determine just how much of the variation in salary is attributed to the number of years they have practiced law.

***

A simple linear regression models the relationships between a single dependent and independent variable, <b>where the independent variable is predicting the value of the dependent variable</b>. A linear regression model is mathematically represented by the formula of a line:
### \begin{align}  y = mx + b \end{align}
Where “y” is the the value of the dependent variable, “m” is the slope (also known as the coefficient), “x” represents the value of the independent variable, and “b” is the y-intercept (also known as the constant) which is the value of “y” when the coefficient is equal to 0. 

<b>Linear regression models will determine the line-of-best-fit, also known as the regression line, which is the best fitting straight line through your data points.</b> Most commonly, the best fitting line is the line that minimizes errors. The equation for the regression line is what is used to make predictions for your dependent variable. 

<center><img src='https://s3.amazonaws.com/stackabuse/media/linear-regression-python-scikit-learn-1.png'></center>

***
## Things to Consider when using Linear Regression

***

### Feature Selection

Once we get a sense of the relationship between our variables, we need to make some decisions on which variables to include in our regression analyses. <b>We only want to include the variables that have some kind of relationship with your dependent variable(s) of interest.</b> This allows us to cut back on the empty "noise" in your dataset and focus on the variables that are meaningful. 

### Redundancy
Be careful with including multiple variables that show similar information. For example, if you have a variable "Hours of Study" which represents the total hours that a student studied, and you've also binned that data and created a new variable "Study Level" which groups students into "high, med, low" - you now have two variables in your dataset that show similar information, although one variable is a lot more specific. When you have this situation, you should elect to keep the variable that has the more detailed information (i.e. the specific hours of study). 

### Multiple Groups
Variables with a lot of different levels (i.e. State of Residence) don't do well in regression models. If you have a variable that has a large number of different groups, unless this variable is vital to your analyses, you should work to reduce the number of overall groups (i.e State can become Region), or leave the variable out of the analyses. It is possible to include these complicated variables, but it isn't always the easiest interpretation.

### Confounding Variables
When we preform regression analyses, we are interested in further understanding the relationship between the dependent and independent variable. Simply put, we are interested in investigating how the independent variable effects the dependent variable. However, it's rarely this simple and there are other factors that need to be understood and controlled. A confounding variable is a third variable that influences both the independent and dependent variable to some degree. It is important to acknowledge this third variable to ensure that the results of your analyses are valid. 

For example, you collect data on sunburns and ice cream consumption. You find that higher ice cream consumption is associated with a higher rate of sunburns. Does this mean that eating ice cream causes sunburns? Absolutely not, there are several other factors that can be attributed to this trend -- but most likely, the confounding variable is temperature. Hot temperatures cause people to eat more ice cream and result in people spending more time outdoors, which can result in more sunburns. Without accounting for the confounding variable(s), you may find relationships between variables that might not actually exist. To control for the potential effects of confounding variables, you simply have to include them in your regression model as another independent variable. 

When you include all of your assumed confounding variables in your regression model, you are controlling for the effects of all of them, and if you find there is still a relationship between a specific independent variable and your dependent variable, you will know that relationship isn't being influenced by any of these other factors.

In [10]:
## new library alert! ##

## Import the StatsModels library for our regression analyses

import statsmodels.formula.api as sm

### Creating a Linear Regression Model

* <b>result</b> is the name that we are assigning the regression formula
* <b>sm</b> is the shorthand for the linear regression model library
* <b>ols</b> is Ordinary Least Squares, the most common method of calculating the regression line
* the regression equation starts with the dependent variable on the left, followed by the independent variables
* independent variables are separated by "+"
* categorical variables must be in parentheses and annotated with a "C"
* data = is where you specify your dataset that you're pulling variables from
* <b>.fit()</b> function uses the predictive values to calculate the best linear regression line
* <b>.summary()</b> function will show the calculated values (slopes and y-intercept) for the linear regression formula

In [11]:
## create the regression model
result = sm.ols('grade ~ hours + age + exercise + C(gender)', data = df).fit()

## print the regression model summary
result.summary()

0,1,2,3
Dep. Variable:,grade,R-squared:,0.665
Model:,OLS,Adj. R-squared:,0.664
Method:,Least Squares,F-statistic:,988.1
Date:,"Tue, 02 May 2023",Prob (F-statistic):,0.0
Time:,01:21:00,Log-Likelihood:,-6299.1
No. Observations:,2000,AIC:,12610.0
Df Residuals:,1995,BIC:,12640.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,58.0874,1.326,43.804,0.000,55.487,60.688
C(gender)[T.male],-0.4485,0.253,-1.773,0.076,-0.944,0.047
hours,1.9173,0.031,61.617,0.000,1.856,1.978
age,0.0405,0.075,0.543,0.587,-0.106,0.187
exercise,0.9841,0.089,11.073,0.000,0.810,1.158

0,1,2,3
Omnibus:,325.522,Durbin-Watson:,2.048
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2284.723
Skew:,-0.569,Prob(JB):,0.0
Kurtosis:,8.111,Cond. No.,214.0


# <font color=DODGERBLUE>Interpreting Linear Regression Result</font>
***
### Determining Model Fit
***
Linear regression calculates the regression line (equation) that minimizes the distance between the regression line and all the data points.  If our data points fall closely to the generated regression line, we consider the model to be a good fit. 

<img src='https://blog.minitab.com/hubfs/Imported_Blog_Media/residual_illustration-1.gif'>

But what does that mean non-graphically? Linear regression may not always be the right technique to use for the specific set of data. The fit of the model describes how well your variables explain the variance in the dependent variable. 

To assess the model fit, we look at the adjusted R-squared (Adj. R-squared). The Adj. R-squared is a statistical measure of how closely the data are to the fitted regression line. <B>The Adj. R-squared is the percentage of variation in the dependent variable that can be explained by all the independent variables included in the model.</B> For example, how much of the variation in student grades (i.e. grades ranging from 32-100) can be explained by hours of study, age, hours of exercise, gender, etc? Values range from 0 to 1; the higher the value, the better the fit. 

***
### Intercept
***
The y-intercept, or constant, is the value given to the dependent variable if all independent variables are equal to 0. For example, when all independent variables are equal to 0, the expected student grade is approximately 58. Don't worry too much about this interpretation, oftentimes this won't make sense - for example, if age equals 0, but it's important to know how to review this. 

***
### Coefficients (coef)
***
These values show how changes in the independent variable influence the dependent variable, and in what direction. 

#### <b> Interpreting Numeric Variable Coefficients </b>
When you are looking at numeric variables (i.e. age), each coefficient represents the numeric change in the dependent variable given a one-unit change in the independent variable. For example, for every one hour increase in study time (hours), grade increases by 1.9 points.

* <b>INTERCEPT</b>: when all other IV's are zero, expected grade is around 58

* <b>AGE</b>: for every one year increase in age, grade increases by 0.04 points (when controlling for hours of study, exercise, and gender)

* <b>EXERCISE</b>: for every one hour increase in exercise, grade increases by approximately 1 point (when controlling for age, hours of study, and gender)

* <b>HOURS</b>: for every one hour increase in study time, grade increases by 1.9 points (when controlling for age, hours of exercise, and gender)

If you have a variable that gives you a negative coefficient, the relationship moves in opposite directions. For example, say the coefficient for age was <b>- 0.0405</b>, you would interpret it in the following way: 

* <b>AGE</b>: for every one year increase in age, grade <u>decreases</u> by 0.04 points. 

#### <b> Interpreting Categorical Variable Coefficients </b>

When working with categorical variables, the model will automatically take one of the categories and use it as a reference (i.e. comparison category). This reference category will not show up in the listed coefficients, and that is how you will be able to identify which category is serving as the reference. In our model, gender is the only categorical variable, and can be interpreted: 

<b>GENDER</b>:
* Female is the reference category
* On average, Male students have a grade 0.44 points lower than Female students (when controlling for age, hours of study, and exercise) 

The grade is lower because the coefficient is negative in this model. If we have more than 2 categories for our categorical variable, we would still compare each level to the reference. For example, if we had a group of students who choose not to disclose their gender, the coefficient for the 'undisclosed' group would show up in our model with it's own coefficient and we would compare those results to the Female category. 

#### Standard Error (std err)

The standard error reflects the level of accuracy of the coefficients. The lower the value, the higher the level of accuracy.

***
### Statistical Significance 
***
#### p-value (P>|t|)
When trying to determine if the results we received are statistically important - we need to consider the p-value. The p-value is the probability that you will receive the same results solely by chance (aka there is no meaning behind the results). 

Because the p-value is representing the probability of random findings, we always want to minimize this value. If the p-value was .5 (50%), that would mean that 50% of the time the results we see are just by chance. The p-value reflects how confident we are in our results and requires strict interpretation. A commonly used cut-off is 0.05 (5% chance the results observed are by chance) or below. 
* If the p-value is less than or equal to 0.05, we can deem our results to be statistically significant. 
* If the p-value is greater than 0.05, our results are not statistically significant. 

***
### What do I do with non-significant variables?
***
Regression analyses are rarely a "one and done" situation. After you run your analyses, you should tweak your model (equation) and see if you can improve the model fit. A good place to start is by removing non-significant variables from our model. In this example, gender and age are not significant - let's try a model without them!

In [12]:
## Removing non-significant variables 

## "~" this sign is called a tilde

result2 = sm.ols('grade ~ hours + exercise', data = df).fit()

result2.summary()

0,1,2,3
Dep. Variable:,grade,R-squared:,0.664
Model:,OLS,Adj. R-squared:,0.664
Method:,Least Squares,F-statistic:,1973.0
Date:,"Tue, 02 May 2023",Prob (F-statistic):,0.0
Time:,01:21:00,Log-Likelihood:,-6300.8
No. Observations:,2000,AIC:,12610.0
Df Residuals:,1997,BIC:,12620.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,58.5316,0.447,130.828,0.000,57.654,59.409
hours,1.9162,0.031,61.575,0.000,1.855,1.977
exercise,0.9892,0.089,11.131,0.000,0.815,1.163

0,1,2,3
Omnibus:,318.721,Durbin-Watson:,2.048
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2158.0
Skew:,-0.564,Prob(JB):,0.0
Kurtosis:,7.962,Cond. No.,43.2


# <font color=DODGERBLUE>Comparing Models and Making Predictions</font>

When comparting two models, the first thing you want to check is if there are any changes in the adjusted r-squared. In this example, the adj r-squared as not changed at all between the two models. The next item you want to check is the p-value of the remaining variables in the model. Did removing variables increase/decrease the p-values for the remaining variables?

You do not need to continue running your model until all your variables are significant. You should focus on maximizing your adj r-square, regardless of the significance of your variables. 

### What can I do with this information?

Now that you have a better understanding of how your variables interact, you are in a better place to describe your data. You now know that an increase of just one hour of studying will, on average, show a significant increase in a students final grade.  You can also make predictions about future grades...

### Making Predictions based on the Regression Results

Recall that a simple linear regression model is mathematically represented by the formula of a line: <b> y = mx + b</b>. When you are creating a multiple linear regression equation, the formula is similar - with some added features: <b> y = b + (m1 x X1) + (m2 x X2) + (m3 x X3) + ....etc. </b> where each "m x X" is representative of one of the coefficients in your model, and "b" represents the intercept. Once you have your regression output, you can plug these values in to make predictions on your dependent variable given specific values for your independent variables.

    grade(y) = intercept(b) + [hours of study coef(m1) x hours of study(x1)] + [hours of exercise coef(m2) x hours exercise(x2)]

    grade(y) = 58.5316 + [1.9162 x hours of study(x1)] + [0.9892 x hours exercise(x2)]

### What grade can we expect from a student that studied 8 hours and exercised 4.3 hours?

    grade(y) = 58.5316 + [1.9162 x hours of study(8)] + [0.9892 x hours exercise(4.3)]

    grade(y) = 58.5316 + [1.9162 x (8)] + [0.9892 x (4.3)]

    grade(y) = 58.5316 + [‭15.3296‬] + ‭[4.25356‬]

    grade = ‭78.11476

## <font color=SALMON>Ta Da! You can now predict the future!</font>

<LEFT><img src='https://i0.wp.com/www.learning-mind.com/wp-content/uploads/2019/10/psychic-spiritual-energy.jpg?resize=768%2C512&ssl=1'></LEFT>

## Making Predictions: the simple way!

What grade can we expect from a 16 year old female student that studied 8 hours and exercised 5.7 hours? 

In [13]:
## We can use the predict function (from statsmodel library) to predict the outcome given specific input
## reference the model with your function to reference the appropriate coef's!

# model_name.predict({'variable1_name':value1, 'variable2_name':value2, ...})

result.predict({
    'hours': 8, 
    'age': 16, 
    'exercise': 5.7, 
    'gender': "female"})

0    79.683606
dtype: float64

In [14]:
## What about another scenario?

result.predict({
    'hours': 14, 
    'age': 18, 
    'exercise': 12, 
    'gender': "male"})

0    97.020102
dtype: float64

# <font color=DODGERBLUE>Model 11 Exercises</font>

#### 1. Complete the code below to import the four libraries we've used most commonly.

In [15]:
dfbabies = pd.read_csv("babies.csv")
df

Unnamed: 0,fname,lname,gender,age,exercise,level_of_fit,hours,level_of_study,grade,home_state
0,Marcia,Pugh,female,17,3,low,10,moderate,82.4,NJ
1,Kadeem,Morrison,male,18,4,low,4,low,78.2,MA
2,Nash,Powell,male,18,5,low,9,moderate,79.3,OH
3,Noelani,Wagner,female,14,2,high,7,moderate,83.2,FL
4,Noelani,Cherry,female,18,4,low,15,high,87.4,OH
...,...,...,...,...,...,...,...,...,...,...
1995,Cody,Shepherd,male,19,1,high,8,moderate,80.1,VA
1996,Geraldine,Peterson,female,16,4,low,18,high,100.0,NY
1997,Mercedes,Leon,female,18,3,low,14,high,84.9,UT
1998,Lucius,Rowland,male,16,1,high,7,moderate,69.1,MT


#### 2. Import the "babies.csv" file and name it df. 

<b>Background Info</b>

    The Child Health and Development Studies considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. The goal is to model the weight of the infants (bwt, in ounces) using variables including length of pregnancy in days (gestation), mother's age in years (age), mother's height in inches (height), whether the child was the first born (parity), mother's pregnancy weight in pounds (weight), and whether the mother was a smoker (smoke).

<b>Variables</b>

    case - id number
    bwt - birthweight, in ounces
    gestation - length of gestation, in days
    parity - binary indicator for a first pregnancy (0=first pregnancy)
    age - mother's age in years
    height - mother's height in inches
    weight - mother's weight in pounds
    smoke - binary indicator for whether the mother smokes

#### 3. Check the shape of the dataset. How many columns and rows are there?

In [16]:
dfbabies.shape

(1236, 8)

In [17]:
dfbabies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1236 entries, 0 to 1235
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   case       1236 non-null   int64  
 1   bwt        1236 non-null   int64  
 2   gestation  1223 non-null   float64
 3   parity     1236 non-null   int64  
 4   age        1234 non-null   float64
 5   height     1214 non-null   float64
 6   weight     1200 non-null   float64
 7   smoke      1226 non-null   float64
dtypes: float64(5), int64(3)
memory usage: 77.4 KB


In [18]:
dfbabies.describe()

Unnamed: 0,case,bwt,gestation,parity,age,height,weight,smoke
count,1236.0,1236.0,1223.0,1236.0,1234.0,1214.0,1200.0,1226.0
mean,618.5,119.576861,279.338512,0.254854,27.255267,64.047776,128.625833,0.39478
std,356.946775,18.236452,16.027693,0.435956,5.781405,2.533409,20.971862,0.489003
min,1.0,55.0,148.0,0.0,15.0,53.0,87.0,0.0
25%,309.75,108.75,272.0,0.0,23.0,62.0,114.75,0.0
50%,618.5,120.0,280.0,0.0,26.0,64.0,125.0,0.0
75%,927.25,131.0,288.0,1.0,31.0,66.0,139.0,1.0
max,1236.0,176.0,353.0,1.0,45.0,72.0,250.0,1.0


#### 4. Check the first 10 rows and the last 10 rows. Drop the column "case". 

In [19]:
dfbabies.head(10)

Unnamed: 0,case,bwt,gestation,parity,age,height,weight,smoke
0,1,120,284.0,0,27.0,62.0,100.0,0.0
1,2,113,282.0,0,33.0,64.0,135.0,0.0
2,3,128,279.0,0,28.0,64.0,115.0,1.0
3,4,123,,0,36.0,69.0,190.0,0.0
4,5,108,282.0,0,23.0,67.0,125.0,1.0
5,6,136,286.0,0,25.0,62.0,93.0,0.0
6,7,138,244.0,0,33.0,62.0,178.0,0.0
7,8,132,245.0,0,23.0,65.0,140.0,0.0
8,9,120,289.0,0,25.0,62.0,125.0,0.0
9,10,143,299.0,0,30.0,66.0,136.0,1.0


In [20]:
dfbabies.tail(10)

Unnamed: 0,case,bwt,gestation,parity,age,height,weight,smoke
1226,1227,109,244.0,1,21.0,63.0,102.0,1.0
1227,1228,103,278.0,0,30.0,60.0,87.0,1.0
1228,1229,118,276.0,0,34.0,64.0,116.0,0.0
1229,1230,127,290.0,0,27.0,65.0,121.0,0.0
1230,1231,132,270.0,0,27.0,65.0,126.0,0.0
1231,1232,113,275.0,1,27.0,60.0,100.0,0.0
1232,1233,128,265.0,0,24.0,67.0,120.0,0.0
1233,1234,130,291.0,0,30.0,65.0,150.0,1.0
1234,1235,125,281.0,1,21.0,65.0,110.0,0.0
1235,1236,117,297.0,0,38.0,65.0,129.0,0.0


In [21]:
dfbabies.drop(columns = ["case"], inplace=True)
dfbabies

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
0,120,284.0,0,27.0,62.0,100.0,0.0
1,113,282.0,0,33.0,64.0,135.0,0.0
2,128,279.0,0,28.0,64.0,115.0,1.0
3,123,,0,36.0,69.0,190.0,0.0
4,108,282.0,0,23.0,67.0,125.0,1.0
...,...,...,...,...,...,...,...
1231,113,275.0,1,27.0,60.0,100.0,0.0
1232,128,265.0,0,24.0,67.0,120.0,0.0
1233,130,291.0,0,30.0,65.0,150.0,1.0
1234,125,281.0,1,21.0,65.0,110.0,0.0


#### 5. Is there any missing data? Check!

In [22]:
dfbabies.isnull().sum()

bwt           0
gestation    13
parity        0
age           2
height       22
weight       36
smoke        10
dtype: int64

#### 6. The amount of missing data is small considering the size of our dataset. Drop all rows in the dataset that have missing data.

In [23]:
dfbabies.dropna(inplace=True)

In [24]:
dfbabies

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
0,120,284.0,0,27.0,62.0,100.0,0.0
1,113,282.0,0,33.0,64.0,135.0,0.0
2,128,279.0,0,28.0,64.0,115.0,1.0
4,108,282.0,0,23.0,67.0,125.0,1.0
5,136,286.0,0,25.0,62.0,93.0,0.0
...,...,...,...,...,...,...,...
1231,113,275.0,1,27.0,60.0,100.0,0.0
1232,128,265.0,0,24.0,67.0,120.0,0.0
1233,130,291.0,0,30.0,65.0,150.0,1.0
1234,125,281.0,1,21.0,65.0,110.0,0.0


#### 7. Check each of your numeric columns for outliers - pick one method and use it for all the columns. 

In [25]:
dfz = dfbabies.copy()

In [26]:
print(dfz.shape)

(1174, 7)


#### 8. Print the descriptive statistics for each numeric column. What is the average age of the mothers? What is the average gestation period?

In [27]:
dfz["zscore_bwt"] = np.abs(stats.zscore(dfz["bwt"]))

In [28]:
dfz.head()

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke,zscore_bwt
0,120,284.0,0,27.0,62.0,100.0,0.0,0.029337
1,113,282.0,0,33.0,64.0,135.0,0.0,0.352741
2,128,279.0,0,28.0,64.0,115.0,1.0,0.465998
4,108,282.0,0,23.0,67.0,125.0,1.0,0.625654
5,136,286.0,0,25.0,62.0,93.0,0.0,0.902658


In [29]:
z_outliers = dfz.loc[dfz["zscore_bwt"] > 3].index

In [30]:
print(z_outliers)

Int64Index([632, 829, 912, 978, 1139], dtype='int64')


In [31]:
dfz.iloc[[632, 829, 912, 978, 1139]]

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke,zscore_bwt
676,100,275.0,0,26.0,60.0,115.0,0.0,1.062315
881,128,272.0,1,18.0,67.0,109.0,0.0,0.465998
967,119,273.0,0,35.0,65.0,125.0,1.0,0.025246
1035,65,237.0,0,31.0,67.0,130.0,0.0,2.972705
1200,97,255.0,1,22.0,63.0,107.0,1.0,1.226062


In [32]:
dfz = dfz.drop(z_outliers)

In [33]:
print(dfz.shape)

(1169, 8)


In [34]:
dfz["zscore_gestation"] = np.abs(stats.zscore(dfz["gestation"]))
dfz.head()

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke,zscore_bwt,zscore_gestation
0,120,284.0,0,27.0,62.0,100.0,0.0,0.029337,0.301151
1,113,282.0,0,33.0,64.0,135.0,0.0,0.352741,0.17399
2,128,279.0,0,28.0,64.0,115.0,1.0,0.465998,0.016752
4,108,282.0,0,23.0,67.0,125.0,1.0,0.625654,0.17399
5,136,286.0,0,25.0,62.0,93.0,0.0,0.902658,0.428312


In [35]:
z_outliers = dfz.loc[dfz["zscore_gestation"] > 3].index

In [36]:
print(z_outliers)

Int64Index([  10,   59,  129,  240,  253,  260,  462,  500,  710,  761,  833,
             869,  969, 1065, 1152, 1172, 1199],
           dtype='int64')


In [37]:
dfz=dfz.drop(z_outliers)

In [38]:
print(dfz.shape)

(1152, 9)


In [39]:
dfz["zscore_parity"]=np.abs(stats.zscore(dfz["parity"]))
dfz.head()

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke,zscore_bwt,zscore_gestation,zscore_parity
0,120,284.0,0,27.0,62.0,100.0,0.0,0.029337,0.301151,0.601417
1,113,282.0,0,33.0,64.0,135.0,0.0,0.352741,0.17399,0.601417
2,128,279.0,0,28.0,64.0,115.0,1.0,0.465998,0.016752,0.601417
4,108,282.0,0,23.0,67.0,125.0,1.0,0.625654,0.17399,0.601417
5,136,286.0,0,25.0,62.0,93.0,0.0,0.902658,0.428312,0.601417


In [40]:
z_outliers=dfz.loc[dfz["zscore_parity"]>3].index

In [41]:
print(z_outliers) #there are no outliers for the variable/column "parity"; therefore, no need to drop outliers.

Int64Index([], dtype='int64')


In [42]:
dfz["zscore_age"]=np.abs(stats.zscore(dfz["age"]))
dfz.head()

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke,zscore_bwt,zscore_gestation,zscore_parity,zscore_age
0,120,284.0,0,27.0,62.0,100.0,0.0,0.029337,0.301151,0.601417,0.041566
1,113,282.0,0,33.0,64.0,135.0,0.0,0.352741,0.17399,0.601417,0.988196
2,128,279.0,0,28.0,64.0,115.0,1.0,0.465998,0.016752,0.601417,0.130061
4,108,282.0,0,23.0,67.0,125.0,1.0,0.625654,0.17399,0.601417,0.728074
5,136,286.0,0,25.0,62.0,93.0,0.0,0.902658,0.428312,0.601417,0.38482


In [43]:
z_outliers=dfz.loc[dfz["zscore_age"]>3].index

In [44]:
print(z_outliers)

Int64Index([1070], dtype='int64')


In [45]:
dfz.iloc[[1070]]

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke,zscore_bwt,zscore_gestation,zscore_parity,zscore_age
1147,156,300.0,0,27.0,65.0,120.0,1.0,1.99431,1.318439,0.601417,0.041566


In [46]:
dfz=dfz.drop(z_outliers)

In [47]:
dfz.shape

(1151, 11)

In [48]:
dfz["zscore_height"]=np.abs(stats.zscore(dfz["height"]))
dfz.head()

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke,zscore_bwt,zscore_gestation,zscore_parity,zscore_age,zscore_height
0,120,284.0,0,27.0,62.0,100.0,0.0,0.029337,0.301151,0.601417,0.041566,0.812477
1,113,282.0,0,33.0,64.0,135.0,0.0,0.352741,0.17399,0.601417,0.988196,0.019632
2,128,279.0,0,28.0,64.0,115.0,1.0,0.465998,0.016752,0.601417,0.130061,0.019632
4,108,282.0,0,23.0,67.0,125.0,1.0,0.625654,0.17399,0.601417,0.728074,1.169636
5,136,286.0,0,25.0,62.0,93.0,0.0,0.902658,0.428312,0.601417,0.38482,0.812477


In [49]:
z_outliers=dfz.loc[dfz["zscore_height"]>3].index

In [50]:
print(z_outliers)

Int64Index([20, 175, 434, 1208], dtype='int64')


In [51]:
dfz.loc[[20, 175, 434, 1208]]

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke,zscore_bwt,zscore_gestation,zscore_parity,zscore_age,zscore_height
20,105,270.0,0,22.0,56.0,93.0,0.0,0.789402,0.588976,0.601417,0.899701,3.191011
175,122,278.0,0,31.0,72.0,155.0,1.0,0.138502,0.080332,0.601417,0.644942,3.151748
434,146,263.0,0,39.0,53.0,110.0,1.0,1.448484,1.03404,0.601417,2.017958,4.380279
1208,141,281.0,0,29.0,54.0,156.0,1.0,1.175571,0.110409,0.601417,0.301688,3.983856


In [52]:
dfz=dfz.drop(z_outliers)

In [53]:
dfz.shape

(1147, 12)

In [54]:
dfz["zscore_weight"]=np.abs(stats.zscore(dfz["weight"]))
dfz.head()

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke,zscore_bwt,zscore_gestation,zscore_parity,zscore_age,zscore_height,zscore_weight
0,120,284.0,0,27.0,62.0,100.0,0.0,0.029337,0.301151,0.601417,0.041566,0.812477,1.375828
1,113,282.0,0,33.0,64.0,135.0,0.0,0.352741,0.17399,0.601417,0.988196,0.019632,0.320714
2,128,279.0,0,28.0,64.0,115.0,1.0,0.465998,0.016752,0.601417,0.130061,0.019632,0.648739
4,108,282.0,0,23.0,67.0,125.0,1.0,0.625654,0.17399,0.601417,0.728074,1.169636,0.164012
5,136,286.0,0,25.0,62.0,93.0,0.0,0.902658,0.428312,0.601417,0.38482,0.812477,1.715136


In [55]:
z_outliers=dfz.loc[dfz["zscore_weight"]>3].index

In [56]:
print(z_outliers)

Int64Index([88, 117, 149, 181, 183, 426, 522, 563, 608, 622, 723, 849, 858,
            865, 924, 1148],
           dtype='int64')


In [57]:
dfz.loc[[88, 117, 149, 181, 183, 426, 522, 563, 608, 622, 723, 849, 858,865, 924, 1148]]

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke,zscore_bwt,zscore_gestation,zscore_parity,zscore_age,zscore_height,zscore_weight
88,125,305.0,0,22.0,70.0,196.0,1.0,0.30225,1.636341,0.601417,0.899701,2.358903,3.277544
117,131,283.0,0,25.0,67.0,215.0,0.0,0.629745,0.23757,0.601417,0.38482,1.169636,4.198524
149,126,282.0,0,38.0,66.0,250.0,0.0,0.356832,0.17399,0.601417,1.846331,0.773213,5.895065
181,113,277.0,0,23.0,65.0,192.0,1.0,0.352741,0.143913,0.601417,0.728074,0.376791,3.083653
183,124,277.0,0,29.0,63.0,220.0,0.0,0.247667,0.143913,0.601417,0.301688,0.416054,4.440887
426,140,251.0,0,28.0,63.0,210.0,0.0,1.120989,1.797006,0.601417,0.130061,0.416054,3.95616
522,132,282.0,0,28.0,67.0,200.0,1.0,0.684328,0.17399,0.601417,0.130061,1.169636,3.471434
563,105,260.0,0,23.0,64.0,197.0,0.0,0.789402,1.224781,0.601417,0.728074,0.019632,3.326016
608,115,273.0,1,23.0,67.0,215.0,1.0,0.243576,0.398235,1.66274,0.728074,1.169636,4.198524
622,91,248.0,0,33.0,63.0,202.0,0.0,1.553558,1.987747,0.601417,0.988196,0.416054,3.56838


In [58]:
dfz=dfz.drop(z_outliers)

In [59]:
dfz.shape

(1131, 13)

In [60]:
dfz["zscore_smoke"]=np.abs(stats.zscore(dfz["smoke"]))
dfz.head()

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke,zscore_bwt,zscore_gestation,zscore_parity,zscore_age,zscore_height,zscore_weight,zscore_smoke
0,120,284.0,0,27.0,62.0,100.0,0.0,0.029337,0.301151,0.601417,0.041566,0.812477,1.375828,0.799456
1,113,282.0,0,33.0,64.0,135.0,0.0,0.352741,0.17399,0.601417,0.988196,0.019632,0.320714,0.799456
2,128,279.0,0,28.0,64.0,115.0,1.0,0.465998,0.016752,0.601417,0.130061,0.019632,0.648739,1.25085
4,108,282.0,0,23.0,67.0,125.0,1.0,0.625654,0.17399,0.601417,0.728074,1.169636,0.164012,1.25085
5,136,286.0,0,25.0,62.0,93.0,0.0,0.902658,0.428312,0.601417,0.38482,0.812477,1.715136,0.799456


In [61]:
z_outliers=dfz.loc[dfz["zscore_smoke"]>3].index

In [62]:
print(z_outliers) #there are no outliers on the variable/column "smoke"

Int64Index([], dtype='int64')


In [63]:
dfz.shape

(1131, 14)

#### 9. Let's model birthweight based on the characteristics of the mother. But first... 

We want to easily distinguish between the numeric and categorical variables. Replace the values 0/1 in the "parity" and "smoke" column with meaningful labels (i.e. smokes, doesn't smoke).

In [64]:
#parity - binary indicator for a first pregnancy (0=first pregnancy)

dfz["parity"].replace([0,1],["first pregnancy","second pregnancy"], inplace=True)
dfz

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke,zscore_bwt,zscore_gestation,zscore_parity,zscore_age,zscore_height,zscore_weight,zscore_smoke
0,120,284.0,first pregnancy,27.0,62.0,100.0,0.0,0.029337,0.301151,0.601417,0.041566,0.812477,1.375828,0.799456
1,113,282.0,first pregnancy,33.0,64.0,135.0,0.0,0.352741,0.173990,0.601417,0.988196,0.019632,0.320714,0.799456
2,128,279.0,first pregnancy,28.0,64.0,115.0,1.0,0.465998,0.016752,0.601417,0.130061,0.019632,0.648739,1.250850
4,108,282.0,first pregnancy,23.0,67.0,125.0,1.0,0.625654,0.173990,0.601417,0.728074,1.169636,0.164012,1.250850
5,136,286.0,first pregnancy,25.0,62.0,93.0,0.0,0.902658,0.428312,0.601417,0.384820,0.812477,1.715136,0.799456
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1231,113,275.0,second pregnancy,27.0,60.0,100.0,0.0,0.352741,0.271074,1.662740,0.041566,1.605321,1.375828,0.799456
1232,128,265.0,first pregnancy,24.0,67.0,120.0,0.0,0.465998,0.906879,0.601417,0.556447,1.169636,0.406376,0.799456
1233,130,291.0,first pregnancy,30.0,65.0,150.0,1.0,0.575163,0.746214,0.601417,0.473315,0.376791,1.047803,1.250850
1234,125,281.0,second pregnancy,21.0,65.0,110.0,0.0,0.302250,0.110409,1.662740,1.071328,0.376791,0.891102,0.799456


In [65]:
dfz["smoke"].replace([0,1],["non-smoker","smoker"], inplace=True)
dfz

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke,zscore_bwt,zscore_gestation,zscore_parity,zscore_age,zscore_height,zscore_weight,zscore_smoke
0,120,284.0,first pregnancy,27.0,62.0,100.0,non-smoker,0.029337,0.301151,0.601417,0.041566,0.812477,1.375828,0.799456
1,113,282.0,first pregnancy,33.0,64.0,135.0,non-smoker,0.352741,0.173990,0.601417,0.988196,0.019632,0.320714,0.799456
2,128,279.0,first pregnancy,28.0,64.0,115.0,smoker,0.465998,0.016752,0.601417,0.130061,0.019632,0.648739,1.250850
4,108,282.0,first pregnancy,23.0,67.0,125.0,smoker,0.625654,0.173990,0.601417,0.728074,1.169636,0.164012,1.250850
5,136,286.0,first pregnancy,25.0,62.0,93.0,non-smoker,0.902658,0.428312,0.601417,0.384820,0.812477,1.715136,0.799456
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1231,113,275.0,second pregnancy,27.0,60.0,100.0,non-smoker,0.352741,0.271074,1.662740,0.041566,1.605321,1.375828,0.799456
1232,128,265.0,first pregnancy,24.0,67.0,120.0,non-smoker,0.465998,0.906879,0.601417,0.556447,1.169636,0.406376,0.799456
1233,130,291.0,first pregnancy,30.0,65.0,150.0,smoker,0.575163,0.746214,0.601417,0.473315,0.376791,1.047803,1.250850
1234,125,281.0,second pregnancy,21.0,65.0,110.0,non-smoker,0.302250,0.110409,1.662740,1.071328,0.376791,0.891102,0.799456


#### 10. Run a correlation matrix with your dataset. Which variables are correlated with birthweight? 

Describe the strength of the correlation between all the numeric variables and birthweight. 

In [66]:
dfz.corr()

Unnamed: 0,bwt,gestation,age,height,weight,zscore_bwt,zscore_gestation,zscore_parity,zscore_age,zscore_height,zscore_weight,zscore_smoke
bwt,1.0,0.396133,0.028941,0.209931,0.163018,0.012658,-0.127979,-0.052664,-0.024005,0.002728,-0.058886,-0.255522
gestation,0.396133,1.0,-0.051545,0.066428,0.038533,-0.108397,-0.09216,0.084452,-0.025401,-0.005903,0.006369,-0.085732
age,0.028941,-0.051545,1.0,0.001643,0.162601,0.095982,0.020402,-0.359798,0.320417,0.037728,0.023386,-0.065449
height,0.209931,0.066428,0.001643,1.0,0.462335,0.064616,-0.021209,0.046251,0.020245,-0.005094,-0.131328,0.015987
weight,0.163018,0.038533,0.162601,0.462335,1.0,0.04117,0.021912,-0.091345,0.114725,0.005696,0.227078,-0.062083
zscore_bwt,0.012658,-0.108397,0.095982,0.064616,0.04117,1.0,0.191628,-0.043767,0.040179,0.03858,0.045744,0.06386
zscore_gestation,-0.127979,-0.09216,0.020402,-0.021209,0.021912,0.191628,1.0,0.032012,0.047037,-0.022182,0.008193,-0.017967
zscore_parity,-0.052664,0.084452,-0.359798,0.046251,-0.091345,-0.043767,0.032012,1.0,0.055211,0.002288,-0.102755,-0.008784
zscore_age,-0.024005,-0.025401,0.320417,0.020245,0.114725,0.040179,0.047037,0.055211,1.0,-0.022641,0.009806,-0.005522
zscore_height,0.002728,-0.005903,0.037728,-0.005094,0.005696,0.03858,-0.022182,0.002288,-0.022641,1.0,0.175102,-0.005776


In [67]:
dfbabies.corr()

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
bwt,1.0,0.407543,-0.043908,0.026983,0.203704,0.155923,-0.2468
gestation,0.407543,1.0,0.080916,-0.053425,0.07047,0.023655,-0.060267
parity,-0.043908,0.080916,1.0,-0.351041,0.043543,-0.096362,-0.009599
age,0.026983,-0.053425,-0.351041,1.0,-0.006453,0.147322,-0.067772
height,0.203704,0.07047,0.043543,-0.006453,1.0,0.435287,0.017507
weight,0.155923,0.023655,-0.096362,0.147322,0.435287,1.0,-0.060281
smoke,-0.2468,-0.060267,-0.009599,-0.067772,0.017507,-0.060281,1.0


In [68]:
#bwt - birthweight, in ounces
#gestation - length of gestation, in days
#parity - binary indicator for a first pregnancy (0=first pregnancy)
#age - mother's age in years
#height - mother's height in inches
#weight - mother's weight in pounds
#smoke - binary indicator for whether the mother smokes

##bwt and gestation has a slighlty positive correlation: bwt increase, gestation increases
##bwt and parity has a positive correlation: bwt decreases, parity decreases
##bwt and age has a very slightlty to almost no correlation
##bwt and height has a very slight to to almost no correlation
##bwt and weight has very slight to almost no correlation
##bwt and smoke has slightly a negative correlation: as birthweight increase, client's smoking decreases


#### 11. Determine the relationship between birthweight and the categorical variables: parity and smoke. 

Use the groupby function to determine if there are any differences between birthweight and the different groups.  Does it seem like there is a relationship between these variables and birthweight?

In [69]:
dfz["bwt"].groupby(dfz["smoke"]).mean()

smoke
non-smoker    123.407246
smoker        114.140590
Name: bwt, dtype: float64

In [70]:
dfz["bwt"].groupby(dfz["parity"]).mean()

parity
first pregnancy     120.357488
second pregnancy    118.254125
Name: bwt, dtype: float64

In [71]:
dfbabies["bwt"].groupby(dfbabies["smoke"]).mean()

smoke
0.0    123.085315
1.0    113.819172
Name: bwt, dtype: float64

In [72]:
dfbabies["bwt"].groupby(dfbabies["parity"]).mean()

parity
0    119.942263
1    118.113636
Name: bwt, dtype: float64

In [73]:
#looking at the variable frequencies of bwt for both categorical variables "smoke" and "parity", we can deduce the following:

#we can see that birthweight has a higher frequency value for "non-smoker" maternity clients compared to clients who are "smoker". In short, we can suggest a positive correlation saying that infants birthweight increases if the mother is a non-smoker vs. a mother who is a smoker
#For "parity", there isn't much of a correlation between first pregnancy and second pregnacy as birthweights tend to be closely the same. We can conclude there there is very little to almost no correlation

#### 12. Let's construct your regression model. Firstly, which variables do you plan to include in your model, and why? 

In the space below, write your justification for why you are including each variable. 

In [74]:
dfz.corr()

Unnamed: 0,bwt,gestation,age,height,weight,zscore_bwt,zscore_gestation,zscore_parity,zscore_age,zscore_height,zscore_weight,zscore_smoke
bwt,1.0,0.396133,0.028941,0.209931,0.163018,0.012658,-0.127979,-0.052664,-0.024005,0.002728,-0.058886,-0.255522
gestation,0.396133,1.0,-0.051545,0.066428,0.038533,-0.108397,-0.09216,0.084452,-0.025401,-0.005903,0.006369,-0.085732
age,0.028941,-0.051545,1.0,0.001643,0.162601,0.095982,0.020402,-0.359798,0.320417,0.037728,0.023386,-0.065449
height,0.209931,0.066428,0.001643,1.0,0.462335,0.064616,-0.021209,0.046251,0.020245,-0.005094,-0.131328,0.015987
weight,0.163018,0.038533,0.162601,0.462335,1.0,0.04117,0.021912,-0.091345,0.114725,0.005696,0.227078,-0.062083
zscore_bwt,0.012658,-0.108397,0.095982,0.064616,0.04117,1.0,0.191628,-0.043767,0.040179,0.03858,0.045744,0.06386
zscore_gestation,-0.127979,-0.09216,0.020402,-0.021209,0.021912,0.191628,1.0,0.032012,0.047037,-0.022182,0.008193,-0.017967
zscore_parity,-0.052664,0.084452,-0.359798,0.046251,-0.091345,-0.043767,0.032012,1.0,0.055211,0.002288,-0.102755,-0.008784
zscore_age,-0.024005,-0.025401,0.320417,0.020245,0.114725,0.040179,0.047037,0.055211,1.0,-0.022641,0.009806,-0.005522
zscore_height,0.002728,-0.005903,0.037728,-0.005094,0.005696,0.03858,-0.022182,0.002288,-0.022641,1.0,0.175102,-0.005776


In [75]:
dfbabies.corr()

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
bwt,1.0,0.407543,-0.043908,0.026983,0.203704,0.155923,-0.2468
gestation,0.407543,1.0,0.080916,-0.053425,0.07047,0.023655,-0.060267
parity,-0.043908,0.080916,1.0,-0.351041,0.043543,-0.096362,-0.009599
age,0.026983,-0.053425,-0.351041,1.0,-0.006453,0.147322,-0.067772
height,0.203704,0.07047,0.043543,-0.006453,1.0,0.435287,0.017507
weight,0.155923,0.023655,-0.096362,0.147322,0.435287,1.0,-0.060281
smoke,-0.2468,-0.060267,-0.009599,-0.067772,0.017507,-0.060281,1.0


In [76]:
#For the construction of a regression model, I would examine the following:
#1. bwt & gestation - to do a through analysis of intercepts, coeffecients, statistical significance(p-value[P>|t|])
#2. height and weight - to do a through analysis of intercepts, coeffecients, statistical significance(p-value[P>|t|])

#The y-intercept, or constant, is the value given to the dependent variable if all independent variables are equal to 0. For example, when all independent variables are equal to 0, the expected student grade is approximately 58. Don't worry too much about this interpretation, oftentimes this won't make sense - for example, if age equals 0, but it's important to know how to review this.

#

In [77]:
result = sm.ols("bwt ~ gestation + parity + age + height + weight + smoke", data = dfz).fit()

result.summary()

0,1,2,3
Dep. Variable:,bwt,R-squared:,0.253
Model:,OLS,Adj. R-squared:,0.249
Method:,Least Squares,F-statistic:,63.38
Date:,"Tue, 02 May 2023",Prob (F-statistic):,7.62e-68
Time:,01:21:01,Log-Likelihood:,-4689.2
No. Observations:,1131,AIC:,9392.0
Df Residuals:,1124,BIC:,9428.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-93.4588,15.204,-6.147,0.000,-123.290,-63.628
parity[T.second pregnancy],-3.6869,1.111,-3.319,0.001,-5.866,-1.507
smoke[T.smoker],-8.1507,0.944,-8.637,0.000,-10.002,-6.299
gestation,0.4781,0.034,14.248,0.000,0.412,0.544
age,-0.0246,0.085,-0.288,0.773,-0.192,0.143
height,1.2250,0.211,5.816,0.000,0.812,1.638
weight,0.0469,0.029,1.641,0.101,-0.009,0.103

0,1,2,3
Omnibus:,4.685,Durbin-Watson:,2.081
Prob(Omnibus):,0.096,Jarque-Bera (JB):,5.636
Skew:,0.008,Prob(JB):,0.0597
Kurtosis:,3.345,Cond. No.,10500.0


#### 13. Construct your regression model and print the summary. 

Write out your full interpretation of the regression results. If you are not happy with the results, tweak your model and run it again. 

In [78]:
result = sm.ols("bwt ~ gestation", data = dfz).fit()

result.summary()

0,1,2,3
Dep. Variable:,bwt,R-squared:,0.157
Model:,OLS,Adj. R-squared:,0.156
Method:,Least Squares,F-statistic:,210.1
Date:,"Tue, 02 May 2023",Prob (F-statistic):,8.47e-44
Time:,01:21:01,Log-Likelihood:,-4757.5
No. Observations:,1131,AIC:,9519.0
Df Residuals:,1129,BIC:,9529.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-22.9059,9.856,-2.324,0.020,-42.244,-3.568
gestation,0.5105,0.035,14.496,0.000,0.441,0.580

0,1,2,3
Omnibus:,2.143,Durbin-Watson:,2.044
Prob(Omnibus):,0.342,Jarque-Bera (JB):,2.026
Skew:,0.097,Prob(JB):,0.363
Kurtosis:,3.071,Cond. No.,5710.0


In [79]:
result = sm.ols("height ~ weight", data = dfz).fit()

result.summary()

0,1,2,3
Dep. Variable:,height,R-squared:,0.214
Model:,OLS,Adj. R-squared:,0.213
Method:,Least Squares,F-statistic:,306.9
Date:,"Tue, 02 May 2023",Prob (F-statistic):,5.65e-61
Time:,01:21:01,Log-Likelihood:,-2488.5
No. Observations:,1131,AIC:,4981.0
Df Residuals:,1129,BIC:,4991.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,56.1605,0.455,123.493,0.000,55.268,57.053
weight,0.0620,0.004,17.520,0.000,0.055,0.069

0,1,2,3
Omnibus:,4.464,Durbin-Watson:,1.908
Prob(Omnibus):,0.107,Jarque-Bera (JB):,4.505
Skew:,-0.138,Prob(JB):,0.105
Kurtosis:,2.862,Cond. No.,900.0


#### 14. Create three scenarios (i.e. make up specific values) and predict the birthweight given these factors. Use the information in the model summary to make these predictions. 

In [80]:
dfz

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke,zscore_bwt,zscore_gestation,zscore_parity,zscore_age,zscore_height,zscore_weight,zscore_smoke
0,120,284.0,first pregnancy,27.0,62.0,100.0,non-smoker,0.029337,0.301151,0.601417,0.041566,0.812477,1.375828,0.799456
1,113,282.0,first pregnancy,33.0,64.0,135.0,non-smoker,0.352741,0.173990,0.601417,0.988196,0.019632,0.320714,0.799456
2,128,279.0,first pregnancy,28.0,64.0,115.0,smoker,0.465998,0.016752,0.601417,0.130061,0.019632,0.648739,1.250850
4,108,282.0,first pregnancy,23.0,67.0,125.0,smoker,0.625654,0.173990,0.601417,0.728074,1.169636,0.164012,1.250850
5,136,286.0,first pregnancy,25.0,62.0,93.0,non-smoker,0.902658,0.428312,0.601417,0.384820,0.812477,1.715136,0.799456
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1231,113,275.0,second pregnancy,27.0,60.0,100.0,non-smoker,0.352741,0.271074,1.662740,0.041566,1.605321,1.375828,0.799456
1232,128,265.0,first pregnancy,24.0,67.0,120.0,non-smoker,0.465998,0.906879,0.601417,0.556447,1.169636,0.406376,0.799456
1233,130,291.0,first pregnancy,30.0,65.0,150.0,smoker,0.575163,0.746214,0.601417,0.473315,0.376791,1.047803,1.250850
1234,125,281.0,second pregnancy,21.0,65.0,110.0,non-smoker,0.302250,0.110409,1.662740,1.071328,0.376791,0.891102,0.799456


In [81]:
#predicting bwt of infant #1
result.predict({"gestation" : 265, "parity" : "first pregnancy", "age" : 21, "height" : 66, "weight" : 112, "smoke" : "non-smoker"})

0    63.100557
dtype: float64

In [82]:
#predicting bwt of infant #2
result.predict({ 
    'gestation': 16, 
    'parity': 5.7, 
    'age': "female",
    'height': 69,
    'weight': 115,
    'smoke': "non-smoker"})

0    63.286453
dtype: float64

In [83]:
#predicting bwt of infant #3
result.predict({"gestation": 270,
                "weight": 120,
                "height": 66})

0    63.596278
dtype: float64