# Exercise 11.1 
Suppose one of your co-workers is expecting a baby and you are participating in an office pool to predict the date of birth. Assuming that bets are placed during the 30th week of pregnancy, what variables could you use to make the best prediction? You should limit yourself to variables that are known before the birth, and likely to be available to the people in the pool.

In [12]:
import first
import statsmodels.formula.api as smf

In [20]:
live, firsts, others = first.MakeFrames()
# Getting pregnancy data greater than the 30th week of pregnancy.
live = live[live.prglngth>30] 

# I believe the following are the variables have a statistically significant effect on pregnancy length.
model = smf.ols('prglngth ~ birthord==1 + race + nbrnaliv>1', data=live)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,prglngth,R-squared:,0.01
Model:,OLS,Adj. R-squared:,0.01
Method:,Least Squares,F-statistic:,31.27
Date:,"Fri, 12 Nov 2021",Prob (F-statistic):,4.24e-20
Time:,15:32:51,Log-Likelihood:,-18252.0
No. Observations:,8884,AIC:,36510.0
Df Residuals:,8880,BIC:,36540.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,38.7581,0.070,553.685,0.000,38.621,38.895
birthord == 1[T.True],0.1054,0.040,2.625,0.009,0.027,0.184
nbrnaliv > 1[T.True],-1.4876,0.165,-9.038,0.000,-1.810,-1.165
race,0.0502,0.035,1.420,0.156,-0.019,0.120

0,1,2,3
Omnibus:,1574.506,Durbin-Watson:,1.619
Prob(Omnibus):,0.0,Jarque-Bera (JB):,6122.061
Skew:,-0.844,Prob(JB):,0.0
Kurtosis:,6.7,Cond. No.,18.1


# Exercise 11.3 
If the quantity you want to predict is a count, you can use Poisson regression, which is implemented in StatsModels with a function called poisson. It works the same way as ols and logit. As an exercise, let’s use it to predict how many children a woman has born; in the NSFG dataset, this variable is called numbabes.
Suppose you meet a woman who is 35 years old, black, and a college graduate whose annual household income exceeds $75,000. How many children would you predict she has born?

In [55]:
import nsfg
import numpy as np
import pandas as pd

In [56]:
resp = nsfg.ReadFemResp()

In [57]:
join = live.join(resp, on='caseid', rsuffix='_r')
formula='numbabes ~ age_r + C(race) + totincr + educat'

#creating a model below using Poisson regression
model = smf.poisson(formula, data=join)
results = model.fit()
results.summary() 

Optimization terminated successfully.
         Current function value: 1.393384
         Iterations 6


0,1,2,3
Dep. Variable:,numbabes,No. Observations:,5651.0
Model:,Poisson,Df Residuals:,5645.0
Method:,MLE,Df Model:,5.0
Date:,"Fri, 12 Nov 2021",Pseudo R-squ.:,0.1073
Time:,16:10:33,Log-Likelihood:,-7874.0
converged:,True,LL-Null:,-8820.6
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-1.2055,0.082,-14.686,0.000,-1.366,-1.045
C(race)[T.2],-0.0256,0.028,-0.921,0.357,-0.080,0.029
C(race)[T.3],-0.1977,0.050,-3.927,0.000,-0.296,-0.099
age_r,0.0616,0.002,40.209,0.000,0.059,0.065
totincr,-0.0520,0.003,-16.955,0.000,-0.058,-0.046
educat,-0.0067,0.005,-1.406,0.160,-0.016,0.003


In [58]:
# Predict the number of children for a woman who is 35 years old, black, and a college graduate whose annual household income exceeds $75,000
columns = ['age_r', 'race', 'totincr', 'educat']
new = pd.DataFrame([[35, 1, 14, 16]], columns=columns)
results.predict(new)

0    1.124008
dtype: float64

# Exercise 11.4 
If the quantity you want to predict is categorical, you can use multinomial logistic regression, which is implemented in StatsModels with a function called mnlogit. As an exercise, let’s use it to guess whether a woman is married, cohabitating, widowed, divorced, separated, or never married; in the NSFG dataset, marital status is encoded in a variable called rmarital.
Suppose you meet a woman who is 25 years old, white, and a high school graduate whose annual household income is about $45,000. What is the probability that she is married, cohabitating, etc?

In [61]:
# Creating the model using multinomial logistic regression.

formula='rmarital ~ age_r + C(race) + totincr + educat'
model = smf.mnlogit(formula, data=join)
results = model.fit()
results.summary() 

Optimization terminated successfully.
         Current function value: 1.188934
         Iterations 8


0,1,2,3
Dep. Variable:,rmarital,No. Observations:,5651.0
Model:,MNLogit,Df Residuals:,5621.0
Method:,MLE,Df Model:,25.0
Date:,"Fri, 12 Nov 2021",Pseudo R-squ.:,0.08755
Time:,16:17:49,Log-Likelihood:,-6718.7
converged:,True,LL-Null:,-7363.3
Covariance Type:,nonrobust,LLR p-value:,1.6470000000000002e-256

rmarital=2,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,2.5003,0.324,7.709,0.000,1.865,3.136
C(race)[T.2],-1.1476,0.106,-10.829,0.000,-1.355,-0.940
C(race)[T.3],-0.7534,0.170,-4.426,0.000,-1.087,-0.420
age_r,-0.0075,0.005,-1.384,0.166,-0.018,0.003
totincr,0.0347,0.012,2.889,0.004,0.011,0.058
educat,-0.2789,0.021,-13.298,0.000,-0.320,-0.238
rmarital=3,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-0.9400,0.955,-0.984,0.325,-2.813,0.933
C(race)[T.2],-0.8189,0.313,-2.619,0.009,-1.432,-0.206
C(race)[T.3],-1.1772,0.638,-1.844,0.065,-2.428,0.074


In [62]:
# Predicting for a woman who is 25 years old, white and a high school graduate whose annual household income is about $45,000.

# This person has a 63% chance of being currently married a 10% chance of being "not married but living with opposite sex partner", etc.

columns = ['age_r', 'race', 'totincr', 'educat']
new = pd.DataFrame([[25, 2, 11, 12]], columns=columns)
results.predict(new)

Unnamed: 0,0,1,2,3,4,5
0,0.632409,0.104414,0.009325,0.107,0.061406,0.085446
