## Multiple Linear Regression

In [6]:
# import libraries 
import pandas as pd
import seaborn as sns 

In [7]:
# load data
penguins = sns.load_dataset("penguins", cache = False)
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


### Data Cleaning

In [8]:
penguins = penguins[["body_mass_g", "bill_length_mm", "sex", "species"]]

penguins.columns = ["body_mass_g", "bill_length_mm", "gender", "species"]

penguins.dropna(inplace=True)

penguins.reset_index(inplace=True, drop=True)

penguins.head()

Unnamed: 0,body_mass_g,bill_length_mm,gender,species
0,3750.0,39.1,Male,Adelie
1,3800.0,39.5,Female,Adelie
2,3250.0,40.3,Female,Adelie
3,3450.0,36.7,Female,Adelie
4,3650.0,39.3,Male,Adelie


### Create holdout sample

In [9]:
penguins_x = penguins[['bill_length_mm', 'gender', 'species']]
penguins_y = penguins[['body_mass_g']]

In [11]:
from sklearn.model_selection import train_test_split

**reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html**

In [12]:
x_train, x_test, y_train, y_test = train_test_split(penguins_x, penguins_y, test_size = 0.2, random_state = 42)

### Model construction

In [14]:
# OLS formula 
ols_formula = 'body_mass_g ~ bill_length_mm + C(gender) + C(species)'

I use **`C()`** to indicate a categorical variable. This will tell the ols() function to one hot encode those variables in the model.

In [15]:
from statsmodels.formula.api import ols

**reference: https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html**

In [17]:
# create OLS dataframe
ols_data = pd.concat([x_train, y_train], axis = 1)

# create OLS object and fit the model
OLS = ols(formula = ols_formula, data = ols_data)
model = OLS.fit()

### Model Evaluation

In [18]:
model.summary()

0,1,2,3
Dep. Variable:,body_mass_g,R-squared:,0.853
Model:,OLS,Adj. R-squared:,0.851
Method:,Least Squares,F-statistic:,378.7
Date:,"Tue, 06 Feb 2024",Prob (F-statistic):,2.37e-107
Time:,11:32:21,Log-Likelihood:,-1902.6
No. Observations:,266,AIC:,3815.0
Df Residuals:,261,BIC:,3833.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2116.4221,311.271,6.799,0.000,1503.499,2729.345
C(gender)[T.Male],535.4101,49.988,10.711,0.000,436.979,633.841
C(species)[T.Chinstrap],-289.0657,96.063,-3.009,0.003,-478.222,-99.909
C(species)[T.Gentoo],1096.4636,84.504,12.975,0.000,930.067,1262.860
bill_length_mm,33.6625,8.366,4.024,0.000,17.189,50.136

0,1,2,3
Omnibus:,0.182,Durbin-Watson:,1.998
Prob(Omnibus):,0.913,Jarque-Bera (JB):,0.244
Skew:,0.061,Prob(JB):,0.885
Kurtosis:,2.915,Cond. No.,768.0


#### Gender
`If all other variables are constant, then we would expect a male penguin's body mass to be about 535.41 grams more than a female penguin's body mass.`

#### Species
`If we compare an Adelie penguin and a Gentoo penguin, who have the same characteristics except their species, we would expect the Gentoo penguin to have a body mass of about 1,096.46 grams more than the Adelie penguin.`


#### Bill Length
`Bill length (mm) is a continuous variable, so if we compare two penguins who have the same characteristics, except one penguin's bill is 1 millimeter longer, we would expect the penguin with the longer bill to have 33.66 grams more body mass than the penguin with the shorter bill.`
