<a href="https://colab.research.google.com/github/Florani1/Florani1/blob/main/Multiple_Linear_Regression_Penguins.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Multiple Linear Regression Penguins



## Data Loading and Inspection

In [12]:
# Import packages
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from statsmodels.formula.api import ols

In [5]:
# Load dataset
penguins = sns.load_dataset("penguins", cache=False)

# Examine first 5 rows of dataset
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


In [7]:
# Subset data
penguins = penguins[["body_mass_g", "bill_length_mm", "sex", "species"]]

# Rename columns
penguins.columns = ["body_mass_g", "bill_length_mm", "gender", "species"]

# Drop rows with missing values
penguins.dropna(inplace=True)

# Reset index
penguins.reset_index(inplace=True, drop=True)

In [8]:
penguins

Unnamed: 0,body_mass_g,bill_length_mm,gender,species
0,3750.0,39.1,Male,Adelie
1,3800.0,39.5,Female,Adelie
2,3250.0,40.3,Female,Adelie
3,3450.0,36.7,Female,Adelie
4,3650.0,39.3,Male,Adelie
...,...,...,...,...
328,4925.0,47.2,Female,Gentoo
329,4850.0,46.8,Female,Gentoo
330,5750.0,50.4,Male,Gentoo
331,5200.0,45.2,Female,Gentoo


In [9]:
# Subset X and y variables
penguins_X = penguins[["bill_length_mm", "gender", "species"]]
penguins_y = penguins[["body_mass_g"]]

In [11]:
X_train, X_test, y_train, y_test = train_test_split(penguins_X, penguins_y,
                                                    test_size = 0.3, random_state = 42)

## Model Construction

First, we have to write out the formula as a string. Recall that we write out the name of the y variable first, followed by the tilde (~), and then each of the X variables separated by a plus sign (+). We can use C() to indicate a categorical variable. This will tell the ols() function to one hot encode those variables in the model

In [13]:
# Write out OLS formula as a string
ols_formula = "body_mass_g ~ bill_length_mm + C(gender) + C(species)"

In [14]:
# Create OLS dataframe
ols_data = pd.concat([X_train, y_train], axis = 1)

# Create OLS object and fit the model
OLS = ols(formula = ols_formula, data = ols_data)
model = OLS.fit()

In [15]:
# Get model results
model.summary()

0,1,2,3
Dep. Variable:,body_mass_g,R-squared:,0.85
Model:,OLS,Adj. R-squared:,0.847
Method:,Least Squares,F-statistic:,322.6
Date:,"Mon, 06 May 2024",Prob (F-statistic):,1.31e-92
Time:,14:33:25,Log-Likelihood:,-1671.7
No. Observations:,233,AIC:,3353.0
Df Residuals:,228,BIC:,3371.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2032.2111,354.087,5.739,0.000,1334.510,2729.913
C(gender)[T.Male],528.9508,55.105,9.599,0.000,420.371,637.531
C(species)[T.Chinstrap],-285.3865,106.339,-2.684,0.008,-494.920,-75.853
C(species)[T.Gentoo],1081.6246,94.953,11.391,0.000,894.526,1268.723
bill_length_mm,35.5505,9.493,3.745,0.000,16.845,54.256

0,1,2,3
Omnibus:,0.339,Durbin-Watson:,1.948
Prob(Omnibus):,0.844,Jarque-Bera (JB):,0.436
Skew:,0.084,Prob(JB):,0.804
Kurtosis:,2.871,Cond. No.,798.0


C(gender) - Male
Given the name of the variable, we know that the variable was encoded as Male = 1, Female = 0. This means that female penguins are the reference point. If all other variables are constant, then we would expect a male penguin's body mass to be about 528.95 grams more than a female penguin's body mass.

C(species) - Chinstrap and Gentoo
Given the names of these two variables, we know that Adelie penguins are the reference point. So, if we compare an Adelie penguin and a Chinstrap penguin, who have the same characteristics except their species, we would expect the Chinstrap penguin to have a body mass of about 285.39 grams less than the Adelie penguin. If we compare an Adelie penguin and a Gentoo penguin, who have the same characteristics except their species, we would expect the Gentoo penguin to have a body mass of about 1,081.62 grams more than the Adelie penguin.

