# Linear Regression 1

**importing libraries**

In [56]:
from IPython.display import display

import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.linear_model import LinearRegression

import plotly.express as px
import plotly.graph_objects as go

### Baselines
- rule of thumb" (using previous knowledge or commonly known information)
- descriptive statistics
- fitting a simple model

**Peguin example**

Since we're serving as (temporary) penguin researchers, we have some experience with judging the weight of a penguin by the flipper length. We know that on average, for about every 20 mm increase in flipper length, the weight of the penguin increases by about 1000 g (1 kg). One of our penguins has a flipper length of 220 mm and we also know his weight is 5000 g. We observe another penguin to have a flipper length of 190 mm; what is the approximate weight of this second penguin? We know we have an increase of 1000g/20mm. The second penguin’s flippers are 30mm shorter so the weight would be 5000g - 1500g = 3500g.

We just used a baseline (1000g/20mm) and made a prediction based on that starting point.

**Graphing with penguin data**

we see in the graphe below that hour baseline estimate (intersection of red cross) was close to what or OLS treadline estimated

In [46]:
df = sns.load_dataset("penguins")
display(df.head())

fig = px.scatter(df, x="flipper_length_mm", y="body_mass_g", trendline="ols", width=800)
fig.add_trace(
    go.Scatter(
        x=[190, 190],
        y=[3000, 4000],
        mode='lines',
        line=go.scatter.Line(color="red"),
        showlegend=False))
fig.add_trace(
    go.Scatter(
        x=[185, 195],
        y=[3500, 3500],
        mode='lines',
        line=go.scatter.Line(color="red"),
        showlegend=False))
fig.add_trace(
    go.Scatter(
        x=[190],
        y=[3500],
        mode='markers',
        line=go.scatter.Line(color="red"),
        showlegend=False))
fig.show()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


### Linear Regression

Linear regression fits a line to data where the equation of the line is given by

$$y = \beta_0 + \beta_1x$$

When we fit a line, we’re trying to find the coefficients $\beta_0$ and $\beta_1$. The parameter $\beta_0$ is the intercept (when $x$=0, the intercept is the $y$ value) and $\beta_1$ is the slope. The results of the model fit will return the slope and intercept.

**Scikit-learn API**

- Load the data set and "clean”: if needed (not specifically part of scikit-learn but important to do first
- Create features and target(s) from the data
- Import the model and instantiate the class
- Fit the model
- Apply your model; use the model to predict new values

In [49]:
# Load the data into a DataFrame
df = sns.load_dataset("penguins")

# Print the shape of the DataFrame
print('Shape of the dataset (before removing NaNs): ', df.shape)

# Drop NaNs
df.dropna(inplace=True)

# Print the shape of the DataFrame
print('Shape of the dataset (after removing NaNs): ', df.shape)

# Display the first five rows
display(df.head())

Shape of the dataset (before removing NaNs):  (344, 7)
Shape of the dataset (after removing NaNs):  (333, 7)


Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male


**Representing Data**

Features
- speacies
- island
- bill_length_mm
- bill_depth_mm
- flipper_length_mm
- body_mass_g
- sex

**Feature Matrix and Target Array**

Feature:

- flipper_length_mm

Target:

- body_mass_g

### Fit a model with penguin weight and flipper length

In [110]:
# Feature X
X = df['flipper_length_mm']
# Target y
y = df['body_mass_g']

print('X shape ', X.shape)
print('y shape ', y.shape)

X shape  (333,)
y shape  (333,)


**Instantiate the class**

In [111]:
model = LinearRegression()
model

LinearRegression()

**Arrange Datta**

In [112]:
print('Originial shape of X: ', X.shape)

# Creating second column for model
X = X[:, np.newaxis]

print('new X shape: ', X.shape)

Originial shape of X:  (333,)
new X shape:  (333, 1)


**Fit the model**

In [113]:
model.fit(X, y)

LinearRegression()

**Look at the coefficients**
- the coefficients describe the slope and intercept

In [114]:
# Slope (also called the model coefficient)
print(model.coef_)

# Intercept
print(model.intercept_)

# In equation form
print(f'\nbody_mass_g = {model.coef_[0]} x flipper_length_mm + ({model.intercept_})')

[50.15326594]
-5872.092682842825

body_mass_g = 50.15326594224113 x flipper_length_mm + (-5872.092682842825)


In [115]:
x_line = np.linspace(170,240)
y_line = model.coef_*x_line + model.intercept_

fig = px.line(x=x_line, y=y_line, width=800)
fig.show()

fig = px.scatter(df, x="flipper_length_mm", y="body_mass_g", width=800)
fig.add_trace(go.Scatter(x=x_line, y=y_line, mode='lines', line=go.scatter.Line(color="red")))
fig.show()


### Fit a model with other data features

When doing multiple feature linear regression the y-intersept is invalid (have to looke deepr into this...)

In [118]:
df.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,333.0,333.0,333.0,333.0
mean,43.992793,17.164865,200.966967,4207.057057
std,5.468668,1.969235,14.015765,805.215802
min,32.1,13.1,172.0,2700.0
25%,39.5,15.6,190.0,3550.0
50%,44.5,17.3,197.0,4050.0
75%,48.6,18.7,213.0,4775.0
max,59.6,21.5,231.0,6300.0


In [126]:
X = df[['bill_length_mm', 'bill_depth_mm']]
y = df['body_mass_g']

model = LinearRegression()

model.fit(X, y)

# Slope (also called the model coefficient)
print(model.coef_)

# Intercept
print(model.intercept_)

x_line = np.linspace(30,60)
y_line = model.coef_[0]*x_line + model.intercept_

fig = px.scatter(df, x="bill_length_mm", y="body_mass_g", width=800)
fig.add_trace(go.Scatter(x=x_line, y=y_line, mode='lines', line=go.scatter.Line(color="red")))
fig.show()

x_line = np.linspace(13,21)
y_line = model.coef_[1]*x_line + model.intercept_

fig = px.scatter(df, x="bill_depth_mm", y="body_mass_g", width=800)
fig.add_trace(go.Scatter(x=x_line, y=y_line, mode='lines', line=go.scatter.Line(color="red")))
fig.show()

[  74.81262567 -145.50718304]
3413.4518512859563
