# Simple Linear Regression

## Step 1: Reading and Understanding the Data

1. Importing data using the pandas library
2. Understanding the structure of the data

In [None]:
# Import the numpy and pandas package

import numpy as np
import pandas as pd

In [None]:
# Read the given CSV file, and view some sample records

advertising = pd.read_csv(r"C:\Users\praveena\Documents\ML\modeling-regression-metrics\advertising.csv")
advertising.head()

Let's inspect the various aspects of our dataframe

In [None]:
advertising.shape

In [None]:
advertising.info()

In [None]:
advertising.describe()

## Step 2: Visualising the Data

Let's now visualise our data using seaborn. We'll first make a pairplot of all the variables present to visualise which variables are most correlated to `Sales`.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
sns.pairplot(advertising, x_vars=['TV', 'Newspaper', 'Radio'], y_vars='Sales',height=4,  kind='scatter')
plt.show()


In [None]:
print(advertising.corr())

In [None]:
sns.heatmap(advertising.corr(), cmap="YlGnBu", annot = True)
plt.show()

As is visible from the pairplot and the heatmap, the variable `TV` seems to be most correlated with `Sales`. So let's go ahead and perform simple linear regression using `TV` as our feature variable.

---
## Step 3: Performing Simple Linear Regression

Equation of linear regression<br>
$y = w_0 + w_1x_1 + w_2x_2 + ... + w_nx_n$

-  $y$ is the response
-  $w_0$ is the intercept
-  $w_1$ is the coefficient for the first feature
-  $w_n$ is the coefficient for the nth feature<br>

In our case:

$y = w_0 + w_1 \times TV$

The $w_1$ values are called the model **coefficients** or **model parameters**.

---

We first assign the feature variable, `TV`, in this case, to the variable `X` and the response variable, `Sales`, to the variable `y`.

In [None]:
X = advertising['TV']
y = advertising['Sales']

### Linear Regression using `linear_model` in `sklearn`

We will use the `linear_model` library from `sklearn` to build the model. Since, we hae already performed a train-test split, we don't need to do it again.

There's one small step that we need to add, though. When there's only a single feature, we need to reshape the train set in order for the linear regression fit to be performed successfully.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state = 42)


# y_train, y_test, X_train, X_test  = train_test_split(y, X, train_size = 0.7, test_size = 0.3, random_state = 42)

x_train and y_train ---> for training (x - input, y - output)

x_test and y_test ---> for testing (x-input, y - output)

In [None]:
X_train.shape   #2D --(n_samples, n_features)

In [None]:
# Convert the Series to NumPy arrays and reshape
X_train = X_train.values.reshape(-1, 1)
X_test = X_test.values.reshape(-1, 1)



# -1 → means: Automatically figure out the number of rows based on the data length
# 1 → means: "We want 1 column (feature)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
from sklearn.linear_model import LinearRegression

# Representing LinearRegression as lr(Creating LinearRegression Object)
lm = LinearRegression()

# Fit the model using lr.fit()
lm.fit(X_train, y_train)

In [None]:
print(lm.intercept_)
print(lm.coef_)

The equationwe get is the same as what we got before!

$ Sales = 7.206 + 0.054* TV $

In [None]:
y_pred = lm.predict(X_test)

In [None]:
y_test

In [None]:
y_pred

In [None]:
# Step 9: Evaluate the model

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

print("R2 Score:", r2_score(y_test, y_pred))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred))

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(8,6))
plt.scatter(y_test, y_pred, color='blue')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')  # Perfect prediction line   (10,50),(10,15)
plt.xlabel("Actual sales price")
plt.ylabel("Predicted sales Price")
plt.title("Actual vs Predicted Sales Prices")
plt.grid(True)
plt.show()
