### 🧪 PROJECT: Predicting Sales using TV, Radio, and Newspaper Ads

##### ✅ STEP 1: Load the Data

In [None]:
import pandas as pd
df = pd.read_csv("../dataset/advertising.csv")

##### 🔍 STEP 2: Understand the Data

In [None]:
df.head()       # look at the first few rows
df.info()       # check data types and missing values
df.describe()   # summary stats

#### 📊 Project Overview: Advertising Dataset

We have 4 columns in this dataset:

- **TV**: Money spent on the TV marketing (in thousands of dollars)
- **Radio**: Money spent on the radio marketing (in thousands of dollars)
- **Newspaper**: Money spent on the newspaper marketing (in thousands of dollars)
- **Sales**: Units sold (in thousands)

---

#### 🔍 Variable Types

- **TV, Radio, Newspaper** are **independent variables (X)**
- **Sales** is the **dependent variable (Y)** — our target

---

#### ❓ Key Questions

- Just having multiple variables doesn’t guarantee that we should use multiple linear regression.
- Do **TV, Radio, and Newspaper** each have a relationship with **Sales**?
- Are any of them **useless** (i.e., have no correlation or effect)?
- Do any of them **overlap too much** (i.e., multicollinearity)?

---

#### 💡 Assumptions

- We are assuming **Sales** is measured in **USD (thousands)**.
- The ad spending columns are also in **thousands of USD**.
- Since it is an advertising dataset, it tracks how money spent on **TV, Radio, and Newspaper ads** affects **sales**.

---

#### 🎯 Objective

> To analyze how spending more on **TV, Radio, or Newspaper** ads affects how many **units we sell**.

---

#### ✅ Data Quality Check

- No null or missing values are present.
- Proceeding to **data visualization and exploration**.


##### 📊 STEP 3: Visualize Relationships (EDA)


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# sns.pairplot(df)  # scatter plots between each pair
sns.heatmap(df.corr(), annot=True)  # correlation matrix


##### From above Heatmap
- We can conclude TV vs Sales	--> 	✅ Very strong positive relationship — spending on TV ads is highly related to increased sales.
- Radio vs Sales	--> ✅ Moderate positive relationship — radio also helps, but not as much as TV.
- Newspaper vs Sales --> 	⚠️ Weak relationship — newspaper ads don't really help much.
- TV vs Radio/Newspaper -> 🔄 Very low — these ad budgets don’t influence each other.

##### 🎯 So what can you learn?
- TV ads are the most powerful predictor of Sales.

- Newspaper ads seem almost useless in comparison (and may not be needed in the model).

- No strong multicollinearity between features — which is good.

##### Let's take only TV as that is the only one which has good relation towards sales. So we have 1 independent and 1 dependent, so it is a simple linear regression.

##### 🧮 STEP 4: Choosing Simple Linear Regression

In [None]:
X = df[['TV']]
y = df['Sales']


##### ⚙️ STEP 5: Split the Data



In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.7,random_state=42)


##### 🧠 STEP 6: Train the Model

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train,y_train)

##### 🧾 STEP 7: Check Model Coefficients


In [None]:
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)


##### 📈 STEP 8: Make Predictions



In [None]:
y_pred = model.predict(X_test)
y_pred


##### 📊 STEP 9: Evaluate the Model

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

print("MSE:", mean_squared_error(y_test, y_pred))
print("R2 Score:", r2_score(y_test, y_pred))

# Use MAE (Mean Absolute Error or Distance Based) or RMSE --> if you want to know how wrong your model (on average)
# Model Performance --> R2
# But strictly speaking, the model internally uses MSE for optimization
# MAE, MSE - tells us the average of how wrong/ how far our predicition is
# R2 tells us the model performance and it also says how well the model explains the data (closer to 1 = better)

##### 📉 STEP 10: Visualize Predictions


In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
plt.scatter(y_test, y_pred)
plt.xlabel("Actual Sales")
plt.ylabel("Predicted Sales")
plt.title("Actual vs Predicted")
plt.grid(True)

# Save the plot before showing it
plt.savefig("../images/actual_vs_predicted.png")  # <-- saves the plot
plt.show()
