<a href="https://colab.research.google.com/github/Neavy1/AnyoneAI/blob/main/8_1_1_PRACTICE_Black_Friday_Sales_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning: Linear Regression

## Black Friday Sales Prediction:

We are going to use a dataset of product purchases during a Black Friday (in the US). The main idea is to be able to generate a predictor that allows us to predict the `purchase amount`.

In order to achieve a good predictor we must apply the different concepts that we have been learning:

* `Exploration`
* `Feature Engineering`
* `Modeling`
* `Evaluation`

The dataset here is a sample of the transactions made in a retail store. The store wants to know better the customer `purchase` behaviour against different products. The problem is a `regression problem` where we are trying to predict the dependent variable (the amount of purchase) with the help of the information contained in the other variables.

### You can try differents Scikit-Learn models from [Linear Models](https://scikit-learn.org/1.5/modules/linear_model.html)

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

!gdown "1HZ_nk9Q0xp-qbAtXZvDxK5VNVfavt7ph"

data = pd.read_csv("BlackFriday.csv")
data.sample(5)

Downloading...
From: https://drive.google.com/uc?id=1HZ_nk9Q0xp-qbAtXZvDxK5VNVfavt7ph
To: /content/BlackFriday.csv
100% 25.0M/25.0M [00:00<00:00, 41.1MB/s]


Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
277275,1000765,P00296042,M,26-35,17,B,1,0,8,13.0,16.0,2085
111658,1005214,P00112142,F,36-45,9,C,1,0,1,2.0,14.0,4004
394497,1000737,P00249642,M,0-17,19,A,2,0,3,5.0,,13467
528921,1003511,P00120042,M,51-55,0,C,2,1,1,2.0,,15340
438050,1001426,P00119142,M,18-25,14,B,3,0,3,4.0,12.0,8393


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data = pd.read_csv("BlackFriday.csv")
data.sample(5)
# Data Exploration
data.info()
data.describe()

# Handling Missing Values
data.isnull().sum()
data.fillna(0, inplace=True)


# Feature Engineering: Convert Categorical features to numerical using One-Hot Encoding
categorical_cols = data.select_dtypes(include=['object']).columns
data = pd.get_dummies(data, columns=categorical_cols, drop_first=True)


# Define Features (X) and Target (y)
X = data.drop('Purchase', axis=1)
y = data['Purchase']

# Split data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


# Model Training (Linear Regression as an example)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

# Model Evaluation
from sklearn.metrics import mean_squared_error, r2_score
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 537577 entries, 0 to 537576
Data columns (total 12 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   User_ID                     537577 non-null  int64  
 1   Product_ID                  537577 non-null  object 
 2   Gender                      537577 non-null  object 
 3   Age                         537577 non-null  object 
 4   Occupation                  537577 non-null  int64  
 5   City_Category               537577 non-null  object 
 6   Stay_In_Current_City_Years  537577 non-null  object 
 7   Marital_Status              537577 non-null  int64  
 8   Product_Category_1          537577 non-null  int64  
 9   Product_Category_2          370591 non-null  float64
 10  Product_Category_3          164278 non-null  float64
 11  Purchase                    537577 non-null  int64  
dtypes: float64(2), int64(5), object(5)
memory usage: 49.2+ MB
