# Sales Prediction using Linear Regression with Specified Categorical Features

This notebook demonstrates the steps to build a linear regression model to predict sales using specified categorical features.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder
import numpy as np

## Load the dataset

In [None]:
# Load the dataset
file_path = '/mnt/data/Clean_Data.csv'
data = pd.read_csv(file_path)
data.head()

## Data Preprocessing

In [None]:
# Convert 'Ship_Date' to datetime and extract the month
data['Ship_Date'] = pd.to_datetime(data['Ship_Date'])
data['Ship_Month'] = data['Ship_Date'].dt.month
data['Ship_Month'] = data['Ship_Month'].astype('category')

In [None]:
# Encode all categorical variables
categorical_columns = ['Ship_Month', 'Segment', 'State', 'Category', 'Sub_Category', 'City', 'Region']

label_encoders = {}
for column in categorical_columns:
    le = LabelEncoder()
    data[column] = le.fit_transform(data[column])
    label_encoders[column] = le

data.head()

## Feature Selection

In [None]:
# Select the final features
final_features = ['Ship_Month', 'Segment', 'State', 'Category', 'Sub_Category', 'City', 'Region']

X_final_specified = data[final_features]
y_final_specified = data['Sales']

X_final_specified.head()

## Splitting the Data

In [None]:
# Split the data into training and testing sets
X_train_final_specified, X_test_final_specified, y_train_final_specified, y_test_final_specified = train_test_split(X_final_specified, y_final_specified, test_size=0.2, random_state=42)

## Model Training

In [None]:
# Initialize and train the Linear Regression model
model_final_specified = LinearRegression()
model_final_specified.fit(X_train_final_specified, y_train_final_specified)

## Making Predictions and Evaluating the Model

In [None]:
# Make predictions
y_pred_final_specified = model_final_specified.predict(X_test_final_specified)

# Evaluate the model
mae_final_specified = mean_absolute_error(y_test_final_specified, y_pred_final_specified)
mse_final_specified = mean_squared_error(y_test_final_specified, y_pred_final_specified)
rmse_final_specified = np.sqrt(mse_final_specified)
r2_final_specified = r2_score(y_test_final_specified, y_pred_final_specified)

final_specified_evaluation_results = {
    "Mean Absolute Error": mae_final_specified,
    "Mean Squared Error": mse_final_specified,
    "Root Mean Squared Error": rmse_final_specified,
    "R^2 Score": r2_final_specified
}

final_specified_evaluation_results

## Interpretation of the R² Score

The R² score reflects how well the model explains the variance in the sales data based on the specified categorical features. A low R² score suggests that the model's predictions are not highly accurate, and further improvements or alternative modeling approaches may be necessary.