# Sales Prediction using Linear Regression

This notebook demonstrates how to predict sales using a linear regression model based on features extracted from a dataset.

## 1. Import Libraries

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

## 2. Load and Prepare the Dataset

In [None]:
file_path = '/mnt/data/Enriched_Clean_Data.csv'
data = pd.read_csv(file_path, sep=';', thousands='.')
data['Sales'] = pd.to_numeric(data['Sales'], errors='coerce')
data = data.dropna(subset=['Sales'])
threshold = 1000
data['HighSales'] = (data['Sales'] > threshold).astype(int)
features = [
    'Ship_Date', 'Ship_Mode', 'Customer_ID', 'Segment', 'City', 'State', 'Region', 
    'Postal_Code', 'Product_ID', 'Category', 'Sub_Category', 'shipping_costs', 
    'number_of_products_in_warehouse', 'production_costs_per_unit'
]
X = data[features]
y = data['HighSales']

## 3. Preprocess the Data

In [None]:
categorical_features = [
    'Ship_Mode', 'Segment', 'City', 'State', 'Region', 'Category', 'Sub_Category'
]
numerical_features = [
    'shipping_costs', 'number_of_products_in_warehouse', 'production_costs_per_unit'
]

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),
        ('num', StandardScaler(), numerical_features)
    ])

## 4. Train the Linear Regression Model

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

linear_model = Pipeline(steps=[('preprocessor', preprocessor),
                               ('regressor', LinearRegression())])

linear_model.fit(X_train, y_train)

## 5. Evaluate the Model

In [None]:
y_pred_linear = linear_model.predict(X_test)

mse_linear = mean_squared_error(y_test, y_pred_linear)
mae_linear = mean_absolute_error(y_test, y_pred_linear)
r2_linear = r2_score(y_test, y_pred_linear)

print(f"Mean Squared Error (MSE): {mse_linear:.4f}")
print(f"Mean Absolute Error (MAE): {mae_linear:.4f}")
print(f"R-squared (R²): {r2_linear:.4f}")

## 6. Conclusion

In this notebook, we built and evaluated a linear regression model to predict sales. The model's performance was assessed using metrics such as Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared (R²).