# **Big Sales Prediction using Random Forest Regressor**

-------------


## **Objective**
This project aims to build a machine learning model using the Random Forest Regressor 
to predict big sales based on various influencing factors.


## **Data Source**

## **Import Library**

## **Import Data**

## **Describe Data**

## **Data Visualization**

## **Data Preprocessing**

## **Define Target Variable (y) and Feature Variables (X)**

## **Train Test Split**

## **Modeling**

## **Model Evaluation**

## **Prediction**

## **Explaination**


## **Dataset Information**
The dataset contains sales-related information, including various features such as 
store type, location, product category, and historical sales data.



## **Data Preprocessing**
1. Handling missing values by filling them with the median.
2. Encoding categorical features using Label Encoding.
3. Splitting the dataset into training and testing sets.



## **Model Training and Evaluation**
- The Random Forest Regressor is used to train the model.
- Evaluation metrics used:
  - Mean Absolute Error (MAE)
  - Mean Squared Error (MSE)
  - Root Mean Squared Error (RMSE)
  - R-squared Score (R2)


In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib

# Load Dataset (Assuming dataset is uploaded)
df = pd.read_csv('big_sales_data.csv')  # Replace with actual dataset path

# Exploratory Data Analysis
print(df.head())
print(df.info())
print(df.describe())

# Handling Missing Values
df.fillna(df.median(), inplace=True)

# Encoding Categorical Variables
label_encoders = {}
for col in df.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

# Feature Selection
X = df.drop(columns=['Sales'])  # Replace 'Sales' with actual target variable name
y = df['Sales']

# Splitting Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model Training
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Model Prediction
y_pred = rf_model.predict(X_test)

# Model Evaluation
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"MAE: {mae}")
print(f"MSE: {mse}")
print(f"RMSE: {rmse}")
print(f"R2 Score: {r2}")

# Feature Importance
feature_importances = pd.DataFrame({'Feature': X.columns, 'Importance': rf_model.feature_importances_})
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10,6))
sns.barplot(x=feature_importances['Importance'], y=feature_importances['Feature'])
plt.title('Feature Importance')
plt.show()

# Save Model
joblib.dump(rf_model, 'big_sales_rf_model.pkl')
