# Sales Prediction

## 1. Introduction
This notebook tackles the Rossmann Store Sales prediction challenge. The goal is to forecast daily sales for Rossmann stores. We will use a sample of the dataset to perform data analysis, feature engineering, and build a regression model to predict sales.

## 2. Data Loading and Initial Exploration

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('sales.csv')

# Display the first few rows
df.head()

In [None]:
# Get a summary of the dataframe
df.info()

## 3. Data Cleaning and Preprocessing

In [None]:
# Convert 'Date' column to datetime objects
df['Date'] = pd.to_datetime(df['Date'])

# Handle 'StateHoliday' - we will convert it to a numerical format
df['StateHoliday'] = df['StateHoliday'].astype(str).replace({'0': 0, 'a': 1, 'b': 2, 'c': 3})

# Check for missing values
df.isnull().sum()

## 4. Feature Engineering

In [None]:
# Extract time-based features from the 'Date' column
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['WeekOfYear'] = df['Date'].dt.isocalendar().week.astype(int)

# Drop the original 'Date' column as it's no longer needed
df.drop(columns=['Date'], inplace=True)

df.head()

## 5. Exploratory Data Analysis (EDA)

In [None]:
# Sales distribution
sns.histplot(df['Sales'], kde=True)
plt.title('Distribution of Sales')
plt.show()

In [None]:
# Sales vs. DayOfWeek
sns.boxplot(x='DayOfWeek', y='Sales', data=df)
plt.title('Sales vs. Day of Week')
plt.show()

## 6. Model Building and Training

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Define features and target
X = df.drop('Sales', axis=1)
y = df['Sales']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Random Forest Regressor
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

## 7. Model Evaluation

In [None]:
# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f'Root Mean Squared Error (RMSE): {rmse:.2f}')
print(f'R-squared (R2): {r2:.2f}')

In [None]:
# Visualize predictions vs. actuals
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
plt.xlabel('Actual Sales')
plt.ylabel('Predicted Sales')
plt.title('Actual vs. Predicted Sales')
plt.show()

## 8. Conclusion
The Random Forest model demonstrates a strong ability to predict sales based on the provided features. The R-squared value indicates a good fit, and the RMSE provides a measure of the average error in the model's predictions. This model can be a valuable tool for forecasting sales and optimizing store operations.