In [None]:
# 1. Import necessary libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# 2. Read the data from a CSV file named "train.csv" into a DataFrame and display the first 10 rows
store_sales = pd.read_csv("train.csv")
store_sales.head(10)

# 3. Display information about the DataFrame to check data types and missing values
store_sales.info()

# 4. Convert the 'date' column to datetime format
store_sales['date'] = pd.to_datetime(store_sales['date'])

# Convert the 'date' column to monthly periods
store_sales['date'] = store_sales['date'].dt.to_period("M")

# Aggregate monthly sales data
monthly_sales = store_sales.groupby('date').sum().reset_index()

# Convert the 'date' column back to datetime format
monthly_sales['date'] = monthly_sales['date'].dt.to_timestamp()

# Display the first 10 rows of the aggregated monthly sales data
monthly_sales.head(10)

# 5. Drop unnecessary columns 'store' and 'item'
store_sales = store_sales.drop(['store', 'item'], axis=1)

# 6. Plot monthly sales data
plt.figure(figsize=(15, 5))
plt.plot(store_sales['date'], store_sales['sales'])
plt.xlabel("Date")
plt.ylabel("Sales")
plt.title("Monthly Customer Sales")
plt.show()

# 7. Calculate monthly sales difference and plot the difference
store_sales['sales_diff'] = store_sales['sales'].diff()
store_sales = store_sales.dropna()

plt.figure(figsize=(15, 5))
plt.plot(store_sales['date'], store_sales['sales_diff'])
plt.xlabel("Date")
plt.ylabel("Sales Difference")
plt.title("Monthly Customers Sales Difference")
plt.show()

# 8. Prepare supervised learning data by creating lag features
supervised_data = store_sales.drop(['date', 'sales'], axis=1)
for i in range(1, 13):
    col_name = 'month_'+ str(i)
    supervised_data[col_name] = supervised_data['sales_diff'].shift(i)
supervised_data =  supervised_data.dropna().reset_index(drop=True)

# 9. Split the data into train and test sets
train_data = supervised_data[:-12]
test_data = supervised_data[-12:]

# 10. Scale the data using MinMaxScaler
scaler = MinMaxScaler(feature_range=(-1, 1))
scaler.fit(train_data)
train_data = scaler.transform(train_data)
test_data = scaler.transform(test_data)

# 11. Prepare input and output data for training and testing
x_train, y_train = train_data[:, 1:], train_data[:, 0:1]
x_test, y_test = test_data[:, 1:], test_data[:, 0:1]
y_train = y_train.ravel()
y_test = y_test.ravel()

# 12. Create a DataFrame for predictions
sales_dates = store_sales['date'][-12:].reset_index(drop=True)
predict_df = pd.DataFrame(sales_dates)

# 13. Fit a Linear Regression model and make predictions
lr_model = LinearRegression()
lr_model.fit(x_train, y_train)
lr_pre = lr_model.predict(x_test)

# 14. Inverse transform the scaled predictions
lr_pre = lr_pre.reshape(-1, 1)
lr_pre_test_set = np.concatenate([lr_pre, x_test], axis=1)
lr_pre_test_set = scaler.inverse_transform(lr_pre_test_set)

# 15. Combine the predictions with actual sales
act_sales = store_sales['sales'][-13:].tolist()
result_list = []
for index in range(0, len(lr_pre_test_set)):
    result_list.append(lr_pre_test_set[index][0] + act_sales[index])
lr_pre_series = pd.Series(result_list, name="Linear Prediction")
predict_df = predict_df.merge(lr_pre_series, left_index=True, right_index=True)

# 16. Calculate evaluation metrics for the Linear Regression model
lr_mse = np.sqrt(mean_squared_error(predict_df['Linear Prediction'], store_sales['sales'][-12:]))
lr_mae = mean_absolute_error(predict_df['Linear Prediction'], store_sales['sales'][-12:])
lr_r2 = r2_score(predict_df['Linear Prediction'], store_sales['sales'][-12:])

# 17. Print evaluation metrics
print("Linear Regression MSE: ", lr_mse)
print("Linear Regression MAE: ", lr_mae)
print("Linear Regression R2: ", lr_r2)

# 18. Plot actual vs predicted sales
plt.figure(figsize=(15, 2))
plt.plot(store_sales['date'], store_sales['sales'])
plt.plot(predict_df['date'], predict_df['Linear Prediction'])
plt.title("Customer Sales Forecast using LR Model")
plt.xlabel("Date")
plt.ylabel("Sales")
plt.legend(['Actual Sales', 'Predicted Sales'])
plt.show()


: 

1. **Understanding Time Series Data**:
   Time series data is a sequence of observations collected at regular intervals over time. Each observation includes a timestamp (the date) and the corresponding value (sales). Mathematically, a time series dataset can be represented as:
   \[ \{(t_1, y_1), (t_2, y_2), ..., (t_n, y_n)\} \]
   where \( t_i \) represents the timestamp (date) and \( y_i \) represents the sales value at time \( t_i \).

2. **Data Preprocessing**:

   - **Checking for Missing Values**: Ensure there are no missing values in the dataset. If missing values exist, handle them using methods such as interpolation or deletion.
   - **Converting Data Types**: Convert the 'date' column to datetime format using the `pd.to_datetime()` function in pandas.
   - **Feature Engineering**: Create additional features that might be useful for modeling, such as the difference between consecutive sales values (sales_diff). This can be calculated using the `diff()` function in pandas.

3. **Exploratory Data Analysis (EDA)**:
   EDA involves visualizing and analyzing the data to understand its characteristics. Common EDA techniques include:

   - **Plotting Time Series Data**: Visualize the monthly sales data using line plots to identify trends, seasonality, and other patterns.
   - **Descriptive Statistics**: Compute summary statistics such as mean, median, and standard deviation to understand the central tendency and variability of the sales data.

4. **Feature Engineering**:

   - **Creating Lag Features**: Generate lagged features to capture the relationship between current and past observations. This involves shifting the values of the target variable (sales) by a certain number of time periods (months) using the `shift()` function in pandas.

5. **Train-Test Split**:

   - Split the dataset into training and testing sets to evaluate the model's performance. Typically, a portion of the data (e.g., 80%) is used for training, and the remaining portion (e.g., 20%) is used for testing.

6. **Data Scaling**:

   - Scale the data to ensure all features have the same scale, which helps improve the convergence of the optimization algorithm during model training. MinMaxScaler is commonly used to scale the data to a specified range, such as (-1, 1).

7. **Model Training**:

   - Train a linear regression model to forecast future sales based on historical data. In linear regression, the relationship between the dependent variable (sales) and independent variables (lag features) is modeled using a linear equation:
     \[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon \]
     where \( y \) is the predicted sales value, \( \beta_0 \) is the intercept, \( \beta_i \) are the coefficients (weights), \( x_i \) are the lag features, and \( \epsilon \) is the error term.

8. **Model Evaluation**:

   - Evaluate the performance of the linear regression model using metrics such as:
     - **Mean Squared Error (MSE)**: Measures the average squared difference between the predicted and actual sales values.
     - **Mean Absolute Error (MAE)**: Measures the average absolute difference between the predicted and actual sales values.
     - **R-squared (R2) Score**: Represents the proportion of the variance in the dependent variable (sales) that is explained by the independent variables (lag features).

9. **Prediction**:

   - Use the trained linear regression model to make predictions on future sales data. Predictions are generated based on the lag features and combined with the actual sales values to produce the final forecast.

10. **Visualization**:
    - Visualize the predicted sales values alongside the actual sales values using line plots. This allows stakeholders to assess the accuracy of the forecasting model and make informed decisions based on the forecasts.

Overall, time series forecasting using a linear regression model involves a systematic process of data preprocessing, model training, evaluation, and prediction to generate insights and inform decision-making.
