# <p style="font-family:Cursive; font-weight:bold; letter-spacing: 2px ; font-size:85%; text-align:center;padding: 10px; border-bottom: 5px solid #9447e6 ; background-color: #4abaed">Introduction:</p>

![](https://th.bing.com/th/id/OIP.utjtoPkFQU8W722O9DQClQHaEn?pid=ImgDet&rs=1)

<div style="border: 3px solid #f0ad4e; background-color: #fcf8e3; padding: 10px;">
    
Time series forecasting is a powerful technique for predicting future values of a target variable based on its past values. A time series dataset typically consists of a sequence of data points collected over time, with each data point representing a <b>measurement or observation</b> at a specific point in time.

Time series forecasting is different from other types of prediction problems because it involves analyzing the temporal patterns and trends in the data. This means that the order of the data points matters, and that there may be <b>seasonality, trends,residuals, and other patterns</b> that need to be accounted for in order to make accurate predictions.

<b>ML</b> models can be trained to automatically detect and account for variables that affect the target variable, such as promotions and special days, and can be used to generate accurate forecasts for a wide range of time series datasets.

<b>AIMs</b>

We aim to develop a flexible and scalable framework that can be applied to a wide range of time series datasets, while accounting for variables that can affect the target variable, and examining the relationship between ensemble techniques and forecasting error.

By combining the strengths of multiple ML models and techniques, we aim to achieve better forecasting performance and gain deeper insights into the underlying patterns and trends in the data.
    </div>

# <p style="font-family:Cursive; font-weight:bold; letter-spacing: 2px ; font-size:85%; text-align:center;padding: 10px; border-bottom: 5px solid #9447e6 ; background-color: #4abaed">METHODOLOGY:</p>

<div style = 'border: 3px solid #f0ad4e; background-color: #fcf8e3; padding: 10px;'>
<ol>
    <li><b>Data Preprocessing</b>: The first step is to prepare the time series dataset for analysis.</li><br>
    <li><b>Stationarity Test</b>: The next step is to test the time series for stationarity.</li><br>
    <li><b>Determine Order of Differencing</b>: If the time series is found to be non-stationary, the next step is to determine the order of differencing required to achieve stationarity.</li><br>
    <li><b>Identify Order of AR and MA Terms</b>: Once the time series is stationary, the next step is to identify the order of the autoregressive (AR) and moving average (MA) terms in the ARIMA model. This can be done by analyzing the autocorrelation and partial autocorrelation functions of the time series.</li><br>
    <li><b>Fit ARIMA Model</b>: With the order of differencing, AR and MA terms identified, the next step is to fit the ARIMA model to the time series data.</li><br>
    <li><b>Model Diagnostics</b>: After fitting the model, the next step is to evaluate its performance using diagnostic checks.</li><br>
    <li><b>Forecasting</b>:The accuracy of the forecasts can be evaluated using measures such as Symmetric mean absolute percentage error(SMAPE).</li><br>
    <li><b>Conclusion</b></li>
    </ol>
    
</div>

# <p style="font-family:Cursive; font-weight:bold; letter-spacing: 2px ; font-size:85%; text-align:center;padding: 10px; border-bottom: 5px solid #9447e6 ; background-color: #4abaed">Importing Packages and Data</p>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from prettytable import PrettyTable
from colorama import Fore, Style
import pickle

from sklearn.preprocessing import LabelEncoder


#Ignore warnings
import warnings
warnings.filterwarnings('ignore')

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_percentage_error
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_absolute_percentage_error

In [None]:
train = pd.read_csv('/kaggle/input/playground-series-s3e19/train.csv',index_col='id')
test = pd.read_csv('/kaggle/input/playground-series-s3e19/test.csv',index_col='id')
sub = pd.read_csv('/kaggle/input/playground-series-s3e19/sample_submission.csv')

In [None]:
train.head()

In [None]:
# Visualize the datatypes 
data_types = train.dtypes.value_counts()
labels = data_types.index.astype(str)
counts = data_types.values.tolist()

# Create the pie chart using plotly
fig = go.Figure(data=[go.Pie(labels=labels, values=counts)])

# Customize the layout of the pie chart
fig.update_layout(
    title='Data Types in train',
    width=600,
    height=450,
    showlegend=True,
    legend=dict(orientation="h")
)

# Display the pie chart
fig.show()

In [None]:
train.info()

In [None]:
train.describe()

In [None]:
train.isna().sum()

In [None]:
# creating pie charts for visualizing the counts
def create_pie_chart(data, column):

    value_counts = data[column].value_counts()
    labels = value_counts.index.astype(str)
    counts = value_counts.values.tolist()

    fig = go.Figure(data=[go.Pie(labels=labels, values=counts)])

    fig.update_layout(
        title=f'Counts of {column}',
        width=600,
        height=450,
        showlegend=True,
        legend=dict(orientation="h")
    )

    fig.show()

# Create a pie chart for country
create_pie_chart(train, 'country')

# Create a pie chart for store
create_pie_chart(train, 'store')

# Create a pie chart for product
create_pie_chart(train, 'product')

In [None]:
def calculate_feature_counts(data, feature):
    
    feature_counts = data[feature].value_counts().reset_index()
    feature_counts.columns = [feature, 'Count']
    
    # Apply colors to the table
    feature_counts['Color'] = feature_counts.apply(lambda row: Fore.GREEN if row['Count'] > 100 else Fore.RED, axis=1)
    
    # Create the pretty table
    table = PrettyTable()
    table.field_names = [feature, 'Count']
    for _, row in feature_counts.iterrows():
        color = row['Color']
        feature_value = row[feature]
        count = row['Count']
        table.add_row([f'{color}{feature_value}{Style.RESET_ALL}', count])
    
    return str(table)

In [None]:
# Calculate the count of each country
country_counts_table = calculate_feature_counts(train, 'country')
print("Count of each country:")
print(country_counts_table)

In [None]:
# Calculate the count of each store
store_counts_table = calculate_feature_counts(train, 'store')
print("Count of each store:")
print(store_counts_table)

In [None]:
# Calculate the count of each product
product_counts_table = calculate_feature_counts(train, 'product')
print("Count of each product:")
print(product_counts_table)

In [None]:
def create_pie_chart(ax, labels, counts, title):

    ax.pie(counts, labels=labels, autopct='%1.1f%%')
    ax.set_title(title)


# Create subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 12))
fig.suptitle('Distribution of Products Sold')

# 1) Histogram for countries
ax1 = axes[0, 0]
country_counts = train['country'].value_counts()
ax1.bar(country_counts.index, country_counts)
ax1.set_xlabel('Country')
ax1.set_ylabel('Count')
ax1.set_title('Distribution of Products Sold by Country')

# 2) Pie chart for countries
ax2 = axes[0, 1]
create_pie_chart(ax2, country_counts.index, country_counts, 'Distribution of Products Sold by Country')

# 3) Histogram for stores
ax3 = axes[1, 0]
store_counts = train['store'].value_counts()
ax3.bar(store_counts.index, store_counts)
ax3.set_xlabel('Store')
ax3.set_ylabel('Count')
ax3.set_title('Distribution of Products Sold by Store')

# 4) Pie chart for stores
ax4 = axes[1, 1]
create_pie_chart(ax4, store_counts.index, store_counts, 'Distribution of Products Sold by Store')

# Adjust spacing between subplots
plt.tight_layout()

# Display the subplots
plt.show()

In [None]:
# Convert the 'date' column to datetime format

train['date'] = pd.to_datetime(train['date'])

# Set the 'date' column as the index
train.set_index('date', inplace=True)


test['date'] = pd.to_datetime(test['date'])

# Set the 'date' column as the index
test.set_index('date', inplace=True)

In [None]:
# calculate the total trend for num_sold over time
plt.figure(figsize=(12, 6))
train['num_sold'].resample('M').sum().plot()
plt.title('Total trend of num_sold over time')
plt.xlabel('Date')
plt.ylabel('Total num_sold')
plt.show()

<div class="alert alert-block alert-warning" style="border: 3px solid #f0ad4e; background-color: #fcf8e3; padding: 10px;">
    <font color = 'black'>
👉Total trend for the num_sold over time: This plot shows the total number of products sold over time, aggregated by month,as we can clearly see that the number of sold has decreased after every 1st month of every year and in the year 2020, and increased rapidly at the end of every year </font>

In [None]:
# Country with the highest num_sold based on the year
country_num_sold = train.groupby([train.index.year, 'country'])['num_sold'].sum()
country_num_sold = country_num_sold.unstack()
country_num_sold.plot(kind='bar', figsize=(12, 6))
plt.title('Total num_sold by Country (Yearly)')
plt.xlabel('Year')
plt.ylabel('Total num_sold')
plt.legend(title='Country')
plt.show()

<div class="alert alert-block alert-warning" style="border: 3px solid #f0ad4e; background-color: #fcf8e3; padding: 10px;">
    <font color = 'black'>
👉Country with the highest num_sold based on the year: This bar chart displays the total number of products sold by country, grouped by year, as we can clearly say from this plot Canada has more sales in every year followed by Japan,The least one is Argentina</font>


In [None]:
# Store with the highest sales based on the year
store_num_sold = train.groupby([train.index.year, 'store'])['num_sold'].sum()
store_num_sold = store_num_sold.unstack()
store_num_sold.plot(kind='bar', figsize=(12, 6))
plt.title('Total num_sold by Store (Yearly)')
plt.xlabel('Year')
plt.ylabel('Total num_sold')
plt.legend(title='Store')
plt.show()

<div class="alert alert-block alert-warning" style="border: 3px solid #f0ad4e; background-color: #fcf8e3; padding: 10px;">
    <font color = 'black'>
👉Store with the highest sales based on the year: This bar chart displays the total number of products sold by store, grouped by year, Kagglazon tops the chart, least one is kaggle learn</font>

In [None]:
# Country-wise sales based on product:
plt.figure(figsize = (12,6))
sns.barplot(train,x='country',y='num_sold',hue='product')
plt.xlabel('Country')
plt.ylabel('No. of products sold')
plt.title('country-wise sales analysis based on product')
plt.show()

In [None]:
# Product with the highest num_sold based on the year
product_num_sold = train.groupby([train.index.year, 'product'])['num_sold'].sum()
product_num_sold = product_num_sold.unstack()
product_num_sold.plot(kind='bar', figsize=(12, 6))
plt.title('Total num_sold by Product (Yearly)')
plt.xlabel('Year')
plt.ylabel('Total num_sold')
plt.legend(title='Product')
plt.show()

<div class="alert alert-block alert-warning" style="border: 3px solid #f0ad4e; background-color: #fcf8e3; padding: 10px;">
    <font color = 'black'>
👉Product with the highest num_sold based on the year: This bar chart displays the total number of products sold by product, grouped by year, as we can see from the above plot there is a stiff competition between the Using LLMS to improve your coding and to train more LLMS </font>

In [None]:
%%time
# Plot the trend for each country
sns.lineplot(data=train, x='date', y='num_sold', hue='country')

# Customize the plot
plt.xlabel('Date')
plt.ylabel('Number of Products Sold')
plt.title('Trend of Products Sold by Country')
plt.legend(title='Country')

# Adjust the plot layout
# we adjust the plot layout using plt.tight_layout() to ensure that all components fit properly
plt.tight_layout()

# Display the plot
plt.show()

In [None]:
# Count the occurrences where num_sold is 0
num_zero_sold = (train['num_sold'] == 0).sum()

print(f"Number of occurrences where num_sold is 0: {num_zero_sold}")

<div id = 'h4' class = 'alert alert-block alert-info' style="border-bottom: 5px solid #9370db; background-color: #f0f8ff; padding: 10px;">
    <h2>Stationarity testing and Determine Order of Differencing</h2>
    </div>

<div class="alert alert-block alert-warning" style="border: 3px solid #f0ad4e; background-color: #fcf8e3; padding: 10px;">
    <font color = 'black'>
👉**Testing the stationarity of a time series using the Augmented Dickey-Fuller (ADF) test from the statsmodels library:**</font>


In [None]:
# define the time series as a pandas Series
ts = train['num_sold']

In [None]:
from statsmodels.tsa.stattools import adfuller

result = adfuller(ts)

print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))

# <p style="font-family:Cursive; font-weight:bold; letter-spacing: 2px; color:black; background-color:#4abaed ; font-size:85%; text-align:center;padding: 10px; border-bottom: 5px solid #9447e6">ARIMA:</p>
<br>
<div style="border-radius:10px;border:#FFF722 solid;padding: 15px;background-color:#ffffff00;font-size:100%;text-align:left">
<b><h1><span style='color:#85BB65'>|</span> Detail Explanation Of Stationary and Non Stationary data</h1></b> <br>
    <ol>
        <li><b>Stationary  :</b> A Time series is said to be <span style = 'background-color: #FFF722; color : black'><b>stationary</b></span> when its statistical properties, such as <span style = 'color : #85BB65'><b> mean, variance, and autocorrelation, remain constant over time</b></span>. To check for stationarity, you can visually inspect the time series plot or use statistical tests like the <span style = 'background-color:#F0AAE3'><b>Augmented Dickey-Fuller (ADF) test</b></span>. If the <span style = 'background-color:#F0E4D5'><b>p-value </b></span> obtained from the ADF test is less than a chosen significance level (e.g., 0.05), then you can reject the null hypothesis and conclude that the series is stationary</li><br>
        <li><b>Non stationary :</b> A non stationary time series exhibits a <span style = 'color:#4244F0'><b>trend, seasonality, or both, </b></span> which causes its statistical properties to change over time. In such cases, traditional time series models like ARIMA cannot be applied directly. Common reasons for non-stationarity include trends <span style = 'background-color:#F0E4D5'><b>(upward or downward movement over time) and seasonality (regular patterns that repeat at fixed intervals).</b></span>.</li>
    </ol>

In [None]:
train

In [None]:
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(train['num_sold'], alpha = 0.05);

<div id = 'h4' class = 'alert alert-block alert-info' style="border-bottom: 5px solid #9370db; background-color: #f0f8ff; padding: 10px;">
    <h1>Model Building</h1>
    </div>

<div  class = 'alert alert-block alert-info' style="border-bottom: 5px solid #9370db; background-color: #f0f8ff; padding: 10px;">
    <h3>Random Forest Regressor</h3> 
</div>

In [None]:
# Create a label encoder
label_encoder = LabelEncoder()

# Encode categorical variables
train['country'] = label_encoder.fit_transform(train['country'])
train['store'] = label_encoder.fit_transform(train['store'])
train['product'] = label_encoder.fit_transform(train['product'])

# Step 2: Split the data into train and validation sets
train_size = int(0.8 * len(train))
train_data = train[:train_size]
validation_data = train[train_size:]


# Random Forests
rf_model = RandomForestRegressor(n_estimators=100, random_state=0)
rf_model.fit(train_data.drop('num_sold', axis=1), train_data['num_sold'])
rf_forecast = rf_model.predict(validation_data.drop('num_sold', axis=1))

# Gradient Boosting
gb_model = GradientBoostingRegressor(n_estimators=100, random_state=0)
gb_model.fit(train_data.drop('num_sold', axis=1), train_data['num_sold'])
gb_forecast = gb_model.predict(validation_data.drop('num_sold', axis=1))

# Step 4: Evaluate the forecasts using MAPE
rf_mape = mean_absolute_percentage_error(validation_data['num_sold'], rf_forecast)
gb_mape = mean_absolute_percentage_error(validation_data['num_sold'], gb_forecast)

print(f"Random Forests MAPE: {rf_mape}")
print(f"Gradient Boosting MAPE: {gb_mape}")

In [None]:
# Additional step: Visualize the actual vs. predicted values for the validation data
plt.plot(validation_data.index, validation_data['num_sold'], label='Actual')
plt.plot(validation_data.index, rf_forecast, label='Random Forests Forecast')
plt.plot(validation_data.index, gb_forecast, label='Gradient Boosting Forecast')
plt.xlabel('Date')
plt.ylabel('num_sold')
plt.title('Actual vs. Forecasted num_sold (Validation Data)')
plt.legend()
plt.show()

In [None]:
test

In [None]:
# Step 2: Encode categorical variables
test['country'] = label_encoder.fit_transform(test['country'])
test['store'] = label_encoder.fit_transform(test['store'])
test['product'] = label_encoder.fit_transform(test['product'])

# Step 3: Apply the trained Random Forest model to predictions
test_predictions = rf_model.predict(test)
test_predictions_gb = gb_model.predict(test)

In [None]:
sub['num_sold']=test_predictions
sub

In [None]:
sub.to_csv('submission_rf(12-07-23).csv')

In [None]:
#save the model in pickle
pickle.dump(rf_model, open('/kaggle/working/SalesForecasting_pickle_file(RF)', 'wb'))

<div class="alert alert-block alert-warning" style="border: 3px solid #f0ad4e; background-color: #fcf8e3; padding: 10px;">
    <h4> Conclusion:</h4>
    <font color = 'black'>
👉Today, my main focus was on the analysis part of the data. I have explored the dataset, visualized trends, calculated counts, and worked on predictive modeling using base line Random Forests regressor. However, I plan to further enhance the forecasting aspect of the project by implementing the ARIMA model tomorrow.</font>
                      <h2>If you appreciate my work, please consider upvoting it 👍</h2>
    </div>