# <center> **Data Science Project** 

## <center> **Problem Definition**

### ***Business Understanding***
Walmart Inc. is an American multinational retail corporation that operates a chain of hypermarkets, discount department stores, and grocery stores from the United States, headquartered in Bentonville, Arkansas

### ***In this project, we focused to answer the following questions***
1. Which store has minimum and maximum sales?
2. Which store has maximum standard deviation i.e., the sales vary a lot. Also, find out the coefficient of mean to standard deviation
3. Which store/s has good quarterly growth rate in Q3’2012
4. Some holidays have a negative impact on sales. Find out holidays which have higher sales than the mean sales in non-holiday season for all stores together
5. Provide a monthly and semester view of sales in units and give insights
6. Build prediction to forecast demand.

### ***Data Understanding***
There are sales data available for 45 stores of Walmart in Kaggle. This is the data that covers sales from 2010-02-05 to 2012-10-26.
### ***The data contains these features***
* This file contains anonymized information about the 45 stores, additional data related to the store and regional activity for the given dates. It contains the following fields:
>__Walmark.csv__
* Store - the store number
* Date - the week of sales
* Weekly_Sales - sales for the given department in the given store
* Holiday_Flag - whether the week is a special holiday week (1–Holiday week, 0–Non-holiday week)
* Temperature - average temperature in the region (fahrenheit)
* Fuel_Price - cost of fuel in the region
* CPI – the consumer price index
* Unemployment - the unemployment rate

*For convenience, the four holidays fall within the following weeks in the dataset.<br>(not all holidays are in the data):*
* *Super Bowl : 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13*
* *Labor Day : 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13*
* *Thanksgiving : 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13*
* *Christmas : 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13*

## <center> **Data Acquisition**

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from scipy.stats import zscore
# Library for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('whitegrid')
# Algorithm (Linear Regression)
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import Pipeline

: 

In [None]:
# Load dataset
missing_sign = ['n/a','na','nan','--','none']
data_set = pd.read_csv('Walmart.csv', na_values=missing_sign)
data_set.head()

: 

In [None]:
# Check information of data
data_set.info()

: 

## <center> **Data Preparation**

In [None]:
# Convert data type 
data_set['Date'] = pd.to_datetime(data_set['Date'], format='%d-%m-%Y')
data_set[['Store', 'Holiday_Flag']] = data_set[['Store', 'Holiday_Flag']].astype('object')

: 

In [None]:
# Determine variable type
cat_var = [column for column in data_set.columns if data_set[column].dtype == "object" and data_set[column].nunique() < 46]
num_var = [column for column in data_set.columns if data_set[column].dtype in ['int64', 'float64']]

: 

In [None]:
# Check values each columns (categorical type)
fig, ax = plt.subplots(ncols=len(cat_var), figsize=(20,7))
for idx,column in enumerate(cat_var):
  sns.countplot(data=data_set,
                x=column,
                ax=ax[idx]
                ).set_title(f"Quantities of {column}")
  print(f"Column : '{column}'\nUnique values : {data_set[column].unique()}\n")

: 

In [None]:
# Check values each columns (numerical type)
fig, (ax_box, ax_hist) = plt.subplots(ncols=len(num_var), nrows=2, gridspec_kw = {"height_ratios": (.15, .85)}, figsize=(20,7))
for idx,column in enumerate(num_var):
  sns.boxplot(data=data_set, 
              x=column, 
              ax=ax_box[idx]
              )
  sns.histplot(data=data_set,
               x=column,
               ax=ax_hist[idx],
               kde=True
               )
  if idx != 0:
    ax_hist[idx].set(ylabel="")
  ax_box[idx].set(xlabel="")

: 

In [None]:
data_set.head()

: 

In [None]:
# Check statistic of data (numerical type)
data_set.describe()

: 

### **Data Cleansing**

In [None]:
df = data_set.copy()

: 

#### *Data Missing*

In [None]:
# Check missing values
missing_count = df.isnull().sum()
total_cells = np.product(df.shape)
total_missing = missing_count.sum()
percent_missing = total_missing*100/total_cells
print(f"Percentage of missing values : {percent_missing} %")

: 

#### *Data Outlier*

In [None]:
# Identifying Outliers with IQR (Interquartile range)
def cleanOutlier_IQR(data, num_cols):
    count = 1
    while count != 0:
        count = 0
        for column in num_cols: 
            q1 = data[column].quantile(0.25)
            q3 = data[column].quantile(0.75)
            iqr = q3-q1
            upper_whisker = q3+(1.5*iqr)
            lower_whisker = q1-(1.5*iqr)
            idx_out = data.loc[(data[column]>upper_whisker) | (data[column]<lower_whisker)].index
            data.drop(idx_out, inplace=True)
            count += len(idx_out)
    qty_outlier = data_set.shape[0]-data.shape[0]
    percent_outlier = (qty_outlier*100)/data_set.shape[0]
    print(f'Outlier : {qty_outlier} units ({percent_outlier:.2f}%)')
    data.reset_index(drop=True, inplace=True)

: 

In [None]:
cleanOutlier_IQR(df, num_var)
print(f'Sample before clear outlier : {data_set.shape[0]:,} units')
print(f'Sample after clear outlier : {df.shape[0]:,} units')

: 

In [None]:
# Plot distribution of data (numerical variable)
fig, ax = plt.subplots(ncols=2, figsize=(20,7))
sns.boxplot(data=pd.melt(zscore(data_set[num_var])),
            x='variable',
            y='value',
            ax=ax[0]
            ).set_title('Data before clear outlier')
sns.boxplot(data=pd.melt(zscore(df[num_var])),
            x='variable',
            y='value',
            ax=ax[1]
            ).set_title('Data after clear outlier');

: 

#### *Duplicate Data*

In [None]:
# Check duplicate data
def cleanDuplicated(data):
    count_duplicated = data.duplicated().sum()
    print(f"Duplicated : {count_duplicated} ea.")
    if count_duplicated != 0:
        data.drop_duplicates(inplace=True)
        print('Clean duplicate data complete!')

: 

In [None]:
cleanDuplicated(df)

: 

## <center> **Exploratory data analysis**

### **Question 1:** *Which store has minimum and maximum sales?*

In [None]:
# Total weekly sales for each store
total_sales = data_set.groupby('Store')['Weekly_Sales'].sum()
print(f'Store {total_sales.idxmax()} has maximum sales : {total_sales.max():,.2f} USD')
print(f'Store {total_sales.idxmin()} has minimum sales : {total_sales.min():,.2f} USD')

: 

In [None]:
# Plot properties (total sales)
# Configure the graph display 
plt.figure(figsize=(20,7))
ax = sns.barplot(x=total_sales.index,
                 y=total_sales.values,
                 order=total_sales.sort_values().index
                 )
# Configure the title text
ax.set_title("Total weekly sales for each Store", fontsize=15)
ax.set_xlabel("Store number", fontsize=13)
ax.set_ylabel("Total sales (USD)", fontsize=13);

: 

### **Question 2:** *Which store has maximum standard deviation i.e., the sales vary a lot. Also, find out the coefficient of mean to standard deviation*

In [None]:
# Combining statistics results in weekly sales for each store. (standard deviation and mean)
sales_stats = data_set.groupby('Store').agg({'Weekly_Sales':['std','mean']})
sales_stats['Coefficient'] = sales_stats['Weekly_Sales']['std']/sales_stats['Weekly_Sales']['mean']
sales_stats.sort_values(('Weekly_Sales','std'), ascending=False).head()

: 

In [None]:
# The highest standard deviation of sales
max_std = sales_stats['Weekly_Sales']['std'].max()
idx_max_std = sales_stats['Weekly_Sales']['std'].idxmax()
print(f"Store {idx_max_std} has the highest standard deviation of sales : {max_std:,.2f}")
print(f"Coefficient of mean to standard deviation : {sales_stats['Coefficient'].loc[idx_max_std]:.2f}")

: 

In [None]:
# Plot properties (distribution of sales)
data_plot = data_set.loc[data_set['Store']==idx_max_std, ['Weekly_Sales']]
# Configure the graph display 
fig, (ax_box, ax_hist) = plt.subplots(nrows=2, gridspec_kw = {"height_ratios": (.15, .85)}, figsize=(20,7))
sns.boxplot(data=data_plot, x='Weekly_Sales', ax=ax_box)
sns.histplot(data=data_plot, x='Weekly_Sales', kde=True, ax=ax_hist)
ax_box.set(xlabel='')
# Configure the title text
ax_box.set_title(f"Distribution of sales for store {idx_max_std}", fontsize=15)
ax_hist.set_xlabel("Weekly sales (USD)", fontsize=13)
ax_hist.set_ylabel("Count", fontsize=13);

: 

### **Question 3:** *Which store/s has good quarterly growth rate in Q3’2012*

In [None]:
col_use = ['Store','Date','Weekly_Sales']
df_growth = data_set[col_use].copy()
# Create new columns 'Year' and 'Quarter'
df_growth['Year'] = df_growth['Date'].dt.year
df_growth['Quarter'] = df_growth['Date'].dt.quarter

: 

In [None]:
# Only data for the year 2012 was filtered
# Quanter 2
quarter_2 = df_growth.loc[(df_growth['Year']==2012)&(df_growth['Quarter']==2)]
sales_Q2 = quarter_2.groupby('Store').agg({'Weekly_Sales':sum}).rename(columns={'Weekly_Sales':'Total_Sales_Q2'})
# Quarter 3
quarter_3 = df_growth.loc[(df_growth['Year']==2012)&(df_growth['Quarter']==3)]
sales_Q3 = quarter_3.groupby('Store').agg({'Weekly_Sales':sum}).rename(columns={'Weekly_Sales':'Total_Sales_Q3'})

: 

In [None]:
# Combining data from quarters 2 and 3
growth_rate = sales_Q2.merge(sales_Q3, on='Store')
growth_rate['Growth_Rate'] = (growth_rate['Total_Sales_Q3']-growth_rate['Total_Sales_Q2'])*100/growth_rate['Total_Sales_Q2']
growth_rate.sort_values('Growth_Rate', ascending=False).head()

: 

In [None]:
# The highest and the lowest growth rates in Q3 2012
print(f"Store {growth_rate['Growth_Rate'].idxmax()} has the highest growth rate of {growth_rate['Growth_Rate'].max():.2f}%.")
print(f"Store {growth_rate['Growth_Rate'].idxmin()} has the lowest growth rate of {growth_rate['Growth_Rate'].min():.2f}%.")

: 

In [None]:
# Plot properties (total sales and growth rate 2012 in quarter 2 and 3 for each store)
# Configure the graph display and title text
# Total sales
df_plot = pd.melt(growth_rate.drop('Growth_Rate',axis=1), ignore_index=False)
fig, ax = plt.subplots(nrows=2, figsize=(20,14))
sns.barplot(x=df_plot.index,
            y=df_plot['value'],
            hue=df_plot['variable'],
            ax=ax[0]
            )
ax[0].set_title("Total sales 2012 in quarter 2 and 3 for each store", fontsize=15)
ax[0].set_xlabel("Store number", fontsize=13)
ax[0].set_ylabel("Total sales (USD)", fontsize=13)
ax[0].legend(title='Quarter', fontsize=10)
# Growth rate
sns.barplot(x=growth_rate.index,
            y=growth_rate['Growth_Rate'],
            ax=ax[1]
            )
ax[1].set_title("Growth rate in the third quarter of 2012", fontsize=15)
ax[1].set_xlabel("Store number", fontsize=13)
ax[1].set_ylabel("Growth rate", fontsize=13);

: 

### **Question 4:** *Some holidays have a negative impact on sales. Find out holidays which have higher sales than the mean sales in non-holiday season for all stores together*
* Super Bowl : 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
* Labor Day : 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
* Thanksgiving : 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
* Christmas : 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

In [None]:
col_use = ['Date','Weekly_Sales','Holiday_Flag']
df_holiday = data_set.loc[data_set['Holiday_Flag']==1, col_use].copy()
holiday_list = {
    'Super Bowl Date' : pd.to_datetime(['2010-02-12','2011-02-11','2012-02-10']),
    'Labor Day' : pd.to_datetime(['2010-09-10','2011-09-09','2012-09-07']),
    'Thanksgiving' : pd.to_datetime(['2010-11-26','2011-11-25','2012-11-23']),
    'Christmas' : pd.to_datetime(['2010-12-31','2011-12-30','2012-12-28'])
}
# Create new columns 'Holiday'
for i in holiday_list:
  df_holiday.loc[df_holiday['Date'].isin(holiday_list[i]), 'Holiday'] = i

: 

In [None]:
# Mean sales for each holiday
# Holiday
mean_sales_holiday = df_holiday.groupby('Holiday')['Weekly_Sales'].mean().to_frame(name='Mean_Sales')
# Non-Holiday
non_holiday = {
    'Mean_Sales': data_set.loc[data_set['Holiday_Flag']==0, 'Weekly_Sales'].mean()
}
df_non_holiday = pd.DataFrame([non_holiday],index=['Normal Day'])
# Combining data from mean sales in holiday and non-holiday
mean_sales = pd.concat([mean_sales_holiday,df_non_holiday], axis=0)
mean_sales

: 

In [None]:
# The highest mean sales
print(f"'{mean_sales['Mean_Sales'].idxmax()}' holiday has higher sales than another events : {mean_sales['Mean_Sales'].max():,.2f} USD")

: 

In [None]:
# Plot properties (Mean sales for each store)
# Configure the graph display
plt.figure(figsize=(20,7))
ax = sns.barplot(x=mean_sales.index, 
                 y=mean_sales['Mean_Sales'], 
                 palette='pastel'
                 )
# Configure the title text
for index, row in mean_sales.reset_index().iterrows():
  ax.text(index, row['Mean_Sales']*1.01, '{:,.2f}'.format(row['Mean_Sales']), ha='center', color='black')
ax.set_title(f"Mean sales for each holiday", fontsize=15)
ax.set_xlabel("Holiday name", fontsize=13)
ax.set_ylabel("Mean sales (USD)", fontsize=13);

: 

### **Question 5:** *Provide a monthly and semester view of sales in units and give insights*

In [None]:
col_use = ['Weekly_Sales','Date']
df_sum = df[col_use].copy()
# Create new column 'Month','Year' and 'Semester'
df_sum['Month'] = df_sum['Date'].dt.month
df_sum['Year'] = df_sum['Date'].dt.year
year = df_sum['Year'].unique()
Semester = {
    'Semester_1' : np.arange(1,7),
    'Semester_2' : np.arange(6,13)
    }
for idy,y in enumerate(year):
    for ids,s in enumerate(Semester):
        df_sum.loc[(df_sum['Month'].isin(Semester[s]))&(df_sum['Year']==y), 'Semester'] = ids+1+(idy*2)

: 

In [None]:
df_sum.head()

: 

In [None]:
# Total monthly sales for each year
monthly_sales = df_sum.pivot_table(index='Month', values='Weekly_Sales', columns='Year', aggfunc='sum', fill_value=0, margins=True)
monthly_sales

: 

In [None]:
month = {
     1 : 'January',
     2 : 'February',
     3 : 'March',
     4 : 'April',
     5 : 'May',
     6 : 'June',
     7 : 'July',
     8 : 'August',
     9 : 'September',
    10 : 'October',
    11 : 'November',
    12 : 'December'
    }

: 

In [None]:
month.get(monthly_sales['All'][:-1].idxmax())

: 

In [None]:
print(f"The highest annual sales (2010-2012) were {monthly_sales.loc['All'][:-1].max():,.2f} USD in {monthly_sales.loc['All'][:-1].idxmax()}.")
print(f"The highest total monthly sales for 3 years (2010-2012) were {monthly_sales['All'][:-1].max():,.2f} USD in {month.get(monthly_sales['All'][:-1].idxmax())}.\n")
for i in year:
  print(f'------------------------- Year : {i} -------------------------')
  print(f"The highest sales in {i} were {monthly_sales[i][:-1].max():,.2f} USD in {month.get(monthly_sales[i][:-1].idxmax())}.")
  print(f"The lowest sales in {i} were {monthly_sales[i][:-1].min():,.2f} USD in {month.get(monthly_sales[i][:-1].idxmin())}.\n")

: 

In [None]:
# Total sales per semester
semester_sales = df_sum.pivot_table(index='Semester', values='Weekly_Sales', aggfunc='sum', margins=True)
semester_sales

: 

In [None]:
print(f"The highest sales in the semester were {semester_sales['Weekly_Sales'][:-1].max():,.2f} USD in semester {int(semester_sales['Weekly_Sales'][:-1].idxmax())}")
print(f"The lowest sales in the semester were {semester_sales['Weekly_Sales'][:-1].min():,.2f} USD in semester {int(semester_sales['Weekly_Sales'][:-1].idxmin())}")

: 

In [None]:
# Plot properties 
# Configure the graph display (total monthly sales)
fig, ax = plt.subplots(nrows=2, figsize = (20,14))
sns.lineplot(data=df_sum,
             x='Month',
             y='Weekly_Sales',
             hue='Year',
             estimator=np.sum,
             palette='pastel',
             errorbar=None,
             ax=ax[0]
             ).set_title('Total monthly sales for each year', fontsize=15)
ax[0].set_xlabel("Month", fontsize=13)
ax[0].set_xticks(range(13), labels=range(0, 13))
ax[0].set_ylabel("Total sales (USD)", fontsize=13)
ax[0].legend(title='Year', fontsize=10)
# Configure the graph display (total sales per semester)
sns.lineplot(data=df_sum,
             x='Semester',
             y='Weekly_Sales',
             estimator =np.sum,
             ax=ax[1]
             ).set_title('Total sales per semester', fontsize=15)
ax[1].set_xlabel("Semester", fontsize=13)
ax[1].set_ylabel("Total sales (USD)", fontsize=13);

: 

## <center> **Modelling**

### **Question 6:** *Build prediction to forecast demand*

#### **Prepare data**

In [None]:
# Use the data after "Data Cleaning"
df_sales = df.copy()

: 

#### *Feature Engineering*

In [None]:
# Create new columns "Year", "Month" and "Holiday"
df_sales['Month'] = df_sales['Date'].dt.month.astype(object)
df_sales['Year'] = df_sales['Date'].dt.year.astype(object)
for event,date in holiday_list.items():
    df_sales.loc[df_sales['Date'].isin(date), 'Holiday'] = event
df_sales['Holiday'].fillna(value='Non-Holiday', inplace=True)
df_sales.drop(columns=['Holiday_Flag'], inplace=True)
df_sales.set_index('Date', inplace=True)

: 

In [None]:
# Group temperature
bins = [0, 50, 77, 86, 104]
labels = [1, 2, 3, 4]
df_sales['Temp'] = pd.cut(df_sales['Temperature'], bins=bins, labels=labels, include_lowest=True).astype('object')
df_sales.drop('Temperature', axis=1, inplace=True)

: 

In [None]:
df_sales.head()

: 

In [None]:
# Determine the category variable.
cat_var = [col for col in df_sales.columns if df_sales[col].dtypes=="object"]
print(f"Categorical columns : {cat_var}")

: 

##### *Dummies variable*

In [None]:
def createDummies(data, cat_col):
  data[cat_col] = data[cat_col].astype('category')
  return pd.get_dummies(data, prefix=cat_col, drop_first=True)

: 

In [None]:
df_dum = createDummies(df_sales, cat_var)
df_dum.head()

: 

##### *Split data for train and test*


In [None]:
x = df_dum.drop(columns='Weekly_Sales')
y = df_dum['Weekly_Sales']
# Divide the data into 2 sets: a training set and a test set.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=10)
print(f"Training set : {x_train.shape[0]:,} samples ({x_train.shape[0]*100/df_dum.shape[0]:.2f}%)")
print(f"Test set : {x_test.shape[0]:,} samples ({x_test.shape[0]*100/df_dum.shape[0]:.2f}%)")

: 

#### **Dimensionality Reduction**

##### *Feature Scaling*

In [None]:
# Feature scaling, use "StandardScaler"
sc = StandardScaler()
x_train_sc = sc.fit_transform(x_train)
df_train_sc = pd.DataFrame(x_train_sc, columns=x_train.columns)
df_train_sc.head()

: 

#### **Train Model**

##### *Algorithm Comparison*

In [None]:
# Choose the best algorithm
model_list = {
    'mlr' : LinearRegression(),
    'llr' : Lasso(tol = 1.275e+11),
    'rid' : Ridge(),
    'enr' : ElasticNet(),
    'rfr' : RandomForestRegressor()
}

: 

In [None]:
def modelCompare(models, feature, target):
  r_score = dict()
  feature_sc = sc.fit_transform(feature)
  for name,model in models.items():
    cvs = cross_val_score(model, feature_sc, target, cv=5)
    r_score[name] = cvs.mean()*100
  return pd.Series(r_score, name='Score(%)')

: 

In [None]:
score = modelCompare(model_list, x_train, y_train)
score

: 

In [None]:
# Choose models with the highest efficiency
model = model_list[score.idxmax()]
print(f"The algorithm '{model}' has the highest efficiency ({score.max():.2f}%)")

: 

In [None]:
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

: 

In [None]:
# Evaluate the model.
R2 = r2_score(y_test, y_pred)
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)

: 

##### *Model Tuning*


In [None]:
# Model tuning for 'LinearRegression', use "Grid Search"
param_dist = {
    'criterion' : ['squared_error','absolute_error','friedman_mse','poisson'],
    'max_features' : [1.0,'sqrt','log2'],
}
grid_search = GridSearchCV(estimator=model, param_grid=param_dist, verbose=0, n_jobs=-1)
grid_result = grid_search.fit(x_train_sc, y_train)

: 

In [None]:
print(f"Parameter obtained from tuning : {grid_search.best_params_}")
print(f"Use parameter : {model.get_params()}")

: 

##### *Predict values*


In [None]:
# Model parameterization
# Create pipeline
pipe = Pipeline([
    ('sc', StandardScaler()),
    ('model', grid_search.best_estimator_),
])

: 

In [None]:
# Predict values
pipe.fit(x_train, y_train)
y_pred_tune = pipe.predict(x_test)

: 

In [None]:
# Evaluate the model.
R2_tune = r2_score(y_test, y_pred_tune)
MSE_tune = round(mean_squared_error(y_test, y_pred_tune), 2)
RMSE_tune = round(np.sqrt(MSE_tune), 2)

: 

In [None]:
# Conclude evaluation
print('+'+'-'*30+'+'+'-'*21+'+'+'-'*21+'+')
print(f"| Evaluate the model           |{' '*4}Before Tuning{' '*4}|{' '*4} After Tuning{' '*4}|")
print('+'+'-'*30+'+'+'-'*21+'+'+'-'*21+'+')
print(f"| Coefficient of Determination |{' '*7}{R2*100:.2f} %{' '*7}|{' '*7}{R2_tune*100:.2f} %{' '*7}|")
print(f"| Mean Square Error            |{' '*2}{MSE:,.2f}{' '*2}|{' '*2}{MSE_tune:,.2f}{' '*2}|") 
print(f"| Root Mean Square Error       |{' '*5}{RMSE:,.2f}{' '*5} |{' '*5}{RMSE_tune:,.2f} {' '*5}|")
print('+'+'-'*30+'+'+'-'*21+'+'+'-'*21+'+')

: 

In [None]:
# Create a dataframe to collect the predicted values.
result = y_test.to_frame()
result['Predict_Values'] = y_pred
result['Diff'] = abs(result['Weekly_Sales']-result['Predict_Values'])
result['Diff(%)'] = (result['Diff']*100)/result['Weekly_Sales']
result.head(5)

: 

In [None]:
# Configure the graph display
fig, ax = plt.subplots(figsize=(20,7))
sns.scatterplot(x=y_pred,
                y=y_test,
                ax=ax
                )
sns.lineplot(x=y_pred,
             y=y_pred,
             color='r',
             ax=ax
             )
# Configure the title text
ax.set_title(f"Result of Weekly sales : Efficiency {R2_tune*100:.2f} %", fontsize=15);
ax.set_xlabel("Predict values", fontsize=13)
ax.set_ylabel("True values", fontsize=13);

: 

In [None]:
plt.figure(figsize=(20,7))
df_result = pd.melt(result[['Weekly_Sales', 'Predict_Values']], ignore_index=False)
ax = sns.lineplot(x=df_result.index, y=df_result.value, hue=df_result.variable)
ax.set_title(f"Result of Weekly sales : Efficiency {R2_tune*100:.2f} %", fontsize=15);

: 