# Intro
## General
Machine learning allows the user to feed a computer algorithm an immense amount of data and have the computer analyze and make data-driven recommendations and decisions based on only the input data. 
In most of the situations we want to have a machine learning system to make **predictions**, so we have several categories of machine learning tasks depending on the type of prediction needed: **Classification, Regression, Clustering, Generation**, etc.

**Classification** is the task whose goal is the prediction of the label of the class to which the input belongs (e.g., Classification of images in two classes: cats and dogs).
**Regression** is the task whose goal is the prediction of numerical value(s) related to the input (e.g., House rent prediction, Estimated time of arrival ).
**Generation** is the task whose goal is the creation of something new related to the input (e.g., Text translation, Audio beat generation, Image denoising ). **Clustering** is the task of grouping a set of objects in such a way that objects in the same group (called a **cluster**) are more similar (in some sense) to each other than to those in other **clusters** (e.g., Clients clutering).

In machine learning, there are learning paradigms that relate to one aspect of the dataset: **the presence of the label to be predicted**. **Supervised Learning** is the paradigm of learning that is applied when the dataset has the label variables to be predicted, known as ` y variables`. **Unsupervised Learning** is the paradigm of learning that is applied when the dataset has not the label variables to be predicted. **Self-supervised Learning** is the paradigm of learning that is applied when part of the X dataset is considere as the label to be predicted (e.g., the Dataset is made of texts and the model try to predict the next word of each sentence).

## Notebook overview
This notebook contains the step by step approach undertaken to build a model that more accurately predicts the unit sales for thousands of items sold at different Favorita stores; a large Ecuadorian-based grocery retailer.

# Setup

## Installation
Here is the section to install all the packages/libraries that will be needed to tackle the challlenge.

In [1]:
# pip install pandas
# pip install numpy 
# pip install matplotlib
# pip install seaborn 
# pip install forex_python
# pip install babel 
# pip install seaborn
# pip install pandas-profiling 

## Importation
Here is the section to import all the packages/libraries that will be used through this notebook.

In [2]:
# Data handling
import pandas as pd
import numpy as np
from statistics import mean
from forex_python.converter import CurrencyRates
from babel.numbers import format_currency
import datetime as dt

# Statistics
from scipy import stats
from scipy.stats import shapiro, trim_mean, mstats, mode
from scipy.stats import ttest_ind
from scipy.stats import skew


# Vizualisation (Matplotlib, Plotly, Seaborn, etc. )
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
from plotly.offline import plot
from IPython.display import display
import plotly.express as px
import plotly.graph_objects as go
import sweetviz as sv

# balance data
from imblearn import under_sampling, over_sampling
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

# Machine learning libraries and metrics
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from scipy.stats import chi2_contingency

# Feature Processing (Scikit-learn processing, etc. )
from sklearn.preprocessing import StandardScaler, RobustScaler, OneHotEncoder, LabelEncoder, Binarizer
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import f1_score,roc_curve, auc,roc_auc_score
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.compose import TransformedTargetRegressor
import joblib
from sklearn.pipeline import Pipeline
import pickle
# Other packages
from statsmodels.tsa.seasonal import seasonal_decompose
import os
import warnings
warnings.filterwarnings('ignore')
import patoolib

# Data Loading
Here is the section to load the datasets (train, eval, test) and the additional files

In [3]:
holidaysurl="https://raw.githubusercontent.com/Gilbert-B/Regression-Project/main/data/holidays_events.csv"
oilurl="https://raw.githubusercontent.com/Gilbert-B/Regression-Project/main/data/oil.csv"
storesurl="https://raw.githubusercontent.com/Gilbert-B/Regression-Project/main/data/stores.csv"
testurl="https://raw.githubusercontent.com/Gilbert-B/Regression-Project/main/data/test.csv"
trainurl="https://github.com/Gilbert-B/Regression-Project/releases/download/data/train.csv"
transactionsurl="https://raw.githubusercontent.com/Gilbert-B/Regression-Project/main/data/transactions.csv"

In [4]:
test_df = pd.read_csv(testurl,error_bad_lines=False)
Train_df= pd.read_csv(trainurl,error_bad_lines=False)
Holiday_df= pd.read_csv(holidaysurl,error_bad_lines=False)
Transaction_df= pd.read_csv(transactionsurl,error_bad_lines=False)
stores_df= pd.read_csv(storesurl,error_bad_lines=False)
oil_df= pd.read_csv(oilurl,error_bad_lines=False)

# Exploratory Data Analysis: EDA

Details of the data cleaing process has been provided in the EDA notebook which can be found here:
https://github.com/Gilbert-B/Regression-Project/blob/main/Notebook/Regression-Project_EDA.ipynb

# Feature Processing & Engineering

## Drop Duplicates

In [5]:
# Find any duplicates in the columns 
Train_df.duplicated().any(), 
stores_df.duplicated().any(), 
oil_df.duplicated().any(),  
test_df.duplicated().any(),  
Holiday_df.duplicated().any() 

False

In [6]:
#checking missing values
null_values=pd.isnull(oil_df).sum()
null_values

date           0
dcoilwtico    43
dtype: int64

Oil Data has 43 missing values 

In [None]:
## Impute Missing Values