# Problem Statement 

In order to forecast the sales of each product at a specific store, BigMart Sales Prediction aims to comprehend the characteristics of products and how they interact with factors unique to each store.

### Goal : 

Predict the sales of a product at a particular store so it would help :

- enhance inventory management
- increase sales 
- marketing decision

# Hypothesis Generation 

> Brainstorming the factors that affect the outcome.    

- consumer behaviour : 
    - age , income and family size
    - loyalty programs
    - maketing campaigns
    - online reviews

- product :
    - higher brand recognition ( higher sales )
    - near expiration date ( lower sales )
    - new product launches ( lowers sales compared to old known products )

- market (store) conditions : 
    - location ( traffic , income levels )
    - better placement , displays
    - shorter wait times 

- macro :
    - competitors prices 
    - inflation

### Loading packages

loading the essential packages to analyze , transform , visualize the data

In [1]:
import pandas as pd # data manipulation library
import numpy as np # scientific computing library
import matplotlib.pyplot as plt # basic visualization library
import seaborn as sns # advanced visualization library

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd # data manipulation library


### Data 

- Train.csv: this file includes "Outcome_Sales" as the target variable, along with features pertaining to the product, store, and data used to train the model. 

- Test.csv: the only difference is that the target variable isn't there because we need this data to see if the model can generalize its prediction and, more broadly, to identify issues during the model's evaluation phase.


### Data Structure and Content

I'll be primarly using pandas to manipulate data next to numpy

In [3]:
train_data = pd.read_csv('data/Train.csv')
test_data = pd.read_csv('data/Test.csv')

#### EDA - Exploratory Data Analysis

- I will be trying to discover data dimension , features and the target variables (columns)

In [10]:
train_data.shape

# 12 columns in the dataset
# 8523 rows in the dataset

test_data.shape

# 11 columns in the dataset ( Item_Outlet_Sales is the target variable )
# 5681 rows in the dataset ( about 1/3 of the train data )

(5681, 11)

## Data Cleaning


In [None]:
# loading the data 
import pandas as pd

data = pd.read_csv('data/Train.csv')

In [None]:
# discovering the data

data.head()
data.describe()
data.info()

In [None]:
# checking for missing values
data.isnull().sum()
# there are missing values in 'Item_Weight' and 'Outlet_Size' columns
# we will fill the missing values in 'Item_Weight' with the mean of the column
# and the missing values in 'Outlet_Size' with the mode of the column
data['Item_Weight'].fillna(data['Item_Weight'].mean(), inplace=True)
data['Outlet_Size'].fillna(data['Outlet_Size'].mode()[0], inplace=True)

In [None]:
# feature engineering 
