# Bigmart Sales Data Set using machine learning for bigenners

Retail is another industry which extensively uses analytics to optimize business processes. Tasks like product placement, inventory management, customized offers, product bundling, etc. are being smartly handled using data science techniques. As the name suggests, this data comprises of transaction records of a sales store. This is a regression problem. The data has 8523 rows of 12 variables.

### Problem: Predict the sales of a store.

Let’s have a look at the Big Mart Sales data and build a Linear Regression Model in the Live Coding window below.

1. import library
    * pandas and numpy to manip the data 
    * matplotlib and seaborn for the visualization
2. load data 
3. exploring data and analys
4. data cleaning
5. data visualization
6. feature selection
7. feature transformation
8. split data
9. machine learning algo 
![](https://miro.medium.com/max/499/1*LXEEUY5Vf3tTCMFC41llJQ.png)

# 1) import library

### A) library data controls

In [None]:
import pandas as pd 
import numpy as np


### B) library data vizualisation

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
sns.set_style('darkgrid')


In [None]:
matplotlib.rcParams['figure.figsize']=(10,10)
matplotlib.rcParams['font.size']=15

# 2) load data

In [None]:
mart = pd.read_csv('../input/big-mart-data/bigmart_data.csv')
mart.head(10)

# 3 ) Exploring data 

In [None]:
mart.shape

so we had (8523) rows train data set has both input and output variable(s). You need to predict the sales for test data set.

In [None]:
mart.columns

In [None]:
mart.info()

In [None]:
mart.dtypes.value_counts().plot.pie(explode=[0.1,0.1,0.1],autopct='%1.1f%%',shadow=True)
plt.title('type of our data');

##### type : object
Item_Identifier      : Unique product ID                     
Item_Fat_Content     : Whether the product is low fat or not         
Item_Type            : 	The category to which the product belongs          
Outlet_Identifier    : Unique store ID                          
Outlet_Size          : 	The size of the store in terms of ground area covered              
Outlet_Location_Type : The type of city in which the store is located                    
Outlet_Type          : Whether the outlet is just a grocery store or some sort of supermarket     
##### type : numerical
Item_Weight                 : Weight of product        
Item_Visibility             : The % of total display area of all products in a store allocated to the particular product         
Item_MRP                    : Maximum Retail Price (list price) of the product                                       
Outlet_Establishment_Year   : The year in which store was established                                               
Item_Outlet_Sales  (traget) : 	Sales of the product in the particular store. This is the outcome variable to be predicted.       

In [None]:
mart.describe(include='all')

## 4 ) data cleaning 

In [None]:
mart.isnull().sum()

In [None]:
mart.isnull().sum()/len(mart)*100

***so we had : 17% of Item_Weight + 28 % of Outlet_Size missing value***

### A) Item_Weight missing value

In [None]:
mart.Item_Weight.describe()

In [None]:
mart.Item_Weight.fillna(mart.Item_Weight.mean(),inplace=True)

### B) Outlet_Size missing value

In [None]:
mart.Outlet_Size.value_counts()

In [None]:
mart.Outlet_Size.fillna('Medium',inplace=True)

### C) verification 

In [None]:
mart.isnull().sum()

# 5) data visualaization

In [None]:
mart.hist(figsize=(15,15),edgecolor='black');

### A) type : numerical

* Item_Weight
* Item_Visibility
* Item_MRP
* Outlet_Establishment_Year
* Item_Outlet_Sales  (traget)

#####  a) Item_Weight

In [None]:
sns.distplot(mart.Item_Weight,kde=True,bins=20)

#####  b) Item_Visibility

In [None]:
sns.distplot(mart.Item_Visibility,kde=True,bins=25)

#### c) Item_MRP

In [None]:
sns.distplot(mart.Item_MRP,kde=True,bins=25)

#### d) Outlet_Establishment_Year

In [None]:
sns.distplot(mart.Outlet_Establishment_Year,kde=True,bins=25)

In [None]:
sns.countplot(x='Outlet_Establishment_Year',data=mart,)

In [None]:
mart.Outlet_Establishment_Year.value_counts().plot.pie(autopct='%1.1f%%',shadow=True)

#### e) Item_Outlet_Sales (traget)

In [None]:
sns.distplot(mart.Item_Outlet_Sales,kde=True,bins=25)

In [None]:
sns.set(style='whitegrid', palette="deep", font_scale=1.1, rc={"figure.figsize": [12, 12]})
sns.distplot(
    mart['Item_Outlet_Sales'], norm_hist=False, kde=False, bins=20, hist_kws={"alpha": 1}
).set(xlabel='Item_Outlet_Sales', ylabel='Count');

In [None]:
y = mart.groupby(['Outlet_Establishment_Year']).sum()
y = y['Item_Outlet_Sales']
x = y.index.astype(int)

plt.figure(figsize=(12,8))
ax = sns.barplot(y = y, x = x)
ax.set_xlabel(xlabel='Year', fontsize=16)
ax.set_xticklabels(labels = x, fontsize=12, rotation=50)
ax.set_ylabel(ylabel='Sales', fontsize=16)
ax.set_title(label='Sales Per Year', fontsize=20)
plt.show();

### B) type : object
* Item_Identifier      : Unique product ID       ( will be doped )               
* Item_Fat_Content     : Whether the product is low fat or not         
* Item_Type            : 	The category to which the product belongs          
* Outlet_Identifier    : Unique store ID                          
* Outlet_Size          : 	The size of the store in terms of ground area covered              
* Outlet_Location_Type : The type of city in which the store is located                    
* Outlet_Type          : Whether the outlet is just a grocery store or some sort of supermarket 

### a) Item_Fat_Content

In [None]:
sns.countplot(x='Item_Fat_Content',data=mart);

In [None]:
mart.Item_Fat_Content.value_counts().plot.pie(explode=[0.05,0.05,0.05,0.05,0.05],autopct='%1.1f%%',shadow=False);

### b) Item_Type

In [None]:
Item_Type1=mart.Item_Type.value_counts().head(20)
sns.barplot(Item_Type1.index,Item_Type1)
plt.xticks(rotation=90)
plt.figure(figsize=(15,15));

In [None]:
mart.Item_Type.value_counts().plot.pie(autopct='%1.1f%%',shadow=False);

### c) Outlet_Size

In [None]:
Outlet_Size1=mart.Outlet_Size.value_counts().head(3)
sns.barplot(Outlet_Size1.index,Outlet_Size1)
plt.xticks(rotation=90)
plt.figure(figsize=(15,15));

In [None]:
mart.Outlet_Size.value_counts().plot.pie(explode=[0.05,0.05,0.05],autopct='%1.1f%%',shadow=False);

### d) Outlet_Location_Type

In [None]:
Outlet_Location_Type1=mart.Outlet_Location_Type.value_counts().head(3)
sns.barplot(Outlet_Location_Type1.index,Outlet_Location_Type1)
plt.xticks(rotation=90)
plt.figure(figsize=(15,15));

In [None]:
mart.Outlet_Location_Type.value_counts().plot.pie(explode=[0.05,0.05,0.05],autopct='%1.1f%%',shadow=False);

### e) Outlet_Type

In [None]:
Outlet_Type1=mart.Outlet_Type.value_counts().head(5)
sns.barplot(Outlet_Type1.index,Outlet_Type1)
plt.xticks(rotation=90)
plt.figure(figsize=(15,15));

In [None]:
mart.Outlet_Type.value_counts().plot.pie(explode=[0.05,0.05,0.05,0.05],autopct='%1.1f%%',shadow=False);

# 6) feauturs selection 

### columns that we gonna delete
Item_Identifier      : Unique product ID    ==> delete                           
Outlet_Identifier    : Unique store ID  ==> delete                        
      

In [None]:
mart=mart.drop('Item_Identifier',axis=1)
mart=mart.drop('Outlet_Identifier',axis=1)
mart

# 7) transform data from categorical to numerical

In [None]:
# Item_Fat_Content/Item_Type/Outlet_Size/Outlet_Location_Type/Outlet_Type

In [None]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()

### Item_Fat_Content

In [None]:
mart.Item_Fat_Content.value_counts()

In [None]:
mart.Item_Fat_Content=le.fit_transform(mart.Item_Fat_Content)

In [None]:
mart.Item_Fat_Content.value_counts()

* 0 ==> LF          
* 1 ==> Low Fat
* 2 ==> Regular    
* 3 ==> low fat
* 4 ==> reg

### Outlet_Size

In [None]:
mart.Outlet_Size.value_counts()

In [None]:
mart.Outlet_Size=le.fit_transform(mart.Outlet_Size)

In [None]:
mart.Outlet_Size.value_counts()

### Outlet_Location_Type

In [None]:
mart.Outlet_Location_Type.value_counts()

In [None]:
mart.Outlet_Location_Type=le.fit_transform(mart.Outlet_Location_Type)

In [None]:
mart.Outlet_Location_Type.value_counts()

### Outlet_Type

In [None]:
mart.Outlet_Type.value_counts()

In [None]:
mart.Outlet_Type=le.fit_transform(mart.Outlet_Type)

In [None]:
mart.Outlet_Type.value_counts()

### Item_Type

In [None]:
mart.Item_Type.value_counts()

In [None]:
mart.Item_Type=le.fit_transform(mart.Item_Type)

In [None]:
mart.Item_Type.value_counts()

In [None]:
mart

# 8) split data 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score

In [None]:
x=mart.drop('Item_Outlet_Sales',axis=1)
y=mart.Item_Outlet_Sales

In [None]:
print(x.shape)
print(y.shape)

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25)

### 1)	Linear Regression 

In [None]:
from sklearn import linear_model
#Train the model
model = linear_model.LinearRegression()
#Fit the model
model.fit(x_train, y_train)
#Score/Accuracy
print("Accuracy --> ", model.score(x_test, y_test)*100)


### 2)	RandomForestRegressor 

In [None]:
#Train the model
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=1000)
#Fit
model.fit(x_train, y_train)
#Score/Accuracy
print("Accuracy --> ", model.score(x_test, y_test)*100)


### 3)	GradientBoostingRegresso

In [None]:
#Train the model
from sklearn.ensemble import GradientBoostingRegressor
GBR = GradientBoostingRegressor(n_estimators=100, max_depth=4)
#Fit
GBR.fit(x_train, y_train)
print("Accuracy --> ", GBR.score(x_test, y_test)*100)


### 4)	Gradient Boosting Regresso

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
#Train the model
GBR = GradientBoostingRegressor(n_estimators=100, max_depth=4)
#Fit
GBR.fit(x_train, y_train)
print("Accuracy --> ", GBR.score(x_test, y_test)*100)


### 5) XGBOOST REGRESSOR

In [None]:
from sklearn.model_selection import KFold, cross_val_score
from xgboost import XGBRegressor

my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(x_train,y_train)
predictions = my_model.predict(x_test)

from sklearn.metrics import mean_absolute_error
print("Mean Absolute Error : " + str(mean_absolute_error(y_test,predictions)))