**PROBLEM STATEMENT** :
The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store. 
Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.

**HYPOTHESES GENERATION**:
>Product level hypotheses:
1. >Brand                : Branded products have more trust of the customers so they should have high sales.
2. >Visibility in Store  : The location of the product placement also depends on the sales.
3. >Display Area        : Products that are placed at an attention-catching place should have more sales.
4. >Utility             : Daily use products have a higher tendency to sell compared to other products.
5. >Packaging           : Quality packaging can attract customers and sell more.

>Store Level Hypotheses:
1. >City type         : Stores located in urban cities should have higher sales.
2. >Store Capacity    : One-stop shops are big in size so their sell should be high.
3. >Population density: Densely populated areas have high demands so the store located in these areas should have higher sales.
4. >Marketing         : Stores having a good marketing division can attract customers through the right offers.

In [None]:
#Loading packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
#Loading Data
train = pd.read_csv("D:\9961_14084_bundle_archive\Train.csv")
test = pd.read_csv("D:\9961_14084_bundle_archive\Test.csv")

In [None]:
train.shape ,test.shape

In [None]:
train.head()

In [None]:
 test.head()

In [None]:
train.describe()

In [None]:
test.describe()

**Exploratory Data Analysis**

**Univariate**

In [None]:
sns.displot(train['Item_Outlet_Sales'])

**Bivariate**

In [None]:
plt.figure(figsize=(12,7))
plt.plot(train.Item_Weight,train['Item_Outlet_Sales'],'.',alpha=0.3)
plt.xlabel("Item_Weight")
plt.ylabel("Item_Outlet_Sales")
plt.title("Item_Weight and Item_Outlet_Sales Analysis")

In [None]:
plt.figure(figsize=(12,7))
plt.plot(train.Item_Visibility,train['Item_Outlet_Sales'],'.',alpha=0.3)
plt.xlabel("Item_Visibility")
plt.ylabel("Item_Outlet_Sales")
plt.title("Item_Visibility and Item_Outlet_Sales Analysis")

In [None]:
sns.scatterplot(x=train.Item_Fat_Content, y=train.Item_Outlet_Sales)


In [None]:
corr_matrix=train.corr()
sns.heatmap(corr_matrix,annot=True,cmap='coolwarm')

**Multivariate**

In [None]:
sns.scatterplot(data=train, x="Item_MRP", y="Item_Outlet_Sales", hue="Outlet_Size")


In [None]:
# Create a bar plot of Outlet_Location_Type vs Item_Outlet_Sales
sns.barplot(data=train, x='Outlet_Location_Type', y='Item_Outlet_Sales', hue='Outlet_Size')

In [None]:
sns.scatterplot(data=train, x="Item_MRP", y="Item_Outlet_Sales", hue="Item_Weight", size="Item_Visibility")

In [None]:
sns.catplot(x="Item_Type", y="Item_Outlet_Sales", hue="Item_Fat_Content", col="Outlet_Identifier", data=train, kind="bar", height=4, aspect=0.7)

In [None]:
# join train and test dataset
train['source'] ='train'
test['source']='test'
data = pd.concat([train,test])
data.shape

In [None]:
data['Item_Outlet_Sales'].describe()

In [None]:
data.apply(lambda x: len(x.unique()))

This tell us that there are 1559 products and 10 outlets/store(which was mentioned in problem statement).And that Item_Type has 16 unique values.

In [None]:
#look at categorical and numerical variables
data.dtypes

**Data Cleaning**

In [None]:
#Finding missing values
#Is there missing values or not in data(train and test)
train.isnull().values.any() ,test.isnull().values.any()

In [None]:
train.isnull().sum() ,test.isnull().sum()

In [None]:
data.isnull().sum()

`In test dataset Item_Outlet_Sales contain missing value which is target column because test data doen't have that column.
In train dataset Item_Weight and Outlet_Size contain missing values so We will fill missing values in Item_Weight column with mean because itis a Numerical Feature and Outlet_Size is a Categorical Feature for this we will use mode to fill missing values.`

In [None]:
data["Item_Weight"].mean()

In [None]:
data["Item_Weight"]=data["Item_Weight"].fillna(np.mean(data["Item_Weight"]))

In [None]:
data["Item_Weight"].isnull().sum()

In [None]:
data["Outlet_Size"].mode()

In [None]:
data["Outlet_Size"]=data["Outlet_Size"].replace(np.nan, 'Medium')

In [None]:
data["Outlet_Size"].isnull().sum()

In [None]:
data.plot(kind='box',subplots=True,layout=(3,3),sharex=False,sharey=False,figsize=(15,15))
sns.set(font_scale=1.5)

In [None]:
data['Item_Visibility'].value_counts()

In [None]:
data['Item_Visibility'].describe()

Item_Visibility had minimum value 0. So this make no sense ,lets consider it as missing value and impute with its mean. 

In [None]:
#Determine average visibility of a product
visibility_avg= data.pivot_table(values='Item_Visibility', index='Item_Identifier')
#Impute 0 values with mean visibility of that product
missing_values=(data['Item_Visibility']==0)
print('Number of 0 values initially: %d'%sum(missing_values))
data.loc[missing_values,'Item_Visibility']=data.loc[missing_values,'Item_Identifier'].apply(lambda x: visibility_avg.at[x,'Item_Visibility'])
print('Number of 0 values after modification: %d'%sum(data['Item_Visibility']==0))

**Feature Egineering**

In [None]:
#Modify categories of Item_Fat_Content
data['Item_Fat_Content'].value_counts()

In [None]:
data['Item_Fat_Content']=data['Item_Fat_Content'].replace({'LF':'Low Fat','low fat':'Low Fat','reg':'Regular'})
data['Item_Fat_Content'].value_counts()

In [None]:
data['Outlet_Establishment_Year'].value_counts()

In [None]:
#Create new column Outlet_Years Remember the data is from 2013
data['Outlet_Years']=2013-data['Outlet_Establishment_Year']
data['Outlet_Years'].describe()

In [None]:
data['Item_Identifier'].value_counts()

In [None]:
#Create a broad category of Type of Item ID
#Item_Type variable has 16 categories which might prove to be very useful in anlysis.So it is good idea to combine them.
data['Item_Type_Combined']=data['Item_Identifier'].apply(lambda x: x[0:2])
#Rename them to more intuitive categories
data['Item_Type_Combined']=data['Item_Type_Combined'].map({'FD':'Food','NC':'Non-Consumble','DR':'Drinks'})
data['Item_Type_Combined'].value_counts()

**Outliers Handling**

In [None]:
data['Item_Visibility']=np.log(data['Item_Visibility'])

In [None]:
data['Item_Outlet_Sales']=np.log(data['Item_Outlet_Sales'])

In [None]:
#label encoding for Categorical Feature
numerical_columns = [col for col in data.columns if data.dtypes[col]!='object']
print('Numerical Features are:',numerical_columns)

In [None]:
categorical_columns = [col for col in data.columns if data.dtypes[col]=='object']
print('Categorical Features are:',categorical_columns)

In [None]:
from sklearn.preprocessing import LabelEncoder


In [None]:
le = LabelEncoder()
#New variable for Outlet
data['Outlet']= le.fit_transform(data['Outlet_Identifier'])

In [None]:
var_mod=['Item_Type_Combined','Item_Fat_Content', 'Outlet_Size','Outlet_Location_Type','Outlet_Type']

In [None]:
for i in var_mod:
    data[i]=le.fit_transform(data[i])

In [None]:
#One Hot Encodeing
data = pd.get_dummies(data, columns=['Item_Fat_Content','Outlet_Size','Outlet_Location_Type','Outlet_Type','Outlet','Item_Type_Combined'])

In [None]:
# Drop columns which combined convert different type
data.drop(['Item_Type','Outlet_Establishment_Year'],axis=1,inplace=True)

# Divide data into train and test
train= data.loc[data['source']=="train"]
test=data.loc[data['source']=="test"]
#Drop unnecessary columns
test.drop(['Item_Outlet_Sales','source'],axis=1,inplace=True)
train.drop(['source'],axis=1,inplace=True)
#Export files as modified versions
train.to_csv("train_modified.csv",index=False)
test.to_csv("test_modified.csv",index=False)
X_test=test.drop(['Item_Identifier','Outlet_Identifier'],axis=1).copy()
X_train =train.drop(['Item_Outlet_Sales','Item_Identifier','Outlet_Identifier'],axis=1)
y_train =train['Item_Outlet_Sales']


**Data Modeling**

In [None]:
#Model Building
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,Ridge
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error,accuracy_score
X_train,X_test,y_train,y_test=train_test_split(X_train,y_train,test_size=0.2,random_state=42)

In [None]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape

In [None]:
model=LinearRegression()
model.fit(X_train,y_train)

In [None]:
model_pred= model.predict(X_test)
model_pred

In [None]:
from sklearn.metrics import mean_squared_error
model_rmse = np.sqrt(mean_squared_error(y_test, model_pred))
model_mean=mean_absolute_error(y_test,model_pred)
model_r2=r2_score(y_test,model_pred)
print("MAE of LR model is:",model_mean)
print("R2 score of LR model is:",model_r2)
print('Linear Regression RMSE:', model_rmse)

In [None]:
model1=Ridge(alpha=0.1)
model1.fit(X_train,y_train)

In [None]:
y_pred_model1= model1.predict(X_test)

In [None]:
model1_rmse = np.sqrt(mean_squared_error(y_test, y_pred_model1))
model1_mean=mean_absolute_error(y_test,y_pred_model1)
model1_r2=r2_score(y_test,y_pred_model1)
print("Ridge Regression MAE :",model1_mean)
print("Ridge Regression R2 score :",model1_r2)
print('Ridge Regression RMSE:', model1_rmse)

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
model2=RandomForestRegressor()
model2.fit(X_train,y_train)

In [None]:

rf_reg = RandomForestRegressor(random_state=42)
rf_reg.fit(X_train, y_train)
y_pred = rf_reg.predict(X_test)

rf_reg_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
rf_reg_mean=mean_absolute_error(y_test,y_pred)
rf_reg_r2=r2_score(y_test,y_pred)
print("Random Forest Regression MAE :",rf_reg_mean)
print("Random Forest Regression R2 score :",rf_reg_r2)
print('Random Forest Regression RMSE:', rf_reg_rmse)

In [None]:
from xgboost import XGBRegressor


In [None]:
model4=XGBRegressor()
model4.fit(X_train,y_train)


In [None]:
y_pred_xgb=model4.predict(X_test)
model4_rmse = np.sqrt(mean_squared_error(y_test, y_pred_xgb))
model4_mean=mean_absolute_error(y_test,y_pred_xgb)
model4_r2=r2_score(y_test,y_pred_xgb)
print("XGB Regression MAE :",model4_mean)
print("XGB Regression R2 score :",model4_r2)
print('XGB Regression RMSE:', model4_rmse)

In [None]:
In Regression model and Ridge Regression model MAE=0.41,RMSE=0.53,r2=72 and Random Forest MAE=0.42,RMSE=0.55,r2=0.71