# BigMart Data Analysis and Prediction

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim of this data science project is to build a predictive model and find out the sales of each product at a particular store.
Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.
The data has missing values as some stores do not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.

## Hypothesis Generation

### Item Features

Item features play a vital role in the number of sales of the item. The higher the quality and the more attractive the item is, the more sales it generates.

- **Item_Identifier**: Some items may do exceptionally well with sales due to marketing reasons or other features that are not recoded here. Some items may perform much better in selected outlets.
- **Item_Weight**: The heavier the item, the more cheaper it becomes by unit mass. Thus it generates more sales.
- **Item_Fat_Content**: More fat means a tastier product and this can lead to higher sales. The majority prefer a tastier product rather than a healthier product (This may change according to the <ins>*outlet location*</ins>. High tier locations have educated people who prefer healthier products)
- **Item_Visibility**: There is a positive correlation between visibility and sales. Some <ins>*item types*</ins> may be more visible than others.
- **Item_Type**: Some types are must have staples at homes, so they generate more sales. Also, some types are going to have a higher <ins>*MRP*</ins>
- **Item_MRP**: The cheaper the item, the more sale it generates. But some <ins>*types*</ins> generate more sales, when they are more expensive like "household" and "health and hygiene", because it implies a product has better quality. Plus, an item with high <ins>*visibility*</ins> (has a special shelf or is being advertised) and high price may generate high sales.

### Outlet Features

Outlets have different strategies to generate more sales for its items and each outlet has customers with different buying habits.

- **Outlet_Identifier**: Some outlets may do exceptionally well with sales due to marketing reasons or other features that are not recorded here.
- **Outlet_Establishment_year**: Older outlets may have better reputation with customers, so they generate higher sales. New outlets may perform better in high <ins>*tier locations*</ins>.
- **Outlet_Size**: The <ins>*visibility*</ins> of products may decrease in bigger outlets, geenrating less sales for each product.
- **Outlet_Locations_Type**: The location may generate higher sales for some <ins>*item types*</ins> and may have higher prices (<ins>*MRP*</ins>).
- **Outlet_Type**: Each type may include high <ins>*item types*</ins> than other outlets.

## Package and Data Loading

In [1]:
import pandas as pd
import matplotlib as plt
import seaborn as sns
%matplotlib inline

In [2]:
df = pd.read_csv("./Train.csv")
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052
