# Perdiction of sales

### Problem Statement
This dataset represents sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store are available. The aim is to build a predictive model and find out the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.

---------------------

### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

In [1]:
import pandas as pd 
import numpy as np

In [4]:
df = pd.read_csv('regression_exercise_nums.csv', index_col=0)

In [5]:
df.head()

Unnamed: 0,item_weight,item_visibility,item_mrp,outlet_establishment_year,item_outlet_sales,years_operating,outlet_size_num,outtype_grocery_store,outtype_supermarket_type1,outtype_supermarket_type2,outtype_supermarket_type3,fatcont_low_fat,fatcont_non_consumable,fatcont_regular,itemcat_drinks,itemcat_food,itemcat_non_consumables,outlet_loctype_num
0,9.3,0.016047,249.8092,1999,3735.138,23,2,0,1,0,0,1,0,0,0,1,0,1
1,5.92,0.019278,48.2692,2009,443.4228,13,2,0,0,1,0,0,0,1,1,0,0,3
2,17.5,0.01676,141.618,1999,2097.27,23,2,0,1,0,0,1,0,0,0,1,0,1
3,19.2,0.0,182.095,1998,732.38,24,1,1,0,0,0,0,0,1,0,1,0,3
4,8.93,0.0,53.8614,1987,994.7052,35,3,0,1,0,0,0,1,0,0,0,1,3


In [10]:
y = df['item_outlet_sales']

In [7]:
X = df[['item_weight','item_visibility','item_mrp']]

In [13]:
from sklearn.model_selection import train_test_split

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=15)

We have covered how to prepare a dataset and the process of feature engineering two weeks ago. In addition, we have already created Lasso and Ridge regressions on Monday. Today, we will be working with the ensemble methods. 

-------------------------
### Model Building: Ensemble Models

Try out the different ensemble models (Random Forest Regressor, Gradient Boosting, XGBoost)
- **Note:** Spend some time on the documention for each of these models.
- **Note:** As you spend time on this challenge, it is suggested to review how each of these models work and how they compare to each other.

Calculate the **mean squared error** on the test set. Explore how different parameters of the model affect the results and the performance of the model. (*Stretch: Create a visualization to display this information*)

- Use GridSearchCV to find optimal paramaters of models.
- Compare agains the Lasso and Ridge Regression models from Monday.

**Questions to answer:**
- Which ensemble model performed the best? 

In [37]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression

from sklearn.ensemble import VotingRegressor

from sklearn.linear_model import Ridge

In [18]:
models = {
    'linearReg':LinearRegression(),
    'rfRegressor':RandomForestRegressor(max_depth=10),
    'gBoost':GradientBoostingRegressor(),
    'xgBoost':XGBRegressor(max_depth=10)
}

# initiate a voting
ensemble = VotingRegressor(models.items(), n_jobs=-1)

In [19]:
ensemble.fit(X_train, y_train)

In [21]:
print('Ensemble performance: \n')
print("Training error:   %.2f" % (1-ensemble.score(X_train, y_train)))
print("Validation error: %.2f" % (1-ensemble.score(X_test, y_test)))

Ensemble performance: 

Training error:   0.39
Validation error: 0.64


In [23]:
en_pred = ensemble.predict(X_test)

In [24]:
train_pred = ensemble.predict(X_train)

In [34]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

In [35]:
print(r2_score(y_test, en_pred))
print(mean_squared_error(y_test,en_pred))

0.3627715013077415
1875308.0356210654


In [36]:
r2_score(y_train, train_pred)

0.6120024108434671

In [30]:
lr = LinearRegression()

In [31]:
lr.fit(X_train, y_train)

In [32]:
lr_pred = lr.predict(X_test)

In [33]:
r2_score(y_test, lr_pred)

0.36162598759397524

In [38]:
ridge_reg = Ridge()

In [39]:
ridge_reg.fit(X_train, y_train)

In [42]:
rr_pred = ridge_reg.predict(X_test)

In [43]:
r2_score(y_test, rr_pred)

0.3621310307306038