# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint


## Objectives


Sales forecasting is the process of estimating future sales. Accurate sales forecasts enable companies to make sound business decisions and predict short-term or long-term performance. Forecasts could be based on data such as past sales, industry-wide comparisons, and economic trends.


A leading retailer in the USA wants to forecast sales for their product categories in their store, based on the sales history of each category. Sales or revenue forecasting is very important for retail operations. Forecasting of retail sales helps the retailer to take necessary measures to plan their budgets or investments in a period (monthly, yearly) among different product categories like women's clothing, men's clothing, and other clothing. Further, they can plan to minimize revenue loss from the unavailability of products by investing accordingly.

**Note: This data is proprietary. Please DO NOT share the dataset with anyone. The solution python notebook and test solution will not be provided.** </br>

In [None]:
#@title Mini Hackathon Walkthrough Video
from IPython.display import HTML

HTML("""<video width="500" height="300" controls>
  <source src="https://cdn.exec.talentsprint.com/content/mini_hackathon_walkthrough.mp4" type="video/mp4">
</video>
""")

## Kaggle link and deadline:


### 1. Link to the Kaggle problem: https://www.kaggle.com/t/655beb4ca16149639522c5998d0a9770

### 2. Deadlines:
  - **Competition closes at** 6:00 PM IST or 12:30 PM UTC, 18th Sep 2021 
  - **Submit this Colab file with code to aimlkaggle@gmail.com:** 
      
      7.00 PM IST or 1:30 PM UTC 18th Sep 2021

## Instructions:

- Refer to the document **MiniHackathon- Kaggle Team Creation** for creating a Kaggle account. After login into the Kaggle account, access the kaggle problem. Follow the steps for Team creation in Kaggle.
- Under the 'Data' tab within the Kaggle competition page (link above), you can find four datasets. Their attributes are given in the "Attributes description".
- Follow **Stage 1** for downloading the data 
- Combine the datasets and apply data-preprocessing to obtain a clean training dataset
- Build your own model using any algorithms learned till now
- **Get the Sales predictions for 2015 month-wise and product-wise** (36 rows)
- Copy and paste the predictions in column B (Sales(In ThousandDollars)) of the **Sample_Submission csv file** (ignore the headers)
- Upload the Sample_Submission csv file into Kaggle by clicking on Submit Predictions in Kaggle.
- The leaderboard takes and reflects your best submission until the specified deadline (maximum of 20 submissions only acceptable per day based on UTC (0:00) timing). 

### **Important: Only the Public Leaderboard rankings are valid, not the Private Leaderboard rankings.**

## Evaluation: 
The evaluation will be done based on the teams placed on the Kaggle leaderboard

**TotalMarks=20**

- The top 4 teams will be awarded 20 marks
- 5-8 teams will be awarded 18
- 9-12 teams will be awarded 16
- The rest of the teams will be awarded 14
- **0 Marks in case of 0 submissions**
 

 ## Finally...
    Don't cheat!
    Apply yourself!
    Have fun!


## **Stage1:** Setting up colab for Kaggle competitions 
This setup helps you directly access the datasets etc of the Kaggle competition.

### 1. Create an API key in Kaggle.

To do this, go to kaggle.com/ and open your user settings page. Click My Account.

![alt text](https://i.stack.imgur.com/jxGQv.png
)



### 2. Next, scroll down to the API access section and click generate to download an API key. 
![alt text](https://i.stack.imgur.com/Hzlhp.png)

### 3. Upload your kaggle.json file using the following snippet in a code cell:



In [None]:
from google.colab import files
files.upload()

In [None]:
#If successfully uploaded in the above step, the 'ls' command here should display the kaggle.json file.
%ls

### 4. Install the Kaggle API using the following command


In [None]:
!pip install -q kaggle

### 5. Move the kaggle.json file into ~/.kaggle, which is where the API client expects your token to be located:



In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [None]:
#Execute the following command to verify whether the kaggle.json is stored in the appropriate location: ~/.kaggle/kaggle.json
!ls ~/.kaggle

In [None]:
!chmod 600 /root/.kaggle/kaggle.json #run this command to ensure your Kaggle API token is secure on colab

### 6. Now download the data

In [None]:
!mkdir data

**NOTE: If you get a '404 - Not Found' error after running the cell below, it is most likely that the user (whose kaggle.json is uploaded above) has not 'accepted' the rules of the competition and therefore has 'not joined' the competition.**

In [None]:
#If you get a forbidden link, you have most likely not joined the competition.
!kaggle competitions download -c retail-case-study-batch17 -p data

## **Stage 2:** YOUR CODE to crack the Kaggle problem here. 

1.  Get the Sales prediction for the 2015 month-wise and product-wise (which give 36 rows). The product order for every month prediction can be as per the test_kaggle.csv file.

2.  Copy and paste the predictions in Sample_Submission.csv (in Sales(In ThousandDollars)) and upload  them in to Kaggle.

After uploading the predictions in Kaggle, the RMSE score will be displayed on the leaderboard.

Understand the RMSE score [here](https://medium.com/analytics-vidhya/forecast-kpi-rmse-mae-mape-bias-cdc5703d242d) with an example.

**Note: It is best advised to write all the code here. (If for any reason you are using other colab files, you could cut and paste the code from there into this notebook)**

#**Weather Data**

In [None]:
import pandas as pd
import numpy as np

Wther2009=pd.read_excel('data/WeatherData.xlsx','2009')
Wther2010=pd.read_excel('data/WeatherData.xlsx','2010')
Wther2011=pd.read_excel('data/WeatherData.xlsx','2011')
Wther2012=pd.read_excel('data/WeatherData.xlsx','2012')
Wther2013=pd.read_excel('data/WeatherData.xlsx','2013')
Wther2014=pd.read_excel('data/WeatherData.xlsx','2014')
Wther2015=pd.read_excel('data/WeatherData.xlsx','2015')
Wther2016=pd.read_excel('data/WeatherData.xlsx','2016')

#CONCATENATING ALL SHEETS
Wther=pd.concat([Wther2009,Wther2010,Wther2011,Wther2012,Wther2013,Wther2014,Wther2015,Wther2016],ignore_index=True)
  
#DROPPING COLUMNS
Weather=Wther.drop(['Temp high (°C)','Temp low (°C)','Dew Point high (°C)','Dew Point low (°C)','Humidity\xa0(%) high','Humidity\xa0(%) low','Sea Level Press.\xa0(hPa) high','Sea Level Press.\xa0(hPa) low','Visibility\xa0(km) high','Visibility\xa0(km) low','Wind\xa0(km/h) high','Wind\xa0(km/h) low','WeatherEvent','Precip.\xa0(mm) sum'],axis=1)

#DEALING WITH EMPTY VALUES
Weather=Weather.replace('-',0)
Weather=Weather.replace('avg',0)
Weather = Weather.dropna()
Weather['Avg_weather'] =  Weather.iloc[:, 3:9].mean(axis=1)
#Weather = Weather.drop(['Temp avg (°C)', 'Dew Point avg (°C)', 'Humidity (%) avg', 'Sea Level Press. (hPa) avg', 'Visibility (km) avg', 'Wind (km/h) avg'])

#YEAR & MONTH LABELS 
year={2009.0:1,2010.0:2,2011.0:3,2012.0:4,2013.0:5,2014.0:6,2015.0:7,2016.0:8}
month={'Jan':1, 'Feb':2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6, 'Jul':7, 'Aug':8, 'Sep':9,'Oct':10, 'Nov':11, 'Dec':12}
Weather["Year"] = Weather["Year"].map(year)
Weather["Month"] = Weather["Month"].map(month)

#AVG_WEATHER
Weather_avg = Weather.groupby(['Year','Month'])['Avg_weather'].mean()
Weather_avg = Weather_avg.round(1)

#PRINTING DATA
print(Weather_avg.shape)

Weather_avg.head(2)

#**Macro_Economic_Data**

In [None]:
macroEco=pd.read_excel('data/MacroEconomicData.xlsx')
macroEco['Year'] = [each.split()[0] for each in macroEco['Year-Month']]
macroEco['Month'] = [each.split()[2] for each in macroEco['Year-Month']]
macroEco = macroEco.drop(['PartyInPower','Year-Month'],axis=1)

month={'Jan':1, 'Feb':2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6, 'Jul':7, 'Aug':8, 'Sep':9,'Oct':10, 'Nov':11, 'Dec':12}
year={2009:1,2010:2,2011:3,2012:4,2013:5,2014:6,2015:7,2016:8}
macroEco["Month"] = macroEco["Month"].map(month)

minimumAdExpence = min([each for each in list(macroEco['AdvertisingExpenses (in Thousand Dollars)']) if type(each) == int])

macroEco=macroEco.replace('?',minimumAdExpence)
macroEco = macroEco.dropna()

macroEco["Year"] = [year[int(each)] for each in macroEco['Year']]

macroEco = macroEco.groupby(['Year','Month'])['Monthly Nominal GDP Index (inMillion$)', 'Monthly Real GDP Index (inMillion$)', 'CPI', 'unemployment rate', 'CommercialBankInterestRateonCreditCardPlans', 'Finance Rate on Personal Loans at Commercial Banks, 24 Month Loan', 'Earnings or wages  in dollars per hour', 'AdvertisingExpenses (in Thousand Dollars)', 'Cotton Monthly Price - US cents per Pound(lbs)', 'Change(in%)', 'Average upland planted(million acres)', 'Average upland harvested(million acres)', 'yieldperharvested acre', 'Production (in  480-lb netweright in million bales)', 'Mill use  (in  480-lb netweright in million bales)', 'Exports'].mean()

print(macroEco.shape)
macroEco.head(2)

#**Holidays Data**

In [None]:
holi=pd.read_excel('data/Events_HolidaysData.xlsx')
year={2009.0:1,2010.0:2,2011.0:3,2012.0:4,2013.0:5,2014.0:6,2015.0:7,2016.0:8}

holi["Year"] = holi["Year"].map(year)
holi['Month'] = [each.month for each in holi['MonthDate']]
holi_fin=holi.groupby(['Year', 'Month']).count()[['Event']]
print(holi_fin.shape)
holi_fin.head()

#**Kaggle Train Data**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

data=pd.read_csv('data/Train_Kaggle.csv')
sns.catplot(x= "ProductCategory",y= "Sales(In ThousandDollars)", data=data)
plt.show()

data.head(2)

* *Individual Clothing*

In [None]:
data_Women=data[data['ProductCategory']=='WomenClothing']
data_WomenMean = int(data_Women['Sales(In ThousandDollars)'].mean())
data_Women=data_Women.fillna(value=data_WomenMean)

data_Other=data[data['ProductCategory']=='OtherClothing']
data_OtherMean = int(data_Other['Sales(In ThousandDollars)'].mean())
data_Other=data_Other.fillna(value=data_OtherMean)

data_Men=data[data['ProductCategory']=='MenClothing']
data_MenMean = int(data_Men['Sales(In ThousandDollars)'].mean())
data_Men=data_Men.fillna(value=data_MenMean)

* *Combined Final Training Data*

In [None]:
year={2009:1,2010:2,2011:3,2012:4,2013:5,2014:6}
dept={'MenClothing':1,'WomenClothing':2,'OtherClothing':3}

fin_data=pd.concat([data_Men,data_Women,data_Other])
fin_data["Year"] = fin_data["Year"].map(year)
fin_data["ProductCategory"] = fin_data["ProductCategory"].map(dept)

fin_data.head(2)

* *Duplicating The Training Data*

In [None]:
train_fin=fin_data
train_fin_Men=train_fin[train_fin['ProductCategory']==1].reset_index(drop=True)
train_fin_Women=train_fin[train_fin['ProductCategory']==2].reset_index(drop=True)
train_fin_Others=train_fin[train_fin['ProductCategory']==3].reset_index(drop=True)

#**COMBINING ALL DATASETS**

In [None]:
full_men1=pd.merge(train_fin_Men,holi_fin, on=['Year','Month'],how="left")
full_women1=pd.merge(train_fin_Women,holi_fin, on=['Year','Month'],how="left")
full_others1=pd.merge(train_fin_Others,holi_fin, on=['Year','Month'],how="left")

macro_weather=pd.merge(macroEco,Weather_avg, on=['Year','Month'],how="left")

full_men=pd.merge(full_men1,macro_weather, on=['Year','Month'],how="left")
full_women=pd.merge(full_women1,macro_weather, on=['Year','Month'],how="left")
full_others=pd.merge(full_others1,macro_weather, on=['Year','Month'],how="left")

full_men = full_men.fillna(full_men.mean())
full_women = full_women.fillna(full_women.mean())
full_others = full_others.fillna(full_others.mean())

Full_data=pd.concat([full_men,full_women,full_others],ignore_index=True)
Full_data.fillna(Full_data.mean())
df = Full_data.round(1)

In [None]:
#REMOVING HIGHLY CORELATED FEATURES

# Create correlation matrix
corr_matrix = df.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.85
to_drop = [column for column in upper.columns if any(upper[column] > 0.85)]

# Drop features 
finalTrainData = df.drop(df[to_drop], axis=1)

finalTrainData.head(2)

In [None]:
finalTrainData = finalTrainData.drop(['Finance Rate on Personal Loans at Commercial Banks, 24 Month Loan', 'Cotton Monthly Price - US cents per Pound(lbs)', 'Change(in%)','Average upland planted(million acres)', 'Average upland harvested(million acres)', 'yieldperharvested acre', 'Mill use  (in  480-lb netweright in million bales)'],axis=1)
finalTrainData.head()

#**EXTRACT FEATURES AND LABELS**

In [None]:
filtered_df = finalTrainData[finalTrainData['Sales(In ThousandDollars)'].notnull()]
features = finalTrainData.loc[:, finalTrainData.columns != 'Sales(In ThousandDollars)']
labels = finalTrainData['Sales(In ThousandDollars)']

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(features)
scaled_features.shape

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor

train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.3)
gbr = GradientBoostingRegressor(n_estimators=100)
gbr.fit(train_features, train_labels)
predictions = gbr.predict(test_features)
mean_squared_error(test_labels, predictions, squared=False)

#**TESTING**

In [None]:
test_data_set=pd.read_csv('data/Test_Kaggle.csv')

year={2009:1,2010:2,2011:3,2012:4,2013:5,2014:6,2015:7}
dept={'MenClothing':1,'WomenClothing':2,'OtherClothing':3}
test_data_set["Year"] = test_data_set["Year"].map(year)
test_data_set["ProductCategory"] = test_data_set["ProductCategory"].map(dept)

In [None]:
my_list = [each for each in Weather_avg.index if each[0] == 7]
Weather_test = Weather_avg[Weather_avg.index.isin(my_list)]

my_list = [each for each in macroEco.index if each[0] == 7]
macroEco_test = macroEco[macroEco.index.isin(my_list)]

my_list = [each for each in holi_fin.index if each[0] == 7]
holi_fin_test = holi_fin[holi_fin.index.isin(my_list)]

dummy=pd.merge(Weather_test,macroEco_test, on=['Year','Month'],how="left")
support_data =pd.merge(dummy,holi_fin_test, on=['Year','Month'],how="left")

In [None]:
toDrop = []
for each in support_data.columns:
  if not each in features.columns:
    toDrop.append(each)

support_data = support_data.drop(toDrop,axis=1)

In [None]:
temp = []
indices = []
columns = list(test_data_set.columns) + (list(support_data.columns))

for i,j in zip(test_data_set['Year'], test_data_set['Month']):
  indices.append((i,j))

for i in range(len(test_data_set)):
  data = list(test_data_set.loc[i])+list(support_data.loc[indices[i]])  
  temp.append(data)

df = pd.DataFrame(temp)
df.columns = columns

order = features.columns
df = df[order]
filtered_test_df_final = df.fillna(df.mean())
filtered_test_df_final.head()

In [None]:
predictions = gbr.predict(filtered_test_df_final)

finalTestDataset = pd.DataFrame(pd.Series(predictions))
finalTestDataset.index += 1
finalTestDataset = finalTestDataset.reset_index()
finalTestDataset.columns = ['Year','Sales(In ThousandDollars)']
finalTestDataset = finalTestDataset.set_index('Year')

finalTestDataset['Sales(In ThousandDollars)'] = finalTestDataset['Sales(In ThousandDollars)'].astype(int)
finalTestDataset.head()

In [None]:
finalTestDataset.to_csv("test_gbr.csv")

## **Stage 3:** Each time you submit in kaggle, ensure that the code given by you in Stage2 gives the same result. Follow the steps for the validation:
### a) Enter your Kaggle RMSE in the form below 
### b) After entering RMSE below, go to File->'Save and pin revision' (To ensure you do so, you are asked to mark 'Yes' to the instruction asking the same)
**Note: The Shortcut for 'Save and pin revision' is Ctrl+M+S**</br>
**Note: You can check if the action has succeeded by going to File->Revision History and you'll find the "PIN" checkbox checked if successful.** 


- This action ensures there is 'proof of code' for each submission you make.
- If you submit your results in Kaggle, and get a leaderboard RMSE score, but you don't follow the steps asked above, then your **score will NOT be considered**, as we don't have the proof of your code. (We map the 'proof of code' by mapping it to your "RMSE+Time of save+pin"). In other words, if you want your RMSE score to be considered you have to follow the process. 
- However, for trial submission (RMSE scores you don't care about being considered, as you're still experimenting in your initial attempts) you don't have to follow the process above.
- **One member from your team can collect all your team-members colab shared links and email them to aimlkaggle@gmail.com as per deadlines.** Ensure to give edit access to aimlkaggle@gmail.com.
- **FINALLY: "Do NOT download and reupload this file as all the revision history will be lost"**


