# ETL, Analysis, and Visualization

**Satish Nalla**

With the MRTS Sales Data cleansed as part of Data prepartion is being used here and building visulations for Analysis.

# Index

- [Abstract](#Abstract)
- [1. Preproject Summary](#1.-Preproject-Summary)
- [2. Initialization](#2.-Initialization)
    - [2.1 Import the Python Modules](#2.1-Import-the-Modules)
    - [2.2 Establishing the Database Connections](#2.2-Establishing-the-database-connections)
    - [2.3 Preparing the SQL COmmands](#2.3-Preparing-the-SQL-Commands)
    - [2.4 Executing the SQL Commands](#2.4-Executing-the-SQL-Commands)
    - [2.5 Building the Pandas Dataframes](#2.5-Building-the-Pandas-Dataframes)
- [3. The Data](#3.-The-Data)
    - [3.1 Exploring the Schema of Dataframes](#3.1-Exploring-the-Schema-of-Dataframes)
    - [3.2 Deepdiving the Total Data](#3.2-Deepdiving-the-Total-Data)
    - [3.3 Deepdiving the Details Data](#3.3-Deepdiving-the-Details-Data)
    - [3.4 Summary](#3.4-Summary)
- [4. Preparing the Data](#4.-Preparing-the-Data)
    - [4.1 Aggregating the Data](#4.1-Aggregating-the-Data)
    - [4.2 Addition of Metrics](#4.2-Addition-of-Metrics)
- [5. Data Visualizations](#5.-Data-Visualizations)
    - [5.1 Visualizing Total Data](#5.1-Visualizing-Total-Data)
    - [5.2 Visualizing Detail Data](#5.2-Visualizing-Detail-Data)
- [6. Rolling Window Analysis](#6.-Rolling-Window-Analysis)
    - [6.1 Rolling Window Details](#6.1-Rolling-Window-Details)
    - [6.2 Preparing Data for Rolling window](#6.2-Preparing-Data-for-Rolling-window)
    - [6.3 Visualizing the Rolling window data](#6.3-Visualizing-the-Rolling-window-data)
- [Conclusion](#Conclusion)
- [References](#References)

[Back to top](#Index)
## Abstract

Importing the MRTS Sales Data and building the various visualizations to answer various questions on the data.

[Back to top](#Index)

## 1. Preproject Summary

The Raw data has lot of Totals and Sub totals records, we have created the database views as only Totals and only subtotals for each kind of business subgroups in the excel file and used for the Analysis in this project.

[Back to top](#Index)

## 2. Initialization

[Back to top](#Index)

### 2.1 Import the Modules

In [None]:
#Importing the Pandas Variables to use in this project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
from dateutil.parser import parse
import math
import mysql.connector

[Back to top](#Index)

### 2.2 Establishing the database connections

In [None]:
#Establisgin the MYSQL Database connection
connection = mysql.connector.connect(user='root',password='Aaira2020',host = '127.0.0.1',database = 'MRTSSALES', auth_plugin='mysql_native_password')
cursor = connection.cursor()

[Back to top](#Index)

### 2.3 Preparing the SQL Commands

In [None]:
#PReparing the SQL Query statements to use in this project
sqlStatementDetails = 'SELECT * FROM VW_MRTSALES_DATA;'
sqlStatementTotals = 'SELECT * FROM VW_MRTSALES_DATA_TOTAL;'

[Back to top](#Index)

### 2.4 Executing the SQL Commands

Executing the SQL Commands and Import the Data to this project

In [None]:
#Executing the Detail SQL Statement created in 2.3
cursor.execute(sqlStatementDetails)
detailColumns = cursor.column_names
detailData = cursor.fetchall()

In [None]:
#Executing the Total SQL Statement created in 2.3
cursor.execute(sqlStatementTotals)
totalColumns = cursor.column_names
totalData = data = cursor.fetchall()

In [None]:
#Closing the Cursor and Connection to save the heap memory
cursor.close()
connection.close()

[Back to top](#Index)

### 2.5 Building the Pandas Dataframes

Building the Pandas Dataframes, loading the data into them, and few other attributes for further analysis

In [None]:
#Converting the Query outputs to Pandas dataframes
DetailDataDf = pd.DataFrame(detailData)
DetailDataDf.columns = detailColumns

totalDataDf = pd.DataFrame(totalData)
totalDataDf.columns = totalColumns

In [None]:
#Assigning the colors to each Kind of Business Attributes in both the datasets
uniqueKinds = totalDataDf.KIND_OF_BUSINESS.unique()
colorCodes = ['#FF4136','#0074D9','#2ECC40','#FF851B','#B10DC9','#FFDC00','#7FDBFF','#001f3f','#39CCCC','#F012BE','#3D9970','#FF69B4','#85144b','#AAAAAA','#FFC300','#00796B','#F37735','#008080','#7FDBFF','#6B8E23']

colorIter = 0
colors = {}

for i in totalDataDf.KIND_OF_BUSINESS.unique():
    colors[i]=colorCodes[colorIter]
    colorIter += 1

for i in DetailDataDf.KIND_OF_BUSINESS.unique():
    colors[i]=colorCodes[colorIter]
    colorIter += 1

[Back to top](#Index)

## 3. The Data

Exploring the data

[Back to top](#Index)

### 3.1 Exploring the Schema of Dataframes

Checking the both Totals and Details Dataframes

In [None]:
totalDataDf.info()

In [None]:
totalDataDf.describe()

In [None]:
totalDataDf.head()

In [None]:
totalDataDf.tail()

In [None]:
DetailDataDf.info()

In [None]:
DetailDataDf.describe()

In [None]:
DetailDataDf.head()

In [None]:
DetailDataDf.tail()

[Back to top](#Index)

### 3.2 Deepdiving the Total Data

In [None]:
#Plotting the Total data to Analyze if any further Aggregations needed
i = 0
j = 0

x = len(totalDataDf.YEAR.unique())
y = len(totalDataDf.KIND_OF_BUSINESS.unique())

print(x,y)

fig, axs = plt.subplots(x,y, figsize=(40, 130))
fig.suptitle(f'Plotting All the Total KPIs for each Year', fontsize = 16, y = 0.9)

for eachYear in totalDataDf.YEAR.unique():
    j = 0
    for eachKind in totalDataDf.KIND_OF_BUSINESS.unique():
        axs[i][j].plot(totalDataDf[(totalDataDf['KIND_OF_BUSINESS']==eachKind) & (totalDataDf['YEAR']==eachYear)]['DATE'], totalDataDf[(totalDataDf['KIND_OF_BUSINESS']==eachKind) & (totalDataDf['YEAR']==eachYear)]['VALUE'],color = colors[eachKind])
        axs[i][j].set_title(eachKind, fontsize = 10)
        axs[i][0].set_ylabel('Sales')
        j += 1
        #print(i,j,eachYear,eachKind)    
    i += 1

#print(i,j)
        
plt.show()

[Back to top](#Index)

### 3.3 Deepdiving the Details Data

In [None]:
#Plotting the Details data to Analyze if any further Aggregations needed
i = 0
j = 0

x = len(DetailDataDf.YEAR.unique())*2
y = math.ceil(len(DetailDataDf.KIND_OF_BUSINESS.unique())/2)

print(x,y)

fig, axs = plt.subplots(x,y, figsize=(40, 200))
fig.suptitle(f'Plotting All the Sub Total KPIs for Each Year', fontsize = 16, y = 0.9)

for eachYear in DetailDataDf.YEAR.unique():
    j = 0
    for eachKind in DetailDataDf.KIND_OF_BUSINESS.unique():
        if j == y:
            i += 1
            j = 0
        axs[i][j].plot(DetailDataDf[(DetailDataDf['KIND_OF_BUSINESS']==eachKind) & (DetailDataDf['YEAR']==eachYear)]['DATE'], DetailDataDf[(DetailDataDf['KIND_OF_BUSINESS']==eachKind) & (DetailDataDf['YEAR']==eachYear)]['VALUE'],color = colors[eachKind])
        axs[i][j].set_title(eachKind, fontsize = 10)
        axs[i][0].set_ylabel('Sales')
        j += 1
        #print(i,j,eachYear,eachKind)
        #print(i,j)

    i += 1

#print(i,j)
        
plt.show()

[Back to top](#Index)

### 3.4 Summary

#### Summary of Data Explorations

As we could see tha above visualizations most of the data for each year trending by month looks consistent pattern, for further analysis the data is being aggregated by year.

[Back to top](#Index)

## 4. Preparing the Data

[Back to top](#Index)

### 4.1 Aggregating the Data

Aggregating the Totals and Details data to the Year for further analysis and usage

In [None]:
#As we summarized data has to be aggregated creating the Aggregated Dataframes for both Total and Detail Dataframes
totalDataAggDf = totalDataDf.groupby(['NAICS_CODE','KIND_OF_BUSINESS','ADJUSTMENT_TYPE','YEAR'])['VALUE'].sum().reset_index()
DetailDataAggDf = DetailDataDf.groupby(['NAICS_CODE','KIND_OF_BUSINESS','ADJUSTMENT_TYPE','YEAR'])['VALUE'].sum().reset_index()

[Back to top](#Index)

### 4.2 Addition of Metrics

Addition of Prior Year value and YOY Values to Aggregate Dataframes

In [None]:
#Adding Prior Year Sales Values and YoY calculations data to the Aggregated Dataframes
totalDataAggDf['YOY_VALUE'] = 0
totalDataAggDf['PY_VALUE'] = 0
for index,row in totalDataAggDf.iterrows():
    priorYearRow = totalDataAggDf[(totalDataAggDf['KIND_OF_BUSINESS']==row['KIND_OF_BUSINESS']) & (totalDataAggDf['ADJUSTMENT_TYPE']==row['ADJUSTMENT_TYPE']) & (totalDataAggDf['NAICS_CODE']==row['NAICS_CODE']) & (totalDataAggDf['YEAR']==row['YEAR']-1)]
    priorYearRow.reset_index()

    if len(priorYearRow) == 0:
        totalDataAggDf.loc[index,'PY_VALUE'] = 0
        totalDataAggDf.loc[index,'YOY'] = 0
    else:
        totalDataAggDf.loc[index,'PY_VALUE'] = priorYearRow['VALUE'].max()
        totalDataAggDf.loc[index,'YOY'] = (row['VALUE']/totalDataAggDf.loc[index,'PY_VALUE'])-1
        
    #print(index,row['VALUE'],row['PY_VALUE'],row['YOY_VALUE'])

In [None]:
#Adding Prior Year Sales Values and YoY calculations data to the Aggregated Dataframes
DetailDataAggDf['YOY_VALUE'] = 0
DetailDataAggDf['PY_VALUE'] = 0
for index,row in DetailDataAggDf.iterrows():
    priorYearRow = DetailDataAggDf[(DetailDataAggDf['KIND_OF_BUSINESS']==row['KIND_OF_BUSINESS']) & (DetailDataAggDf['ADJUSTMENT_TYPE']==row['ADJUSTMENT_TYPE']) & (DetailDataAggDf['NAICS_CODE']==row['NAICS_CODE']) & (DetailDataAggDf['YEAR']==row['YEAR']-1)]
    priorYearRow.reset_index()

    if len(priorYearRow) == 0:
        DetailDataAggDf.loc[index,'PY_VALUE'] = 0
        DetailDataAggDf.loc[index,'YOY'] = 0
    else:
        DetailDataAggDf.loc[index,'PY_VALUE'] = priorYearRow['VALUE'].max()
        DetailDataAggDf.loc[index,'YOY'] = (row['VALUE']/DetailDataAggDf.loc[index,'PY_VALUE'])-1
        
    #print(index,row['VALUE'],row['PY_VALUE'],row['YOY_VALUE'])

[Back to top](#Index)

## 5. Data Visualizations

[Back to top](#Index)

### 5.1 Visualizing Total Data.

Visualizing Total Data by Time Trending and Year over Year of each Kind of Business.

In [None]:
#Visualizing the Total Aggregated Data
i = 0
j = 0

x = len(totalDataAggDf.KIND_OF_BUSINESS.unique())
y = 2

print(x,y)

fig, axs = plt.subplots(x,y, figsize=(20, 40))
fig.suptitle(f'Plotting All the Total KPIs', fontsize = 16, y = 0.9)

for eachKind in totalDataAggDf.KIND_OF_BUSINESS.unique():
    axs[i][0].plot(totalDataAggDf[(totalDataAggDf['KIND_OF_BUSINESS']==eachKind) & (totalDataAggDf['YEAR']!=2021) ]['YEAR'], totalDataAggDf[(totalDataAggDf['KIND_OF_BUSINESS']==eachKind) & (totalDataAggDf['YEAR']!=2021)]['VALUE'], color = colors[eachKind])
    axs[i][0].scatter(totalDataAggDf[(totalDataAggDf['KIND_OF_BUSINESS']==eachKind) ]['YEAR'], totalDataAggDf[(totalDataAggDf['KIND_OF_BUSINESS']==eachKind)]['VALUE'], s = 10, color = colors[eachKind])
    axs[i][0].set_title(eachKind, fontsize = 10)
    axs[i][0].set_ylabel('Sales')

    axs[i][1].plot(totalDataAggDf[(totalDataAggDf['KIND_OF_BUSINESS']==eachKind) & (totalDataAggDf['YEAR']!=2021) ]['YEAR'], totalDataAggDf[(totalDataAggDf['KIND_OF_BUSINESS']==eachKind) & (totalDataAggDf['YEAR']!=2021)]['YOY'], color = colors[eachKind])
    axs[i][1].set_title(eachKind, fontsize = 10)
    axs[i][1].set_ylabel('% Change')
    axs[i][1].set_xlabel('Year')
    axs[i][1].xaxis.set_major_formatter(plt.FuncFormatter('{:.0f}'.format))
    axs[i][1].yaxis.set_major_formatter(plt.FuncFormatter('{:.0f}%'.format))
    i += 1
    

#print(i,j)
        
plt.show()

[Back to top](#Index)

### 5.2 Visualizing Detail Data

Visualizing Detail Data by Time Trending and Year over Year of each Kind of Business.[Back to top](#Index)

In [None]:
#Visualizing the Detail Aggregated Data
i = 0
j = 0

x = len(DetailDataAggDf.KIND_OF_BUSINESS.unique())
y = 2

print(x,y)

fig, axs = plt.subplots(x,y, figsize=(20, 60))
fig.suptitle(f'Plotting All the Detail Kind of Business KPIs', fontsize = 16, y = 0.9)
l = {}
for eachKind in DetailDataAggDf.KIND_OF_BUSINESS.unique():
    
    axs[i][0].plot(DetailDataAggDf[(DetailDataAggDf['KIND_OF_BUSINESS']==eachKind) & (DetailDataAggDf['YEAR']!=2021) ]['YEAR'], DetailDataAggDf[(DetailDataAggDf['KIND_OF_BUSINESS']==eachKind) & (DetailDataAggDf['YEAR']!=2021)]['VALUE'],color = colors[eachKind])
    axs[i][0].scatter(DetailDataAggDf[(DetailDataAggDf['KIND_OF_BUSINESS']==eachKind) ]['YEAR'], DetailDataAggDf[(DetailDataAggDf['KIND_OF_BUSINESS']==eachKind)]['VALUE'], s = 10, color = colors[eachKind])
    axs[i][0].set_title(eachKind, fontsize = 10)
    axs[i][0].set_ylabel('Sales')

    axs[i][1].plot(DetailDataAggDf[(DetailDataAggDf['KIND_OF_BUSINESS']==eachKind) & (DetailDataAggDf['YEAR']!=2021) ]['YEAR'], DetailDataAggDf[(DetailDataAggDf['KIND_OF_BUSINESS']==eachKind) & (DetailDataAggDf['YEAR']!=2021)]['YOY'],color = colors[eachKind])
    axs[i][1].set_title(eachKind, fontsize = 10)
    axs[i][1].set_ylabel('% Change')
    axs[i][1].set_xlabel('Year')
    axs[i][1].xaxis.set_major_formatter(plt.FuncFormatter('{:.0f}'.format))
    axs[i][1].yaxis.set_major_formatter(plt.FuncFormatter('{:.0f}%'.format))
    
    
    i += 1
    

#print(i,j)

plt.show()

[Back to top](#Index)

## 6. Rolling Window Analysis

[Back to top](#Index)

### 6.1 Rolling Window Details

Rolling windows are checking the averages for past specific periods like every month what was the average for last n months,

Dataframes will be updated with Rolling 3 months, Rolling 6 months, Rolling 12 months calculations as below.

[Back to top](#Index)

### 6.2 Preparing Data for Rolling window

Creating the Rolling window calculations to our monthly dataset as its a monthly data

In [None]:
#Sorting the data as we will be calculating the Rolling window calculations based on month
totalDataDf = totalDataDf.sort_values(by=['ADJUSTMENT_TYPE','NAICS_CODE','KIND_OF_BUSINESS','DATE'], ascending=True)
totalDataDf = totalDataDf.reset_index()
totalDataDf.drop(['index'], axis=1)

#Adding few rolling caluclation to the pandas Total dataframe.
totalDataDf['R3M'] = totalDataDf['VALUE'].rolling(3).mean()
totalDataDf['R6M'] = totalDataDf['VALUE'].rolling(6).mean()
totalDataDf['R12M'] = totalDataDf['VALUE'].rolling(12).mean()

In [None]:
#Adding few rolling caluclation to the pandas Detail dataframe.
DetailDataDf['R3M'] = 0.0
DetailDataDf['R6M'] = 0.0
DetailDataDf['R12M'] = 0.0

DetailDataDf = DetailDataDf.sort_values(by=['ADJUSTMENT_TYPE','NAICS_CODE','KIND_OF_BUSINESS','DATE'], ascending=True)
DetailDataDf = DetailDataDf.reset_index()
DetailDataDf.drop(['index'], axis=1)

#Adding few rolling caluclation to the pandas Total dataframe.
DetailDataDf['R3M'] = DetailDataDf['VALUE'].rolling(3).mean()
DetailDataDf['R6M'] = DetailDataDf['VALUE'].rolling(6).mean()
DetailDataDf['R12M'] = DetailDataDf['VALUE'].rolling(12).mean()

[Back to top](#Index)

### 6.3 Visualizing the Rolling window data

Visualizing Rolliing window data for Total Kind of Business KPIs

In [None]:
#Plotting the Rolling Window Metrics for Total data 
i = 0
j = 0

x = len(totalDataDf.KIND_OF_BUSINESS.unique())
y = 1

print(x,y)

fig, axs = plt.subplots(x,y, figsize=(40, 130))
fig.suptitle(f'Plotting All the Rolling window KPIs for Total Data', fontsize = 16, y = 0.9)

j = 0 
for eachKind in totalDataDf.KIND_OF_BUSINESS.unique():
    axs[i].plot(totalDataDf[(totalDataDf['KIND_OF_BUSINESS']==eachKind)]['DATE'], totalDataDf[(totalDataDf['KIND_OF_BUSINESS']==eachKind)]['R3M'], label = 'Rolling 3 Months')
    axs[i].plot(totalDataDf[(totalDataDf['KIND_OF_BUSINESS']==eachKind)]['DATE'], totalDataDf[(totalDataDf['KIND_OF_BUSINESS']==eachKind)]['R6M'], label = 'Rolling 6 Months')
    axs[i].plot(totalDataDf[(totalDataDf['KIND_OF_BUSINESS']==eachKind)]['DATE'], totalDataDf[(totalDataDf['KIND_OF_BUSINESS']==eachKind)]['R12M'], label = 'Rolling 12 Months')
    axs[i].set_title(eachKind, fontsize = 10)
    axs[i].set_ylabel('Sales')
    axs[i].legend()
    i += 1
    #print(i,j,eachYear,eachKind)    

#print(i,j)
        
plt.show()

In [None]:
line1, = ax.plot([1, 2, 3], label='label1')
line2, = ax.plot([1, 2, 3], label='label2')
ax.legend(handles=[line1, line2])

Visualizing Rolliing window data for Detail Kind of Business KPIs

In [None]:
#Plotting the Rolling Window Metrics for Detail data 
i = 0
j = 0

x = len(DetailDataDf.KIND_OF_BUSINESS.unique())
y = 1

print(x,y)

fig, axs = plt.subplots(x,y, figsize=(40, 130))
fig.suptitle(f'Plotting All the Rolling window KPIs for Detail Data', fontsize = 16, y = 0.9)

j = 0 
for eachKind in DetailDataDf.KIND_OF_BUSINESS.unique():
    axs[i].plot(DetailDataDf[(DetailDataDf['KIND_OF_BUSINESS']==eachKind)]['DATE'], DetailDataDf[(DetailDataDf['KIND_OF_BUSINESS']==eachKind)]['R3M'], label = 'Rolling 3 Months')
    axs[i].plot(DetailDataDf[(DetailDataDf['KIND_OF_BUSINESS']==eachKind)]['DATE'], DetailDataDf[(DetailDataDf['KIND_OF_BUSINESS']==eachKind)]['R6M'], label = 'Rolling 6 Months')
    axs[i].plot(DetailDataDf[(DetailDataDf['KIND_OF_BUSINESS']==eachKind)]['DATE'], DetailDataDf[(DetailDataDf['KIND_OF_BUSINESS']==eachKind)]['R12M'], label = 'Rolling 12 Months')
    axs[i].set_title(eachKind, fontsize = 10)
    axs[i].set_ylabel('Sales')
    axs[i].legend()
    i += 1
    #print(i,j,eachYear,eachKind)    

#print(i,j)
        
plt.show()

[Back to top](#Index)

## Conclusion

This whole project is ready to answer all Sales related questions like Sales Trending by month or by Year, YOY percentages for each Kind of Business or overall buiness, time trending Analysis for period of time either few months or years.

[Back to top](#Index)

## References

Adding the References which have been used in this notebook

- Christopher, Antony "Python MYSQL Connector" https://medium.com/analytics-vidhya/importing-data-from-a-mysql-database-into-pandas-data-frame-a06e392d27d7
- “Using Matplotlib” https://pandas.pydata.org/pandas-docs/version/0.9.1/visualization.html
- “Pandas Data Aggregation” https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.aggregate.html
- “Pandas DataFrame Rolling window” https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html
