# TPS Jan 2022 

---

This notebook is basic EDA on the training and test datasets for the TPS January 2022. I have used Pandas, NumPy and Plotly Libraries for this. The next steps after this notebook will be developing machine learning models using ML and DL libraries and practices.


<div class="alert alert-block alert-info" style="margin-top: 20px">
    <h2>(1) Importing Libraries</h2>
</div>
<hr>


In [None]:
import numpy as np
import pandas as pd

import plotly.express as px
import plotly.graph_objects as go

import warnings
warnings.filterwarnings('ignore')

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <h2>(2) Importing Data</h2>
</div>
<hr>

In [None]:
train_data = pd.read_csv("../input/tabular-playground-series-jan-2022/train.csv")
train_data = train_data.sort_values(by=['date'])
train_data.head() 

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <h2>(3) Exploring Data</h2>
</div>
<hr>


In [None]:
train_data.info()

In [None]:
train_data.describe()

In [None]:
numerical_columns = [var for var in train_data.columns if train_data[var].dtype!='O']
print(" Numerical columns : ", len(numerical_columns),"\n Columns: ",numerical_columns)

In [None]:
categorical_columns = [var for var in train_data.columns if train_data[var].dtype=='O']
print(" Categorical columns : ", len(categorical_columns),"\n Columns: ",categorical_columns)

Creating new columns from "date" feature.

In [None]:
train_data['year'] = ""
train_data['month'] = ""
train_data['day'] = ""

for i in range(len(train_data)):
    train_data['year'].iloc[i] = train_data['date'].iloc[i].split("-")[0]
    train_data['month'].iloc[i] = train_data['date'].iloc[i].split("-")[1]
    train_data['day'].iloc[i] = train_data['date'].iloc[i].split("-")[2]

In [None]:
train_data.head()

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <h2>(4) Cleaning Data</h2>
</div>
<hr>


In [None]:
train_data.isna().sum()

There are **no** missing values or Null Values to clean.

In [None]:
train_data[train_data.duplicated()]

Also, there are **no** duplicated rows.

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <h2>(5) Visualizing Data</h2>
</div>
<hr>

Visualizing entire dataset in a graph could be overwhelming and might not give much value out. Hence we will break this into smaller sets by considering year and then maximum values of "num_sold" values in a day. 

In the following cells, we will be looking at training dataset for 2015 year only.


### Sales for a Year 

In [None]:
train_data_2015 = train_data[train_data['year']=='2015']

fig1 = px.line(train_data_2015, x='date', y="num_sold",title="Daily Sales for 2015")
fig1.show()

### Maximum sales for a Year 

In [None]:
train_data_2015_max = pd.DataFrame(train_data_2015.groupby(['date','year'], sort=False)['num_sold'].max()).reset_index()

fig2= px.line(train_data_2015_max, x='date', y="num_sold", title="Maximum Daily Sales for 2015")
fig2.show()

### Year-Year Maximum Sales

In [None]:
train_data_max = pd.DataFrame(train_data.groupby(['date','year','month','day'], sort=False)['num_sold'].max()).reset_index()

fig3 = px.line(train_data_max, x='date', y="num_sold", title="Maximum Daily Sales 2015-18")
fig3.update_xaxes(rangeslider_visible=True)
fig3.show()

### Month-Month Maximum Sales

In [None]:
train_data_max_monthly = train_data_max.groupby(['year','month'], sort=False)['num_sold'].max().reset_index()

fig4 = px.area(train_data_max_monthly, x='month', y='num_sold', facet_col="year", color ='year',facet_col_wrap=2)
fig4.show()

### Store vs Years

In [None]:
year_sold = train_data.groupby(['year','store'], sort=False)['num_sold'].max().reset_index()
fig5 = px.bar(year_sold, x='year',y='num_sold',barmode='group',color='store',title="Sales vs Years for Type of Stores")
fig5.show()

### Product Sales

In [None]:
product_year_sold = train_data.groupby(['year','product'], sort=False)['num_sold'].max().reset_index()
figure = px.bar(product_year_sold, x='year',y='num_sold',color='product',title="Sales vs Years for Type of Product")
figure.show()

### Sales per Country

In [None]:
country_year_sold = train_data.groupby(['year','country'], sort=False)['num_sold'].max().reset_index()
figure_country = px.bar(country_year_sold, x='year',y='num_sold',color='country',
                        barmode='group',title="Sales vs Years w.r.t Countries")
figure_country.show()

# Please do upvote ! 😃

<hr>

### References and Resources : 
 
 1. [Plotly](https://plotly.com/) 
 2. [Pandas](https://pandas.pydata.org/)


<div class="alert alert-block alert-warning" align="center" style="margin-top: 20px">
    <h2><i>Thank you and wish you a Happy New year !</i> 🎉</h2>
</div>