# Exploratory Data Analysis
Exploratory data analysis is the lifeblood of every meaningful machine learning project. It helps us unravel the nature of the data and sometimes informs how you go about modelling. A careful exploration of the data encapsulates checking all available features, checking their interactions and correlation as well as their variability with respect to the target.

In this task, you seek to explore the behaviour of customers in the various stores. Our goal is to check how some measures such as promos and opening of new stores affect purchasing behavior.

To achieve this goal, you need to first clean the data. The data cleaning process will involve building pipelines to detect and handle outlier and missing data. This is particularly important because you don’t want to skew our analysis.

Visualizing various features and interactions is necessary for clearly communicating our findings. It is a powerful tool in the data science toolbox. Communicate the findings below via the necessary plots.

You can use the following questions as a guide during your analysis. It is important to come up with more questions to explore. This is part of our expectation for an excellent analysis.

In [3]:
# importing of libraries
import numpy as np
import pandas as pd
import warnings
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
import seaborn as sns
import os,sys
sys.path.append(os.path.abspath(os.path.join('../scripts')))
from eda import EDA
from Clean import Clean
sns.set()
warnings.simplefilter(action='ignore', category=FutureWarning)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.options.mode.chained_assignment = None  # default='warn'
plt.rcParams["figure.figsize"] = (12, 8)
pd.set_option('display.max_columns', None)

In [4]:
# reading the csv file
train = pd.read_csv("../data/train.csv",index_col=False)
test = pd.read_csv("../data/test.csv",index_col=False)
store = pd.read_csv("../data/store.csv",index_col=False)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [5]:
#cleaning the train dataset
clean_df = Clean(train)

[07/Sep/2022 03:37:55] INFO - Successfully initialized clean class


In [6]:
clean_df.merge_df(store,'Store') #merge store with clean train dataset

[07/Sep/2022 03:37:55] INFO - Successfully merged the dataframe


Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,5,2015-07-31,5263,555,1,1,0,1,c,a,1270.000,9.000,2008.000,0,,,
1,2,5,2015-07-31,6064,625,1,1,0,1,a,a,570.000,11.000,2007.000,1,13.000,2010.000,"Jan,Apr,Jul,Oct"
2,3,5,2015-07-31,8314,821,1,1,0,1,a,a,14130.000,12.000,2006.000,1,14.000,2011.000,"Jan,Apr,Jul,Oct"
3,4,5,2015-07-31,13995,1498,1,1,0,1,c,c,620.000,9.000,2009.000,0,,,
4,5,5,2015-07-31,4822,559,1,1,0,1,a,a,29910.000,4.000,2015.000,0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1017204,1111,2,2013-01-01,0,0,0,0,a,1,a,a,1900.000,6.000,2014.000,1,31.000,2013.000,"Jan,Apr,Jul,Oct"
1017205,1112,2,2013-01-01,0,0,0,0,a,1,c,c,1880.000,4.000,2006.000,0,,,
1017206,1113,2,2013-01-01,0,0,0,0,a,1,a,c,9260.000,,,0,,,
1017207,1114,2,2013-01-01,0,0,0,0,a,1,a,c,870.000,,,0,,,


In [7]:
clean_df.drop_missing_values()

[07/Sep/2022 03:37:56] INFO - NumExpr defaulting to 8 threads.
[07/Sep/2022 03:37:56] INFO - Successfully dropped the columns with missing values


In [8]:
clean_df.fix_outliers('Sales',25000)

[07/Sep/2022 03:37:56] INFO - Successfully stored the features
[07/Sep/2022 03:37:56] INFO - Successfully handled outliers


In [9]:
clean_df.remove_unnamed_cols()

[07/Sep/2022 03:37:56] INFO - Successfully removed columns with head unnamed


In [10]:
clean_df.transfrom_time_series("Store","Date")

[07/Sep/2022 03:37:56] INFO - Successfully transformed data to time series data


In [11]:
clean_df.save(name="../data/trainstore.csv")

[07/Sep/2022 03:38:00] INFO - Successfully saved the dataframe


In [12]:
clean_df = Clean(test)

[07/Sep/2022 03:38:00] INFO - Successfully initialized clean class


In [13]:
clean_df.merge_df(store,'Store')

[07/Sep/2022 03:38:00] INFO - Successfully merged the dataframe


Unnamed: 0,Id,Store,DayOfWeek,Date,Open,Promo,StateHoliday,SchoolHoliday,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,1,4,2015-09-17,1.000,1,0,0,c,a,1270.000,9.000,2008.000,0,,,
1,2,3,4,2015-09-17,1.000,1,0,0,a,a,14130.000,12.000,2006.000,1,14.000,2011.000,"Jan,Apr,Jul,Oct"
2,3,7,4,2015-09-17,1.000,1,0,0,a,c,24000.000,4.000,2013.000,0,,,
3,4,8,4,2015-09-17,1.000,1,0,0,a,a,7520.000,10.000,2014.000,0,,,
4,5,9,4,2015-09-17,1.000,1,0,0,a,c,2030.000,8.000,2000.000,0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41083,41084,1111,6,2015-08-01,1.000,0,0,0,a,a,1900.000,6.000,2014.000,1,31.000,2013.000,"Jan,Apr,Jul,Oct"
41084,41085,1112,6,2015-08-01,1.000,0,0,0,c,c,1880.000,4.000,2006.000,0,,,
41085,41086,1113,6,2015-08-01,1.000,0,0,0,a,c,9260.000,,,0,,,
41086,41087,1114,6,2015-08-01,1.000,0,0,0,a,c,870.000,,,0,,,


In [14]:
clean_df.drop_missing_values()

[07/Sep/2022 03:38:01] INFO - Successfully dropped the columns with missing values


In [15]:
clean_df.fix_outliers('Sales',25000)

[07/Sep/2022 03:38:01] INFO - Successfully stored the features
[07/Sep/2022 03:38:01] INFO - Successfully handled outliers


In [16]:
clean_df.remove_unnamed_cols()

[07/Sep/2022 03:38:01] INFO - Successfully removed columns with head unnamed


In [17]:
clean_df.transfrom_time_series("Store","Date")

[07/Sep/2022 03:38:01] INFO - Successfully transformed data to time series data


In [18]:
clean_df.get_df().drop('Id',axis=1,inplace=True)

In [19]:
clean_df.save(name="../data/teststore.csv")

[07/Sep/2022 03:38:01] INFO - Successfully saved the dataframe
