# Transactions data

Let's breakdown the transaction dataset.

Its structure is simple:
 - date: day
 - store_nbr: which store?
 - transactions: number of transactions by that store on that day.

First we check out the values in the data, then restructure the dataset to give each store number its own column. Than we will do a basic analysis.

In [1]:
# import libs
import pandas as pd
from matplotlib import pyplot as plt
from pandas_profiling import ProfileReport

from EDA.analysis import describe_dataset, check_missing_dates

from definitions import TRANSACTIONS_FILE

# load data
df = pd.read_csv(TRANSACTIONS_FILE)
describe_dataset(df)

#show histogram
df.transactions.hist(bins=100)
plt.show()

print(f'has nans: {df.transactions.hasnans}')

['date', 'store_nbr', 'transactions']

length 83437
          store_nbr  transactions
count  83437.000000  83437.000000
mean      26.939296   1694.694536
std       15.608269    963.380084
min        1.000000      5.000000
25%       13.000000   1046.000000
50%       27.000000   1393.000000
75%       40.000000   2079.000000
max       54.000000   8359.000000
has nans: False


  plt.show()


Nothing surprising: non nans, transactions ranging from 5 to 8356 and the distribution of is positively skewed.
Nothing surprising: non nans, transactions ranging from 5 to 8356 and the distribution of is positively skewed.

In [6]:
transactions_by_store = []
for store_nbr, transactions in df.groupby('store_nbr'):
    transactions = transactions.set_index('date')
    transactions = transactions[['transactions']].rename(columns={'transactions': 'transactions_' + str(store_nbr)})
    transactions_by_store.append(transactions)

grouped_by_store = pd.concat(transactions_by_store, axis=1)

# create minimal (descriptives only) profile report.
report = ProfileReport(grouped_by_store, title="Transactions dataset report.", minimal=True)
# manually enable matrix to visualise missing data.
report.config.missing_diagrams = {'bar': False, 'matrix': True, 'heatmap': False, 'dendrogram': False}

# uncomment to save as html file if you rather look at the report in your browser.
# report.to_file('transactions_report.html')
report.to_notebook_iframe()

title='Transactions dataset report.' dataset=Dataset(description='', creator='', author='', copyright_holder='', copyright_year='', url='') variables=Variables(descriptions={}) infer_dtypes=False show_variable_description=True pool_size=0 progress_bar=True vars=Univariate(num=NumVars(quantiles=[0.05, 0.25, 0.5, 0.75, 0.95], skewness_threshold=20, low_categorical_threshold=5, chi_squared_threshold=0.0), cat=CatVars(length=False, characters=False, words=False, cardinality_threshold=50, n_obs=5, chi_squared_threshold=0.0, coerce_str_to_date=False, redact=False, histogram_largest=10), image=ImageVars(active=False, exif=False, hash=False), bool=BoolVars(n_obs=3, mappings={'t': True, 'f': False, 'True': True, 'False': False, 'y': True, 'n': False}), path=PathVars(active=False), file=FileVars(active=False), url=UrlVars(active=False)) sort=None missing_diagrams={'bar': False, 'matrix': False, 'heatmap': False, 'dendrogram': False} correlations={'pearson': Correlation(key='', calculate=False, w

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Observations:
- Some stores have a lot of missing values.
- The histograms show distributions that are normal / skewed positive.
- Some stores are heavily skewed positive
- Some stores have bimodal transaction distributions.

There is no analysis on relatedness / correlations because the dataset is too large. Correlating all 55 variables is an building the resulting graphs is too much for pandas_profiling.

## Missing values
The missing values matrix shows that some stores have data entries before a certain date. We can probably attribute that to new stores being opened during the timespan of the dataset.
There are also gaps, likely due to stores being closed for a longer period.
