# Exploratory Data Analysis

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# local packages packages
from utils import Utils
from data_utils import Data

## Getting to know the data

In [None]:
# load data:
data = Data(Utils.read_config_for_env(config_path='../config/config.yml'))

In [None]:
data.shop_list.info()

In [None]:
data.item_list.info()

In [None]:
data.category_list.info()

In [None]:
data.transactions.info(show_counts=True)

What is price? is it unit price or total price (=unit price * amount). A high correlation between amount and price would indicate that the price is total price.

In [None]:
plt.plot(data.transactions.amount, data.transactions.price, 'o')

No correlation. From this we can conclude that price is unit price. But it might be good to double check with the customer.

In [None]:
print("Shapes of Each Table:")
print(f"Items table: {data.item_list.shape}")
print(f"Shops table: {data.shop_list.shape}")
print(f"Categories table: {data.category_list.shape}")
print(f"Transactions Table: {data.transactions.shape}")

In [None]:
print("Unique count of items, shops, categories:")
print(f"Unique item count: {data.item_list.item_id.unique().shape[0]}")
print(f"Unique shop count: {data.shop_list.shop_id.unique().shape[0]}")
print(f"Unique category count: {data.category_list.item_category_id.unique().shape[0]}")

In [None]:
print("Representation of the transaction data:")
print(f'Unique items in transactions: {data.transactions.item_id.unique().shape[0]}')
print(f'Unique shops in transactions: {data.transactions.shop_id.unique().shape[0]}')

First insights:
- no missing values at the first sight, however they may be encoded as -99, etc
- some items are not represented in the transactions data, which means they were never bought. This is entirely plausible, but this means we will not be able to make predictions for these items, as long as we don't have any other data source on what items are available in which shops.
- 1 shop (out of 60) is not represented in the shops data, which means we probably don't have data for this shop. This means we don't have transaction data for this shop, we can't make predictions for this shop, since we don't know what items are available in this shop.
- Item categories (from item_list table) seems like a potential feature we can use for predictions.
- Month data, extracted from the transactions.dates should be used as a predictive feature, we have to look closer to decide whether years is a feature or not

## Merge tables to ease analysis and visualisation

In [None]:
data_merged = data.merge_data()
data_merged.info(show_counts=True)

## Identifying implausible values and outliers

### Date

In [None]:
print(f"Date range: {data_merged.date.min()}, {data_merged.date.max()}")

Let's look at the distribution of transactions over time, to see if there is any anomaly:

In [None]:
datecounts = data_merged.groupby('date').size()

In [None]:
datecounts = datecounts.reset_index(name='count')
datecounts.rename(columns={'category':'date'}, inplace=True)

In [None]:
plt.bar(datecounts['date'], datecounts['count'], color='skyblue')

# Adding title and labels
plt.title('Count of Transactions for Each Date')
plt.xlabel('Date')
plt.ylabel('Count')

# Display the plot
plt.show()

### Price

In [None]:
data_merged.price.describe()

In [None]:
data_merged[data_merged.price<0]

For price, -1 is certainly an implausible value. We should remove (and impute?) such rows where price<0

In [None]:
plt.boxplot(data_merged.price)

That point seems like an outlier, let's take a closer look at transactions where price>100K.

In [None]:
pd.set_option('display.max_rows', None)
data_merged[data_merged.price>100000]

>100000 (Euros?) for software like photoshop, or items like xbox and PS (amount=1) look suspicious, but given that these shops seem to be from russia, it is possible that the currency is in Rubles (this is something we should clarify with the customer), and considering that 1 Ruble = 0.010 Euros, these correspond to about 1000 Euros, which starts becoming plausible. But what about that most expensive item shown in the diagram?

In [None]:
data_merged[data_merged.price>200000]

It turns out, Radmin is again some software, for which a price of a presumable 10.000 Euros is unreasonable, but the '522 persons' in the item name make it again difficult to judge. There can be outliers in the lower range too.
Let's eliminate the rows with significanlty deviating prices (e.g., mean -/+ 3*std). At a later stage, we can try to eliminate based on statistics of the cateogories or individual items. 

### Amount

In [None]:
data_merged.amount.describe()

In [None]:
data_merged[data_merged.amount<-5]

In [None]:
data_merged[data_merged.amount<0].shape

Here, negative values might be plausible, as these may represent returns. This is something to check with the customer, but for now, we will assume that we are not interested in these data, and will remove them. Predicting 'returns' might be anyway outside of the scope of the current project.

In [None]:
plt.boxplot(data_merged.amount)

In [None]:
data_merged[data_merged.amount>500]

Some of these may seem ok'ish (eg, Ticket), but some are looking definitely strange (e.g., Grand Theft Auto).
Similar to the price outlier removing logic, let's eliminate the rows with significanlty deviating amounts. At a later stage, we can try to eliminate based on statistics of the cateogories or individual items.

## Data cleanining

In [None]:
data_cleaned = data.clean_data(data_merged)

In [None]:
data_cleaned.describe()

In [None]:
plt.boxplot(data_cleaned.price)

In [None]:
plt.boxplot(data_cleaned.amount)