#### Exploratory Data Analysis (EDA) - Initial Data Visualization: Dispersion and Outliers

Dataset: 

- _fs_feature.csv_

Author: Luis Sergio Pastrana Lemus  
Date: 2025-09-09

# Exploratory Data Analysis – Food supplier Dataset

## __1. Libraries__.

In [1]:
from pathlib import Path
import sys

# Define project root dynamically, gets the current directory from which the notebook belongs and moves one level upper
project_root = Path.cwd().parent

# Add src to sys.path if it is not already
if str(project_root) not in sys.path:

    sys.path.append(str(project_root))

# Import function directly (more controlled than import *)
from src import *


from IPython.display import display, HTML
import os
import pandas as pd
import numpy as np

## __2. Path to Data file__.

In [2]:
# Build route to data file and upload
data_file_path = project_root / "data" / "processed" / "feature"

df_fs = load_dataset_from_csv(data_file_path, "fs_feature.csv", header='infer', parse_dates=['datetime'])


In [3]:
# Format notebook output
format_notebook()

## __3. Exploratory Data Analysis__.

### 3.0 Casting Data types.

In [4]:
# Casting dtypes
# df_fs 'eventname' to category
df_fs.loc[:, 'eventname'] = df_fs['eventname'].astype('category')
df_fs['eventname'].dtype

# dfs 'date' and 'time' to datetime
df_fs['date'] = pd.to_datetime(df_fs['date']).dt.date
df_fs['time'] = pd.to_datetime(df_fs['time'], format='%H:%M:%S').dt.time

In [5]:
df_fs

Unnamed: 0,eventname,deviceidhash,datetime,expid,date,time
0,mainscreenappear,4575588528974610257,2019-07-25 04:43:36+00:00,246,2019-07-25,04:43:36
1,mainscreenappear,7416695313311560658,2019-07-25 11:11:42+00:00,246,2019-07-25,11:11:42
2,paymentscreensuccessful,3518123091307005509,2019-07-25 11:28:47+00:00,248,2019-07-25,11:28:47
3,cartscreenappear,3518123091307005509,2019-07-25 11:28:47+00:00,248,2019-07-25,11:28:47
4,paymentscreensuccessful,6217807653094995999,2019-07-25 11:48:42+00:00,248,2019-07-25,11:48:42
...,...,...,...,...,...,...
243708,mainscreenappear,4599628364049201812,2019-08-07 21:12:25+00:00,247,2019-08-07,21:12:25
243709,mainscreenappear,5849806612437486590,2019-08-07 21:13:59+00:00,246,2019-08-07,21:13:59
243710,mainscreenappear,5746969938801999050,2019-08-07 21:14:43+00:00,246,2019-08-07,21:14:43
243711,mainscreenappear,5746969938801999050,2019-08-07 21:14:58+00:00,246,2019-08-07,21:14:58


### 3.3 Data Visualization: Data Analysis.

3.3.1 How many events are in the logs?

In [6]:
# Show total number of events within the records
display(HTML(f"> <b>Total</b> number of <b>events</b> within the records: <i>{df_fs.shape[0]}</i> (events)"))

3.3.2 How many users are there in the records?

In [7]:
# Show total number users within the records
display(HTML(f"> <b>Total</b> number of <b>users</b> within the records: <i>{df_fs['deviceidhash'].nunique()}</i> (users)"))

3.3.3 What is the average number of events per user?

In [8]:
# Show the average number of events per user
display(HTML(f"> <b>Average</b> number of <b>events</b> per user: <i>{df_fs.groupby('deviceidhash')['eventname'].count().mean():.3f}</i> (events)"))

3.3.4 What time period does the data cover? Find the maximum and minimum dates. Plot a histogram by date and time. What period do the data actually represent?

In [9]:
# Show the time period
display(HTML(f"> <b>Earliest date</b> within the records: <i>{df_fs['date'].min()}</i>"))
display(HTML(f"> <b>Latest date</b> within the records: <i>{df_fs['date'].max()}</i>"))

In [10]:
# Plot a histogram by date and time.
plotly_frequency_datetime_plotlypx(df_fs['datetime'], bins=500, density=False, color='grey', title='Event Distribution with Full Timestamp Precision', xlabel='Date-Time', ylabel='Frequency')

In [None]:
# Find the point at which the data begins to be complete and ignore the previous section. What period does the data actually represent?
df_fs["datetime"] = df_fs["datetime"].dt.tz_convert(None)
df_fs["datetime"] = df_fs["datetime"].astype("int64") // 10**9
df_fs


Unnamed: 0,eventname,deviceidhash,datetime,expid,date,time
0,mainscreenappear,4575588528974610257,1564029816,246,2019-07-25,04:43:36
1,mainscreenappear,7416695313311560658,1564053102,246,2019-07-25,11:11:42
2,paymentscreensuccessful,3518123091307005509,1564054127,248,2019-07-25,11:28:47
3,cartscreenappear,3518123091307005509,1564054127,248,2019-07-25,11:28:47
4,paymentscreensuccessful,6217807653094995999,1564055322,248,2019-07-25,11:48:42
...,...,...,...,...,...,...
243708,mainscreenappear,4599628364049201812,1565212345,247,2019-08-07,21:12:25
243709,mainscreenappear,5849806612437486590,1565212439,246,2019-08-07,21:13:59
243710,mainscreenappear,5746969938801999050,1565212483,246,2019-08-07,21:14:43
243711,mainscreenappear,5746969938801999050,1565212498,246,2019-08-07,21:14:58


In [12]:
# Analyzing outliers
outlier_limit_bounds(df_fs, column='datetime', bound='both', clamp_zero=False)

Unnamed: 0,eventname,deviceidhash,datetime,expid,date,time
0,mainscreenappear,4575588528974610257,1564029816,246,2019-07-25,04:43:36
1,mainscreenappear,7416695313311560658,1564053102,246,2019-07-25,11:11:42
2,paymentscreensuccessful,3518123091307005509,1564054127,248,2019-07-25,11:28:47
3,cartscreenappear,3518123091307005509,1564054127,248,2019-07-25,11:28:47
4,paymentscreensuccessful,6217807653094995999,1564055322,248,2019-07-25,11:48:42
...,...,...,...,...,...,...
90,mainscreenappear,2793988848638831992,1564252142,247,2019-07-27,18:29:02
91,offersscreenappear,4284716907364183621,1564253762,247,2019-07-27,18:56:02
92,mainscreenappear,5218365729556903805,1564255514,247,2019-07-27,19:25:14
93,mainscreenappear,5218365729556903805,1564255522,247,2019-07-27,19:25:22


In [13]:
display(HTML(f"> Lower = {pd.to_datetime(1564279636.500, unit="s", utc=True)}"))
display(HTML(f"> Lower = {pd.to_datetime(1565551552.500, unit="s", utc=True)}"))

In [14]:
df_fs = df_fs.copy()
df_fs = df_fs.loc[(df_fs['datetime'] > 1564297636.500) & (df_fs['datetime'] < 1565569552.500), :]
df_fs.loc[:, :] = normalize_datetime(df_fs, include=['datetime'], unix_unit='s')
df_fs

Unnamed: 0,eventname,deviceidhash,datetime,expid,date,time
110,offersscreenappear,5287660122200561379,2019-07-28 07:24:12+00:00,247,2019-07-28,07:24:12
111,mainscreenappear,4601930136642303959,2019-07-28 07:25:42+00:00,246,2019-07-28,07:25:42
112,mainscreenappear,4078637421796153763,2019-07-28 07:31:58+00:00,248,2019-07-28,07:31:58
113,mainscreenappear,2851587046453391812,2019-07-28 07:40:40+00:00,246,2019-07-28,07:40:40
114,mainscreenappear,9111985047183779356,2019-07-28 07:48:19+00:00,247,2019-07-28,07:48:19
...,...,...,...,...,...,...
243708,mainscreenappear,4599628364049201812,2019-08-07 21:12:25+00:00,247,2019-08-07,21:12:25
243709,mainscreenappear,5849806612437486590,2019-08-07 21:13:59+00:00,246,2019-08-07,21:13:59
243710,mainscreenappear,5746969938801999050,2019-08-07 21:14:43+00:00,246,2019-08-07,21:14:43
243711,mainscreenappear,5746969938801999050,2019-08-07 21:14:58+00:00,246,2019-08-07,21:14:58


In [15]:
# Plot a histogram by date and time, no outliers.
plotly_frequency_datetime_plotlypx(df_fs['datetime'], bins=500, density=False, color='grey', title='Event Distribution with Full Timestamp Precision', xlabel='Date-Time', ylabel='Frequency')

In [23]:
# Number and proportion of records lost after excluding older records
records = df_fs.shape[0]
excluded = df_fs.loc[(df_fs['datetime'] < '2019-08-01 00:00:00+00:00'), :].shape[0]
display(HTML(f"> Total number of records: {records}"))
display(HTML(f"> Number of records to be excluded: {excluded}"))
display(HTML(f"> Proportion of records to be excluded: {(excluded * 100) / records:.3f} %"))

In [24]:
# Data prior to August 1, 2019, have incomplete records and are not included in the analysis to avoid bias.
df_fs = df_fs.loc[(df_fs['datetime'] > '2019-08-01 00:00:00+00:00'), :]
df_fs

Unnamed: 0,eventname,deviceidhash,datetime,expid,date,time
2826,tutorial,3737462046622621720,2019-08-01 00:07:28+00:00,246,2019-08-01,00:07:28
2827,mainscreenappear,3737462046622621720,2019-08-01 00:08:00+00:00,246,2019-08-01,00:08:00
2828,mainscreenappear,3737462046622621720,2019-08-01 00:08:55+00:00,246,2019-08-01,00:08:55
2829,offersscreenappear,3737462046622621720,2019-08-01 00:08:58+00:00,246,2019-08-01,00:08:58
2830,mainscreenappear,1433840883824088890,2019-08-01 00:08:59+00:00,247,2019-08-01,00:08:59
...,...,...,...,...,...,...
243708,mainscreenappear,4599628364049201812,2019-08-07 21:12:25+00:00,247,2019-08-07,21:12:25
243709,mainscreenappear,5849806612437486590,2019-08-07 21:13:59+00:00,246,2019-08-07,21:13:59
243710,mainscreenappear,5746969938801999050,2019-08-07 21:14:43+00:00,246,2019-08-07,21:14:43
243711,mainscreenappear,5746969938801999050,2019-08-07 21:14:58+00:00,246,2019-08-07,21:14:58


In [25]:
# Plot a histogram by date and time, just complete records.
plotly_frequency_datetime_plotlypx(df_fs['datetime'], bins=500, density=False, color='grey', title='Event Distribution with Full Timestamp Precision', xlabel='Date-Time', ylabel='Frequency')

In [26]:
project_root = Path.cwd().parent
processed_path = project_root / "data" / "processed" / "clean" / "fs_norm.csv"

df_fs.to_csv(processed_path, index=False)