# Introduction
- This notebook covers:
    - Loading of the `toolwindow_data.csv`
    - Exploratory Data Analysis (EDA)
    - Cleaning

# Imports

In [None]:
import pandas as pd
import plotly.express as px

# Load Dataset

In [None]:
df = pd.read_csv(r'data/toolwindow_data.csv')

In [None]:
df

# Exploratory Data Analysis (EDA)

## 1. Basic Information
In this section, we check dataset structure, columns, and a quick sample of records.

In [None]:
df.shape

In [None]:
print("Columns:", ", ".join(df.columns))

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.tail()

## 2. Checks
Checks for nulls and duplicates, and checks of total counts for event and open types to understand overall balance.

In [None]:
print("Nulls per column:\n", df.isnull().sum())

In [None]:
print("Duplicate rows:", df.duplicated().sum())

In [None]:
df['event'].value_counts()

In [None]:
df['open_type'].value_counts(dropna=False)

## 3. Time Range and Event Distribution
We check the time coverage and visualize how events occur over time.

In [None]:
# The timestamps are in epoch milliseconds,so the new column is added where they are converted to datetime format
df['datetime'] = pd.to_datetime(df['timestamp'], unit='ms')
df['datetime']

In [None]:
print(f"Min time: {df['datetime'].min()}\nMax time: {df['datetime'].max()}"
      f"\nTime range: {df['datetime'].max() - df['datetime'].min()}")

In [None]:
fig = px.histogram(
    df,
    x='datetime',
    color='event',
    nbins=50,
    title='Event Frequency Over Time',
    opacity=0.75
)

fig.update_layout(
    xaxis_title='Datetime',
    yaxis_title='Count',
    bargap=0.1,
    legend_title='Event'
)

fig.show()

# Cleaning
- Except for adding of the 'datetime' column, **no cleaning is required**.
- This is because the 'open_type' column is only populated on 'open' events(with values 'auto' and 'manual'), and the nulls appear solely on 'close' events, so nulls do not affect the session reconstruction and its analysis in the later stages.