In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

### Retailrocket recommender system dataset

Source: https://www.kaggle.com/retailrocket/ecommerce-dataset

Load the dataset files.

In [None]:
events = 
item_properties = pd.read_csv(os.path.join('data', 'item_properties_part1.csv'))
category_tree = pd.read_csv(os.path.join('data', 'category_tree.csv'))

In [None]:
events

In [None]:
item_properties

In [None]:
category_tree

## Exploratory Data Analysis (EDA)

EDA is about understanding the data and forming hypotheses about it. 

- Visualizing Data: Histograms, scatter plots, box plots, etc., to understand distributions and relationships.
- Summary Statistics: Calculating mean, median, mode, standard deviation, and correlation to gain insights into the dataset.
- Detecting Outliers: Identifying values that deviate significantly from the rest of the data.
- Assessing Data Types and Structure: Checking data types, unique values, and identifying missing values.

#### Q1: Convert timestamp into corresponding date.

In [None]:
# Convert timestamp to datetime (milliseconds to seconds)


#### Q2: Computer total count of each event type (`views`, `addtocart`, `transaction`) per item per day.

#### Q3: Compute top 10 items with the highest number of `view` events.

#### Q4: What is the distribution of event types (view, purchase, etc.) in the events dataset?

#### Q5: How many different items are in the dataset?

#### Q6: What is the average number of events per visitor?

#### Q7: How many unique transactions are in the dataset?

#### Q8: What is the distribution of transactions?

#### Q9: How many events happen on average per day?

#### Q10: Left join `events` with `item_properties`.

#### Q11: Convert `parentid` column to `int32` type.

In [None]:
category_tree

In [None]:
category_tree_v2 = 
category_tree_v3 = 
category_tree_v4 = 

## Data imputation

a process that replaces missing values in a dataset with estimated values

### Handling missing values with `category_tree`

#### Option 1: Fill `NaN` with a placeholder value (e.g., -1 or another integer)

#### Option 2: Drop rows with `NaN` in the `parentid `column

#### Option 3: Use `Int32` (nullable integer type)

### Interpolation

Interpolation is a technique that can be useful for handling missing values, particularly when the missing data is assumed to follow a pattern or trend based on the existing values in the dataset. This is often the case with time series or ordered data, where the missing values are assumed to lie between known values. Interpolation fills in these gaps by estimating the missing data points using existing values.

When **NOT** to Use Interpolation:
- Large gaps: If the data has large gaps between observations, interpolation might not provide meaningful or reliable estimates.
- Randomness in missing values: If the missing values are random or don't follow any pattern (Missing Completely at Random - MCAR), interpolation may not be appropriate, as it assumes a relationship between values.
- Categorical or non-numeric data: Interpolation is typically used for continuous numerical data. For categorical or binary data, interpolation is not suitable.

**TopHat question**

In [None]:
np.random.seed(639)

date_range = pd.date_range(start='2024-01-01', periods=60, freq='D')
sales_data = np.random.normal(loc=200, scale=20, size=len(date_range))
sales_data[::5] = np.nan  # missing value every 5th day

df = pd.DataFrame({
    'date': date_range,
    'sales': sales_data
})
print(df.head())

plt.figure(figsize=(10, 6))
plt.plot(df['date'], df['sales'], marker='o', linestyle='-', color='black')
plt.xticks(rotation=45)
plt.ylabel('Sales')
plt.grid(True)
plt.show()

### Linear interpolation

- Assumption: the missing data points lie along a straight line between the known data points.
- Linear interpolation commonly used for time series where changes are expected to be linear between data points.
- When to use: when the relationship between consecutive values is roughly linear or changes gradually.

In [None]:


plt.figure(figsize=(10, 6))
plt.plot(df['date'], df['sales'], marker='o', linestyle='-', label='Interpolated Sales', color='orange')
plt.xticks(rotation=45)
plt.legend()
plt.grid(True)
plt.show()

### Polynomial interpolation

- Polynomial interpolation fits a polynomial curve through the known data points and uses it to estimate the missing values.
- When to use: when the data shows a nonlinear relationship between points (for example, seasonal effects or periodic patterns).

In [None]:

plt.figure(figsize=(10, 6))
plt.plot(df['date'], df['sales'], marker='o', linestyle='-', label='Interpolated Sales', color='orange')
plt.xticks(rotation=45)
plt.legend()
plt.grid(True)
plt.show()

#### Why do both methods produced the same result?

- Polynomial interpolation of order 2 fits a quadratic function (a parabola) between the two surrounding data points.
- A quadratic function can curve, but for the simple case where the data points are relatively close to each other and do not exhibit any highly nonlinear or curving behavior, the quadratic curve might end up being very similar to the straight line in terms of interpolation.
- When there are only two points surrounding the missing value (as is typical with simple time series data), the quadratic interpolation will essentially behave like a linear interpolation because a second-degree polynomial (a parabola) that passes through two points is uniquely determined by those two points and does not "bend" between them in a noticeable way.

In [None]:
# Non-linear timeseries - sine curve
np.random.seed(42)
date_range = pd.date_range(start='2024-01-01', periods=60, freq='D')
sales_data = 100 + 50 * np.sin(np.linspace(0, 3 * np.pi, len(date_range)))

# Introduce missing values randomly
missing_indices = np.random.choice(range(len(sales_data)), size=18, replace=False)
sales_data[missing_indices] = np.nan

df = pd.DataFrame({
    'date': date_range,
    'sales': sales_data
})

plt.figure(figsize=(10, 6))
plt.plot(df['date'], df['sales'], marker='o', linestyle='-', label='Original Sales with Missing Values')
plt.xticks(rotation=45)
plt.ylabel('Sales')
plt.grid(True)

In [None]:
# Interpolate using linear method

# Interpolate using polynomial method (order 2)

plt.figure(figsize=(10, 6))
plt.plot(df['date'], df['sales_linear'], marker='o', linestyle='-', label='Interpolated Sales (Linear)', color='orange')
plt.plot(df['date'], df['sales_polynomial'], marker='x', linestyle='-', label='Interpolated Sales (Polynomial)', color='green')
plt.legend()
plt.title('Linear vs Polynomial Interpolation on Nonlinear Data')
plt.grid(True)
plt.show()

### Spline interpolation

- Spline interpolation fits a smooth curve (a piecewise polynomial, typically cubic) through the known data points and estimates the missing values.
- When to use: when the data exhibits a smooth, nonlinear trend (often used for time series with cycles or seasonal patterns).

### Spline vs Polynomial

- When to use spline interpolation: when you need smooth, piecewise fits, especially when the data is non-linear or has noise. It's ideal for smooth, continuous data that needs to be modeled accurately across a range of values.
- When to use polynomial interpolation: when you have a simple, small dataset, and you want a single polynomial that fits all points exactly. Avoid polynomial interpolation with large or noisy datasets because it can cause overfitting and oscillations.

In [None]:
np.random.seed(639)
date_range = pd.date_range(start='2024-01-01', periods=60, freq='D')
sales_data = 100 * np.sin(np.linspace(0, 3 * np.pi, len(date_range))) 

# Introduce missing values randomly
missing_indices = np.random.choice(range(5, len(sales_data), 5), size=10, replace=False)
sales_data[missing_indices] = np.nan

df = pd.DataFrame({
    'date': date_range,
    'sales': sales_data
})

plt.figure(figsize=(10, 6))
plt.plot(df['date'], df['sales'], marker='o', linestyle='-', label='Original Sales with Missing Values', color='blue')
plt.xticks(rotation=45)
plt.ylabel('Sales')
plt.grid(True)

# Interpolate using cubic spline method


plt.plot(df['date'], df['sales_spline'], marker='x', linestyle='-', label='Interpolated Sales (Spline)', color='green')
plt.legend()
plt.title('Spline Interpolation on Sinusoidal Data')
plt.show()