# Product Data Analytics

This notebook performs an exploratory data analysis (EDA) on the `intern_data_ikarus.csv` dataset. The goal is to understand the distribution of the data, identify key characteristics, and extract insights that could be useful for the recommendation model and for the analytics dashboard.

### 1. Load Libraries and Data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import ast # For safely evaluating string-formatted lists

// Set visualization style
sns.set_style("whitegrid")

// Load the dataset
file_path = '../backend/data/intern_data_ikarus.csv'
df = pd.read_csv(file_path)

### 2. Initial Data Exploration

In [None]:
print("Dataset Shape:")
print(df.shape)
print("\nFirst 5 Rows:")
print(df.head())
print("\nData Types and Missing Values:")
print(df.info())

### 3. Price Analysis

The `price` column is an object type and contains '$'. We need to clean it by removing the dollar sign and converting it to a numeric type to perform analysis.

In [None]:
# Clean the price column
df['price_numeric'] = df['price'].replace('[\$,]', '', regex=True).astype(float)

# Plot the distribution of prices
plt.figure(figsize=(12, 6))
sns.histplot(df['price_numeric'], bins=50, kde=True)
plt.title('Distribution of Product Prices', fontsize=16)
plt.xlabel('Price ($)', fontsize=12)
plt.ylabel('Number of Products', fontsize=12)
plt.show()

# Display summary statistics for price
print("Price Statistics:")
print(df['price_numeric'].describe())

### 4. Category Analysis

The `categories` column is a string representation of a list. We will parse this column and then count the occurrences of each category to find the most common ones.

In [None]:
# Safely parse the 'categories' column from string to list
df['categories_list'] = df['categories'].apply(ast.literal_eval)

# Explode the dataframe to have one category per row and count them
all_categories = df.explode('categories_list')['categories_list']
category_counts = all_categories.value_counts()

# Get the top 10 most common categories
top_10_categories = category_counts.head(10)

# Plot the top 10 categories
plt.figure(figsize=(12, 8))
sns.barplot(x=top_10_categories.values, y=top_10_categories.index, palette='viridis')
plt.title('Top 10 Most Common Product Categories', fontsize=16)
plt.xlabel('Number of Products', fontsize=12)
plt.ylabel('Category', fontsize=12)
plt.show()

print("Top 10 Categories:")
print(top_10_categories)

### 5. Material Analysis

In [None]:
# Analyze the 'material' column
# We'll check for missing values and count the top materials
material_counts = df['material'].value_counts().dropna()

top_10_materials = material_counts.head(10)

# Plot the top 10 materials
plt.figure(figsize=(12, 8))
sns.barplot(x=top_10_materials.values, y=top_10_materials.index, palette='mako')
plt.title('Top 10 Most Common Product Materials', fontsize=16)
plt.xlabel('Number of Products', fontsize=12)
plt.ylabel('Material', fontsize=12)
plt.show()

print("Top 10 Materials:")
print(top_10_materials)