# The Data Analysis Process

This notebook demonstrates the 5-step data analysis pipeline (Collect → Clean → Explore → Analyze → Interpret) with a small simulated e-commerce dataset. Follow the cells to see each step in action.

## 1) Collect — create a small example dataset

In [None]:
import pandas as pd
import numpy as np

# Simulate a small e-commerce dataset
df = pd.DataFrame({
    'order_id': range(1,13),
    'customer_id': [101,102,103,101,104,105,106,107,108,106,109,110],
    'order_value': [120.0, 35.5, 250.0, np.nan, 80.0, 15.0, 200.0, 40.0, 60.0, 195.0, 5.0, 99.0],
    'discount_applied': [False, True, False, False, True, False, True, False, False, True, False, False],
    'order_date': pd.to_datetime(['2021-01-03','2021-01-07','2021-01-12','2021-01-15','2021-01-20','2021-01-24','2021-01-30','2021-02-02','2021-02-07','2021-02-14','2021-02-20','2021-02-25'])
})

df.head()

## 2) Clean — handle missing values and inspect duplicates/outliers

In [None]:
# Check for missing values
df.info()

# Simple cleaning: fill missing order_value with median
df['order_value'] = df['order_value'].fillna(df['order_value'].median())

# Check duplicates by order_id
df[df.duplicated(subset=['order_id'])]  # should be empty in this simulated data

## 3) Explore — summary statistics and simple visual checks

In [None]:
# Summary stats
df.describe(include='all')

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8,4))
sns.histplot(df['order_value'], bins=8, kde=False)
plt.title('Distribution of Order Value')
plt.xlabel('Order value')
plt.show()

## 4) Analyze — look for relationships (e.g., discount vs order value)

In [None]:
# Compare average order value when discount was applied vs not
df.groupby('discount_applied')['order_value'].mean().reset_index()

# Simple correlation (discount_applied as int)
df['discount_int'] = df['discount_applied'].astype(int)
df[['order_value','discount_int']].corr()

## 5) Interpret & Communicate — write a brief finding

**Example interpretation:** In this small sample, average order value when a discount was applied is lower/higher (see numbers above). This suggests discounts may be used for lower-value purchases, or that offering discounts increases transaction volume but decreases average order value — further analysis with more data is required.

**Next steps:** run statistical tests on a larger dataset, segment by customer cohorts, and measure conversion/retention impact.

In [None]:
# Save the sample dataset to CSV and read it back (example)
csv_path = 'the_data_analysis_process_sample.csv'
df.to_csv(csv_path, index=False)
print('Saved CSV ->', csv_path)
# Read back and parse dates
df_loaded = pd.read_csv(csv_path, parse_dates=['order_date'])
df_loaded.head()