# Interactive Exploration of ASOS Product Returns

**Objective:** This notebook provides an interactive exploratory data analysis (EDA) of the prepared ASOS dataset. We will investigate the key factors influencing product returns through interactive visualizations.

## 1. Setup and Data Loading

First, we import the necessary libraries for data manipulation (`pandas`) and interactive plotting (`plotly`). We then load the cleaned dataset prepared in the previous stage.

In [None]:
# 1. Setup and Data Loading
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import iplot

# Load the prepared data
prepared_data_path = 'C:/Users/ADMIN/ET6-CDSP-group-17-repo/2_data_preparation/ASOS_GraphReturns/prepared_asos_data.csv'
try:
    df = pd.read_csv(prepared_data_path)
    print(f"Successfully loaded prepared data from {prepared_data_path}")
except FileNotFoundError:
    print(f"Error: {prepared_data_path} not found. Please ensure the data preparation step was completed.")

## 2. Initial Data Inspection

Here we take a first look at the dataset's structure, including its shape, columns, data types, and missing values.

In [None]:
print("--- Initial Data Overview ---")
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")

print("
--- Data Info ---")
df.info()

print("
--- Missing Values ---")
print(df.isnull().sum()[df.isnull().sum() > 0])

print("
--- First 5 Rows ---")
df.head()

## 3. Overall Product Return Distribution

This section provides a high-level overview of the product return landscape within the ASOS dataset. We visualize the overall proportion of returned versus non-returned items using a pie chart. This immediate visual representation helps us understand the baseline return rate and the general magnitude of returns within the dataset. A significant imbalance here might suggest a need for deeper investigation into the factors driving returns or non-returns.

In [None]:
# 3. Overall Return Share
return_counts = df['isReturned'].value_counts()
return_percentage = df['isReturned'].value_counts(normalize=True) * 100

print(f"Value Counts for isReturned:\n{return_counts}")
print(f"\nPercentage of Returns:\n{return_percentage}")

fig = px.pie(names=['Not Returned (0)', 'Returned (1)'], 
             values=return_counts.values,
             title='Overall Product Return Distribution',
             hole=0.3)
fig.update_traces(textinfo='percent+label', pull=[0, 0.1])
fig.show()
fig.write_image("C:/Users/ADMIN/ET6-CDSP-group-17-repo/3_data_exploration/ASOS_GraphReturns/images/return_distribution.png")

## 4. Return Rates by Categorical Features

Now, let's dive deeper and analyze how return rates vary across different categorical features like gender, country, and product type.

### 4.1 Return Rate by Gender

This section explores the influence of gender on product return rates. By analyzing the return rates for male and female customers, we can identify if there are significant differences in return behavior between these groups. This insight can be valuable for targeted marketing strategies or product development.

In [None]:
# 4.1 Return Rate by Gender
if 'isMale' in df.columns:
    gender_return_rate = df.groupby('isMale')['isReturned'].mean().reset_index()
    gender_return_rate['isMale'] = gender_return_rate['isMale'].map({0: 'Female', 1: 'Male'})
    
    fig = px.bar(gender_return_rate, x='isMale', y='isReturned',
                 title='Return Rate by Gender',
                 labels={'isMale': 'Gender', 'isReturned': 'Average Return Rate'},
                 text=gender_return_rate['isReturned'].apply(lambda x: f'{x:.2%}'))
    fig.update_traces(texttemplate='%{text:.2%}', textposition='outside')
    fig.update_layout(xaxis_tickangle=-45)
    fig.show()
    fig.write_image("C:/Users/ADMIN/ET6-CDSP-group-17-repo/3_data_exploration/ASOS_GraphReturns/images/gender_return_rate.png")
else:
    print(f"'isMale' column not found for gender analysis.")

### 4.2 Return Rate by Shipping Country (Top 10)

Geographical location can play a significant role in return behavior. This section focuses on the top 10 shipping countries and their respective return rates. Understanding these variations can help in optimizing logistics, understanding regional preferences, or identifying potential issues related to shipping or product fit in specific countries.

In [None]:
# 4.2 Return Rate by Shipping Country (Top 10)
country_cols = [col for col in df.columns if col.startswith('Country_') and col != 'Country_Other']
if country_cols:
    country_return_rates = {col.replace('Country_', ''): df[df[col] == 1]['isReturned'].mean() for col in country_cols}
    country_return_rates_series = pd.Series(country_return_rates).sort_values(ascending=False).head(10)
    
    fig = px.bar(x=country_return_rates_series.index, y=country_return_rates_series.values,
                 title='Return Rate by Top 10 Shipping Countries',
                 labels={'x': 'Country', 'y': 'Average Return Rate'},
                 text=country_return_rates_series.values)
    fig.update_traces(texttemplate='%{text:.2%}', textposition='outside')
    fig.update_layout(xaxis_tickangle=-45)
    fig.show()
    fig.write_image("C:/Users/ADMIN/ET6-CDSP-group-17-repo/3_data_exploration/ASOS_GraphReturns/images/country_return_rate.png")
else:
    print("No 'Country_X' columns found for country analysis.")

### 4.3 Return Rate by Product Type (Top 10)

Different product types may have varying return rates due to factors like fit, material, or customer expectations. This section analyzes the return rates for the top 10 product types, providing insights into which categories are more prone to returns. This information is crucial for inventory management, product design, and understanding customer satisfaction across different product lines.

In [None]:
# 4.3 Return Rate by Product Type (Top 10)
product_type_cols = [col for col in df.columns if col.startswith('productType_') and col != 'productType_Other']
if product_type_cols:
    product_type_return_rates = {col.replace('productType_', ''): df[df[col] == 1]['isReturned'].mean() for col in product_type_cols}
    product_type_return_rates_series = pd.Series(product_type_return_rates).sort_values(ascending=False).head(10)
    
    fig = px.bar(x=product_type_return_rates_series.index, y=product_type_return_rates_series.values,
                 title='Return Rate by Top 10 Product Types',
                 labels={'x': 'Product Type', 'y': 'Average Return Rate'},
                 text=product_type_return_rates_series.values)
    fig.update_traces(texttemplate='%{text:.2%}', textposition='outside')
    fig.update_layout(xaxis_tickangle=-45)
    fig.show()
    fig.write_image("C:/Users/ADMIN/ET6-CDSP-group-17-repo/3_data_exploration/ASOS_GraphReturns/images/product_type_return_rate.png")
else:
    print("No 'productType_X' columns found for product type analysis.")

## 5. Numeric Feature Distributions

This section delves into the distributions of key numerical features, specifically focusing on how they differ between returned and non-returned items. By visualizing these distributions (e.g., using histograms or box plots), we can identify potential thresholds or ranges within numerical features that are more associated with returns. For instance, are higher-priced items returned more often than lower-priced ones, or vice-versa?

In [None]:
# 5. Distribution of Price for Returned vs. Non-Returned Items
if 'price' in df.columns:
    fig = px.histogram(df, x='price', color='isReturned',
                      marginal='box', 
                      barmode='overlay',
                      title='Distribution of Price for Returned vs. Non-Returned Items',
                      labels={'price': 'Price', 'isReturned': 'Is Returned'})
    fig.show()
    fig.write_image("C:/Users/ADMIN/ET6-CDSP-group-17-repo/3_data_exploration/ASOS_GraphReturns/images/price_distribution.png")
else:
    print("'price' column not found.")

## 6. Correlation Analysis

This final section provides a comprehensive view of the linear relationships between all numerical features in the dataset, including the `isReturned` target variable. A correlation heatmap visually represents these relationships, allowing us to quickly identify strong positive or negative correlations. This can highlight features that are highly predictive of returns or reveal multicollinearity among independent variables, which is important for subsequent modeling efforts.

In [None]:
# 6. Correlation Analysis
numerical_cols = df.select_dtypes(include=['number']).columns.tolist()
correlation_matrix = df[numerical_cols].corr()

fig = go.Figure(data=go.Heatmap(
                    z=correlation_matrix.values,
                    x=correlation_matrix.columns,
                    y=correlation_matrix.columns,
                    colorscale='Viridis',
                    colorbar=dict(title='Correlation')))

fig.update_layout(
    title='Correlation Matrix of Numerical Features',
    xaxis_tickangle=-45
)
fig.show()
fig.write_image("C:/Users/ADMIN/ET6-CDSP-group-17-repo/3_data_exploration/ASOS_GraphReturns/images/correlation_matrix.png")

## 7. Summary of Findings

This section would be filled in after running the notebook and analyzing the plots. Key takeaways might include:

- The overall return rate is X%.
- [Gender/Country/Product Type] has the most significant impact on returns.
- Higher/lower priced items are more likely to be returned.
- Feature A and Feature B are highly correlated.