# Nike Sales Data Exploratory Data Analysis (EDA) Project

This project demonstrates a professional EDA workflow using a real-world Nike sales dataset.
It covers data exploration, cleaning, analysis, and insight generation—ideal for showcasing Python, pandas, and analytical skills to employers.

**Author:** Samrah Far

## 📌 Table of Contents
1. [Project Overview](#project-overview)
2. [Importing Libraries](#importing-libraries)
3. [Loading the Dataset](#loading-the-dataset)
4. [Initial Data Exploration](#initial-data-exploration)
5. [Data Cleaning](#data-cleaning)
6. [Univariate Analysis](#univariate-analysis)
7. [Bivariate Analysis](#bivariate-analysis)
8. [Handling Outliers](#handling-outliers)
9. [Feature Scaling / Transformation](#feature-scaling--transformation)
10. [Insights & Business Recommendations](#insights--business-recommendations)
11. [Conclusion](#conclusion)


## 1. Project Overview <a name="project-overview"></a>

This project demonstrates advanced Exploratory Data Analysis (EDA) using Nike Sales data.

**Objectives:**
- Uncover trends and data quality issues
- Extract actionable insights
- Support business decisions with analytics

## 2. Importing Libraries <a name="importing-libraries"></a>

In [None]:
import pandas as pd                # Data manipulation and analysis
import matplotlib.pyplot as plt    # Data visualization
import seaborn as sns              # Enhanced statistical plots
import scipy.stats as stats        # Statistical tools
import numpy as np                 # Numerical computing

sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

## 3. Loading the Dataset <a name="loading-the-dataset"></a>

In [None]:
# Load the sales data
df = pd.read_csv('Nike_Sales_Uncleaned.csv')

In [None]:
# Preview the first few rows
df.head()

In [None]:
# Preview the last few rows
df.tail()

## 4. Initial Data Exploration <a name="initial-data-exploration"></a>

In [None]:
# Inspect data types, missing values, and overall structure
df.info()

In [None]:
# Generate summary statistics for numeric columns
df.describe()

In [None]:
# Missing values per column
df.isnull().sum()

## 5. Data Cleaning <a name="data-cleaning"></a>

In [None]:
# Remove rows where critical numeric data is missing (Units_Sold or MRP)
df_clean = df.dropna(subset=['Units_Sold', 'MRP']).copy()

In [None]:
# Fix inconsistent date formats in 'Order_Date'
def parse_date(date):
    if pd.isnull(date):
        return np.nan
    for fmt in ("%Y-%m-%d", "%d-%m-%Y", "%Y/%m/%d", "%d/%m/%Y", "%d-%m-%y", "%Y-%d-%m"):
        try:
            return pd.to_datetime(date, format=fmt)
        except:
            continue
    return pd.to_datetime(date, errors='coerce')

df_clean['Order_Date'] = df_clean['Order_Date'].apply(parse_date)

In [None]:
# Fill missing Discount_Applied with 0 (assume no discount)
df_clean['Discount_Applied'] = df_clean['Discount_Applied'].fillna(0)

# Fill missing Size with 'Unknown'
df_clean['Size'] = df_clean['Size'].fillna('Unknown')

# Standardize region names to title case
df_clean['Region'] = df_clean['Region'].str.title()

## 6. Univariate Analysis <a name="univariate-analysis"></a>

In [None]:
# Distribution of Units_Sold
sns.histplot(df_clean['Units_Sold'], bins=20, kde=True, color='skyblue')
plt.title('Distribution of Units Sold')
plt.xlabel('Units Sold')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Distribution of Profit
sns.histplot(df_clean['Profit'], bins=20, kde=True, color='salmon')
plt.title('Distribution of Profit')
plt.xlabel('Profit')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Top 10 products by units sold
top_products = df_clean.groupby('Product_Name')['Units_Sold'].sum().sort_values(ascending=False).head(10)
print("Top 10 Products by Units Sold:\n", top_products)

In [None]:
# Count of sales by region
region_counts = df_clean['Region'].value_counts()
sns.barplot(x=region_counts.index, y=region_counts.values, palette="viridis")
plt.title('Number of Sales by Region')
plt.xlabel('Region')
plt.ylabel('Number of Sales')
plt.xticks(rotation=45)
plt.show()

## 7. Bivariate Analysis <a name="bivariate-analysis"></a>

In [None]:
# Correlation heatmap
corr = df_clean[['Units_Sold', 'MRP', 'Discount_Applied', 'Revenue', 'Profit']].corr()
sns.heatmap(corr, annot=True, cmap='Blues')
plt.title('Correlation Heatmap')
plt.show()

In [None]:
# Profit by Product Line
sns.boxplot(x='Product_Line', y='Profit', data=df_clean)
plt.title('Profit by Product Line')
plt.xlabel('Product Line')
plt.ylabel('Profit')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Revenue by Region
sns.boxplot(x='Region', y='Revenue', data=df_clean)
plt.title('Revenue by Region')
plt.xlabel('Region')
plt.ylabel('Revenue')
plt.xticks(rotation=45)
plt.show()

In [None]:
# MRP vs Profit scatter plot
sns.scatterplot(x='MRP', y='Profit', data=df_clean, alpha=0.6)
plt.title('MRP vs Profit')
plt.xlabel('MRP')
plt.ylabel('Profit')
plt.show()

## 8. Handling Outliers <a name="handling-outliers"></a>

In [None]:
# Detect outliers in Profit using Z-score
z_scores = np.abs(stats.zscore(df_clean['Profit']))
outlier_threshold = 3
outliers = df_clean[z_scores > outlier_threshold]
print(f"Detected {len(outliers)} outlier rows in Profit (Z-score > {outlier_threshold})")

# Show distribution before and after removing outliers
df_no_outliers = df_clean[z_scores <= outlier_threshold]

sns.histplot(df_no_outliers['Profit'], bins=20, kde=True, color='green')
plt.title('Profit Distribution (No Outliers)')
plt.xlabel('Profit')
plt.ylabel('Frequency')
plt.show()

## 9. Feature Scaling / Transformation <a name="feature-scaling--transformation"></a>

In [None]:
# Log transformation of Revenue to address skewness
df_no_outliers['Log_Revenue'] = np.log1p(df_no_outliers['Revenue'])
sns.histplot(df_no_outliers['Log_Revenue'], bins=20, kde=True, color='purple')
plt.title('Log-Transformed Revenue Distribution')
plt.xlabel('Log(Revenue + 1)')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Standardization of MRP
mrp_mean = df_no_outliers['MRP'].mean()
mrp_std = df_no_outliers['MRP'].std()
df_no_outliers['MRP_zscore'] = (df_no_outliers['MRP'] - mrp_mean) / mrp_std

## 10. Insights & Business Recommendations <a name="insights--business-recommendations"></a>

In [None]:
print("--- Key Insights ---")
# Most units sold are concentrated in a few product names.
print(f"Top selling product: {top_products.index[0]} with {top_products.iloc[0]} units.")

# Regions with most sales
top_region = region_counts.idxmax()
print(f"Region with highest sales: {top_region}")

# Product lines show distinct profit profiles
profit_per_line = df_no_outliers.groupby('Product_Line')['Profit'].mean().sort_values(ascending=False)
print("Average profit by product line:\n", profit_per_line)

# Discounts have a moderate positive correlation with units sold
discount_corr = corr.loc['Units_Sold', 'Discount_Applied']
print(f"Correlation between Units Sold and Discount Applied: {discount_corr:.2f}")

# Recommendations
print("\n--- Business Recommendations ---")
print("- Focus marketing on top-selling products and top-performing regions.")
print("- Analyze underperforming product lines for improvement or discontinuation.")
print("- Consider targeted discount strategies, as discounts moderately drive volume.")

## 11. Conclusion <a name="conclusion"></a>

In [None]:
print("--- Conclusion ---")
print("This Nike Sales EDA project demonstrates data cleaning, advanced analysis, and insight generation.")
print("The workflow and insights are applicable for real-world business decision support and showcase core data science skills.")