# 📓 Lesson 11: Data Visualization with Pandas, Matplotlib, and Seaborn
📘 What you will learn:

1. In this lesson, you’ll learn how to:
2. Create basic charts with Pandas (.plot())
3. Use Matplotlib for customizing plots
4. Use Seaborn for cleaner and more advanced visuals
5. Choose the right chart for your data
6. Add labels, titles, colors, and grid lines

🧠 Why is this useful?

- Visualization helps you:
- Understand patterns and trends faster
- Communicate results clearly to others
- Detect outliers and problems
- Make better decisions based on visual evidence

## 🧪 Step 1: Load and prepare the data

In [None]:
import pandas as pd

df = pd.read_csv('../data/Sales_January_2019.csv')

# Clean and prepare data
df['Order Date'] = pd.to_datetime(df['Order Date'], format="%m/%d/%y %H:%M", errors='coerce')
df['Quantity Ordered'] = pd.to_numeric(df['Quantity Ordered'], errors='coerce')
df['Price Each'] = pd.to_numeric(df['Price Each'], errors='coerce')
df = df.dropna(subset=['Order Date', 'Quantity Ordered', 'Price Each'])
df['Total Price'] = df['Quantity Ordered'] * df['Price Each']

## 📈 Step 2: Line Plot – Total Sales by Day

[Matplotlib docs here.](https://matplotlib.org/)

In [None]:
pip install matplotlib

In [None]:
import matplotlib.pyplot as plt

# Group by date
df.set_index('Order Date', inplace=True)
daily_sales = df['Total Price'].resample('D').sum()

# Plot it
plt.figure(figsize=(12, 4))
daily_sales.plot()
plt.title('Total Sales Per Day')
plt.xlabel('Date')
plt.ylabel('Total Sales')
plt.grid(True)
plt.show()


📌 Use a line chart when you want to show a trend over time.

## 📊 Step 3: Bar Plot – Sales by Product

In [None]:
product_sales = df.groupby('Product')['Total Price'].sum().sort_values()

product_sales.plot(kind='barh', figsize=(10, 6), color='skyblue')
plt.title('Total Sales by Product')
plt.xlabel('Sales ($)')
plt.ylabel('Product')
plt.grid(axis='x')
plt.show()

📌 Use a bar chart when comparing categories like products or cities.

## 🥧 Step 4: Pie Chart – Share of Total

In [None]:
# Share of sales by product (top 5 only)
top5 = df.groupby('Product')['Total Price'].sum().nlargest(5)
top5.plot(kind='pie', autopct='%1.1f%%', startangle=90, figsize=(6, 6), title='Top 5 Products')


📌 What it shows:
- Show proportions of a whole
- Which categories take the largest portion of revenue

⚠️ Not recommended for more than 4–6 categories

## 📍 Step 5: Scatter Plot – Show Relationship Between Two Variables

Use case: Explore correlation or trend between two values

Example: Quantity Ordered vs. Total Sales

In [None]:
summary = df.groupby('Product').agg({
    'Quantity Ordered': 'sum',
    'Total Price': 'sum'
}).reset_index()

plt.figure(figsize=(10, 6))
plt.scatter(summary['Quantity Ordered'], summary['Total Price'], s=100)

# Add product labels to each point
for i, txt in enumerate(summary['Product']):
    plt.annotate(txt, (summary['Quantity Ordered'][i], summary['Total Price'][i]))

plt.xlabel("Quantity Ordered")
plt.ylabel("Total Revenue")
plt.title("Quantity vs Revenue per Product")
plt.grid(True)
plt.show()


📌 What it shows:
- Products with high quantity but low price → right-bottom (e.g., cables)
- Products with low quantity but high price → top-left (e.g., Macbook)

## 📉 Step 6: Histogram – Distribution of Order 
Use case: Show how values are distributed over ranges (bins)

In [None]:
df['Total Price'].plot(kind='hist', bins=50, color='orange', figsize=(10, 4))
plt.title('Distribution of Total Order Values')
plt.xlabel('Total Price ($)')
plt.grid(True)
plt.show()

📌 Use a histogram to understand the distribution of numeric data.
- View the distribution of prices, incomes or scores
- Identify outliers
- Analyze the concentration of data (where is it more?)
- Understand whether the data is normal or not (Normal distribution)

## 🎨 Step 7: Use Seaborn for nicer plots

[Seaborn docs here.](https://seaborn.pydata.org/)

In [None]:
pip install seaborn

📌 Seaborn makes it easier to build attractive plots with one line of code.

### 📦 Box Plot – Compare Distributions & Spot Outliers
Use case: Compare spread of values across categories, and detect outliers

In [None]:
import seaborn as sns

# Example: Box plot of order values by product
plt.figure(figsize=(12, 5))
sns.boxplot(data=df, x='Product', y='Total Price')
plt.xticks(rotation=90)
plt.title('Order Value Distribution per Product')
plt.show()


📌 What it shows:
- Boxes = 50% of values (middle range)
- Lines = median
- Dots = outliers (extremely high/low orders)

## 🔥 Step 8: Heatmap – Correlation Between Variables
Use case: Show how multiple numeric columns relate to each other

In [None]:
correlation = df[['Quantity Ordered', 'Price Each', 'Total Price']].corr()

sns.heatmap(correlation, annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()


📌 What it shows:
- How strongly columns are related (from -1 to +1)
- Dark red/blue = stronger correlation

## 🧠 Practice Exercises
1. Line chart of daily total sales
2. Bar chart of total quantity sold by product
3. Histogram of Price Each
4. Boxplot of total price by city (extract city from address)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# 1
daily_sales = df['Total Price'].resample('D').sum()
daily_sales.plot(figsize=(12, 4), title="Daily Total Sales")

# 2
df.groupby('Product')['Quantity Ordered'].sum().plot(kind='bar', color='lightgreen')

# 3
df['Price Each'].plot(kind='hist', bins=40, title="Price Distribution")

# 4
def extract_city(address):
    try:
        return address.split(',')[1].strip()
    except:
        return None

df['City'] = df['Purchase Address'].apply(extract_city)
df = df.dropna(subset=['City'])

plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='City', y='Total Price')
plt.xticks(rotation=45)
plt.title("Order Value by City")
plt.grid(True)
plt.show()

## 📌 Summary
- In this lesson, you learned:
- How to use Pandas and Matplotlib to create simple visualizations
- How to use Seaborn for advanced, beautiful plots
- The best chart types for trend, comparison, and distribution
- How to customize titles, labels, and layouts

👉 In the next module, we’ll move on to real-world projects, starting with Sales Data Analysis from all 12 months.