# Basic Data Reporting with Matplotlib & Seaborn

Now that you can load and clean data with Pandas, let's visualize it! Visualization helps you:   
- Understand data distributions  
- Spot outliers  
- Discover relationships between features  
- Communicate insights
     


In [None]:
import pandas as pd
df = pd.read_csv('data/housing.csv')
df.head()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn-v0_8')  # Modern look
sns.set_palette("husl")     

## Distribution Plots 
Understand how values are spread (critical for ML!)   

In [None]:
# Histogram
plt.figure(figsize=(8, 5))
plt.hist(df['median_house_value'], bins=50, color='skyblue', edgecolor='black')
plt.title('Distribution of House Values')
plt.xlabel('Median House Value ($)')
plt.ylabel('Frequency')
plt.show()

In [None]:
import warnings
warnings.simplefilter("ignore", FutureWarning)

# Density plot
plt.figure(figsize=(8, 5))
sns.kdeplot(df['median_income'], fill=True, color='purple')
plt.title('Income Distribution')
plt.xlabel('Median Income (x10k $)')
plt.show()

Useful for:   
- Skewed distributions? → May need log-transform for ML!  
- Outliers? → May need clipping!
     

## Cathegorical Analysis

In [None]:
# Compare groups 

plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='ocean_proximity', y='median_house_value')
plt.title('House Value by Ocean Proximity')
plt.xticks(rotation=15)
plt.show()

In [None]:
# count groups
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='ocean_proximity')
plt.title('Number of Samples per Ocean Proximity Category')
plt.xticks(rotation=15)
plt.show()

Useful for:
- Reveals data imbalance (e.g., few "ISLAND" samples)  
- Shows if location affects price (spoiler: it does!)

## Relationship between variables

Find correlations (key for feature selection)

In [None]:
## Scatter Plot: Income vs. House Value
plt.figure(figsize=(8, 6))
plt.scatter(df['median_income'], df['median_house_value'], alpha=0.6)
plt.title('Income vs. House Value')
plt.xlabel('Median Income (x10k $)')
plt.ylabel('Median House Value ($)')
plt.show()

In [None]:
# Pair Plot: Multiple Relationships (Seaborn)
# Focus on key numeric columns
cols = ['median_income', 'housing_median_age', 'median_house_value']
sns.pairplot(df[cols], diag_kind='kde', plot_kws={'alpha': 0.1, 's': 5})
plt.show()

Useful for:
- Strong correlation? → Good predictor for ML!
    - Redundant feature 
- Non-linear patterns? → May need feature engineering!

## Geospatial Visualization

In [None]:
import contextily as ctx
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 10))
ax = plt.gca()

# Create the scatter plot
scatter = plt.scatter(df['longitude'], df['latitude'], 
                     c=df['median_house_value'], cmap='viridis', 
                     alpha=0.6, s=10)
plt.colorbar(scatter, label='Median House Value ($)')
plt.title('California Housing Prices with Map Background')

# Add basemap
ctx.add_basemap(ax, crs='EPSG:4326')  # WGS84 coordinate system
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()

In [None]:
plt.figure(figsize=(10, 8))
ax = plt.gca()
sns.scatterplot(data=df, x='longitude', y='latitude', 
                hue='ocean_proximity', alpha=0.6)
plt.title('Housing Locations by Ocean Proximity')
ctx.add_basemap(ax, crs='EPSG:4326')  # WGS84 coordinate system
plt.show()

## Correlation Heatmap

In [None]:
# Compute correlation matrix
numeric_cols = df.select_dtypes(include='number').columns
corr_matrix = df[numeric_cols].corr()

# Plot heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()

## Tips
1. Always visualize before modeling:  
- Fix skewness (log-transform)  
- Handle outliers (clip or remove)  
- Address imbalanced categories
2. Use seaborn for speed: sns.histplot(), sns.boxplot(), sns.scatterplot() cover 90% of needs
  
**A good plot is worth 1,000 lines of descriptive statistics!**

# Exercise
Add a new analysis for each of the categories in this notebook and explain the results.