# Exploratory Data Analysis (EDA) & Extracting Insights
Once the data is clean, the goal is to understand its underlying patterns, relationships, and anomalies. EDA can be viewed as detective work: dusting for fingerprints, finding out-of-place clues, and seeing how they relate before model building.

## 1. Univariate Analysis & The Shape of Data üìä
Looking at one variable at a time to understand its distribution.

- **Histograms**: Group continuous data into bins to show the shape, central tendency, spread, and skewness of the distribution.
- **Box Plots**: Visualize the spread and highlight outliers. It shows:
  - Median (middle value)
  - Interquartile range (middle 50% of the data)
  - Whiskers (expected range)
  - Outliers (dots outside whiskers)

*Example*: If someone buys a $1,500 espresso machine while typical transactions are $4-$12, the average (mean) becomes distorted, making the median a much more reliable metric.

## 2. Multivariate Mysteries & Paradoxes üï∏Ô∏è
Exploring relationships between two or more variables. Watch out for these traps:

- **Correlation vs. Causation üìà**: Two variables moving together doesn't imply one causes the other (e.g., ice cream sales and sunburns are both driven by hot weather).
- **Multicollinearity üëØ**: Input variables are highly correlated with each other, making it hard for models to isolate their individual impacts.
- **Simpson's Paradox ü§Ø**: A trend appears in different groups but disappears or reverses when groups are combined (e.g., medical treatment success rates skewing when accounting for severity).

## 3. Verifying Correlation
To verify if two features correlate, we use visual and mathematical tools.

### Visual Verification üëÅÔ∏è
- **Scatter Plots**: Plot features on X and Y axes.
  - Line points up & right ‚Üí Positive correlation.
  - Line points down & right ‚Üí Negative correlation.
  - Random cloud ‚Üí No correlation.

### Mathematical Verification üßÆ
Correlation coefficient ranges from -1.0 to +1.0.
- **Pearson Correlation ($r$)**: Measures strict linear relationships. Best for continuous, normally distributed data.

$$r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}$$

  - $x_i$ and $y_i$ are individual data points.
  - $\bar{x}$ and $\bar{y}$ are the means of variables $x$ and $y$.

- **Spearman Rank Correlation ($\rho$)**: Measures monotonic relationships by comparing ranks of values instead of raw data. Robust to extreme outliers.

$$\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}$$

  - $d_i$ is the difference between the ranks of corresponding values for $x$ and $y$.
  - $n$ is the number of observations.

### Code Example: Correlation Matrix
We can use pandas and seaborn to compute and visualize correlations across multiple features.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming 'df' is your pandas DataFrame with multiple columns

# 1. Calculate the Pearson correlation matrix
pearson_matrix = df.corr(method='pearson')

# 2. Calculate the Spearman correlation matrix
spearman_matrix = df.corr(method='spearman')

# 3. Visualize the Pearson matrix using a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(pearson_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title("Pearson Correlation Heatmap")
plt.show()


## 4. Translating EDA into Business Insights üí°
EDA helps frame ambiguous case studies logically by uncovering hidden clues.

- **Segmentation & Cohort Analysis üç∞**: Averages hide the truth. Segmenting data reveals targeted insights (e.g., an AI tool's usage might be 10% overall but 60% for a specific medical specialty).
- **Funnel & Bottleneck Analysis ‚è≥**: Mapping user journeys step-by-step to find exact drop-off points (e.g., 100% login, 90% prompt, but only 15% click reference links).

These techniques allow you to form precise, actionable hypotheses for real-world business problems.