# Exploratory Data Analysis 
Project: Legal Document Importance Prediction   
Goal: Understand document metadata & text patterns influencing Importance Score

NOTE: This notebook is for experimentation.  
Production code lives in the src/ directory.

## Imports & Visualization Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use("ggplot")
sns.set_theme(style="whitegrid", font_scale=1.1)

## 1. Load Dataset

In [None]:
df = pd.read_csv("../data/raw/train.csv")

# Standardize column names
df.columns = df.columns.str.strip().str.replace(" ", "_")

df.head()


### Observations 
- The dataset contains structured legal document metadata along with rich textual fields such as Headline, Key Insights, and Reasoning.

- The target variable **Importance_Score** is present only in the training data, confirming this as a supervised regression problem.

## 2. Dataset Structure & Overview

In [None]:
df.shape

In [None]:
df.info()

- The dataset contains **20,624 documents with 10 columns**.
- Most columns are text-based (`object`), indicating a strong **NLP-driven modeling requirement**.
- Numeric columns are limited to `id` and the target `Importance_Score`.

## 3. Statistical Summary

In [None]:
df.describe(include="all").T

- The **Importance_Score** ranges from **0 to 92**, with a median of **12**, indicating most documents are low to moderately important.
- Text fields such as Headline, Reasoning, and Key Insights have very high uniqueness, confirming that documents are largely non-duplicative.
- Some categorical fields (Lead_Types, Power_Mentions) show repeated themes, suggesting potential for strong feature signals.

## 4. Missing Value Analysis

### 4.1 Count of Missing Values

In [None]:
df.isnull().sum()

- Metadata fields (`Lead_Types`, `Power_Mentions`, `Agencies`) contain substantial missing values.
- Core textual fields (`Headline`, `Reasoning`) are fully populated, which is ideal for NLP-based modeling.

### 4.2 Percentage of Missing Values

In [None]:
(df.isnull().sum() / len(df)) * 100 

- `Agencies` is missing in **~68%** of records, indicating it should be treated as **optional contextual metadata** rather than a primary feature.
- `Key_Insights` and `Tags` have minimal missing values (<1%), making them reliable for text feature extraction.

## 5. Column-wise Understanding & Inspection

In [None]:
df.sample(5)

In [None]:
df.nunique().sort_values(ascending=False)

**Text Columns Inspection**

In [None]:
df["Headline"].sample(5).tolist()
df["Key_Insights"].sample(5).tolist()
df["Reasoning"].sample(5).tolist()

**List-like Columns Inspection**

In [None]:
df["Lead_Types"].dropna().sample(5).tolist()
df["Power_Mentions"].dropna().sample(5).tolist()
df["Agencies"].dropna().sample(5).tolist()
df["Tags"].dropna().sample(5).tolist()

- Sample documents range from **benign, boilerplate communications** to **highly actionable allegations involving powerful individuals**.
- The dataset mixes **low-value noise** and **high-impact investigative leads**, reinforcing the need for automated importance scoring.
- List-like fields are stored as **semicolon-separated strings**, requiring parsing during preprocessing.

## 6. Distribution and Correlation of Numerical Columns

### 6.1 Create Numerical Features (for EDA only)

In [None]:
df["headline_len"] = df["Headline"].fillna("").str.len()
df["insight_len"] = df["Key_Insights"].fillna("").str.len()

Text length features (`headline_len`, `insight_len`) were created to quantify document verbosity, which may correlate with investigative importance.

### 6.2 Histogram of Numerical Features

In [None]:
num_cols = ["Importance_Score", "headline_len", "insight_len"]

df[num_cols].hist(bins=30, figsize=(12,6))
plt.suptitle("Distribution of Numerical Features")
plt.show()

- Importance Score distribution is **right-skewed**, with most documents clustered at lower scores.
- Headline lengths are relatively compact, while Key Insights vary widely, suggesting richer information density in longer insights.

### 6.3 Boxplots of Numerical Features

In [None]:
plt.figure(figsize=(12,5))
sns.boxplot(data=df[num_cols])
plt.title("Boxplot of Numerical Features")
plt.show()

- Significant outliers exist for text length, especially in `insight_len`, which is expected in legal and investigative documents.
- Importance Score outliers represent **high-value documents**, which are crucial and should not be removed.

### 6.4 Correlation Heatmap

In [None]:
plt.figure(figsize=(6,4))
sns.heatmap(df[num_cols].corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()

- Importance Score shows **strong positive correlation** with `insight_len` (≈0.74) and `headline_len` (≈0.63).
- This confirms that **longer, detailed documents tend to be more investigatively valuable**.

## 7. Distribution of Categorical Columns

In [None]:
df["Lead_Types"].value_counts().head(10)

In [None]:
df["Power_Mentions"].value_counts().head(10)

In [None]:
df["Agencies"].value_counts().head(10)

In [None]:
df["Tags"].value_counts().head(10)

- Common Lead Types such as **legal exposure, financial flow, and sexual misconduct** dominate the dataset.
- Power Mentions are heavily skewed toward **high-profile individuals**, indicating strong predictive potential.
- Agency involvement (e.g., FBI, DOJ) appears frequently in high-importance contexts.

## 8. Target Variable Analysis

In [None]:
sns.histplot(df["Importance_Score"], bins=30, kde=True)
plt.title("Distribution of Importance Score")
plt.show()

In [None]:
df["Importance_Score"].describe()

- Importance Score is **not normally distributed** and heavily concentrated at lower values.
- High scores are rare, which aligns with real-world investigative settings where only a few documents are critical.

## 9. Bivariate Analysis

### 9.1 Numerical Features vs Target

In [None]:
sns.scatterplot(x="headline_len", y="Importance_Score", data=df)
plt.title("Headline Length vs Importance Score")
plt.show()

In [None]:
sns.scatterplot(x="insight_len", y="Importance_Score", data=df)
plt.title("Key Insights Length vs Importance Score")
plt.show()


- Both headline length and key insight length show **clear upward trends** with Importance Score.
- This validates the inclusion of text-length and NLP-derived features.

### 9.2 Categorical Features vs Target

In [None]:
df.groupby("Lead_Types")["Importance_Score"].mean().sort_values(ascending=False).head(10)

In [None]:
df.groupby("Agencies")["Importance_Score"].mean().sort_values(ascending=False).head(10)

- Documents involving **sexual misconduct, human trafficking, obstruction of justice, and law enforcement agencies** consistently have the highest average Importance Scores.
- This confirms that **content type and institutional involvement are strong relevance signals**.

## 10. Correlation Analysis for Numerical Features

In [None]:
df[["Importance_Score", "headline_len", "insight_len"]].corr()

- `insight_len` is the strongest numerical correlate of Importance Score, outperforming headline length.
- Combining textual richness with semantic understanding is likely to yield the best model performance.

## 11. Outlier Check

In [None]:
sns.boxplot(x=df["Importance_Score"])
plt.title("Outlier Check – Importance Score")
plt.show()

- High Importance Score outliers represent **critical investigative documents** and should be preserved.
- No corrective outlier treatment is applied, as removing them would harm the model’s objective.

## 12.Feature Engineering Opportunities Identified

**Textual Features:**
The strong correlation between text length (`headline_len`, `insight_len`) and Importance Score indicates that **NLP-based representations** (TF-IDF, n-grams, or embeddings) from Headline, Key Insights, and Reasoning will be highly informative.

**Entity-Based Signals:**
Columns such as Power Mentions and Agencies frequently reference high-profile individuals and institutions. Extracting:
- Count of mentioned entities
- Presence of specific high-risk actors (e.g., politicians, law enforcement)
can significantly enhance predictive power.

**Categorical Semantics:**
Lead Types encode investigative themes such as **sexual misconduct, financial flow, and obstruction of justice**, which consistently align with higher Importance Scores.
These can be transformed into:
- Multi-label binary indicators
- Weighted category scores

**Structural Metadata:**
Features such as:
- Number of tags
- Number of agencies involved
- Combined count of lead types
can act as proxies for document complexity and relevance.

**Interaction Features:**
Combining text richness with metadata (e.g., long insights + multiple agencies) may capture nonlinear importance patterns that linear models miss.

## 13. summary

This exploratory analysis reveals that the dataset effectively mirrors real-world investigative workflows, where a small fraction of documents contain high-value evidence amid substantial background noise.

Key findings include:

- **Importance Score is highly right-skewed**, with most documents being low value and a minority carrying critical investigative relevance.
- **Textual depth is a strong predictor** of importance, particularly within the Key Insights field.
- **Contextual metadata**—such as lead types, agency involvement, and named entities—provides meaningful signals that amplify textual importance.
- Outliers in Importance Score represent **genuinely critical documents** and must be preserved during modeling.

Overall, the analysis confirms that a **hybrid approach combining NLP techniques with structured metadata features** is best suited to predict document importance accurately. This foundation sets the stage for effective preprocessing, feature extraction, and regression modeling aimed at minimizing RMSE.