# 01. Data Loading and Setup

This notebook handles:
1. Loading the raw dataset
2. Data quality checks (missing values, duplicates, outliers)
3. Data cleaning
4. Feature engineering (temporal features, headline metrics)
5. Saving processed data for use in other notebooks


In [2]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os

# Add utils to path
sys.path.append('.')
from utils import (
    load_data, 
    check_data_quality, 
    clean_data, 
    preprocess_data,
    save_processed_data,
    setup_plotting_style
)

# Setup plotting style
setup_plotting_style()

print("Libraries imported successfully!")


Plotting style configured
Libraries imported successfully!


## 1.1 Data Loading


In [3]:
# Load the raw dataset
file_path = '../data/raw_analyst_ratings.csv'

# Option 1: Load full dataset
df = load_data(file_path)

# Option 2: Load sample for faster initial exploration (uncomment to use)
# df = load_data(file_path, sample_size=100000)

print(f"\nDataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")


Loading dataset...
Loaded 1,407,328 rows
Initial shape: (1407328, 6)

Dataset loaded successfully!
Shape: (1407328, 6)
Columns: ['Unnamed: 0', 'headline', 'url', 'publisher', 'date', 'stock']


In [4]:
# Display first few rows
print("First 5 rows:")
df.head()

First 5 rows:


Unnamed: 0.1,Unnamed: 0,headline,url,publisher,date,stock
0,0,Stocks That Hit 52-Week Highs On Friday,https://www.benzinga.com/news/20/06/16190091/s...,Benzinga Insights,2020-06-05 10:30:54-04:00,A
1,1,Stocks That Hit 52-Week Highs On Wednesday,https://www.benzinga.com/news/20/06/16170189/s...,Benzinga Insights,2020-06-03 10:45:20-04:00,A
2,2,71 Biggest Movers From Friday,https://www.benzinga.com/news/20/05/16103463/7...,Lisa Levin,2020-05-26 04:30:07-04:00,A
3,3,46 Stocks Moving In Friday's Mid-Day Session,https://www.benzinga.com/news/20/05/16095921/4...,Lisa Levin,2020-05-22 12:45:06-04:00,A
4,4,B of A Securities Maintains Neutral on Agilent...,https://www.benzinga.com/news/20/05/16095304/b...,Vick Meyer,2020-05-22 11:38:59-04:00,A


In [5]:
# Basic info about the dataset
print("Dataset Info:")
df.info()


Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1407328 entries, 0 to 1407327
Data columns (total 6 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   Unnamed: 0  1407328 non-null  int64 
 1   headline    1407328 non-null  object
 2   url         1407328 non-null  object
 3   publisher   1407328 non-null  object
 4   date        1407328 non-null  object
 5   stock       1407328 non-null  object
dtypes: int64(1), object(5)
memory usage: 64.4+ MB


## 1.2 Data Quality Checks


In [6]:
# Perform comprehensive data quality checks
quality_report = check_data_quality(df)


DATA QUALITY CHECKS

1. MISSING VALUES:
✓ No missing values found!

2. DUPLICATE RECORDS:
  • Duplicate rows: 0 (0.00%)
  • Duplicate headlines: 561,558 (39.90%)
  • Duplicate URLs: 523,899 (37.23%)

3. DATA TYPES:
Unnamed: 0     int64
headline      object
url           object
publisher     object
date          object
stock         object

4. DATE RANGE:
  • Valid dates: 55,987 (3.98%)
  • Invalid dates: 1,351,341 (96.02%)
  • Date range: 2011-04-27 21:01:48-04:00 to 2020-06-11 17:12:35-04:00

5. OUTLIER DETECTION (IQR Method):
  ✓ No significant outliers detected in numeric columns


## 1.3 Data Cleaning

**Cleaning Strategy:**
- **Missing Values:** Choose one:
  - `'report'` - Only report (default, recommended for EDA)
  - `'drop'` - Drop rows with missing values
  - `'fill'` - Fill missing values (numeric: median, text: 'Unknown')
- **Duplicates:** Remove duplicate rows
- **Dates:** Convert to datetime and handle invalid dates


In [7]:
# Clean the data
# Adjust parameters based on your data quality findings:
# - handle_missing: 'report', 'drop', or 'fill'
# - remove_duplicates: True or False

df_clean = clean_data(
    df, 
    remove_duplicates=True,  # Remove duplicate rows
    handle_missing='report',  # Options: 'report', 'drop', 'fill'
    fix_dates=True           # Convert dates and handle invalid ones
)


DATA CLEANING

1. Converting date column...
   ⚠ Found 1,351,341 invalid dates (converted to NaT)

3. Removing duplicates...
   → Removed 0 duplicate rows

CLEANING SUMMARY:
  Original shape: (1407328, 6)
  Cleaned shape:  (1407328, 6)
  Rows removed:   0
  Columns:        6 → 6


## 1.4 Feature Engineering


In [8]:
# Create derived features (temporal features, headline metrics, etc.)
df_processed = preprocess_data(df_clean)



FEATURE ENGINEERING

2. Calculating headline metrics...
   ✓ Headline length metrics created

Final shape: (1407328, 8)
Date range: 2011-04-27 21:01:48-04:00 to 2020-06-11 17:12:35-04:00


In [9]:
# Verify the processed data
print("Processed Data Summary:")
print(f"Shape: {df_processed.shape}")
print(f"\nNew columns created:")
new_cols = [col for col in df_processed.columns if col not in df.columns]
print(new_cols)
print(f"\nSample of processed data:")
df_processed.head()


Processed Data Summary:
Shape: (1407328, 8)

New columns created:
['headline_length', 'headline_word_count']

Sample of processed data:


Unnamed: 0.1,Unnamed: 0,headline,url,publisher,date,stock,headline_length,headline_word_count
0,0,Stocks That Hit 52-Week Highs On Friday,https://www.benzinga.com/news/20/06/16190091/s...,Benzinga Insights,2020-06-05 10:30:54-04:00,A,39,7
1,1,Stocks That Hit 52-Week Highs On Wednesday,https://www.benzinga.com/news/20/06/16170189/s...,Benzinga Insights,2020-06-03 10:45:20-04:00,A,42,7
2,2,71 Biggest Movers From Friday,https://www.benzinga.com/news/20/05/16103463/7...,Lisa Levin,2020-05-26 04:30:07-04:00,A,29,5
3,3,46 Stocks Moving In Friday's Mid-Day Session,https://www.benzinga.com/news/20/05/16095921/4...,Lisa Levin,2020-05-22 12:45:06-04:00,A,44,7
4,4,B of A Securities Maintains Neutral on Agilent...,https://www.benzinga.com/news/20/05/16095304/b...,Vick Meyer,2020-05-22 11:38:59-04:00,A,87,14


## 1.5 Save Processed Data

Save the cleaned and preprocessed data for use in other notebooks.


In [11]:
# Save processed data
# This allows other notebooks to load preprocessed data quickly
save_processed_data(df_processed, file_path='../data/processed/df_processed.pkl')

# Alternative: Save as parquet (more efficient, preserves data types)
# df_processed.to_parquet('../data/processed/df_processed.parquet')
# print("Saved as parquet file")


Saved processed data to ../data/processed/df_processed.pkl
