# GDELT Data Validation

This notebook validates the GDELT news articles data ingested for MAG7 companies.

In [None]:
## What You'll Learn

**Data validation** = checking your data for errors before using it.

Think of it like spell-checking an essay. We want to catch:
- **Missing values** (blank cells)
- **Duplicates** (same data repeated)
- **Invalid data** (wrong formats, weird values)

Let's start by loading our data and taking a first look!

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

# Load the GDELT articles data
DATA_PATH = Path("../data/raw/gdelt_articles.csv")
df = pd.read_csv(DATA_PATH, parse_dates=["seendate"])
print(f"Loaded {len(df):,} rows from {DATA_PATH}")

Loaded 1,400 rows from ../data/raw/gdelt_articles.csv


## 1. Basic Data Inspection

**RULE #1: Always look at your data first!**

Before doing any analysis, you need to understand:
1. **How big** is your data? (rows and columns)
2. **What columns** do you have?
3. **What types** of data are in each column? (numbers, text, dates)
4. **What does the data look like?** (preview some rows)

In [2]:
# Preview first few rows
df.head()

Unnamed: 0,query,seendate,url,title,description,language,domain,sourceCountry,socialimage,company,ticker
0,"(""Apple"" OR AAPL) (stock OR shares OR earnings...",2026-01-16 21:00:00+00:00,https://www.cosmeticsandtoiletries.com/cosmeti...,The Longevity and Slow - Aging Movement Gets a...,,English,cosmeticsandtoiletries.com,,https://img.cosmeticsandtoiletries.com/mindful...,Apple,AAPL
1,"(""Apple"" OR AAPL) (stock OR shares OR earnings...",2026-01-16 20:45:00+00:00,https://www.businessinsider.com/apple-losing-g...,Apple Is Losing Its Grip on the World Tech Sup...,,English,businessinsider.com,,https://i.insider.com/696a837aa645d11881878256...,Apple,AAPL
2,"(""Apple"" OR AAPL) (stock OR shares OR earnings...",2026-01-16 20:45:00+00:00,https://finance.yahoo.com/news/asml-soars-abov...,ASML Soars Above $500 Billion Value on TSMC Up...,,English,finance.yahoo.com,,https://s.yimg.com/ny/api/res/1.2/2pyxQMi5YKpX...,Apple,AAPL
3,"(""Apple"" OR AAPL) (stock OR shares OR earnings...",2026-01-16 20:30:00+00:00,https://www.androidpolice.com/the-excellent-ga...,The excellent Galaxy Buds3 Pro deserve way mor...,,English,androidpolice.com,,https://static0.anpoimages.com/wordpress/wp-co...,Apple,AAPL
4,"(""Apple"" OR AAPL) (stock OR shares OR earnings...",2026-01-16 20:00:00+00:00,https://markets.financialcontent.com/stocks/ar...,FinancialContent - The Great Rebalancing : Sma...,,English,markets.financialcontent.com,,https://marketminute.ghost.io/content/images/s...,Apple,AAPL


In [None]:
df.shape #how many rows and columns we have 

(1400, 11)

In [5]:
#check all column names
df.columns

Index(['query', 'seendate', 'url', 'title', 'description', 'language',
       'domain', 'sourceCountry', 'socialimage', 'company', 'ticker'],
      dtype='object')

In [6]:
#check if any comand is null, missing or useful
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1400 entries, 0 to 1399
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype              
---  ------         --------------  -----              
 0   query          1400 non-null   object             
 1   seendate       1400 non-null   datetime64[ns, UTC]
 2   url            1400 non-null   object             
 3   title          1400 non-null   object             
 4   description    0 non-null      float64            
 5   language       1400 non-null   object             
 6   domain         1400 non-null   object             
 7   sourceCountry  0 non-null      float64            
 8   socialimage    1200 non-null   object             
 9   company        1400 non-null   object             
 10  ticker         1400 non-null   object             
dtypes: datetime64[ns, UTC](1), float64(2), object(8)
memory usage: 120.4+ KB


In [7]:
df.isnull().sum() #count all the missing values

query               0
seendate            0
url                 0
title               0
description      1400
language            0
domain              0
sourceCountry    1400
socialimage       200
company             0
ticker              0
dtype: int64

In [9]:
#check how many percentage of missing values we have based on the whole dataset

(df.isnull().sum()/len(df)) * 100

query              0.000000
seendate           0.000000
url                0.000000
title              0.000000
description      100.000000
language           0.000000
domain             0.000000
sourceCountry    100.000000
socialimage       14.285714
company            0.000000
ticker             0.000000
dtype: float64

*You can see that all description and sourceCountry data are missing, and socialimage missed 14% of the data*

In [11]:
df_clean = df.drop(columns=['description', 'sourceCountry'])
print(f"Before: {df.shape}")
print(f"After: {df_clean.shape}")

Before: (1400, 11)
After: (1400, 9)


In [16]:

print(f"Duplicate rows: {df_clean.duplicated().sum()}")

Duplicate rows: 1050
