# MScFE 600 Financial Data - Task 1: Data Quality Analysis

**Course**: MScFE 600 Financial Data  
**Institution**: WorldQuant University  
**Date**: September 2025

---

This notebook demonstrates examples of poor quality financial data, examining both structured and unstructured datasets to understand how they fail to meet data quality standards. The analysis employs KYC (Know Your Customer) data as the primary example for structured data examination, whilst exploring financial news and social media content to illustrate unstructured data quality challenges.

The investigation focuses on identifying characteristics of poor quality data across different formats, recognising common issues that compromise data integrity, and understanding the broader implications for financial decision-making processes. Through comprehensive analysis of real-world scenarios, we examine how data quality failures can cascade through financial institutions, affecting everything from regulatory compliance to algorithmic trading systems.

In [None]:
# Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Set display options for better output formatting
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## Poor Quality Structured Data: KYC Dataset Analysis

Financial institutions rely heavily on structured KYC datasets for regulatory compliance and risk management. These datasets must maintain the highest standards of accuracy, completeness, and consistency to support critical business decisions and meet regulatory requirements.

In [None]:
# Create a poor quality KYC dataset with multiple data quality issues
def create_poor_kyc_data():
    """
    Creates a KYC dataset with intentional data quality issues to demonstrate
    poor data practices in financial institutions.
    """
    
    # Intentionally problematic data
    kyc_data = {
        'customer_id': [1001, 1002, 1002, 1004, '', 1006, 1007, None, 1009, 1010],  # Duplicates, missing
        'first_name': ['John', 'JANE', 'jane', 'Bob', '123Invalid', '', 'Alice', 'Carol', 'Dave', 'Frank'],
        'last_name': ['Smith', 'DOE', 'doe', 'Johnson', 'Name456', 'Missing', '', 'Brown', 'Wilson', 'Taylor'],
        'date_of_birth': ['1985-01-15', '1990/02/20', '32-12-1988', '1975-13-40', 'Invalid', 
                         '2025-01-01', '1950-01-01', '', '1980-06-15', '1992-03-10'],
        'email': ['john@email.com', 'JANE@GMAIL.COM', 'invalid-email', 'bob@', '', 
                 'alice@bank.com', 'carol@email', 'duplicate@test.com', 'dave@email.com', 'frank@email.com'],
        'phone': ['123-456-7890', '+1-234-567-8901', '12345', '999-999-9999', '', 
                 '111-111-1111', 'INVALID', '123-456-7890', '+44-20-1234-5678', '555-0123'],
        'annual_income': [50000, 150000, -25000, 999999999, 0, None, '', 'High', 75000, 120000],
        'country': ['USA', 'usa', 'United States', 'US', '', 'Canada', 'UK', 'UNKNOWN', 'Japan', 'Germany'],
        'risk_score': [3.5, 7.2, 15.5, -2.0, None, '', 'Low', 8.9, 4.1, 6.7],  # Out of range values
        'kyc_status': ['Verified', 'verified', 'PENDING', 'Rejected', '', 'Unknown', 'verified', 'Pending', 'Approved', 'verified']
    }
    
    return pd.DataFrame(kyc_data)

# Create the problematic dataset
poor_kyc_df = create_poor_kyc_data()

print("Poor Quality KYC Dataset:")
print("=" * 50)
print(poor_kyc_df)
print("\nDataset Info:")
poor_kyc_df.info()

## Analysis of Poor Quality Structured Data

The KYC dataset above demonstrates multiple data quality failures that violate fundamental properties required for reliable financial data management.

**Accuracy Failures** pervade the dataset through invalid dates such as '32-12-1988' and '1975-13-40' that violate basic calendar constraints. Future birth dates like '2025-01-01' represent logical impossibilities that would trigger immediate validation failures in any robust system. Negative income values of -25000 are unrealistic for KYC purposes and suggest data entry errors or system malfunctions. Risk scores falling outside valid ranges, particularly values like 15.5 and -2.0 when typical scales operate between 1-10, indicate a complete breakdown in data validation controls.

**Completeness Problems** manifest through missing customer IDs, the fundamental identifier that enables record linkage and customer identification. Empty strings and null values in critical fields such as names and contact information render customer identification impossible, whilst missing values in regulatory fields like income and risk_score compromise compliance reporting capabilities.

**Consistency Violations** appear throughout the dataset with inconsistent formatting for customer names, mixing upper and lower case arbitrarily. Country representations vary wildly, with 'USA', 'usa', 'United States', and 'US' all representing the same entity, creating massive challenges for data aggregation and reporting. Status values exhibit similar inconsistencies with 'Verified', 'verified', and 'PENDING' representing different capitalisations of what should be standardised categories.

**Uniqueness Failures** occur when customer_id 1002 appears multiple times, creating fundamental data integrity issues that would corrupt customer relationship management systems and risk assessment calculations.

This data fails comprehensively to meet regulatory requirements for KYC compliance, creating significant operational and compliance risks that could result in regulatory penalties, operational failures, and complete breakdown of customer identification processes.

In [None]:
# Data Quality Assessment - Quantitative Analysis
def analyze_data_quality(df):
    """
    Performs comprehensive data quality analysis on the KYC dataset
    """
    print("DATA QUALITY ASSESSMENT REPORT")
    print("=" * 60)
    
    # 1. Missing Values Analysis
    print("\n1. MISSING VALUES ANALYSIS:")
    missing_stats = df.isnull().sum()
    missing_pct = (df.isnull().sum() / len(df)) * 100
    missing_df = pd.DataFrame({
        'Missing_Count': missing_stats,
        'Missing_Percentage': missing_pct
    })
    print(missing_df[missing_df['Missing_Count'] > 0])
    
    # 2. Duplicate Analysis
    print("\n2. DUPLICATE ANALYSIS:")
    duplicates = df.duplicated().sum()
    print(f"Total duplicate rows: {duplicates}")
    
    # Check for duplicate customer IDs
    id_duplicates = df['customer_id'].duplicated().sum()
    print(f"Duplicate customer IDs: {id_duplicates}")
    
    # 3. Data Type Issues
    print("\n3. DATA TYPE ISSUES:")
    print("Current data types:")
    print(df.dtypes)
    
    # 4. Inconsistency Analysis
    print("\n4. INCONSISTENCY ANALYSIS:")
    
    # Country field inconsistencies
    country_values = df['country'].value_counts()
    print(f"Country field unique values: {len(country_values)}")
    print("Country variations:", country_values.index.tolist())
    
    # Status field inconsistencies  
    status_values = df['kyc_status'].value_counts()
    print(f"Status field unique values: {len(status_values)}")
    print("Status variations:", status_values.index.tolist())
    
    return missing_df

# Run the analysis
quality_report = analyze_data_quality(poor_kyc_df)

## Poor Quality Unstructured Data: Financial News and Social Media

Unstructured financial data presents unique challenges for quality assessment, particularly when sourcing information from diverse channels including traditional financial media, social platforms, and user-generated content for sentiment analysis applications.

In [None]:
# Create poor quality unstructured financial news data
def create_poor_financial_news_data():
    """
    Creates financial news/social media dataset with quality issues
    for sentiment analysis applications
    """
    
    poor_news_data = [
        {
            'source': 'Twitter',
            'timestamp': '2024-15-45 25:99:99',  # Invalid timestamp
            'content': 'AAPL to the moon!!! 🚀🚀🚀 #stocks #yolo BUY BUY BUY!!!',
            'author': '@anonymous123',
            'sentiment_label': 'VERY_POSITIVE'
        },
        {
            'source': 'Bloomberg',
            'timestamp': '',  # Missing timestamp
            'content': 'The Federal Reserve announced... [CONTENT TRUNCATED] ...rate decisions will impact market volatility significantly.',
            'author': 'Unknown Author',
            'sentiment_label': None
        },
        {
            'source': 'Reddit',
            'timestamp': 'yesterday',  # Ambiguous timestamp
            'content': 'Tesla stock is going to crash because Elon Musk tweeted something about aliens 👽. My neighbor told me he heard from his friend that works at Tesla that they are going to announce bankruptcy next week. This is not financial advice but you should probably sell everything now!!!',
            'author': 'deleted_user',
            'sentiment_label': 'negative'
        },
        {
            'source': 'WSJ',
            'timestamp': '2024-09-30T14:30:00Z',
            'content': '<html><body><div class="article">Market analysis shows that Q3 earnings for tech sector have exceeded expectations by 15% on average. However, concerns about inflation persist among investors.</div></body></html>',
            'author': 'Jane Smith, Financial Analyst',
            'sentiment_label': 'NEUTRAL'
        },
        {
            'source': 'Discord',
            'timestamp': '2024-09-30T15:45:00Z',
            'content': 'guys i put my entire life savings into crypto and now im broke 😭 dont do what i did my wife left me and took the kids',
            'author': 'crypto_king_2024',
            'sentiment_label': 'EXTREMELY_NEGATIVE'
        },
        {
            'source': 'Financial Times',
            'timestamp': '2024-09-30T16:00:00Z',
            'content': 'The S&P 500 index closed higher today, driven by strong performance in the technology sector. Analysts remain cautiously optimistic about Q4 prospects despite ongoing geopolitical tensions.',
            'author': 'Dr. Michael Chen, Senior Market Strategist',
            'sentiment_label': 'positive'
        },
        {
            'source': 'TikTok',
            'timestamp': '2024-09-30T17:15:00Z',
            'content': 'OMG guys!! I just discovered this AMAZING trading strategy that made me $1000 in 5 minutes!!! Link in bio!!! #daytrading #getrich #financialfreedom #notascam',
            'author': '@trading_guru_18',
            'sentiment_label': '🚀🚀🚀'
        },
        {
            'source': '',  # Missing source
            'timestamp': '2024-09-30T18:00:00Z',
            'content': '',  # Empty content
            'author': '',
            'sentiment_label': ''
        }
    ]
    
    return poor_news_data

# Create and display the poor quality unstructured data
poor_news = create_poor_financial_news_data()

print("Poor Quality Financial News/Social Media Dataset:")
print("=" * 60)

for i, article in enumerate(poor_news, 1):
    print(f"\nArticle {i}:")
    print(f"Source: {article['source']}")
    print(f"Timestamp: {article['timestamp']}")
    print(f"Author: {article['author']}")
    print(f"Content: {article['content'][:100]}{'...' if len(article['content']) > 100 else ''}")
    print(f"Sentiment: {article['sentiment_label']}")
    print("-" * 40)

## Analysis of Poor Quality Unstructured Data

The financial news and social media dataset demonstrates critical failures in unstructured data quality that would severely compromise sentiment analysis and algorithmic trading decisions.

**Reliability and Source Credibility Issues** emerge from mixing highly credible sources such as Bloomberg, WSJ, and Financial Times with unreliable social media accounts including anonymous Twitter users, TikTok influencers, and deleted Reddit accounts. This creates a false equivalency where unreliable speculation carries the same analytical weight as professional financial analysis, leading to skewed sentiment scores and potentially catastrophic trading decisions. The credibility spectrum ranges from verified financial journalists with established track records to anonymous accounts promoting questionable trading strategies.

**Temporal Inconsistency and Context Loss** manifest through missing, invalid, or ambiguous timestamp information. Critical temporal data appears as impossible timestamps like '2024-15-45 25:99:99', completely missing timestamp fields, or vague references such as 'yesterday' that provide no actionable timing information. Without precise timestamps, the data cannot support time-sensitive financial applications where correlation with market events and establishment of causal relationships becomes impossible.

**Content Integrity and Processing Challenges** appear throughout the dataset with truncated articles missing crucial context, HTML markup contaminating text content, and emoji symbols replacing standardised sentiment classifications. Empty records provide no informational value whilst creating processing overhead. This inconsistency makes automated processing extremely difficult and prone to errors, as sentiment analysis algorithms cannot reliably extract meaningful signals from corrupted or incomplete text data.

**Standardisation and Classification Failures** occur when sentiment labels employ inconsistent formats including 'VERY_POSITIVE', 'negative', '🚀🚀🚀', and empty values that cannot be systematically processed or compared. This lack of standardisation means the data cannot feed into machine learning models or quantitative analysis pipelines, rendering the entire dataset unusable for its intended purpose in financial decision-making systems. The absence of controlled vocabularies and classification schemes makes cross-platform analysis impossible whilst preventing meaningful aggregation of sentiment indicators.

In [None]:
# Unstructured Data Quality Assessment
def analyze_unstructured_data_quality(news_data):
    """
    Analyzes quality issues in unstructured financial news data
    """
    print("UNSTRUCTURED DATA QUALITY ASSESSMENT")
    print("=" * 60)
    
    total_articles = len(news_data)
    
    # 1. Missing/Empty Content Analysis
    empty_content = sum(1 for article in news_data if not article['content'].strip())
    empty_source = sum(1 for article in news_data if not article['source'].strip())
    empty_timestamp = sum(1 for article in news_data if not article['timestamp'].strip())
    empty_author = sum(1 for article in news_data if not article['author'].strip())
    
    print(f"\n1. COMPLETENESS ANALYSIS:")
    print(f"Total articles: {total_articles}")
    print(f"Empty content: {empty_content} ({empty_content/total_articles*100:.1f}%)")
    print(f"Missing source: {empty_source} ({empty_source/total_articles*100:.1f}%)")
    print(f"Missing timestamp: {empty_timestamp} ({empty_timestamp/total_articles*100:.1f}%)")
    print(f"Missing author: {empty_author} ({empty_author/total_articles*100:.1f}%)")
    
    # 2. Source Reliability Analysis
    sources = [article['source'] for article in news_data if article['source']]
    reliable_sources = ['Bloomberg', 'WSJ', 'Financial Times', 'Reuters']
    unreliable_sources = ['Twitter', 'Reddit', 'TikTok', 'Discord']
    
    reliable_count = sum(1 for source in sources if source in reliable_sources)
    unreliable_count = sum(1 for source in sources if source in unreliable_sources)
    
    print(f"\n2. SOURCE RELIABILITY ANALYSIS:")
    print(f"Reliable sources: {reliable_count} ({reliable_count/len(sources)*100:.1f}%)")
    print(f"Unreliable sources: {unreliable_count} ({unreliable_count/len(sources)*100:.1f}%)")
    
    # 3. Content Quality Issues
    html_contaminated = sum(1 for article in news_data if '<html>' in article['content'] or '<div>' in article['content'])
    emoji_heavy = sum(1 for article in news_data if '🚀' in article['content'] or '😭' in article['content'] or '👽' in article['content'])
    truncated = sum(1 for article in news_data if '[CONTENT TRUNCATED]' in article['content'])
    
    print(f"\n3. CONTENT QUALITY ISSUES:")
    print(f"HTML contaminated: {html_contaminated} ({html_contaminated/total_articles*100:.1f}%)")
    print(f"Emoji-heavy content: {emoji_heavy} ({emoji_heavy/total_articles*100:.1f}%)")
    print(f"Truncated content: {truncated} ({truncated/total_articles*100:.1f}%)")
    
    # 4. Sentiment Label Consistency
    sentiment_labels = [article['sentiment_label'] for article in news_data if article['sentiment_label']]
    unique_formats = set(sentiment_labels)
    
    print(f"\n4. SENTIMENT LABEL ANALYSIS:")
    print(f"Unique sentiment formats: {len(unique_formats)}")
    print(f"Sentiment label formats: {list(unique_formats)}")
    print(f"Standardization level: {'Poor - Multiple inconsistent formats' if len(unique_formats) > 3 else 'Good'}")

# Run the unstructured data analysis
analyze_unstructured_data_quality(poor_news)

## Summary and Conclusions

Structured data quality in the KYC dataset revealed how fundamental data management failures can compromise regulatory compliance and operational effectiveness in financial institutions. The analysis demonstrated that missing values, duplicate records, inconsistent formatting, and invalid data ranges create cascading problems throughout customer relationship management and risk assessment systems.

Unstructured data quality challenges in financial news and social media present unique difficulties including source reliability assessment, content contamination, temporal inconsistencies, and standardisation issues. These problems can lead to flawed sentiment analysis, poor trading decisions, and systematic misinterpretation of market signals.

The broader implications for financial data quality extend beyond immediate operational concerns to encompass regulatory compliance failures, inaccurate risk assessments, compromised algorithmic trading decisions, and potential loss of customer trust. Effective data quality management requires robust validation rules, clear governance policies, reliable source verification, complete audit trails, and continuous monitoring procedures.

Organisations must implement comprehensive data quality frameworks that address both structured and unstructured data challenges whilst maintaining the flexibility to adapt to evolving data sources and analytical requirements. The cost of poor data quality in financial services extends far beyond immediate operational inconvenience to encompass significant regulatory, reputational, and financial risks.

---

**References:**
- Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. *Journal of Management Information Systems*, 12(4), 5-33.
- Batini, C., & Scannapieco, M. (2016). *Data and information quality: Dimensions, principles and techniques*. Springer.
- Financial Conduct Authority. (2021). *Data governance and quality standards for financial institutions*.