# Community Notes Study

This notebook analyzes Community Notes data from the file `notes-00000-2.tsv`.

Community Notes (formerly known as Birdwatch) is a collaborative system on X (Twitter) that allows users to add context to potentially misleading posts.

## Setup and Data Loading

First, let's import the necessary libraries and load the data.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

In [None]:
# Load the data
data_file = 'notes-00000-2.tsv'

# Check if file exists
if not Path(data_file).exists():
    print(f"Warning: {data_file} not found. Please ensure the file is in the same directory as this notebook.")
else:
    # Read the TSV file
    df = pd.read_csv(data_file, sep='\t')
    print(f"Successfully loaded {len(df)} rows from {data_file}")

## Data Overview

Let's examine the structure and contents of the dataset.

In [None]:
# Display basic information about the dataset
print("Dataset Shape:", df.shape)
print("\nColumn Names:")
print(df.columns.tolist())

In [None]:
# Display first few rows
df.head()

In [None]:
# Data types and missing values
print("Data Types and Missing Values:")
info_df = pd.DataFrame({
    'Data Type': df.dtypes,
    'Non-Null Count': df.count(),
    'Null Count': df.isnull().sum(),
    'Null Percentage': (df.isnull().sum() / len(df) * 100).round(2)
})
display(info_df)

## Descriptive Statistics

Let's compute basic statistics for numerical columns.

In [None]:
# Descriptive statistics for numerical columns
df.describe()

## Data Analysis

### Analysis by Classification

If the data contains classification information, let's analyze the distribution.

In [None]:
# Check for common Community Notes columns
classification_cols = [col for col in df.columns if 'classification' in col.lower()]

if classification_cols:
    for col in classification_cols:
        print(f"\nDistribution of {col}:")
        print(df[col].value_counts())
        print(f"\nPercentage distribution:")
        print(df[col].value_counts(normalize=True) * 100)

### Temporal Analysis

If timestamp data is available, let's analyze trends over time.

In [None]:
# Look for timestamp columns
time_cols = [col for col in df.columns if any(x in col.lower() for x in ['time', 'date', 'created'])]

if time_cols:
    print("Time-related columns found:")
    for col in time_cols:
        print(f"- {col}")
        # Try to convert to datetime
        try:
            df[col + '_datetime'] = pd.to_datetime(df[col], unit='ms', errors='coerce')
            print(f"  Successfully converted to datetime")
        except:
            print(f"  Could not convert to datetime")

## Visualizations

Let's create some visualizations to better understand the data.

In [None]:
# Visualization 1: Distribution of classifications (if available)
if classification_cols:
    fig, axes = plt.subplots(1, len(classification_cols), figsize=(6*len(classification_cols), 6))
    if len(classification_cols) == 1:
        axes = [axes]
    
    for idx, col in enumerate(classification_cols):
        df[col].value_counts().plot(kind='bar', ax=axes[idx])
        axes[idx].set_title(f'Distribution of {col}')
        axes[idx].set_xlabel(col)
        axes[idx].set_ylabel('Count')
        axes[idx].tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()

In [None]:
# Visualization 2: Missing data heatmap
if len(df.columns) <= 50:  # Only if we have a reasonable number of columns
    plt.figure(figsize=(12, 8))
    sns.heatmap(df.isnull(), cbar=True, yticklabels=False, cmap='viridis')
    plt.title('Missing Data Pattern')
    plt.tight_layout()
    plt.show()

In [None]:
# Visualization 3: Numerical columns distribution
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()

if numerical_cols and len(numerical_cols) <= 10:
    n_cols = min(3, len(numerical_cols))
    n_rows = (len(numerical_cols) + n_cols - 1) // n_cols
    
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(6*n_cols, 5*n_rows))
    axes = axes.flatten() if len(numerical_cols) > 1 else [axes]
    
    for idx, col in enumerate(numerical_cols):
        df[col].hist(bins=30, ax=axes[idx], edgecolor='black')
        axes[idx].set_title(f'Distribution of {col}')
        axes[idx].set_xlabel(col)
        axes[idx].set_ylabel('Frequency')
    
    # Hide extra subplots
    for idx in range(len(numerical_cols), len(axes)):
        axes[idx].axis('off')
    
    plt.tight_layout()
    plt.show()

## Text Analysis

If the dataset contains text fields (like note summaries), let's analyze them.

In [None]:
# Find text columns (object type with long strings)
text_cols = []
for col in df.select_dtypes(include=['object']).columns:
    if df[col].notna().any():
        avg_len = df[col].dropna().astype(str).str.len().mean()
        if avg_len > 20:  # Likely a text field
            text_cols.append(col)

if text_cols:
    print("Text columns found:")
    for col in text_cols:
        print(f"\n{col}:")
        print(f"  Average length: {df[col].dropna().astype(str).str.len().mean():.2f} characters")
        print(f"  Sample: {df[col].dropna().iloc[0][:100]}...")

In [None]:
# Word count distribution for text columns
if text_cols:
    for col in text_cols[:3]:  # Limit to first 3 text columns
        word_counts = df[col].dropna().astype(str).str.split().str.len()
        
        plt.figure(figsize=(10, 5))
        plt.hist(word_counts, bins=30, edgecolor='black')
        plt.title(f'Word Count Distribution in {col}')
        plt.xlabel('Number of Words')
        plt.ylabel('Frequency')
        plt.show()
        
        print(f"\nStatistics for {col}:")
        print(f"  Mean word count: {word_counts.mean():.2f}")
        print(f"  Median word count: {word_counts.median():.2f}")
        print(f"  Max word count: {word_counts.max():.0f}")

## Correlation Analysis

Let's examine correlations between numerical variables.

In [None]:
# Correlation matrix for numerical columns
if len(numerical_cols) > 1:
    correlation_matrix = df[numerical_cols].corr()
    
    plt.figure(figsize=(10, 8))
    sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
                center=0, square=True, linewidths=1)
    plt.title('Correlation Matrix of Numerical Features')
    plt.tight_layout()
    plt.show()

## Summary Statistics

Let's create a comprehensive summary of the dataset.

In [None]:
print("="*60)
print("COMMUNITY NOTES DATASET SUMMARY")
print("="*60)
print(f"\nTotal number of notes: {len(df):,}")
print(f"Number of features: {len(df.columns)}")
print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"\nData completeness: {(1 - df.isnull().sum().sum() / (len(df) * len(df.columns))) * 100:.2f}%")

if classification_cols:
    print(f"\nClassification columns: {', '.join(classification_cols)}")

if text_cols:
    print(f"\nText columns: {', '.join(text_cols)}")

print(f"\nNumerical columns: {len(numerical_cols)}")
print(f"Categorical columns: {len(df.select_dtypes(include=['object']).columns)}")
print("="*60)

## Key Insights

Based on the analysis above, here are some key observations:

1. **Data Volume**: The dataset contains community notes that can be analyzed for patterns and trends.
2. **Data Quality**: Check the missing data patterns to understand data completeness.
3. **Classifications**: If present, classification distributions show how notes are categorized.
4. **Text Content**: Text analysis reveals the nature and length of note contents.

### Next Steps

Further analysis could include:
- Sentiment analysis on text fields
- Topic modeling to identify common themes
- Time series analysis if temporal data is available
- Network analysis of note relationships
- Comparison with rating data if available

## Export Results

Optionally, save processed data or summary statistics.

In [None]:
# Example: Save summary statistics to CSV
# summary = df.describe(include='all')
# summary.to_csv('community_notes_summary.csv')
# print("Summary statistics saved to community_notes_summary.csv")