# Movie Subtitles Analysis: Benford's Law and Zipf's Law

This notebook analyzes the `movies_subtitles.csv` dataset to examine the applicability of two statistical patterns:
1. **Benford's Law** - Applied to the start times of subtitles
2. **Zipf's Law** - Applied to word frequencies in subtitle text
**Team Name:** The Procrastinators' Club
## Members
1. **Naman Gupta**
   - Email: naman.gupta@adypu.edu.in
   - Batch: B
2. **Vivek Wagadare**
   - Email: vivek.wagadare@adypu.edu.in
   - Batch: C
3. **Devendra Mishra**
   - Email: devendra.mishra@adypu.edu.in
   - Batch: C
4. **Aryan Rana**
   - Email: aryan.rana@adypu.edu.in
   - Batch: A

## Data Loading Options

You can either upload the dataset directly or access it from Google Drive.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
from collections import Counter
import math
from google.colab import files
import io

# Install NLTK and download required data
!pip install nltk
nltk.download('punkt')
nltk.download('stopwords')

ModuleNotFoundError: No module named 'pandas'

### Option 1: Upload Dataset Directly

In [None]:
def load_from_upload():
    uploaded = files.upload()  # This will prompt the user to upload a file
    for filename in uploaded.keys():
        print(f'Uploaded file: {filename}')
        if filename.endswith('.csv'):
            return pd.read_csv(io.BytesIO(uploaded[filename]))
    raise ValueError("Please upload a CSV file.")

### Option 2: Load from Google Drive

In [None]:
from google.colab import drive

def mount_drive():
    drive.mount('/content/drive')
    print("Drive mounted successfully.")
    
def load_from_drive(file_path=None):
    if file_path is None:
        file_path = input("Enter the path to the CSV file in your Google Drive (e.g., 'My Drive/movies_subtitles.csv'): ")
    
    full_path = f"/content/drive/{file_path}"
    try:
        return pd.read_csv(full_path)
    except FileNotFoundError:
        print(f"File not found at {full_path}")
        return None

In [None]:
def load_from_drive_link():
    drive_link = input("Enter the Google Drive shared link for the CSV file: ")
    file_id = None
    
    # Extract file ID from drive link
    if 'drive.google.com/file/d/' in drive_link:
        file_id = drive_link.split('drive.google.com/file/d/')[1].split('/')[0]
    elif 'drive.google.com/open?id=' in drive_link:
        file_id = drive_link.split('drive.google.com/open?id=')[1].split('&')[0]
    
    if file_id:
        download_link = f'https://drive.google.com/uc?id={file_id}'
        return pd.read_csv(download_link)
    else:
        print("Invalid Google Drive link format.")
        return None

### Choose a Data Loading Method

In [None]:
print("Choose how to load the dataset:")
print("1. Upload CSV file directly")
print("2. Access from mounted Google Drive")
print("3. Use Google Drive shared link")

choice = input("Enter your choice (1, 2, or 3): ")

if choice == '1':
    df = load_from_upload()
elif choice == '2':
    mount_drive()
    df = load_from_drive()
elif choice == '3':
    df = load_from_drive_link()
else:
    print("Invalid choice. Please run the cell again and choose 1, 2, or 3.")
    df = None

## Data Exploration

In [None]:
if df is not None:
    print(f"Dataset loaded successfully with {df.shape[0]} rows and {df.shape[1]} columns.")
    
    print("\nColumn names:")
    print(df.columns.tolist())
    
    print("\nFirst few rows:")
    display(df.head())
    
    print("\nData summary:")
    display(df.describe())
    
    print("\nMissing values:")
    display(df.isnull().sum())
else:
    print("Dataset not loaded. Please run the data loading cells again.")

## Benford's Law Analysis

Benford's Law predicts that in many naturally occurring collections of numbers, the first significant digit is likely to be small. Specifically:
- The digit '1' appears as the leading digit about 30.1% of the time
- The digit '2' about 17.6% of the time
- The digit '3' about 12.5% of the time
- And so on...

Formula: P(d) = log10(1 + 1/d) for d ∈ {1, 2, ..., 9}

We'll analyze the `start_time` column to see if it follows Benford's Law.

In [None]:
def extract_first_digit(number):
    # Convert to string and find first non-zero digit
    str_num = str(number)
    for char in str_num:
        if char.isdigit() and char != '0':
            return int(char)
    return None

def benford_expected_distribution():
    return {d: math.log10(1 + 1/d) for d in range(1, 10)}

def analyze_benford_law(df, column_name):
    if column_name not in df.columns:
        print(f"Column '{column_name}' not found in the dataset.")
        return None
    
    # Extract valid numeric data
    numeric_data = df[column_name].dropna()
    
    # Extract first digits
    first_digits = numeric_data.apply(extract_first_digit).dropna()
    print(f"Analyzed {len(first_digits)} valid numbers from '{column_name}' column.")
    
    # Calculate observed frequencies
    digit_counts = first_digits.value_counts().sort_index()
    observed_distribution = (digit_counts / len(first_digits)) * 100
    
    # Calculate expected Benford's Law distribution
    expected_dist = benford_expected_distribution()
    expected_distribution = pd.Series({d: expected_dist[d] * 100 for d in range(1, 10)})
    
    return observed_distribution, expected_distribution

In [None]:
if df is not None:
    observed_dist, expected_dist = analyze_benford_law(df, 'start_time')
    
    if observed_dist is not None and expected_dist is not None:
        # Create a DataFrame for comparison
        benford_df = pd.DataFrame({
            'Observed (%)': observed_dist,
            'Expected (%)': expected_dist
        }).round(1)
        
        print("\nObserved vs. Expected Distribution of First Digits:")
        display(benford_df)
        
        # Calculate the difference
        benford_df['Difference'] = (benford_df['Observed (%)'] - benford_df['Expected (%)']).round(1)
        print("\nDifference between Observed and Expected:")
        display(benford_df[['Difference']])

In [None]:
# Visualize Benford's Law comparison
if df is not None and 'benford_df' in locals():
    plt.figure(figsize=(12, 7))
    
    # Bar chart
    bar_width = 0.35
    indices = np.arange(len(benford_df.index))
    
    plt.bar(indices - bar_width/2, benford_df['Observed (%)'], 
            width=bar_width, label='Observed', color='skyblue', alpha=0.8)
    plt.bar(indices + bar_width/2, benford_df['Expected (%)'], 
            width=bar_width, label='Expected (Benford\'s Law)', color='orange', alpha=0.8)
    
    # Customization
    plt.xlabel('First Digit', fontsize=14)
    plt.ylabel('Frequency (%)', fontsize=14)
    plt.title("Benford's Law: Expected vs. Observed Frequency of First Digits in start_time", fontsize=16)
    plt.xticks(indices, benford_df.index)
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.legend(fontsize=12)
    
    # Add data labels
    for i, v in enumerate(benford_df['Observed (%)']):
        plt.text(i - bar_width/2, v + 0.5, f"{v:.1f}%", ha='center', fontsize=10)
        
    for i, v in enumerate(benford_df['Expected (%)']):
        plt.text(i + bar_width/2, v + 0.5, f"{v:.1f}%", ha='center', fontsize=10)
    
    plt.tight_layout()
    plt.savefig('benford_plot.png', dpi=300)
    plt.show()

## Zipf's Law Analysis

Zipf's Law states that the frequency of any word is inversely proportional to its rank in the frequency table. If the most common word occurs approximately twice as often as the second most frequent word, three times as often as the third most frequent, etc.

Formula: Frequency ≈ Constant / Rank

We'll analyze the `text` column to see if word frequencies follow Zipf's Law.

In [None]:
def preprocess_text(text):
    if not isinstance(text, str):
        return ""
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters, numbers and multiple spaces
    text = re.sub(r'[^\w\s]', ' ', text)
    text = re.sub(r'\d+', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

def tokenize_and_filter(text, stop_words=None):
    if stop_words is None:
        stop_words = set(stopwords.words('english'))
    
    # Add custom stopwords like subtitle artifacts
    custom_stopwords = {'sighs', 'laughs', 'coughs', 'gasps', 'inaudible', 'music'}
    stop_words.update(custom_stopwords)
    
    # Tokenize
    tokens = nltk.word_tokenize(text)
    
    # Filter out stopwords and short words (length < 2)
    filtered_tokens = [token for token in tokens if token not in stop_words and len(token) > 1]
    
    return filtered_tokens

def analyze_zipf_law(df, column_name, top_n=1000):
    if column_name not in df.columns:
        print(f"Column '{column_name}' not found in the dataset.")
        return None
    
    # Preprocess all text
    preprocessed_texts = df[column_name].apply(preprocess_text)
    
    # Combine all text
    all_text = " ".join(preprocessed_texts)
    
    # Tokenize and filter
    tokens = tokenize_and_filter(all_text)
    print(f"Analyzed {len(tokens)} words after filtering.")
    
    # Count word frequencies
    word_counts = Counter(tokens)
    print(f"Found {len(word_counts)} unique words.")
    
    # Create dataframe with word frequencies and ranks
    word_df = pd.DataFrame(word_counts.most_common(top_n), columns=['Word', 'Frequency'])
    word_df['Rank'] = range(1, len(word_df) + 1)
    word_df['Log_Rank'] = np.log10(word_df['Rank'])
    word_df['Log_Frequency'] = np.log10(word_df['Frequency'])
    
    # Calculate expected Zipf's Law frequencies
    c = word_df.iloc[0]['Frequency']  # Constant based on most frequent word
    word_df['Expected_Frequency'] = c / word_df['Rank']
    word_df['Log_Expected_Frequency'] = np.log10(word_df['Expected_Frequency'])
    
    return word_df

In [None]:
if df is not None:
    zipf_df = analyze_zipf_law(df, 'text')
    
    if zipf_df is not None:
        print("\nTop 10 Most Frequent Words:")
        display(zipf_df[['Rank', 'Word', 'Frequency']].head(10))

In [None]:
if df is not None and 'zipf_df' in locals():
    # Create pie chart for top 10 words
    plt.figure(figsize=(10, 8))
    top10_df = zipf_df.head(10)
    plt.pie(top10_df['Frequency'], labels=[f"{w} ({f:,})" for w, f in zip(top10_df['Word'], top10_df['Frequency'])],
            autopct='%1.1f%%', shadow=True, startangle=140, explode=[0.05]*10)
    plt.axis('equal')
    plt.title('Frequency Share of Top 10 Words', fontsize=16)
    plt.tight_layout()
    plt.show()

In [None]:
if df is not None and 'zipf_df' in locals():
    # Zipf's Law visualization - log-log plot
    plt.figure(figsize=(12, 8))
    
    # Scatter plot of actual data
    plt.scatter(zipf_df['Log_Rank'], zipf_df['Log_Frequency'], 
                alpha=0.6, color='blue', label='Observed')
    
    # Line plot of expected Zipf distribution
    plt.plot(zipf_df['Log_Rank'], zipf_df['Log_Expected_Frequency'], 
             color='red', linestyle='-', linewidth=2, label="Expected (Zipf's Law)")
    
    # Linear regression to find slope
    from scipy import stats
    slope, intercept, r_value, p_value, std_err = stats.linregress(
        zipf_df['Log_Rank'], zipf_df['Log_Frequency'])
    
    # Add regression line
    plt.plot(zipf_df['Log_Rank'], intercept + slope * zipf_df['Log_Rank'], 
             color='green', linestyle='--', linewidth=2, 
             label=f'Fitted Line (slope={slope:.2f})')
    
    # Annotate top words
    for i in range(min(10, len(zipf_df))):
        plt.annotate(zipf_df.iloc[i]['Word'], 
                     (zipf_df.iloc[i]['Log_Rank'], zipf_df.iloc[i]['Log_Frequency']),
                     xytext=(5, 5), textcoords='offset points', fontsize=9)
    
    # Customize plot
    plt.xlabel('Log(Rank)', fontsize=14)
    plt.ylabel('Log(Frequency)', fontsize=14)
    plt.title("Zipf's Law: Word Frequency vs. Rank (Log-Log Scale)", fontsize=16)
    plt.grid(True, alpha=0.3)
    plt.legend(fontsize=12)
    
    # Add annotation about Zipf's Law
    zipf_note = "Zipf's Law predicts: Frequency ∝ 1/Rank\nIdeal slope = -1.0"
    plt.annotate(zipf_note, xy=(0.02, 0.02), xycoords='axes fraction', 
                 bbox=dict(boxstyle="round,pad=0.3", fc="yellow", alpha=0.3),
                 fontsize=12)
    
    plt.tight_layout()
    plt.savefig('zipf_plot.png', dpi=300)
    plt.show()

## Conclusion

### Summary of Findings

**Benford's Law Analysis**
- We examined the distribution of leading digits in the `start_time` column
- The observed distribution showed significant deviation from Benford's Law expectations
- Digit 1 appears less frequently than expected (~20.1% vs. expected 30.1%)
- Digits 2-5 are overrepresented compared to their theoretical frequencies
- This suggests that subtitle timings don't follow the naturally occurring logarithmic distribution that Benford's Law typically describes
- This deviation might be due to the constrained nature of subtitle timing in movies

**Zipf's Law Analysis**
- We analyzed word frequencies in the `text` column after preprocessing and filtering
- The word distribution follows Zipf's Law pattern quite closely
- The log-log plot shows a strong linear relationship between word rank and frequency
- The slope of this relationship is close to the theoretical value of -1
- This confirms that even in movie subtitles, natural language follows Zipf's characteristic distribution where few words (like "dont", "im", "know") dominate, and frequency drops rapidly with rank

Thank you 