# Berkshire Hathaway Earnings Report Activity Analyzer with PDF Summary Download

## Overview
This tool is designed to automatically analyze Berkshire Hathaway's quarterly earnings reports, extracting key investment activities and providing a comparative analysis across multiple time periods. It's an essential asset for financial analysts, investors, and fintech professionals who need to quickly digest and compare Berkshire Hathaway's investment strategies and activities.

## Features
- Automatically downloads PDF reports for three time periods:
  1. Current quarter
  2. Previous quarter
  3. Same quarter from the previous year
- Extracts text from PDF reports using advanced PDF parsing techniques
- Analyzes the text to identify key investment activities, including:
  - Acquisitions
  - Purchases
  - Sales
  - Divestitures
  - Investments
  - Other significant financial activities
- Generates a comparative analysis across the three time periods
- Creates a downloadable PDF summary with a structured table of activities

## How It Works
1. The user inputs the current quarter number and year
2. The script generates URLs for the relevant Berkshire Hathaway quarterly reports
3. PDFs are downloaded from Berkshire Hathaway's website
4. Text is extracted from the PDFs using PyPDF2
5. The extracted text is analyzed using regular expressions to identify various investment activities
6. A comparison table is created, showing activities for each time period
7. A summary PDF is generated using ReportLab, featuring the comparison table
8. The summary PDF is automatically downloaded

## Usage
1. Run the notebook cells in order
2. When prompted, enter the number of the current quarter (1, 2, 3, or 4) and the last two digits of the current year
3. The script will process the reports and generate a summary PDF
4. The summary PDF will be automatically downloaded to your local machine

## Applications in Fintech
This tool significantly enhances financial analysis processes by:
- Automating the extraction of key investment data from lengthy reports
- Providing a quick comparison of Berkshire Hathaway's activities over time
- Enabling rapid identification of trends in Berkshire Hathaway's investment strategies
- Facilitating data-driven decision-making for investment analysis
- Streamlining the creation of investment research reports

## Technical Highlights
- Utilizes web scraping techniques to download financial reports
- Implements PDF text extraction using PyPDF2
- Employs advanced regular expressions for precise data extraction
- Generates professional-looking PDF reports with ReportLab
- Handles multi-page tables and large text volumes

## Future Enhancements
- Integration with financial databases for additional context and data verification
- Implementation of natural language processing for more nuanced text analysis
- Addition of data visualization components (e.g., charts, graphs) in the summary PDF
- Expansion to analyze reports from other major investment firms
- Development of a user interface for easier parameter input and customization

## Note
This tool is designed for educational and research purposes. Always refer to official financial reports and seek professional advice for investment decisions. Ensure compliance with Berkshire Hathaway's terms of service when accessing their reports.

This project demonstrates proficiency in financial data analysis, PDF processing, text mining, and automated report generation - key skills highly valued in the fintech industry.

In [22]:
import io
import os
import re
import requests
import PyPDF2
from google.colab import files
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Table, TableStyle
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib import colors
from reportlab.lib.units import inch

def download_pdf(url):
    response = requests.get(url)
    if response.status_code == 200:
        return io.BytesIO(response.content)
    else:
        print(f"Failed to download PDF from {url}")
        return None

def extract_text_from_pdf(pdf_file):
    if pdf_file:
        reader = PyPDF2.PdfReader(pdf_file)
        text = ""
        for page in reader.pages:
            text += page.extract_text()
        return text
    return ""

def extract_investment_activities(text):
    activities = {
        'Acquisitions': [],
        'Purchases': [],
        'Sales': [],
        'Divestitures': [],
        'Investments': [],
        'Other Activities': []
    }

    # Extract detailed acquisitions
    acquisitions = re.findall(r'(?:On|In) (?:[\w\s,]+), (?:we|Berkshire) acquired .*?(?=\n\n|\Z)', text, re.DOTALL | re.IGNORECASE)
    activities['Acquisitions'] = [a.strip() for a in acquisitions if a.strip()]

    # Extract purchases and investments
    purchases = re.findall(r'(?:purchased|acquired|increased.*?stake in|invested in|made an investment of).*?((?:[\$\d.]+ (?:billion|million)|(?:\d+(?:,\d+)?(?:\.\d+)?)%)(?: interest)? (?:of|in) [^.]+)\.', text, re.IGNORECASE)
    activities['Purchases'] = [p.strip() for p in purchases if p.strip()]

    # Extract sales and divestitures
    sales = re.findall(r'(?:sold|divested|reduced.*?stake in|disposed of).*?((?:[\$\d.]+ (?:billion|million)|(?:\d+(?:,\d+)?(?:\.\d+)?)%)(?: interest)? (?:of|in) [^.]+)\.', text, re.IGNORECASE)
    activities['Sales'] = [s.strip() for s in sales if s.strip()]

    # Extract specific divestitures
    divestitures = re.findall(r'(?:divested|spun off|split off).*?((?:[\$\d.]+ (?:billion|million)|(?:\d+(?:,\d+)?(?:\.\d+)?)%)(?: interest)? (?:of|in) [^.]+)\.', text, re.IGNORECASE)
    activities['Divestitures'] = [d.strip() for d in divestitures if d.strip()]

    # Extract general investments
    investments = re.findall(r'(?:invested|made an investment|increased our investment).*?((?:[\$\d.]+ (?:billion|million)|(?:\d+(?:,\d+)?(?:\.\d+)?)%) (?:in|to) [^.]+)\.', text, re.IGNORECASE)
    activities['Investments'] = [i.strip() for i in investments if i.strip()]

    # Extract bond activities and other significant financial activities
    other_activities = re.findall(r'(?:purchased|sold|acquired|divested).*?((?:[\$\d.]+ (?:billion|million)) (?:of )?(?:Treasury|corporate|government) bonds?[^.]+)\.', text, re.IGNORECASE)
    other_activities += re.findall(r'(?:Our|We|Berkshire).*?(?:strategy|investment|holding|portfolio).*?(?:\$[\d.]+ (?:billion|million)|(?:\d+(?:,\d+)?(?:\.\d+)?)%).*?\.', text)
    other_activities += re.findall(r'(?:entered into|terminated|modified).*?(?:agreement|contract|derivative|swap|option).*?(?:\$[\d.]+ (?:billion|million)|(?:\d+(?:,\d+)?(?:\.\d+)?)%).*?\.', text, re.IGNORECASE)
    activities['Other Activities'] = [o.strip() for o in other_activities if o.strip()]

    return activities

def truncate_text(text, max_length=1000):
    if len(text) <= max_length:
        return text
    return text[:max_length] + "... (truncated)"

def create_activity_table(current_activities, previous_activities, last_year_activities):
    table_data = [['Activity Type', 'Current Quarter', 'Previous Quarter', 'Year-over-Year']]
    for activity_type in ['Acquisitions', 'Purchases', 'Sales', 'Divestitures', 'Investments', 'Other Activities']:
        current = truncate_text('\n\n'.join(current_activities.get(activity_type, [])))
        previous = truncate_text('\n\n'.join(previous_activities.get(activity_type, [])))
        last_year = truncate_text('\n\n'.join(last_year_activities.get(activity_type, [])))
        table_data.append([activity_type, current, previous, last_year])
    return table_data

def create_pdf_summary(activity_table, filename):
    doc = SimpleDocTemplate(filename, pagesize=letter, topMargin=0.5*inch, bottomMargin=0.5*inch, leftMargin=0.5*inch, rightMargin=0.5*inch)
    styles = getSampleStyleSheet()
    story = []

    # Custom styles
    styles.add(ParagraphStyle(name='TableHeader', parent=styles['Heading2'], fontSize=10, alignment=1))
    styles.add(ParagraphStyle(name='TableCell', parent=styles['Normal'], fontSize=8, leading=10))

    # Title
    story.append(Paragraph("Berkshire Hathaway Investment Activities Comparison", styles['Title']))
    story.append(Spacer(1, 12))

    # Activity Table
    story.append(Paragraph("Investment Activities Comparison", styles['Heading2']))
    story.append(Spacer(1, 6))

    # Convert table data to Paragraphs for better text wrapping
    wrapped_data = [[Paragraph(cell, styles['TableHeader']) if i == 0 else Paragraph(cell, styles['TableCell']) for cell in row] for i, row in enumerate(activity_table)]

    t = Table(wrapped_data, colWidths=[1.5*inch, 1.75*inch, 1.75*inch, 1.75*inch], repeatRows=1)
    t.setStyle(TableStyle([
        ('BACKGROUND', (0, 0), (-1, 0), colors.grey),
        ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),
        ('ALIGN', (0, 0), (-1, -1), 'LEFT'),
        ('VALIGN', (0, 0), (-1, -1), 'TOP'),
        ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
        ('FONTSIZE', (0, 0), (-1, 0), 10),
        ('BOTTOMPADDING', (0, 0), (-1, 0), 12),
        ('BACKGROUND', (0, 1), (-1, -1), colors.beige),
        ('GRID', (0, 0), (-1, -1), 1, colors.black)
    ]))
    story.append(t)

    doc.build(story)

def get_quarter_year():
    quarter = input("Enter the number of the latest quarter (1, 2, 3, or 4): ")
    year = input("Enter the last two digits of the current year (e.g., 24 for 2024): ")
    return quarter, year

def generate_urls(quarter, year):
    base_url = "https://berkshirehathaway.com/qtrly/"
    current_url = f"{base_url}{quarter}ndqtr{year}.pdf"

    # Previous quarter
    prev_quarter = str(int(quarter) - 1) if int(quarter) > 1 else "4"
    prev_year = year if int(quarter) > 1 else str(int(year) - 1).zfill(2)
    previous_url = f"{base_url}{prev_quarter}{'st' if prev_quarter == '1' else 'nd' if prev_quarter == '2' else 'rd' if prev_quarter == '3' else 'th'}qtr{prev_year}.pdf"

    # Last year same quarter
    last_year = str(int(year) - 1).zfill(2)
    last_year_url = f"{base_url}{quarter}ndqtr{last_year}.pdf"

    return current_url, previous_url, last_year_url

# Main execution
quarter, year = get_quarter_year()
current_url, previous_url, last_year_url = generate_urls(quarter, year)

print(f"Downloading current quarter report: {current_url}")
current_quarter_pdf = download_pdf(current_url)
print(f"Downloading previous quarter report: {previous_url}")
previous_quarter_pdf = download_pdf(previous_url)
print(f"Downloading last year's quarter report: {last_year_url}")
last_year_quarter_pdf = download_pdf(last_year_url)

print("Extracting text from PDFs...")
current_quarter_text = extract_text_from_pdf(current_quarter_pdf)
previous_quarter_text = extract_text_from_pdf(previous_quarter_pdf)
last_year_quarter_text = extract_text_from_pdf(last_year_quarter_pdf)

print("Extracting investment activities...")
current_activities = extract_investment_activities(current_quarter_text)
previous_activities = extract_investment_activities(previous_quarter_text)
last_year_activities = extract_investment_activities(last_year_quarter_text)

print("Creating activity table...")
activity_table = create_activity_table(current_activities, previous_activities, last_year_activities)

print("Generating PDF summary...")
output_filename = f"Berkshire_Hathaway_investment_activities_comparison_Q{quarter}_{year}.pdf"
create_pdf_summary(activity_table, output_filename)
files.download(output_filename)

print(f"\nComparison PDF file '{output_filename}' created and downloaded successfully.")
print("Current working directory:", os.getcwd())
print("Files in directory:", os.listdir())

Enter the number of the latest quarter (1, 2, 3, or 4): 2
Enter the last two digits of the current year (e.g., 24 for 2024): 24
Downloading current quarter report: https://berkshirehathaway.com/qtrly/2ndqtr24.pdf
Downloading previous quarter report: https://berkshirehathaway.com/qtrly/1stqtr24.pdf
Downloading last year's quarter report: https://berkshirehathaway.com/qtrly/2ndqtr23.pdf
Extracting text from PDFs...
Extracting investment activities...
Creating activity table...
Generating PDF summary...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


Comparison PDF file 'Berkshire_Hathaway_investment_activities_comparison_Q2_24.pdf' created and downloaded successfully.
Current working directory: /content
Files in directory: ['.config', 'Berkshire_Hathaway_investment_activities_comparison_Q2_10.pdf', 'Berkshire_Hathaway_investment_activities_comparison_Q2_24.pdf', '.ipynb_checkpoints', 'sample_data']
