Ocean Beach Events Web Data Pipeline Project <br>
Christopher Luke <br>
2025 

Background: This project utilizes Python and its libraries to extract community event data from the Ocean Beach MainStreet Association website. The pipeline parses raw HTML into structured datasets; organizing event details such as dates, times, and titles, and outputting them into a Pandas DataFrame for analysis and visualization.

Librarys used: 
<ins>*requests*</ins> - for sending HTTP requests and retrieving HTML information,
<ins>*BeautifulSoup*</ins> - for navigating through HTML information,
<ins>*datetime*</ins> - for handling date and time related information,
<ins>*pandas*</ins> - for organizing tabular data  



In [2]:
import requests 
from bs4 import BeautifulSoup as bs
import pandas as pd 
from datetime import datetime, timedelta

This project is my first attempt to apply what I’ve been learning in Python towards a real-world problem: finding information online, understanding how to extract and organize it, and transforming it into a structured, analyzable dataset. I wanted to start with something small, preferably a single website, yet still challenging and personally useful. With this in mind, I decided to build a pipeline to capture and analyze local events in my community of Ocean Beach, San Diego. I will walk through my thought process during the entire creation cycle, explaining the reasoning behind my decisions.  

To start, I needed a reliable data source. After searching online, I found [Ocean Beach San Diego](https://oceanbeachsandiego.com), which provides event information about Newport Avenue in a monthly calendar format. The URL for the calendar changes each month following a predictable pattern (`/calendar/month/year-month/`). Knowing this, I generated a list of URLs for each month of a year for a sequence of years (I decided on the past three years worth of events) using list comprehensions. With these lists, and by combining them with the Python libraries `requests` and `Beautiful Soup`, I now had a systematic way to iterate through each page, extract the raw HTML, and parse it into a structure I could actually work with.  

Inspecting the HTML revealed that the calendar is structured as a `<table>` containing a `<tbody>`. Each week is represented by a `<tr>` (row), which can contain information about numerical dates and event types for that week, both single-day (`class="single-day"`) and multi-day (`class="multi-day"`) events. Inside each `<tr>` are `<td>` elements representing individual days, which may be empty or contain events - when an event exists, it is enclosed in a `<div class="item">` element, containing the event title (inside an `<a>` tag) and times (inside `<span>` tags).  

From this analysis, I developed my initial strategy: request the HTML for each calendar URL, iterate through the `<tbody>` to find `<tr>` elements for single- and multi-day events, then parse their `<td>` elements if they contained event details. Each event, once extracted, could be stored in a dictionary, and the collection of dictionaries would ultimately be converted into a Pandas DataFrame for analysis.  

This approach worked well for single-day events, but I quickly noticed a problem with multi-day events: they were being recorded multiple times, all with the same start date. To troubleshoot, I went back and inspected the HTML more carefully - that’s when I noticed the role of the `colspan` attribute. Multi-day events weren’t actually listed with separate dates — instead, they were represented by a single start date, with the `colspan` value indicating how many days the event lasted. For example, `colspan=3 data-date=2025-08-12` meant the event ran from August 12th through the 14th. 
At first, my code could not handle this and only captured the given start date - however, once I understood the purpose of `colspan` I had the key to fixing it. My solution was to generate the full date range by creating a sequence that matched the `colspan` value and then extend the event across those days using Python’s `datetime.timedelta`. With this approach, multi-day events were now recorded with the correct span of dates rather than just the same date entry. 

This process also revealed another classification issue - if an event placed in the multi-day events `<tr>` only lasted one day, is it technically still a multi-day event? This prompted me to reinspect the HTML and by doing so I noticed that there were inconsistencies in the HTML itself! For instance, a committee event during one month might have been listed under the single day `<tr>`, but in the following month under the multi-event `<tr>` while still only being referenced in the calendar for one day. To resolve this problem I relied on the classification given within the `<td>` element and not the `<tr>` element for deciding event classification and thus proper date handling.    

The results of this process are illustrated in the code below, which generates the complete event DataFrame.

In [3]:
def scrape_ob_calendar():
    base_url = 'https://oceanbeachsandiego.com/calendar/month/'
    yearlist = ['202' + f'{i}' + '-' for i in range(2,6)]
    eventlist = []
    for year in yearlist:
        urllist  = [base_url + year + f"{i:02d}" for i in range(1,13)]
        for url in urllist:
            response = requests.get(url)
            soup = bs(response.text, 'html.parser')
            tbody = soup.find('tbody')
            for tr in tbody.find_all('tr', class_=['single-day','multi-day']):
                for td in tr.find_all('td'):
                    etype = td.get("class")
                    if td.find('div', class_="item") and 'single-day' in etype:
                        date = td.get('data-date')
                        date = datetime.strptime(date,"%Y-%m-%d")
                        month = date.strftime("%B")
                        day = date.strftime("%A")
                        year = date.strftime("%Y")
                        items = td.find_all('div', class_='item')
                        for item in items:
                            d = {}
                            event = item.find('a').get_text(strip=True)
                            start = item.find('span', class_='date-display-start')
                            if start:
                                start = start.get_text(strip=True)
                            else:
                                start = item.find('span', class_='date-display-single').get_text(strip=True)
                            end = item.find('span', class_='date-display-end')
                            if end:
                                end = end.get_text(strip=True)
                            else:
                                end = None
                            d['event type'] = etype[0]
                            d['day'] = day
                            d['month'] = month
                            d['year'] = year
                            d['date'] = date
                            d['event'] = event
                            d['start time'] = start
                            d['end time'] = end
                            eventlist.append(d)
                    elif td.find('div', class_="item") and 'multi-day' in td.get("class"):
                        datespan = int(td.get('colspan'))
                        date = td.get('data-date')
                        date = datetime.strptime(date, "%Y-%m-%d")
                        items = td.find_all('div', class_='item')
                        for dayi in range(datespan):
                            multidate = date + timedelta(days = dayi)
                            month = multidate.strftime("%B")
                            day = multidate.strftime("%A")
                            year = multidate.strftime("%Y")
                            for item in items:
                                d = {}
                                event = item.find('a').get_text(strip=True)
                                start = item.find('span', class_='date-display-start')
                                if start:
                                    start = start.get_text(strip=True)
                                else:
                                    start = item.find('span', class_='date-display-single').get_text(strip=True)
                                end = item.find('span', class_='date-display-end')
                                if end:
                                    end = end.get_text(strip=True)
                                else:
                                    end = None
                                d['event type'] = etype[0]
                                d['day'] = day
                                d['month'] = month
                                d['year'] = year
                                d['date'] = multidate
                                d['event'] = event
                                d['start time'] = start
                                d['end time'] = end
                                eventlist.append(d)
    df = pd.DataFrame(eventlist)       
    return df
                           
    

In [4]:
df = scrape_ob_calendar()
display(df)

Unnamed: 0,event type,day,month,year,date,event,start time,end time
0,single-day,Friday,January,2022,2022-01-07,Promotion Committee,9:00 AM,
1,single-day,Friday,February,2022,2022-02-04,Promotion Committee,9:00 AM,
2,single-day,Friday,February,2022,2022-02-18,OB Community Cleanup - Feb. 18th,8:00 AM,10:00 AM
3,single-day,Friday,March,2022,2022-03-04,Promotion Committee,9:00 AM,
4,single-day,Friday,April,2022,2022-04-01,Promotion Committee,9:00 AM,
...,...,...,...,...,...,...,...,...
584,multi-day,Thursday,December,2025,2025-12-18,"Shop Small, Shop Local Ocean Beach",12:00 AM,11:45 PM
585,multi-day,Friday,December,2025,2025-12-19,"Shop Small, Shop Local Ocean Beach",12:00 AM,11:45 PM
586,single-day,Wednesday,December,2025,2025-12-17,OB Farmers Market,4:00 PM,
587,single-day,Wednesday,December,2025,2025-12-24,OB Farmers Market,4:00 PM,


With this organized DataFrame, I can start performing some analysis - for example, I might want to quantify how frequently each unique event occurs each year. These insights might reveal which events are most popular or meaningful in my community, and by comparing them over multiple years I might be able to detect shifts in interests. To preserve the original data, I use Python’s built-in copy library to create a separate version that I can manipulate safely.

Next, I leverage Pandas’ groupby function to add a new column, count, representing the number of occurrences of each event per year. I then remove duplicate rows so each event-year combination appears once, and sort the results by year and event frequency. For ease of comparison I limit the output to the top 10 most occuring events per year.

In [5]:
import copy as copy

In [6]:
df2 = df.copy()
df2['count'] = df2.groupby(['year','event'])['event'].transform('count')
df2 = df2.drop_duplicates(subset=['year','event'])
df2 = df2.sort_values(by = ['year','count'], ascending = False)[['event','count','year']].reset_index(drop=True)

In [7]:
display(df2.loc[df2['year'] == '2023'].reset_index(drop=True).head(10))
display(df2.loc[df2['year'] == '2024'].reset_index(drop=True).head(10))
display(df2.loc[df2['year'] == '2025'].reset_index(drop=True).head(10))


Unnamed: 0,event,count,year
0,OB Farmers Market,43,2023
1,Promotion Committee,12,2023
2,Board of Directors,9,2023
3,Cymbiotika San Diego Open,8,2023
4,Finance Task Force,4,2023
5,Made in PL + OB Holiday Market,2,2023
6,"Free Presentation ""New Laws for 2023"" at Newbr...",1,2023
7,FY23 OBMA Annual Meeting,1,2023
8,Event: TEA The Eternal Art – January 15th,1,2023
9,2023 OBMA Annual Awards Celebration,1,2023


Unnamed: 0,event,count,year
0,OB Farmers Market,52,2024
1,"Shop Small, Shop Local Ocean Beach",23,2024
2,Trivia Nights at Dirty Birds OB,16,2024
3,Lizzie: The Musical at Wildsong Theater and Ar...,13,2024
4,Promotion Committee,12,2024
5,Finance Task Force,12,2024
6,Board of Directors,12,2024
7,OB Planning Board,10,2024
8,"Footloose, the Musical at OB Playhouse",6,2024
9,The San Diego Audubon Society Bird Festival: B...,5,2024


Unnamed: 0,event,count,year
0,OB Farmers Market,53,2025
1,"Shop Small, Shop Local Ocean Beach",23,2025
2,OB Planning Board,12,2025
3,Finance Task Force,12,2025
4,Board of Directors,12,2025
5,Economic Vitality Committee,11,2025
6,Promotion Committee,11,2025
7,Design Committee,11,2025
8,Clean & Safe Committee,11,2025
9,OB Church Camp for Kids: K - 5th Grade,5,2025


From this analysis, it is clear that one event takes precedence over all others: the OB Farmers Market, which occurs almost every week throughout the year. Over the past two years, a new promotional event has also emerged during the holiday season, aimed at encouraging shopping at locally owned businesses. Consistently present across all years are committee and board-led events, which involve decision-making, planning, and logistical support for community development, programs, and maintenance. Overall, it is evident that Ocean Beach is a highly active community, committed to supporting itself through grassroots initiatives and local engagement.

Reflections and Future Plans: This project provided me with valuable hands-on experience with engaging with real-world community data, and highlighting that even well-structured sources can contain hidden inconsistencies. The analysis revealed the OB Farmers Market as a dominant recurring event, the rise of seasonal promotional activities, and the steady role of committee and board-led initiatives in supporting local governance. Future improvements could include adding attendance or economic impact metrics, expanding the dataset across more years or neighborhoods, and creating visual dashboards to make trends and patterns more accessible and actionable.