# Doctor Who - Actor Timeline

## Topics & Techniques Covered

* Extracting text data from a Wikipedia table
* Pandas `.read_html()`
* Regular expressions (Python `re` module)
* Getting time series data out of text data
* Python's `datetime` module and the `dateutil` package
* Timeline visualization

In this example, we will be creating a barplot that shows which actors played The Doctor on Doctor Who at which times. There is a Wikipedia article that shows this in a table, but we'd like to visualize the data.

https://en.wikipedia.org/wiki/List_of_actors_who_have_played_the_Doctor

## Imports

In [1]:
import re
import datetime
import requests

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### `requests`

We will be using the `requests` module to perform a "get" HTTP request to the aforementioned Wikipedia page.

There are four types of HTTP requests: `get`, `put`, `post`, and `delete`. We aren't hosting the page in question, so we'll just be using a `get` request.

For a more extensive tutorial on the `requests` module and on web-scraping, please see the archived "Practical Python" workshop materials on the Library's "[Introduction to Python](https://libguides.libraries.claremont.edu/intro-to-python)" Research Guide.

Once we get the page's data, we will use pandas to convert the table in question to a DataFrame, and then use regular expressions to clean the data and make it usable.

In [2]:
#This next line may be commented out after running to avoid sending
#a request more than necessary; requests typically have a rate limit
r = requests.get('https://en.wikipedia.org/wiki/List_of_actors_who_have_played_the_Doctor')
page_data = r.content

### pandas read_html (The Easy Way)

In [3]:
wiki_data = pd.read_html(page_data)

Pandas's `.read_html()` method grabs *all* the tables it finds in html code. Luckily for us, the table we want is the first one on the page, so we can look at the table at index \[0] returned by `pd.read_html()`.

In [None]:
wiki_data[0]

We want to create a visualization of how long each actor portrayed The Doctor. The chronological data in this dataset is all in the "Tenure" column. However, both the start and end dates are in the same field among other text and punctuation.

Fortunately for our purposes, all the Doctors' dates are listed in a predictable pattern: a one-or-two-digit number, the full name of the month, then the four-digit year. We can use the Python regular expressions module `re` to extract their starting and ending years.

### Regular Expressions

What are regular expressions?

[Regular expressions](https://en.wikipedia.org/wiki/Regular_expression) (or regex for short) are sequences of characters that can be used to find segments of text that match specified characteristics. They are used in search engines and in-page/in-document searches.

Regex is not a programming language unto itself, but rather a syntax for pattern-recognition. This syntax is the same across implementations in different programming languages and software systems.

There are many free resources for learning regex patterns; you will not need to memorize them in order to use them, but you will need to understand how they are assembled.

Both regexr and regex101 offer live platforms to test out patterns on sample text; this can be a very effective means of developing new regex patterns. 

[regexr](https://regexr.com/)

[regex101](https://regex101.com/)

If you *really* want to test your understanding of regex syntax, you can do so with regex crossword puzzles:

[Regex Crossword](https://regexcrossword.com/)

In [None]:
#Select every other row of the original DataFrame by using a stride of 2
doctors = list(wiki_data[0]['Actor (role)'][::2])

In [None]:
doctors

In [None]:
#Use a lst comprehension, splitting on the space and removing parentheses:
doctor_numbers = [doctor.split(' (')[1].replace(')','') for doctor in doctors]
doctor_numbers

In [None]:
doctors = [doctor.split(' (')[0] for doctor in doctors]
doctors

In [None]:
#Original DataFrame has date information in every other row
#Select every other row of the original DataFrame:
tenure = list(wiki_data[0]['Tenure'][::2])

# The `re` Module

Regular expressions in Python are handled using the `re` module (part of the Python Standard Library), which contains several methods for matching strings. `re.findall()` returns all non-overlapping instances of a pattern in a given string, scanned left-to-right. 

In [None]:
# Create empty lists to store actors' start and end dates as Doctor

start_years = []
end_years = []

# Iterate through table, pulling first and optionally second years from each Tenure cell:
for entry in tenure:
    years = re.findall(r'\d\d\d\d', entry)
    present = re.findall(r'present', entry)
    start_years.append(years[0])
    
    if len(years) < 2:
        # If the Doctor is the current Doctor, append current year:
        if present:
            end_years.append(datetime.date.today().year)
        # If there is no second year but the Doctor is not
        # the current Doctor, append first year to both lists
        else:
            end_years.append(years[0])
    else:
        end_years.append(years[1])

But can we get it to be even more granular? What if we wanted the full date, not just the year?

How can we construct a regular expression that captures both "1 January 2010" and "25 December 2013" but which excludes "3 years, 11 months"?

## CODING EXERCISE 

Try it yourself on Regexr or regex101! Both of these sites have a "cheat sheet"/"quick reference" section that shows you components of regex patterns.





In [None]:
print(wiki_data[0].iloc[20]['Tenure'])

In [None]:
wiki_data[0].iloc[20]['Tenure'].replace(u'\xa0',' ')

`\xa0` is a unicode character that indicates a [non-breaking space (NBSP)](https://en.wikipedia.org/wiki/Non-breaking_space). This is a special character that prevents a line break between two words, so "11 months" won't end up as "11 

months". We're going to replace all instances of this character, to demonstrate that the regex pattern will work regardless of its presence.

Once you have found a pattern that matches dates in the given format, try pasting it into the `rstring` in the cell below:

In [None]:
################################################################################
################################################################################

entry = wiki_data[0].iloc[20]['Tenure'].replace(u'\xa0',' ')

search_string = r'____'

result = re.findall(search_string, entry)

result

################################################################################
################################################################################

### Solution

There are several ways to get the correct matches using regex, but this is one that works reliably in this particular case:

We start with `\d+` which means "at least one digit", then we add a `\s+` (at least one whitespace character).

There are several options for what we could add next to capture this pattern, but we're going to go with `\D+` here, which means "at least one character that *isn't* numeric."

Then we'll round it out with another "some amount of whitespace", and finally `\d{4}` (exactly four numeric digits).

This will only match patterns that meet all the criteria above. Since there are no durations with four digits worth of months, it won't accidentally match "3 years, 11 months".

In [None]:
################################################################################
################################################################################

entry = wiki_data[0].iloc[20]['Tenure'].replace(u'\xa0',' ')

search_string = r'\d+\s+\D+\s+\d{4}'

result = re.findall(search_string, entry)

result

################################################################################
################################################################################

## Extracting the Dates

In [None]:
# Create empty lists to store actors' start and end dates as Doctor

start_dates = []
end_dates = []

# Iterate through table, pulling first and optionally second years from each Tenure cell:
for entry in tenure:

    #Replace all non-breaking spaces with regular spaces
    entry = entry.replace(u'\xa0',' ')

    #Look for instances of predetermined regex pattern
    years = re.findall(search_string, entry)
    
    print(years)
    present = re.findall(r'present', entry)
    start_dates.append(years[0])

    #If only one date listed:
    #For "present" use today's date
    #For a single date not "present", use for both start and end date:
    
    if len(years) < 2:
        # If the Doctor is the current Doctor, append current year:
        if present:
            end_dates.append(datetime.date.today().strftime('%d %B %Y'))
        # If there is no second year but the Doctor is not
        # the current Doctor, append first year to both lists
        else:
            end_dates.append(years[0])
    else:
        end_dates.append(years[1])

https://strftime.org/

In [None]:
#Create empty DataFRame
doctor_tenures = pd.DataFrame()

#Create columns for Doctor, Doctor Number, Start and End dates
doctor_tenures['Doctor'] = doctors
doctor_tenures['Doctor Number'] = doctor_numbers
doctor_tenures['Start Date'] = start_dates
doctor_tenures['End Date'] = end_dates

doctor_tenures

## QUESTION - Best Practices for Date Formatting

What do you suppose the most useful date format is from a programming perspective?

#### (Click to reveal)

YYYYMMDD has the advantage of being chronologically sortable when read as a simple integer value, since its time increments are represented largest-to-smallest.

In [None]:
#MMDDYYYY
12132024 < 12112025

In [None]:
#YYYYMMDD
20241213 < 20251211

## A Brief History of Datetime (and Dateutil)

Python's builtin `datetime` module is a powerful tool for reformatting dates and times so they can be applied across different timestamp paradigms, whether to make them more human-readable, or more machine-readable. `datetime` is host to a number of useful classes, including the redundantly-named `datetime`, which is a combination of the `date` class and the `time` class, as well as `timedelta`, which expressed the difference between `date` and/or `datetime` objects, as well as classes specific to handling time zones, `timezone` and `tzinfo`. 

However, `datetime` leaves something to be desired in ease of use. The `dateutil` package is an extension to `datetime` that makes parsing and formatting dates easier. It adds additional capability for calculating time deltas by different units, more precise implementations of time zone, and most importantly for our purposes, robust support for parsing differently formatted dates.

In [None]:
from dateutil import parser

The dateutil parser can interpret dates in a number of different formats. It defaults to reading dates with the US convention of MM-DD-YYYY formatting, but it can be set to parse [different date formats](https://en.wikipedia.org/wiki/List_of_date_formats_by_country). 

It can also interpret months written as words and ordinal dates written as "1st", "2nd", "3rd", etc., though it cannot parse ordinal dates written as words (e.g. "June third, 2002")

In [None]:
parser.parse('6 December 1989')

In [None]:
parser.parse('December 6 1989')

In [None]:
parser.parse('12 6 1989')

In [None]:
#Dateutil's parser defaults to reading a date as MM-DD-YYYY when it's ambiguous
parser.parse('6/12/1989')

In [None]:
#The "dayfirst" parameter lets you change the default behavior
parser.parse('6/12/1989', dayfirst=True)

In [None]:
#The parser will read DD-MM-YYYY dates correctly when the date cannot be in MM-DD-YYYY format
parser.parse('13/12/1989', dayfirst=True)

In [None]:
#The parser works with non-numeric date indicators as well.
parser.parse('June 3rd, 2002')

In [None]:
parser.parse('3rd June, 2002')

### CODING EXERCISE - `dateutil` Parser
Try passing a date into the dateutil parser. See for yourself which formats work and which do not.

In [None]:
################################################################################
################################################################################

parser.parse('3rd June, 2002')

################################################################################
################################################################################

We'll be using the dateutil parser within an `apply(lambda)` statement, as seen in `PPW1 - ManagerSalary.ipynb`.

In [None]:
doctor_tenures['Start Date'] = doctor_tenures['Start Date'].apply(lambda x: parser.parse(x))
doctor_tenures['End Date'] = doctor_tenures['End Date'].apply(lambda x: parser.parse(x))

In [None]:
doctor_tenures

Note that the output of the parser is normally `datetime.datetime(year, month, day, hour, minute)`, but because it's in a Pandas DataFrame, it's stored in YYYY-MM-DD format. The dates have been converted to pandas Timestamp format, allowing them to be used as numerical values.

In [None]:
for x in doctor_tenures['Start Date']:
    print(x, type(x))

In [None]:
for x in doctor_tenures['End Date']:
    print(x, type(x))

In [None]:
doctor_tenures

Next, we can create a new column containing a list for each pair of start/end dates.

In [None]:
doctor_tenures['StartEnd'] = [(start, end) for start, end in zip(doctor_tenures['Start Date'], doctor_tenures['End Date'])]
doctor_tenures['StartEnd'] = [[entry] for entry in doctor_tenures['StartEnd']]

In [None]:
doctor_tenures

We can create a dictionary that stores the start date of each performer.

In [None]:
#Create empty dictionary
start_date_timeline = {}

#Store start, end dates as values with actors as keys:
for key, value in zip(doctor_tenures['Doctor'], doctor_tenures['Start Date']):
    #Check to see if actor is already stored as key:
    if key not in start_date_timeline.keys():
        start_date_timeline[key] = value

In [None]:
start_date_timeline

In [None]:
doctor_tenures

Next we can get the durations of each performer's stint as The Doctor by subtracting the start date timestamp from the end date timestamp, in our StartEnd column.

In [None]:
doctor_tenures['timediff'] = doctor_tenures['StartEnd'].apply(lambda x: [y[1] - y[0] for y in x])

## What About David Tennant? - Pandas `.groupby()`

As both the Tenth and Fourteenth Doctor, David Tennant is an anomaly among Doctors. If we want to avoid redunant entries, we need to use Pandas `.groupby()` to combine his entries.

This is a fairly complex sequence of nested object methods: 

First, we use `.groupby()` to merge rows with the same value in the specified column ("Doctor").

Then we specify the columns to be combined ("StartEnd", "timediff"), use `.sum()` to add the values (or in this case, concatenate them)

Next, we use `.sort_values()` to reorder the entries based on which actor started first (following the order specified in `start_date_timeline`).

The `.map()` method allows us to pass in a custom dictionary for the order rather than relying on the default ascending/descending sort.

In [None]:
doc_tenures_graph = doctor_tenures.groupby('Doctor', as_index=False)[['StartEnd', 'timediff']].sum().sort_values(by=['Doctor'], key=lambda x: x.map(start_date_timeline))

doc_tenures_graph.reset_index(inplace=True, drop=True)
doc_tenures_graph

# Visualization

[Example of Timeline Plot](https://deparkes.co.uk/2021/09/05/python-timeline-plot/) for period settings of television shows set in the UK. We can base our visualization on this example, but we will need to alter some lines of code to better suit our purposes.

Please note: the UK television link contains Plotly examples that refer to this type of chart as a "Gantt chart"; while there are similarities between Gantt charts and this kind of visualization, [Gantt charts](https://en.wikipedia.org/wiki/Gantt_chart) are specifically designed for coordinating tasks for project scheduling.

![image.png](attachment:82e15824-fd41-4dd3-95f9-0fa064147c08.png)

## Doctor Timeline

Following the template in the example above, we can construct our own broken horizontal bar graph to show the durations of each actor's tenure. We're going to try to get a working prototype before we consdier how to deal with David Tennant's particular situation.

### Graph 1

In [None]:
#"padding" will give some dimension to extremely short entries
#Paul McGann's stint as the Doctor has 0 length, so it would be invisible without padding
padding = datetime.timedelta(days=31)

#Create the figure and axes for the timeline:
fig, gnt = plt.subplots(figsize = (14, 10))

#Set labels for Y-Axis
y_tick_labels = doctor_tenures.sort_values(by='Start Date')['Doctor']
y_pos = np.arange(len(y_tick_labels))
gnt.set_yticks(y_pos)
gnt.set_yticklabels(y_tick_labels)

#Plot the actor's stints, sorting by their start dates
for index, row in doctor_tenures.sort_values(by='Start Date').reset_index().iterrows():
    start_date = row['Start Date']
    duration = row['timediff'][0]
    
    gnt.broken_barh([(start_date-padding, duration+padding)], 
                    (index-0.5,0.8), 
                    facecolors =('cornflowerblue'),
                   label=row['Doctor'])
    gnt.text(start_date+padding, index-0.2, row['Doctor'])

### Graph 2

There are two David Tennants... while in the Whoniverse, the Tenth and Fourteenth Doctors are separate regenerations of the same entity, the real David Tennant has gone through no such transformative process (at least, not that we know of). Let's tidy this up by aligning both his stints as The Doctor in the same row.

We can accomplish this by using list comprehensions for each actor's start dates and the durations of their tenures, and then adding features to the graph based on the contents of those freshly-generated lists.

In [None]:
#"padding" will give some dimension to extremely short entries
#Paul McGann's stint as the Doctor has 0 length, so it would be invisible without padding
padding = datetime.timedelta(days=31)

#Create the figure and axes for the timeline:
fig, gnt = plt.subplots(figsize = (14, 10))

#Set labels for Y-Axis
y_tick_labels = doc_tenures_graph['Doctor']
y_pos = np.arange(len(y_tick_labels))
gnt.set_yticks(y_pos)
gnt.set_yticklabels(y_tick_labels)

#Plot the actor's stints, sorting by their INITIAL start dates
for index, row in doc_tenures_graph.iterrows():

    #Use list comprehensions to create sequence of start/end dates for each performer:
    start_dates = [x[0] for x in row['StartEnd']]
    durations = [y for y in row['timediff']]

    gnt.broken_barh([(start_date-padding, duration+padding) for start_date, duration in zip(start_dates, durations)], 
                    (index-0.5,0.8), 
                    facecolors =('cornflowerblue'),
                   label=row['Doctor'])
    gnt.text(start_dates[0]+padding*10, index-0.2, row['Doctor'])
    if len(start_dates) > 1:
        gnt.text(start_dates[1]+padding*10, index-0.2, row['Doctor'])

plt.title('Doctor Who - Actor Timeline')
plt.ylabel('Performer')
plt.xlabel('Year')

plt.show()

# End of Module 2

*© 2025. This work is openly licensed via CC BY 4.0*