# Doctor Who Data Visualization

This notebook seeks to explore the Doctor Who datasets and identify anomalies and missing values requiring cleaning along with potential correlations to orient a machine learning model around.

In [1]:
# Imports
import pandas as pd

In [4]:
# Load Episodes
df_eps = pd.read_csv('../Data/all-detailsepisodes.csv')

# First Diffusion seems to indicate the air date so let's rename it for clarity
df_eps.rename(columns={'first_diffusion': 'air_date'}, inplace=True)

# Preview
df_eps.head()

Unnamed: 0,episodeid,title,air_date,doctorid
0,4-3,The Power of the Daleks,"5 Nov, 1966",2
1,4-4,The Highlanders,"17 Dec, 1966",2
2,4-5,The Underwater Menace,"Jan 14, 1967",2
3,4-6,The Moonbase,"11 Feb, 1967",2
4,4-7,The Macra Terror,"11 Mar, 1967",2


In [5]:
# Load Doctor Who Guide
df_guide = pd.read_csv('../Data/dwguide.csv')

# Column notes:
# - episodenbr may correlate to episodeid in df_eps, but the name would need to be standardized
# - views needs cleaning to be numeric
# - share seems to have missing values
# - AI is unclear in meaning
# - chart is unclear in meaning
# - Cast is a JSON array
# - Crew is a JSON array
# - Summary has missing values

# Preview
df_guide.head()

Unnamed: 0,episodenbr,title,weekday,broadcastdate,broadcasthour,duration,views,share,AI,chart,cast,crew,summary
0,601,The King's Demons: Part Two,Wed,16 Mar 1983,6:47pm,00:24:27,7.20m,,63.0,66,"[{""role"":""The Doctor"",""name"":""Peter Davison""},...","[{""role"":""Writer"",""name"":""Terence Dudley""},{""r...",The Master and the Doctor fight for control ov...
1,602,The Five Doctors,Fri,25 Nov 1983,7:20pm,01:30:23,7.70m,,75.0,54,"[{""role"":""The Doctor"",""name"":""Peter Davison""},...","[{""role"":""Writer"",""name"":""Terrance Dicks""},{""r...",
2,603,Warriors of the Deep: Part One,Thu,5 Jan 1984,6:41pm,00:24:48,7.60m,,65.0,51,"[{""role"":""The Doctor"",""name"":""Peter Davison""},...","[{""role"":""Writer"",""name"":""Johnny Byrne""},{""rol...",Seabase 4 is on alert as a strange craft appro...
3,604,Warriors of the Deep: Part Two,Fri,6 Jan 1984,6:41pm,00:24:04,7.50m,,64.0,52,"[{""role"":""The Doctor"",""name"":""Peter Davison""},...","[{""role"":""Writer"",""name"":""Johnny Byrne""},{""rol...",The Silurians and Sea Devils send in the Myrka...
4,605,Warriors of the Deep: Part Three,Thu,12 Jan 1984,6:41pm,00:24:02,7.30m,,62.0,74,"[{""role"":""The Doctor"",""name"":""Peter Davison""},...","[{""role"":""Writer"",""name"":""Johnny Byrne""},{""rol...",As Vorshak's crew are cut down by Sauvix's Sea...


In [7]:
# Load IMDB Details
df_imdb = pd.read_csv('../Data/imdb_details.csv')

# Simplify column
df_imdb.rename(columns={'nbr_votes': 'votes'}, inplace=True)

# TODO: we have season and number, we should be able to standardize an identifier across all 3 datasets

# Preview
df_imdb.head()

Unnamed: 0,number,title,rating,votes,description,season
0,1,Rose,7.6,6504,When ordinary shop-worker Rose Tyler meets a m...,1
1,2,The End of the World,7.6,5684,The Doctor takes Rose to the year 5 billion to...,1
2,3,The Unquiet Dead,7.6,5326,The Doctor has great expectations for his late...,1
3,4,Aliens of London,7.0,5116,The Doctor returns Rose to her own time - well...,1
4,5,World War Three,7.1,4943,The Slitheen have infiltrated Parliament and h...,1
