# 1 Data Collection

The data collection for analyzing social graphs and interactions was performed for scraping the Fandom Wiki of the Grey's Anatomy universe in the following three main steps. 

1. Scraping the list of Grey's Anatomy characters

2. Scraping and cleaning the character pages of every character in the character list

3. Scrape the episode and season summaries

The scraping will be done for all characters of all Grey's Anatomy shows, Grey's Anatomy as the main show and Private Practice and Station 19 as the spin-offs.


In [1]:
!pip install networkx

import warnings
warnings.filterwarnings("ignore")

import json
import pandas as pd
import urllib.request
import re
import networkx as nx
import numpy as np
import urllib.request
import requests
from sklearn import preprocessing
import matplotlib.pyplot as plt
%matplotlib inline

Collecting networkx
  Downloading networkx-2.8.8-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: networkx
Successfully installed networkx-2.8.8
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

## 1.1 Scraping the character list

The first step of the data gathering is to get the list of all characters from the Fandom page.

In [9]:
session = requests.Session()

url = "https://greysanatomy.fandom.com/api.php"

In [10]:
params = {
    "format": "json",
    "list": "categorymembers",
    "action": "query",
    "cmtitle": "Category:Characters",
    "cmlimit": "500",
    "cmcontinue": ""
}
request = session.get(url=url, params=params)
data = request.json()

pages = data["query"]["categorymembers"]

while 'continue' in data.keys():
    params["cmcontinue"] = data["continue"]["cmcontinue"]
    request = session.get(url=url, params=params)
    data = request.json()
    pages.extend(data["query"]["categorymembers"])

with open('data/characters.json', 'w', encoding='utf-8') as file:
    json.dump(pages, file, ensure_ascii=False, indent=4)

## 1.2 Scraping and cleaning the character pages

The scraping and cleaning of the character pages contains the following main steps.
1. Scrape and save the character wikipages in a folder based on the character list
2. Get the occurences of each character in every show (Grey's Anatomy, Private Practice, Station 19) normalized by the number of season 
3. Define the main universe of the character based on the occurences
4. Define categories of characters (Doctors, Nurses, Patients or other such as Family or Friends)
5. Extract the list of aliases for characters and the status (Dead or alive) from the infobox
6. Scrape the clean description for each character 
7. Get the history for each character



In [11]:
# open the list of characters and save to a dataframe 
f = open('data/characters.json')
data = json.load(f) 
df = pd.DataFrame(data)
df.sample(10)

Unnamed: 0,pageid,ns,title
2798,50318,0,Dr. Simcox
241,47482,0,Rachel Bishop
484,34824,0,Chris (Season 11)
3356,51013,14,Category:Patients (Endocrinology)
3365,8218,14,Category:Patients (Oncology)
2833,29861,0,Leanne Smith
2749,48863,0,Sharon (Stand By Me)
2750,86226,0,Sharon Peters
496,50776,0,Claire (Private Practice)
2256,48070,0,Mr. Nyles


In [12]:
# drop the ns column as it does not contain any valuable information
df.drop(columns = ['ns'], inplace=True)

In [13]:
# create a column for the file name to store the character page based on the character name in title
df['file'] = df['title']
df = df.replace({"file":{" ":"_", "/":"_", "\"":"", "\?":""}}, regex=True)
df = df[~df.file.str.contains("Unnamed_Characters")]
df = df[~df.file.str.contains("Unseen")]
df = df[~df.file.str.contains("User")]
df = df[~df.file.str.contains("Category")]

In [14]:
# scrape the character pages and save them in a folder called character wikipages
for index, row in df.iterrows():
    pageid = row['pageid']
    title = row['file']
    
    query = "https://greysanatomy.fandom.com/api.php?action=query&pageids={}&prop=revisions&rvprop=content&format=json".format(pageid)
    
    wikiresponse = urllib.request.urlopen(query)
    wikidata = wikiresponse.read()
    wikitext = wikidata.decode('utf-8')
    data_json = json.loads(wikitext)
    
    with open("data/character wikipages/" + title+'.json', 'w') as f:
        json.dump(data_json, f)

In [15]:
# get the outlinks for each character page for the universe assignment and connection betwenn characters
df['outlinks_clean'] = ""
df['outlinks_clean'] = df['outlinks_clean'].astype('object')

for index, row in df.iterrows():
    title = row['file']
    pageid = str(row['pageid'])
    
    f = open("data/character wikipages/" + title + ".json")
    data = json.load(f)
    text = json.dumps(data)
    
    
    res = re.findall(r"\[\[.*?\]\]", text)
    res_clean = []
    for element in res:
        element = element.strip("[[").strip("]]")
        res_clean.append(element)
    
    df.at[index, 'outlinks_clean'] = res_clean

In [16]:
# count the occurence of a character in the seasons and normalize by the number of seasons
for index, row in df.iterrows():
    ga_count = 0
    pp_count = 0
    s19_count = 0
    for element in row['outlinks_clean']:
        if element.startswith("Category"):
            ga_count += element.count("GA S")
            pp_count += element.count("PP S")
            s19_count += element.count("S19 S")
        
    df.at[index, "ga_occurences"] = ga_count/19
    df.at[index, "pp_occurences"] = pp_count/6
    df.at[index, "s19_occurences"] = s19_count/6

In [17]:
# define a main universe based on the occurences and drop characters without main universe
for index, row in df.iterrows():   
    if row['ga_occurences'] >= row['pp_occurences'] and row['ga_occurences'] >= row['s19_occurences'] and row['ga_occurences'] > 0:
        df.at[index, 'main_universe'] = "Grey's Anatomy"
    elif row['pp_occurences'] > row['ga_occurences'] and row['pp_occurences'] >= row['s19_occurences'] and row['pp_occurences'] > 0:
        df.at[index, 'main_universe'] = "Private Practice"
    elif row['s19_occurences'] > row['ga_occurences'] and row['s19_occurences'] > row['ga_occurences'] and row['s19_occurences'] > 0:
        df.at[index, 'main_universe'] = "Station 19"

df = df[~df.main_universe.isnull()]

In [18]:
# create a column with the encoded universe
le = preprocessing.LabelEncoder()
df['universe_encoded'] = le.fit_transform(df['main_universe'])

In [19]:
# define categories of characters (Doctors, Nurses, Patients or other such as Family or Friends)
for index, row in df.iterrows():
    if 'Category:Doctors' in row['outlinks_clean'] :
        df.at[index, 'category'] = "Doctor"
    elif 'Category:Firefighters' in row['outlinks_clean']:
        df.at[index, 'category'] = "Firefighter"
    elif 'Category:Nurses' in row['outlinks_clean']:
        df.at[index, 'category'] = "Nurses"
    elif 'Category:Patients' in row['outlinks_clean']:
        df.at[index, 'category'] = "Patient"
    else: 
        df.at[index, 'category'] = "Other"

In [20]:
# get the status of each character whether the character is alive or dead and the list of aliases
for index, row in df.iterrows():
    title = row['file']
    pageid = str(row['pageid'])
    
    f = open("data/character wikipages/" + title + ".json")
    data = json.load(f)
    text = json.dumps(data)
    
    pattern_infobox = r'\{\{\w+\sInfobox.*?\}\}'
    infobox = re.findall(pattern_infobox, text)
    if len(infobox) >0:
        infobox = infobox[0]
        infobox = infobox.strip("{{").strip("}}").replace('\\n|', '\n')
        character_info = {}
        for line in infobox.split('\n'):
            if 'Infobox' not in line:
                info = line.split(' = ')
                if len(info)>1:
                    character_info[info[0]] = info[1]
        
        if "status" in character_info.keys():
            df.at[index, 'status'] = character_info['status']
            
        if "alias" in character_info.keys():
            alias_list = []
            alias_list = character_info['alias'].split("\\n")
    
            df.at[index, 'alias_list'] = alias_list

In [21]:
# get a description of each character
for index, row in df.iterrows():
    pageid = row['pageid']
    title = row['file']
    
    query = "https://greysanatomy.fandom.com/api.php?action=query&pageids={}&prop=pageprops&format=json".format(pageid)
    
    wikiresponse = urllib.request.urlopen(query)
    wikidata = wikiresponse.read()
    wikitext = wikidata.decode('utf-8')
    data_json = json.loads(wikitext)
    
    if "fandomdescription" in data_json['query']['pages'][str(pageid)]['pageprops'].keys():
        df.at[index, 'description'] = data_json['query']['pages'][str(pageid)]['pageprops']['fandomdescription']

In [22]:
# to get an overview of the scraped and cleaned dara
df.sample(10)

Unnamed: 0,pageid,title,file,outlinks_clean,ga_occurences,pp_occurences,s19_occurences,main_universe,universe_encoded,category,status,alias_list,description
2335,31512,Chuckie Patel,Chuckie_Patel,"[Isaac Patel, Rina Patel, Tina Patel, Love the...",0.052632,0.0,0.0,Grey's Anatomy,0,Other,Alive,,
89,49781,Andrea,Andrea,"[Dell Parker, Naomi Bennett, In Which Addison ...",0.0,0.166667,0.0,Private Practice,1,Patient,Alive,,
1949,50802,Kevin Mason,Kevin_Mason,"[Ryan Mason, Take Two, Private Practice|PP, Se...",0.0,0.166667,0.0,Private Practice,1,Patient,Alive,,
1074,60932,Alison Goodman,Alison_Goodman,"[April Kepner, Miranda Bailey, Maggie Pierce, ...",0.052632,0.0,0.0,Grey's Anatomy,0,Patient,Alive,,
798,48153,Rhada Douglas,Rhada_Douglas,"[Heather Douglas, Six Days, Part 1, Six Days, ...",0.052632,0.0,0.0,Grey's Anatomy,0,Other,Alive,,Rhada Douglas is the mother of Heather Douglas...
1241,63946,Taryn Helm,Taryn_Helm,"[General Surgery|Surgical, Resident, Emerald C...",0.315789,0.0,0.5,Station 19,2,Doctor,,,Taryn Helm was a surgical resident at Grey Slo...
1966,83548,Max (Too Darn Hot),Max_(Too_Darn_Hot),"[Too Darn Hot, Station 19|S19, Season 5 (Stati...",0.0,0.0,0.166667,Station 19,2,Other,Alive,,
440,43394,Paul Castello,Paul_Castello,"[Attending, Trauma Surgery|Trauma Surgeon, Dil...",0.105263,0.0,0.0,Grey's Anatomy,0,Doctor,Deceased,,
2751,48532,Otis Sharon,Otis_Sharon,"[Callie Torres, Izzie Stevens, Where the Wild ...",0.052632,0.0,0.0,Grey's Anatomy,0,Patient,Alive,,
3035,81919,Nell Timms,Nell_Timms,"[Richard Webber, Jackson Avery, Cormac Hayes, ...",0.052632,0.0,0.0,Grey's Anatomy,0,Patient,Alive,,


In [23]:
#define helper functions to get a characters history and clean it 

def clean_text(text):
    text = re.sub(r"\[\[(?:[^\]\]|:]+?)\|([^(\|)]+?)\]\]", r"\1", text)
    text = re.sub(r"\[\[([^(\|)]+?)\]\]", r"\1", text)
    text = re.sub(r"\[\[(?:Image|File).+?\|([^\|]+?)\]\]", r"", text)
    text = re.sub(r"==.+?==", r"", text)
    text = re.sub(r"<ref>.*?<\/ref>", r"", text)
    return text.replace("\n"," ").replace("*","").replace("=","")

def get_character_history(title):
    character_params = {
        "format": "json",
        "page": title,
        "action": "parse",
        "prop": "wikitext",
        "section": 1,
        "disabletoc": 1
    }
    request = session.get(url=url, params=character_params)
    if 'parse' in request.json().keys():
        return clean_text(request.json()['parse']['wikitext']['*'])
    else:
        return ""

In [24]:
# get character history
character_history = {}
for index, row in df.iterrows():
    title = row['file']
    character_history[title] = get_character_history(title)
    df.at[index, 'history'] = get_character_history(title)

# save history in the character dataframe but also a separate json for easier processing 
with open('data/characters_history.json', 'w', encoding='utf-8') as file:
    json.dump(character_history, file, ensure_ascii=False, indent=4)

In [25]:
# save the data to a characters file
df.to_csv("data/"+"characters.csv")

### 1.3 Scrape the episode and season summaries

The last step of the data scraping is gettting the episode and season data from Fandom.

In [26]:
# create a dictionary of the shows 
shows_dict = {
    'GA': {'seasons': 19, 'name': "Grey's Anatomy"},
    'S19': {'seasons': 6, 'name': 'Station 19'},
    'PP': {'seasons': 6, 'name': 'Private Practice'}
}

# function the get the season data so the summary and plots for each show and season
def get_season_data(show, season):
    season_params = {
        "format": "json",
        "page": "Season {} ({})".format(season, shows_dict[show]['name']),
        "action": "parse",
        "prop": "wikitext",
        "section": 1,
        "disabletoc": 1
    }
    
    episodes_params = {
    "format": "json",
    "list": "categorymembers",
    "action": "query",
    "cmtitle": "Category:{} S{} Episodes".format(show, season),
    "cmlimit": 50
    }
    
    season_data = {"nr": season, "show": show}
    request = session.get(url=url, params=season_params)
    summary = request.json()['parse']['wikitext']['*']
    season_params['section'] = 2
    request = session.get(url=url, params=season_params)
    plots = request.json()['parse']['wikitext']['*']
    season_data['summary_and_plots'] = clean_text(summary + plots)
    
    request = session.get(url=url, params=episodes_params)
    episodes = request.json()['query']['categorymembers']
    for episode in episodes:
        episode['summary'] = clean_text(get_episode_summary(episode['title']))
        episode['nr'] = get_episode_number(episode['title'])
        del episode['ns']
        del episode['pageid']
    season_data['episodes'] = sorted(episodes, key=lambda d: d['nr'])
    return season_data

# function the get the detailed episode summaries 
def get_episode_summary(title):
    episode_params = {
        "format": "json",
        "prop": "wikitext",
        "action": "parse",
        "page": title,
        "section": 2,
        "disabletoc": 1
    }
    request = session.get(url=url, params=episode_params)
    if "Episode in detail." in request.json()['parse']['wikitext']['*'] or request.json()['parse']['wikitext']['*'] == "==Full Summary==":
        episode_params['section'] = 1
        request = session.get(url=url, params=episode_params)
    return clean_text(request.json()['parse']['wikitext']['*'])

def get_episode_number(title):
    episode_params = {
        "format": "json",
        "prop": "wikitext",
        "action": "parse",
        "page": title,
        "section": 0,
        "disabletoc": 1
    }
    request = session.get(url=url, params=episode_params)
    return int(re.search(r"episode\s*=\s*(\d+)", request.json()['parse']['wikitext']['*']).group(1))

In [27]:
# function to get all data and apply it to all three shows
def get_all_data(show):
    seasons = []
    for i in range(shows_dict[show]['seasons']):
        seasons.append(get_season_data(show, i+1))
    return seasons

seasons_data = get_all_data('GA')
seasons_data.extend(get_all_data('PP'))
seasons_data.extend(get_all_data('S19'))

In [28]:
# save the data in a json 
with open('data/episodes.json', 'w', encoding='utf-8') as file:
    json.dump(seasons_data, file, ensure_ascii=False, indent=4)

## Main result of the data collection

The result of the data collection is that we have 
* a clean dataframe with the Grey's Anatomy characters with their main information such as main universe, links to other characters or description
* a clean json with the summaries of all seasons and detailed descriptions of the episodes
* a clean json with the characters' history and development

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=6c50234c-0fc5-40eb-b0ef-2d1cda57d893' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>