# Acquire data

In this notebook we're going to do some webscraping to acquire the name of every South Park character, along with their 'category' or type. We will then be able to use this list of characters to match the names up and identify named entities. This will make the process of building our network a lot easier. The output we want is a pandas dataframe with the following columns:

* Category - Denotes category or type of character 
* Name - Name of character

To do this I'll be using the requests module, along with BeautifulSoup (to scrape the data), and pandas (to transform our list of dictionaries containing the scraped content to a dataframe). 

## Import necessary libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Start scraping.. 

Luckily, the southpark studios website has a page which contains a full list of characters.

In [2]:
url = 'https://www.southparkstudios.co.uk/w/index.php?title=List_of_Characters&oldid=14766'
reqs = requests.get(url)

In [3]:
# Success
reqs

<Response [200]>

In [4]:
soup = BeautifulSoup(reqs.text, 'html.parser')

# Try scraping all character names

In [5]:
characters = []

for div in soup.findAll('div', {'class': 'character'}):
        a = div.find('a')
        name = a.text.strip()
        characters.append(name)

In [8]:
for i in characters:
    print(i)

Stan Marsh
Kyle Broflovski
Eric Cartman
Kenny McCormick
Butters Stotch
Wendy Testaburger
Annie Knitts
Bebe Stevens
Bradley Biggle
Craig Tucker
Clyde Donovan
David Rodriguez
Dog Poo Petuski
Esther
Francis
Heidi Turner
Jason
Jenny Simons
Jimmy Valmer
Kevin Stoley
Lola
Millie Larsen
Nelly
Nichole Daniels
Pip Pirrup
Powder Turner
Red
Scott Malkinson
Timmy Burch
Token Black
Tweek Tweak
Alex Glick
Allen Varcas
Allie Nelson
Annie (Butters' Bottom Bitch)
Apple Replacement Friend
Ashley
Baahir Hakeem
Beth
Bill Allen
Billy Circlovich
Billy Miller
Bloodrayne
Bobby
Bradley (Cartman Sucks)
Greeley Batter (Brian)
Carlos (My Future Self N' Me)
Charlotte
Chris Donnely
Christophe The Mole
Cosette
Courtney
Damien Thorn
David Weatherhead
Davin Miller
Douglas
Emily Marx
Emmett Hollis
Estella Havesham
Felipe (My Future Self N' Me)
Ferrari
Fosse McDonald
Francis (Special Ed)
Frederick Smith
Gary Harrison
Gavin Throttle
Gregory of Yardale
Herbert Pocket
Isiah Jordan
Isla
Jake
Jessica Rodriguez
Jessie
Jessie 

# Acquire main categories

In [7]:
h2_headings = []
h3_headings = []

for h2 in soup.findAll('h2'):
    for i in h2.findAll('span', {'class': 'mw-headline'}):
        print("MAIN CATEGORY", i.text)
        h2_headings.append(i.text)
        
        for h3 in h2.find_next("h3"):
#        for i in h3.findAll('span', {'class': 'mw-headline'}):
            if h3 is not None:
                print("SUB-CATEGORY", h3.text)
            else:
                print("oops")

            
#         for h3 in soup.findAll('h3'):
#             for i in h3.findAll('span', {'class': 'mw-headline'}):
#                 print(i.text)
    

MAIN CATEGORY The Four Boys
SUB-CATEGORY Featured 4th Graders
MAIN CATEGORY 4th Graders
SUB-CATEGORY Featured 4th Graders
MAIN CATEGORY School Characters
SUB-CATEGORY Preschoolers
MAIN CATEGORY Parents
SUB-CATEGORY Marsh Family
MAIN CATEGORY Adults
SUB-CATEGORY Townsfolk
MAIN CATEGORY Federal Government
SUB-CATEGORY Military
MAIN CATEGORY Canadians
SUB-CATEGORY Canadian Celebrities
MAIN CATEGORY Non-Human
SUB-CATEGORY Aliens
MAIN CATEGORY Religious
SUB-CATEGORY Holiday
MAIN CATEGORY Celebrities
SUB-CATEGORY Fictional
MAIN CATEGORY Alter Egos
SUB-CATEGORY Stan
MAIN CATEGORY Groups
SUB-CATEGORY 
SUB-CATEGORY Boys' Identities
MAIN CATEGORY Pilot Characters


TypeError: 'NoneType' object is not iterable

In [None]:
n = -1
h2_headings = []
h3_headings = []

h2 = soup.find('h2')
if h2:
    for i in h2.findAll('span', {'class': 'mw-headline'}):
        print("MAIN CATEGORY", i.text)
        h2_headings.append(i.text)
        h3_headings.append("None")
        
        # Find all h3 tags following this h2 tag
        next_siblings = h2.find_next_siblings()
        for next_sibling in next_siblings:
#            print("NEXT SIBLING", next_sibling)
            if next_sibling.name == 'h2':
                h2_span = next_sibling.find('span', {'class': 'mw-headline'})
                if h2_span:
                    print("MAIN CATEGORY", h2_span.text)
                    h2_headings.append(h2_span.text)
                    h3_headings.append("None")
            if next_sibling.name == 'h3':
                h3_span = next_sibling.find('span', {'class': 'mw-headline'})
                if h3_span:
                    print("SUB-CATEGORY", h3_span.text)
                    h3_headings.append(h3_span.text)
                    h2_headings.append("None")
            if next_sibling.name == 'div' and 'character' in next_sibling.get('class', []):
                n += 1
                print(next_sibling.text.strip())


In [None]:
h2_headings

In [None]:
h3_headings

# Test code

In [None]:
html_content = """<h2><span class="mw-headline" id="The_Four_Boys">The Four Boys</span></h2>
<div class="character"><a href="/w/index.php/Stan_Marsh"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/stan-marsh.png?height=98"><br>Stan Marsh</a></div>
<div class="character"><a href="/w/index.php/Kyle_Broflovski"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/kyle-broflovski.png?height=98"><br>Kyle Broflovski</a></div>
<div class="character"><a href="/w/index.php/Eric_Cartman"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/eric-cartman.png?height=98"><br>Eric Cartman</a></div>
<div class="character"><a href="/w/index.php/Kenny_McCormick"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/kenny-mccormick.png?height=98"><br>Kenny McCormick</a></div>
<h2><span class="mw-headline" id="4th_Graders">4th Graders</span></h2>
<h3><span class="mw-headline" id="Featured_4th_Graders">Featured 4th Graders</span></h3>
<div class="character"><a href="/w/index.php/Butters_Stotch"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/butters-stotch.png?height=98"><br>Butters Stotch</a></div>
<div class="character"><a href="/w/index.php/Wendy_Testaburger"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/wendy-testaburger.png?height=98"><br>Wendy Testaburger</a></div>
<div class="character"><a href="/w/index.php/Annie_Knitts"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/annie.png?height=98"><br>Annie Knitts</a></div>
<div class="character"><a href="/w/index.php/Bebe_Stevens"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/bebe-stevens.png?height=98"><br>Bebe Stevens</a></div>
<h3><span class="mw-headline" id="Other_4th_Graders">Other 4th Graders</span></h3>
<div class="character"><a href="/w/index.php/Alex_Glick"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/alex.png?height=98"><br>Alex Glick</a></div>
<div class="character"><a href="/w/index.php/Allen_Varcas"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/other-4th-graders-allen-varcas-conifer-batter.png?height=98"><br>Allen Varcas</a></div>
<div class="character"><a href="/w/index.php/Allie_Nelson"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/4th-grader-allie-nelson.png?height=98"><br>Allie Nelson</a></div>
<div class="character"><a href="/w/index.php/Annie_(Butters%27_Bottom_Bitch)"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/unamed-3rd-4th-graders-annie.png?height=98"><br>Annie (Butters' Bottom Bitch)</a></div>
<div class="character"><a href="/w/index.php/Apple_Replacement_Friend"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/other-4th-graders-apple-replacement-friend.png?height=98"><br>Apple Replacement Friend</a></div>
<h3><span id="Unnamed_3rd.2F4th_Graders"></span><span class="mw-headline" id="Unnamed_3rd/4th_Graders">Unnamed 3rd/4th Graders</span></h3>
<div class="character"><a href="/w/index.php/Annie_Knitts%27_Boyfriend_(Skank_Hunt)"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/default/wiki_characters.jpg?height=98"><br>Annie Knitts' Boyfriend (Skank Hunt)</a></div>
<div class="character"><a href="/w/index.php/Black_Haired_Canadian_Girl_(Where_My_Country_Gone%3F)"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/unnamed-3rd-4th-graders-blackhair-pinkbow.png?height=98"><br>Black Haired Canadian Girl (Where My Country Gone?)</a></div>
<div class="character"><a href="/w/index.php/Blonde_Girl"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/4th-graders-blonde-girl.png?height=98"><br>Blonde Girl</a></div>
<h2><span class="mw-headline" id="Adults">Adults</span></h2>
<h3><span class="mw-headline" id="Townsfolk">Townsfolk</span></h3>
<div class="character"><a href="/w/index.php/Mr._Anders"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/adults/adults-townsfolk-mr-anders.png?height=98"><br>Mr. Anders</a></div>
<div class="character"><a href="/w/index.php/Ben_and_Girlfriend"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/adults/adults-townsfolk-ben-and-girlfriend.png?height=98"><br>Ben and Girlfriend</a></div>
<div class="character"><a href="/w/index.php/Ms._Campbell"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/adults/miss-campbell.png?height=98"><br>Ms. Campbell</a></div>
<div class="character"><a href="Carrie%20Ayers%20(A%20Ladder%20to%20Heaven)"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/adults/townsfolk-carrie-ayers.png?height=98"><br>Carrie Ayers</a></div>
<div class="character"><a href="/w/index.php/Chase"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/adults/adults-townsfolk-chase.png?height=98"><br>Chase</a></div>
<div class="character"><a href="/w/index.php/Charlie,_the_DYNO-MIGHT_firework_company_owner"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/adults/miscellaneous-charlie-dynomight-firework-company.png?height=98"><br>Charlie, the DYNO-MIGHT firework company owner</a></div>
"""



In [None]:
soup = BeautifulSoup(html_content, 'html.parser')

In [None]:
h2_headings = []
h3_headings = []
chars = []

for x in soup.findAll('div', {'class': 'character'}):
    #print(x.text)
    chars.append(x.text)


    prev_siblings = x.find_previous_siblings()
    
    for prev in prev_siblings:
#        print(prev)

        
        if prev.name == 'h2':
            h2_span = prev.find('span', {'class': 'mw-headline'})
            h2_headings.append(h2_span.text)
            h3_headings.append("None")
            print(h2_headings)
        
        if prev.name == 'h3':
            h3_span = prev.find('span', {'class': 'mw-headline'})
            h3_headings.append(h3_span.text)
           
            print(h3_headings)
            
            sec_siblings = prev.find_previous_siblings()
#            print("SECOND PREVIOUS", sec_siblings)
            
            for sec_prev in sec_siblings:
                # print("SEC PREV", sec_prev)
                
                if sec_prev.name == 'h2':
                    h2_span = prev.find('span', {'class': 'mw-headline'})
                    h2_headings.append(h2_span.text)
                    print(h2_headings)
#                 elif sec_prev.name != 'div' and 'character' in next_sibling.get('class', []):
#                     pass
#                 else:
#                     h2_headings.append("None")
          

        

In [None]:
chars

In [None]:
h2_headings

In [None]:
len(h3_headings)

# Just scrape characters and main headings

In [None]:
main_cat = []
chars = []

soup = BeautifulSoup(reqs.text, 'html.parser')
for div in soup.findAll('div', {'class': 'character'}):
        a = div.find('a')
        name = a.text.strip()
        chars.append(name)
        print(div.find_previous_sibling('h2').text)
        main_cat.append(div.find_previous_sibling('h2').text)

In [None]:
len(chars)

# Create new df using lists

In [None]:
df = pd.DataFrame({'Character': chars, 'Category': main_cat})

In [None]:
df
    

In [None]:
df[df.Category == 'Non-Human']

In [None]:
chars = []

soup = BeautifulSoup(reqs.text, 'html.parser')
for div in soup.findAll('div', {'class': 'character'}):
        a = div.find('a')
        name = a.text.strip()
        chars.append(name)


In [None]:
chars

# Export main_cat csv

In [None]:
import os
os.getcwd()

In [None]:
df.to_csv('/Users/loucap/Documents/GitWork/SNA/Data/main_cat.csv')

In [None]:
ssfd

# Chat-GPT's solution

In [None]:
from bs4 import BeautifulSoup

html_content = """
<h2><span class="mw-headline" id="The_Four_Boys">The Four Boys</span></h2>
<div class="character"><a href="/w/index.php/Stan_Marsh"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/stan-marsh.png?height=98"><br>Stan Marsh</a></div>
<div class="character"><a href="/w/index.php/Kyle_Broflovski"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/kyle-broflovski.png?height=98"><br>Kyle Broflovski</a></div>
<div class="character"><a href="/w/index.php/Eric_Cartman"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/eric-cartman.png?height=98"><br>Eric Cartman</a></div>
<div class="character"><a href="/w/index.php/Kenny_McCormick"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/kenny-mccormick.png?height=98"><br>Kenny McCormick</a></div>
<h2><span class="mw-headline" id="4th_Graders">4th Graders</span></h2>
<h3><span class="mw-headline" id="Featured_4th_Graders">Featured 4th Graders</span></h3>
<div class="character"><a href="/w/index.php/Butters_Stotch"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/butters-stotch.png?height=98"><br>Butters Stotch</a></div>
<div class="character"><a href="/w/index.php/Wendy_Testaburger"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/wendy-testaburger.png?height=98"><br>Wendy Testaburger</a></div>
<div class="character"><a href="/w/index.php/Annie_Knitts"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/annie.png?height=98"><br>Annie Knitts</a></div>
<div class="character"><a href="/w/index.php/Bebe_Stevens"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/bebe-stevens.png?height=98"><br>Bebe Stevens</a></div>
<h3><span class="mw-headline" id="Other_4th_Graders">Other 4th Graders</span></h3>
<div class="character"><a href="/w/index.php/Alex_Glick"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/alex.png?height=98"><br>Alex Glick</a></div>
<div class="character"><a href="/w/index.php/Allen_Varcas"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/other-4th-graders-allen-varcas-conifer-batter.png?height=98"><br>Allen Varcas</a></div>
<div class="character"><a href="/w/index.php/Allie_Nelson"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/4th-grader-allie-nelson.png?height=98"><br>Allie Nelson</a></div>
<div class="character"><a href="/w/index.php/Annie_(Butters%27_Bottom_Bitch)"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/unamed-3rd-4th-graders-annie.png?height=98"><br>Annie (Butters' Bottom Bitch)</a></div>
<div class="character"><a href="/w/index.php/Apple_Replacement_Friend"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/other-4th-graders-apple-replacement-friend.png?height=98"><br>Apple Replacement Friend</a></div>
<h3><span id="Unnamed_3rd.2F4th_Graders"></span><span class="mw-headline" id="Unnamed_3rd/4th_Graders">Unnamed 3rd/4th Graders</span></h3>
<div class="character"><a href="/w/index.php/Annie_Knitts%27_Boyfriend_(Skank_Hunt)"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/default/wiki_characters.jpg?height=98"><br>Annie Knitts' Boyfriend (Skank Hunt)</a></div>
<div class="character"><a href="/w/index.php/Black_Haired_Canadian_Girl_(Where_My_Country_Gone%3F)"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/unnamed-3rd-4th-graders-blackhair-pinkbow.png?height=98"><br>Black Haired Canadian Girl (Where My Country Gone?)</a></div>
<div class="character"><a href="/w/index.php/Blonde_Girl"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/kids/4th-graders-blonde-girl.png?height=98"><br>Blonde Girl</a></div>
<h2><span class="mw-headline" id="Adults">Adults</span></h2>
<h3><span class="mw-headline" id="Townsfolk">Townsfolk</span></h3>
<div class="character"><a href="/w/index.php/Mr._Anders"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/adults/adults-townsfolk-mr-anders.png?height=98"><br>Mr. Anders</a></div>
<div class="character"><a href="/w/index.php/Ben_and_Girlfriend"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/adults/adults-townsfolk-ben-and-girlfriend.png?height=98"><br>Ben and Girlfriend</a></div>
<div class="character"><a href="/w/index.php/Ms._Campbell"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/adults/miss-campbell.png?height=98"><br>Ms. Campbell</a></div>
<div class="character"><a href="Carrie%20Ayers%20(A%20Ladder%20to%20Heaven)"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/adults/townsfolk-carrie-ayers.png?height=98"><br>Carrie Ayers</a></div>
<div class="character"><a href="/w/index.php/Chase"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/adults/adults-townsfolk-chase.png?height=98"><br>Chase</a></div>
<div class="character"><a href="/w/index.php/Charlie,_the_DYNO-MIGHT_firework_company_owner"><img src="http://images.paramount.tech/path/mgid:file:gsp:entertainment-assets:/sps/shared/characters/adults/miscellaneous-charlie-dynomight-firework-company.png?height=98"><br>Charlie, the DYNO-MIGHT firework company owner</a></div>
"""

soup = BeautifulSoup(html_content, 'html.parser')

character_names = []
main_cat = []
sub_cat = []

for character_div in soup.find_all('div', class_='character'):
    # Extract the character name
    character_name = character_div.get_text(strip=True)
    character_names.append(character_name)

    # Find the main category (h2 heading)
    main_category = None
    sub_category = "None"
    
    # Traverse previous siblings to find h2 and h3
    for prev_sibling in character_div.find_previous_siblings():
        if prev_sibling.name == 'h2':
            main_category = prev_sibling.get_text(strip=True)
            break
        elif prev_sibling.name == 'h3':
            sub_category = prev_sibling.get_text(strip=True)
    
    if main_category is not None:
        main_cat.append(main_category)
        sub_cat.append(sub_category)


In [None]:
for name, main, sub in zip(character_names, main_cat, sub_cat):
    print(f"Character: {name}, Main Category: {main}, Sub Category: {sub}")


In [None]:
character_names

In [None]:
main_cat

In [None]:
sub_cat