# Part A

#### Go to the page https://en.wikipedia.org/wiki/List_of_country_music_performers and extract all of the links using your regular expressions from above.

In [8]:
import re 

with open('performers.txt', 'r') as file:
    text = file.read()

pattern = r'\[\[([^\]|]+)(?:\|([^\]]+))?\]\]'
performers = re.findall(pattern, text)

for performer in performers:
    link = performer[0]
    print(f"Link: {link}")

Link: The Abrams Brothers
Link: Ace in the Hole Band
Link: Roy Acuff
Link: Kay Adams (singer)
Link: Ryan Adams
Link: Doug Adkins
Link: Trace Adkins
Link: David "Stringbean" Akeman
Link: Rhett Akins
Link: Alabama (band)
Link: Lauren Alaina
Link: Jason Aldean
Link: Alee (singer)
Link: Daniele Alexander
Link: Jessi Alexander
Link: Gary Allan
Link: Susie Allanson
Link: Deborah Allen
Link: Duane Allen
Link: Harley Allen
Link: Jimmie Allen
Link: Rex Allen
Link: Terry Allen (country singer)
Link: Allman Brothers Band
Link: Gregg Allman
Link: Tommy Alverson
Link: Dave Alvin
Link: Amazing Rhythm Aces
Link: American Young
Link: Don Amero
Link: Colin Amey
Link: Al Anderson (NRBQ)
Link: Bill Anderson (singer)
Link: Brent Anderson (singer)
Link: Coffey Anderson
Link: John Anderson (singer)
Link: Keith Anderson
Link: Liz Anderson
Link: Lynn Anderson
Link: Sharon Anderson (singer)
Link: Elisabeth Andreassen
Link: Ingrid Andress
Link: Courtney Marie Andrews
Link: Jessica Andrews
Link: Sheila Andrews
L

#### Use your knowledge of APIs and the list of all the wiki-pages to download all the text on the pages of the country performers.

In [11]:
import os
import re
import requests
from bs4 import BeautifulSoup
from time import sleep

output_dir = "performer_files"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

page_titles = list(set(performer[0] for performer in performers))
page_titles = [title.strip().replace(' ', '_') for title in page_titles]


WIKIPEDIA_API_URL = "https://en.wikipedia.org/w/api.php"
headers = {
    'User-Agent': 'YourAppName/1.0 (your_email@example.com)'
}

# Process each page individually
for title in page_titles:
    params = {
        "action": "parse",
        "page": title,
        "prop": "text",
        "format": "json"
    
    }

    try:
        response = requests.get(WIKIPEDIA_API_URL, params=params, headers=headers)
        response.raise_for_status()
        data = response.json()

        # Check for errors in the response
        if 'error' in data:
            error_code = data['error'].get('code', '')
            if error_code == 'missingtitle':
                print(f"Page '{title}' is missing. Skipping.")
                continue
            elif error_code == 'redirects':
                print(f"Page '{title}' is a redirect. Skipping.")
                continue
            else:
                print(f"An error occurred with page '{title}': {data['error'].get('info', '')}")
                continue

        if 'parse' in data:
            html_content = data['parse']['text']['*']

            # Use BeautifulSoup to parse HTML content
            soup = BeautifulSoup(html_content, 'html.parser')
            text = soup.get_text()

            # Save or process the text in the performer_files folder
            filename = os.path.join(output_dir, f"{title}_plain.txt")
            with open(filename, 'w', encoding='utf-8') as f:
                f.write(text)
            print(f"Downloaded plain text for {title}")
        else:
            print(f"No content found for page '{title}'. Skipping.")

        # Respectful crawling
        sleep(0.5)  # Sleep for half a second between requests
    except requests.exceptions.RequestException as e:
        print(f"An error occurred while fetching '{title}': {e}")
    except Exception as e:
        print(f"An unexpected error occurred with '{title}': {e}")


Downloaded plain text for The_Sky_Kings
Downloaded plain text for Charlie_Rich
Downloaded plain text for Cactus_Choir_(band)
Downloaded plain text for Buddy_Brown_(musician)
Downloaded plain text for Caitlin_Cary
Downloaded plain text for Kolby_Cooper
Downloaded plain text for Curtis_Wright
Downloaded plain text for Drake_Jensen
Downloaded plain text for Roy_Acuff
Downloaded plain text for Daron_Norwood
Downloaded plain text for Johnny_Russell_(singer)
Downloaded plain text for Porter_Wagoner
Downloaded plain text for First_Aid_Kit_(band)
Downloaded plain text for Jerry_Salley
Downloaded plain text for Lauren-Ashley
Downloaded plain text for Dick_Feller
Downloaded plain text for Ryan_Laird
Downloaded plain text for Pam_Tillis
Downloaded plain text for Moonshine_Bandits
Downloaded plain text for Heather_Myles
Downloaded plain text for Kid_Rock
Downloaded plain text for Karen_Brooks_(singer)
Downloaded plain text for Nashville_Bluegrass_Band
Downloaded plain text for Tom_Hambridge
Downlo