# Web Scrapping in Python

The first objective of this notebook is to discover the `request` and `BeautifulSoup` libraries to crawl a table on a Wikitable page, build a dataframe, and create a map.

*   `request` and [urllib](https://docs.python.org/3/library/urllib.html#module-urllib) for requestion REST API
*   [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to inspect webpages

Please note that this part serves as an initiation of web scrapping but you will need to learn by yourself to make the project. It is inspired from notebooks published by Galiana, Lino. 2023. Python Pour La Data Science. https://doi.org/10.5281/zenodo.8229676. Exercices target different sources of information, the code needs to be adapted.











In [14]:
# !pip install -q lxml # Needed in Colab

import bs4
import lxml
import pandas
import urllib

from urllib import request

## Exercice 1 : Scrap a table on a wikipedia page

We would like to display on a map the location of Summer Olympic Games since 1896. We will use a [Wikipedia page](https://fr.wikipedia.org/wiki/Jeux_olympiques) to scrap the associated table.
Below is the code to extract the content of the page using `request` and display its title using `BeautifulSoup`.

In [15]:
jo = "https://en.wikipedia.org/wiki/List_of_record_charts"
req = request.Request(jo, headers={"User-Agent": "Mozilla/5.0"})

request_text = request.urlopen(req).read()
print(request_text[:1000])
page = bs4.BeautifulSoup(request_text, "lxml")
print(page.find("title"))
     

b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>List of record charts - Wikipedia</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabl

Our objective here is to extract the different information in the first table "Jeux olympiques d'été" and to build a data frame.

To proceed, you will have to follow these steps:
*   Find the list of charts around the world
*   Collect each charting organization
*   Retrieve the different columns and transform them into text format. Also, use `strip` to format the value into a proper text format (e.g., without useless spaces). Store these lines (formated as a table of columns) in a table.


* Collect headers of the HTML table
* Build a data frame from the result table and the headers








In [16]:
import requests
from bs4 import BeautifulSoup, Tag
import pandas as pd
from urllib.parse import urljoin

WIKI_URL = "https://en.wikipedia.org/wiki/List_of_record_charts"
BASE_URL = "https://en.wikipedia.org"

headers = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/127.0.0.0 Safari/537.36"
    )
}

KNOWN_CONTINENTS = {
    "Africa", "Asia", "Europe", "North_America", "Oceania", "South_America", "Central_America"
}

def extract_text(el: Tag | None) -> str | None:
    if el is None:
        return None
    a = el.find("a", recursive=False)
    if a and a.get_text(strip=True):
        return a.get_text(strip=True)
    return el.get_text(strip=True)

def _collect_consecutive_uls(start: Tag) -> list[Tag]:
    """Collect consecutive <ul> siblings immediately following `start` within the same parent level."""
    uls = []
    sib = start.next_sibling
    # skip whitespace
    while sib is not None and not isinstance(sib, Tag):
        sib = sib.next_sibling
    # collect consecutive <ul> siblings
    while isinstance(sib, Tag) and sib.name == "ul":
        uls.append(sib)
        sib = sib.next_sibling
        while sib is not None and not isinstance(sib, Tag):
            sib = sib.next_sibling
    return uls

def _parse_ul_block(ul: Tag) -> list[tuple[str | None, str | None]]:
    """Extract (chart_name, chart_url) pairs from a <ul>, including a single nested <ul>."""
    out = []
    for li in ul.find_all("li", recursive=False):
        a = li.find("a", recursive=False)
        if a is not None:
            name = a.get_text(strip=True)
            href = a.get("href")
            url = urljoin(BASE_URL, href) if href else None
            out.append((name, url))
        nested = li.find("ul", recursive=False)
        if nested:
            for nli in nested.find_all("li", recursive=False):
                na = nli.find("a", recursive=False)
                name = na.get_text(strip=True) if na else nli.get_text(strip=True)
                href = na.get("href") if na else None
                url = urljoin(BASE_URL, href) if href else None
                out.append((name, url))
    return out

def _section_fragment_soup(h2_tag: Tag) -> BeautifulSoup:
    """
    Build a mini-soup for the section by taking the heading's parent wrapper
    (e.g., <div class="mw-heading mw-heading2">...</div>) and concatenating its
    following siblings until the next sibling that contains an <h2>.
    """
    wrapper = h2_tag.parent if isinstance(h2_tag.parent, Tag) else h2_tag
    parts = []
    sib = wrapper.next_sibling
    while sib is not None:
        if isinstance(sib, Tag):
            # stop at the next heading wrapper (or a raw h2 sibling, just in case)
            if sib.name == "h2" or (sib.name in {"div", "section"} and sib.find("h2")):
                break
        parts.append(str(sib))
        sib = sib.next_sibling
    return BeautifulSoup("".join(parts), "lxml")

def parse_continent_section(h2_tag: Tag) -> list[tuple[str, str | None, str | None]]:
    """
    Return rows: (country, chart_name, chart_url) for the continent section.
    Works when <dl>/<ul> live directly under the section or inside tables/columns.
    """
    frag = _section_fragment_soup(h2_tag)
    results: list[tuple[str, str | None, str | None]] = []

    for dl in frag.find_all("dl"):
        dts = dl.find_all("dt", recursive=False)
        if not dts:
            continue

        ul_list = _collect_consecutive_uls(dl)
        if not ul_list:
            continue

        charts: list[tuple[str | None, str | None]] = []
        for ul in ul_list:
            charts.extend(_parse_ul_block(ul))

        for dt in dts:
            country = extract_text(dt)
            if not country:
                continue
            for chart_name, chart_url in charts:
                if chart_name:
                    results.append((country, chart_name, chart_url))

    return results

def _continent_key_from_h2(h2: Tag) -> str | None:
    """
    Normalize the continent identifier to match KNOWN_CONTINENTS.
    Prefer explicit ids; fall back to text with spaces->underscores.
    """
    # Some pages put the id on the <h2> directly
    if h2.has_attr("id"):
        return h2["id"]
    # Classic pattern: <h2><span class="mw-headline" id="Europe">Europe</span></h2>
    span = h2.find("span", class_="mw-headline")
    if span and span.get("id"):
        return span["id"]
    # Fallback: normalize text
    text = h2.get_text(strip=True)
    if text:
        return text.replace(" ", "_")
    return None

def parse_record_charts(html: str) -> pd.DataFrame:
    soup = BeautifulSoup(html, "lxml")
    data = []

    for h2 in soup.find_all("h2"):
        key = _continent_key_from_h2(h2)
        if key not in KNOWN_CONTINENTS:
            continue

        rows = parse_continent_section(h2)
        for country, chart_name, chart_url in rows:
            data.append({
                "continent": key.replace("_", " "),
                "country": country,
                "chart": chart_name,
                "url": chart_url,
            })

    if not data:
        return pd.DataFrame(columns=["continent", "country", "chart", "url"])

    return pd.DataFrame(data).drop_duplicates().reset_index(drop=True)

def scrape_record_charts(url: str = WIKI_URL) -> pd.DataFrame:
    resp = requests.get(url, headers=headers, timeout=30)
    resp.raise_for_status()
    return parse_record_charts(resp.text)



In [17]:

df = scrape_record_charts(WIKI_URL)
print(df.shape)                 # e.g., (>0, 4)
print(df.head(12).to_string())  # quick peek
print(df.dtypes)
print(df.describe())
# df.to_csv("record_charts_by_country_continent.csv", index=False)


(131, 4)
   continent       country                               chart                                                               url
0     Africa         Egypt                                IFPI                                https://en.wikipedia.org/wiki/IFPI
1     Africa       Nigeria                       TurnTable Top                       https://en.wikipedia.org/wiki/TurnTable_Top
2     Africa       Nigeria            TurnTable Top  100 chart             https://en.wikipedia.org/wiki/TurnTable_Top_100_Songs
3     Africa       Nigeria             TurnTable Top 100 Album            https://en.wikipedia.org/wiki/TurnTable_Top_100_Albums
4     Africa  North Africa                                IFPI                                https://en.wikipedia.org/wiki/IFPI
5     Africa  South Africa   The Official South African Charts   https://en.wikipedia.org/wiki/The_Official_South_African_Charts
6       Asia         China                       China Top 100             https://en.wi

In [18]:
# Save to CSV in project root
output_path = "../record_charts.csv"
df.to_csv(output_path, index=False)
print(f"Saved to {output_path}")


Saved to ../record_charts.csv


Map of Record Charts

In [48]:
# Map countries with any chart using GeoPandas + Folium
import geopandas as gpd
import folium
from shapely.geometry import Point

# Unique countries from scrape
countries = (
    df['country']
    .dropna()
    .str.strip()
    .drop_duplicates()
    .tolist()
)

# Load Natural Earth country boundaries
# Download Natural Earth data (GeoPandas 1.0+ requires manual download)
from pathlib import Path
import urllib.request
import zipfile

# Prefer the S3 mirror; fallback to Natural Earth site
NE_URL_PRIMARY = "https://naturalearth.s3.amazonaws.com/110m_cultural/ne_110m_admin_0_countries.zip"
NE_URL_FALLBACK = "https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/110m/cultural/ne_110m_admin_0_countries.zip"
cache_dir = Path(".cache")
cache_dir.mkdir(exist_ok=True)
ne_zip_path = cache_dir / "ne_110m_admin_0_countries.zip"
ne_shp_path = cache_dir / "ne_110m_admin_0_countries.shp"

if not ne_shp_path.exists():
    print("Downloading Natural Earth data...")
    import requests
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36"}
    url = NE_URL_PRIMARY
    try:
        resp = requests.get(url, headers=headers, timeout=60)
        resp.raise_for_status()
    except Exception:
        url = NE_URL_FALLBACK
        resp = requests.get(url, headers=headers, timeout=60)
        resp.raise_for_status()
    with open(ne_zip_path, "wb") as f:
        f.write(resp.content)
    with zipfile.ZipFile(ne_zip_path, 'r') as zip_ref:
        zip_ref.extractall(cache_dir)
    print("Download complete.")

world = gpd.read_file(ne_shp_path)
world = world[['NAME', 'geometry']].rename(columns={'NAME': 'name'})

# Manual remaps for known mismatches

# Manual remaps for known mismatches
remap = {
    'Czech Republic': 'Czechia',
    'United States': 'United States of America',
    'DR Congo': 'Democratic Republic of the Congo',
    'Congo': 'Republic of the Congo',
    'Ivory Coast': 
        next((n for n in world['name'] if 'Côte d’Ivoire' in n or 'Cote d' in n), 'Côte d’Ivoire'),
    'Swaziland': 'Eswatini',
    'North Macedonia': 'Macedonia',  # Natural Earth sometimes lists as Macedonia
}

# Non-country labels to ignore (regions, groupings that appear on the page)
non_countries = {
    'North Africa', 'Southeast Asia', 'Continental Europe', 'Central America',
}

# Build a mapping DataFrame
name_map = []
for c in countries:
    if c in non_countries:
        continue
    target = remap.get(c, c)
    name_map.append((c, target))

map_df = pd.DataFrame(name_map, columns=['country', 'ne_name'])

# Merge with world geometries
merged = map_df.merge(world, left_on='ne_name', right_on='name', how='left')
merged = merged.drop(columns=['name']).dropna(subset=['geometry'])

# Compute country chart counts
counts = df.groupby('country', as_index=False).agg(num_charts=('chart', 'nunique'))
merged = merged.merge(counts, on='country', how='left')

# Prepare Folium map centered on global mean of representative points
import geopandas as gpd

gdf_countries = gpd.GeoDataFrame(merged, geometry='geometry', crs='EPSG:4326')
reps = gdf_countries.representative_point()  # safe in WGS84 for display
center_lat = reps.y.mean()
center_lng = reps.x.mean()
chart_map = folium.Map(location=[center_lat, center_lng], tiles='openstreetmap', zoom_start=2)

# Add one marker per country with popup listing top few charts
for i, row in gdf_countries.iterrows():
    pt = reps.iloc[i]
    lat = pt.y
    lng = pt.x
    country_name = row['country']
    subset = df[df['country'] == country_name].dropna(subset=['chart']).head(10)
    charts_html = '<br>'.join(sorted(subset['chart'].unique())[:10])
    popup_html = f"""
    <b>{country_name}</b><br/>
    Charts (up to 10):<br/>
    {charts_html}
    """
    folium.CircleMarker(
        location=[lat, lng],
        radius=5 + float(row.get('num_charts', 1))**0.5,
        color='#2A93D5',
        fill=True,
        fill_opacity=0.7,
        popup=folium.Popup(popup_html, max_width=300)
    ).add_to(chart_map)

# Save and show
chart_map.save('charts_world_map.html')
print('Saved interactive map to charts_world_map.html')
chart_map

Saved interactive map to charts_world_map.html


In [51]:
# Extract tables from PDF and save to CSV files
# Requires: pip install pdfplumber
import pdfplumber
import pandas as pd
from pathlib import Path

pdf_path = Path("../data/MEEC_ACCI_Analysis_of_weather_and_music_preference.pdf")
out_dir = Path("data/extracted_tables")
out_dir.mkdir(parents=True, exist_ok=True)

extracted = []
with pdfplumber.open(pdf_path) as pdf:
    for page_idx, page in enumerate(pdf.pages, start=1):
        tables = page.extract_tables()
        for t_idx, table in enumerate(tables, start=1):
            if not table or len(table) < 2:
                continue
            # Assume first row is header if all entries are strings
            header = table[0]
            body = table[1:]
            try:
                df_tbl = pd.DataFrame(body, columns=header)
            except Exception:
                # Fallback without headers
                df_tbl = pd.DataFrame(table)
            out_path = out_dir / f"page_{page_idx:03d}_table_{t_idx:02d}.csv"
            df_tbl.to_csv(out_path.as_posix(), index=False)
            extracted.append((page_idx, t_idx, out_path.name, df_tbl.shape))

print(f"Saved {len(extracted)} tables to {out_dir}")
for page_idx, t_idx, name, shape in extracted[:10]:
    print(f"- Page {page_idx}, Table {t_idx}: {name} {shape}")
if len(extracted) > 10:
    print(f"... and {len(extracted) - 10} more")
    
# pip install pdf2image opencv-python pillow geopandas shapely
from pdf2image import convert_from_path
import cv2, numpy as np, pandas as pd
from pathlib import Path

# 1) Rasterize page 5 to image
img = np.array(convert_from_path("../data/MEEC_ACCI_Analysis_of_weather_and_music_preference.pdf",
                                 dpi=300, first_page=5, last_page=5,
                                 poppler_path=r"C:\Program Files\poppler-25.07.0\Library\bin")[0])[:,:,::-1]  # RGB->BGR

# 2) Color masks (tune thresholds to your image)
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
masks = {
    "cluster_1_grey":  cv2.inRange(hsv, (0,0,120), (180,40,230)),
    "cluster_2_green": cv2.inRange(hsv, (35,40,40), (85,255,255)),
    "cluster_3_yellow":cv2.inRange(hsv, (20,70,70), (35,255,255)),
}

def centers(mask):
    cnts,_ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    pts=[]
    for c in cnts:
        if cv2.contourArea(c) < 20: 
            continue
        (x,y),r = cv2.minEnclosingCircle(c)
        pts.append((float(x), float(y)))
    return pts

dots=[]
for label, m in masks.items():
    for x,y in centers(m):
        dots.append({"x":x,"y":y,"cluster":label})
dots = pd.DataFrame(dots)

# 3) Calibrate pixel->lon/lat (click or hand-enter two map corners)
# Example: bounding box (lon_min, lat_max) at (x0,y0), (lon_max, lat_min) at (x1,y1)
# You can read these off the map edges or set manually.
lon_min, lat_max = -180.0, 85.0
lon_max, lat_min =  180.0, -60.0
x0,y0 = 150,150    # set from your image
x1,y1 = 2400,1300  # set from your image

def px_to_lonlat(x,y):
    lon = lon_min + (x - x0) * (lon_max - lon_min) / (x1 - x0)
    lat = lat_max - (y - y0) * (lat_max - lat_min) / (y1 - y0)
    return lon, lat

lon, lat = zip(*[px_to_lonlat(x,y) for x,y in zip(dots.x, dots.y)])
dots["lon"], dots["lat"] = lon, lat
dots.to_csv("../data/map_clusters_extracted.csv", index=False)

from pdf2image import convert_from_path
from pathlib import Path

pdf_path = Path("../data/MEEC_ACCI_Analysis_of_weather_and_music_preference.pdf")

images = convert_from_path(
    pdf_path,
    dpi=300,
    first_page=4,      # choose the page(s)
    last_page=5,
    poppler_path=r"C:\Program Files\poppler-25.07.0\Library\bin"  # Poppler bin directory
)

# Save or process the images
out_dir = Path("../data/pdf_pages")
out_dir.mkdir(parents=True, exist_ok=True)

for i, img in enumerate(images, start=1):
    out_file = out_dir / f"page_{i}.png"
    img.save(out_file, "PNG")
    print(f"Saved {out_file}")

Saved 5 tables to data\extracted_tables
- Page 1, Table 1: page_001_table_01.csv (1, 2)
- Page 3, Table 1: page_003_table_01.csv (11, 2)
- Page 3, Table 2: page_003_table_02.csv (5, 2)
- Page 5, Table 1: page_005_table_01.csv (3, 2)
- Page 5, Table 2: page_005_table_02.csv (3, 6)
Saved ..\data\pdf_pages\page_1.png
Saved ..\data\pdf_pages\page_2.png


In [55]:
# Extract line chart data from page 4 - Convert to PNG for digitization
from pdf2image import convert_from_path
from pathlib import Path

pdf_path = Path("../data/MEEC_ACCI_Analysis_of_weather_and_music_preference.pdf")
out_dir = Path("../data/pdf_pages")
out_dir.mkdir(parents=True, exist_ok=True)

# 1) Convert page 4 to high-resolution PNG (simpler and more reliable)
from pdf2image import convert_from_path

pdf_path = Path("../data/MEEC_ACCI_Analysis_of_weather_and_music_preference.pdf")
out_dir = Path("../data/pdf_pages")
out_dir.mkdir(parents=True, exist_ok=True)

print("Converting page 4 to PNG...")
images = convert_from_path(
    pdf_path,
    dpi=300,  # High resolution for better digitization
    first_page=4,
    last_page=4,
    poppler_path=r"C:\Program Files\poppler-25.07.0\Library\bin"
)

# Save the image
page4_img = out_dir / "page4_figure.png"
images[0].save(page4_img, "PNG")
print(f"Saved page 4 image to {page4_img}")
print(f"\nImage saved! You can now:")
print("1. Open the image in WebPlotDigitizer: https://automeris.io/WebPlotDigitizer/")
print("2. For each subplot (Energy, Tempo, Valence, Loudness, Danceability):")
print("   - Select '2D (X-Y) Plot'")
print("   - Calibrate axes (pick two dates on x-axis, two values on y-axis)")
print("   - Use 'Automatic Extraction > Multiple Lines' for each colored city line")
print("   - Export as CSV")
print("3. Combine all CSV files into one DataFrame")

# The PNG image has been saved. 
# Next: Use WebPlotDigitizer (https://automeris.io/WebPlotDigitizer/) to digitize the line charts.
# After digitizing, you can load the CSV files and combine them:

# Example: Load digitized data (after using WebPlotDigitizer)
# import pandas as pd
# energy_data = pd.read_csv("energy_digitized.csv")
# tempo_data = pd.read_csv("tempo_digitized.csv")
# # ... etc for other metrics
# 
# # Combine all metrics
# all_data = pd.concat([energy_data, tempo_data, ...], ignore_index=True)
# all_data.to_csv("../data/page4_digitized.csv", index=False)

Converting page 4 to PNG...
Saved page 4 image to ..\data\pdf_pages\page4_figure.png

Image saved! You can now:
1. Open the image in WebPlotDigitizer: https://automeris.io/WebPlotDigitizer/
2. For each subplot (Energy, Tempo, Valence, Loudness, Danceability):
   - Select '2D (X-Y) Plot'
   - Calibrate axes (pick two dates on x-axis, two values on y-axis)
   - Use 'Automatic Extraction > Multiple Lines' for each colored city line
   - Export as CSV
3. Combine all CSV files into one DataFrame


In [None]:
# Extract line chart data from page 4 - Convert to PNG for digitization
from pdf2image import convert_from_path
from pathlib import Path

pdf_path = Path("../data/MEEC_ACCI_Analysis_of_weather_and_music_preference.pdf")
out_dir = Path("../data/pdf_pages")
out_dir.mkdir(parents=True, exist_ok=True)

# Convert page 4 to high-resolution PNG
print("Converting page 4 to PNG...")
images = convert_from_path(
    pdf_path,
    dpi=300,  # High resolution for better digitization
    first_page=4,
    last_page=4,
    poppler_path=r"C:\Program Files\poppler-25.07.0\Library\bin"
)

# Save the image
page4_img = out_dir / "page4_figure.png"
images[0].save(page4_img, "PNG")
print(f"✓ Saved page 4 image to {page4_img}")
print(f"\nNext steps to extract data:")
print("1. Open the image in WebPlotDigitizer: https://automeris.io/WebPlotDigitizer/")
print("2. For each subplot (Energy, Tempo, Valence, Loudness, Danceability):")
print("   - Select '2D (X-Y) Plot'")
print("   - Calibrate axes (pick two dates on x-axis, two values on y-axis)")
print("   - Use 'Automatic Extraction > Multiple Lines' for each colored city line")
print("   - Export as CSV")
print("3. Combine all CSV files into one DataFrame using pandas")


In [40]:
import time
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse

HEADERS = {"User-Agent": "Mozilla/5.0 ..."}
DOMAIN_SCRAPERS = {}

def scraper(domain):
    def _register(func):
        DOMAIN_SCRAPERS[domain] = func
        return func
    return _register

def load_urls(csv_path="../data/chart_official_sites.csv"):
    df = pd.read_csv(csv_path)
    if "official_site" in df.columns:
        df["url"] = df["official_site"].fillna(df["url"])
    df = df.dropna(subset=["url"])
    return df

def scrape_top_songs(url, limit=10):
    domain = urlparse(url).netloc.replace("www.", "")
    handler = DOMAIN_SCRAPERS.get(domain)
    if not handler:
        raise NotImplementedError(f"No scraper for {domain}")
    return handler(url, limit=limit)

def fetch_soup(url):
    resp = requests.get(url, headers=HEADERS, timeout=30)
    resp.raise_for_status()
    return BeautifulSoup(resp.text, "lxml")

In [41]:
# Domain scrapers and batch runner for top songs
import re
import time
import pandas as pd

# Utilities to robustly extract text

def _first_text(el, selectors):
    for sel in selectors:
        node = el.select_one(sel)
        if node:
            text = node.get_text(strip=True)
            if text:
                return text
    return None


def _digits(text):
    if not text:
        return None
    m = re.findall(r"\d+", text)
    return int(m[0]) if m else None


@scraper("officialcharts.com")
def scrape_officialcharts(url, limit=10):
    soup = fetch_soup(url)
    # Two common layouts: cards or table
    rows = soup.select("div.chart-positions__item")
    if not rows:
        rows = soup.select("table.chart tbody tr")
    out = []
    for row in rows:
        rank = _first_text(row, [
            "div.chart-positions__position",
            "td.position",
            "td.chart-position",
            "td:nth-child(1)",
        ])
        title = _first_text(row, [
            "div.chart-positions__track a",
            "td.title a",
            "td.title",
            "td:nth-child(2)",
        ])
        artist = _first_text(row, [
            "div.chart-positions__artist",
            "td.artist",
            "td:nth-child(3)",
        ])
        if title and artist and rank:
            out.append({
                "position": _digits(rank) or rank,
                "title": title,
                "artist": artist,
            })
        if len(out) >= limit:
            break
    return out


@scraper("billboard.com")
def scrape_billboard(url, limit=10):
    soup = fetch_soup(url)
    # Modern layout
    rows = soup.select("li.o-chart-results-list__item")
    if not rows:
        rows = soup.select("ul.o-chart-results-list > li")
    out = []
    for row in rows:
        rank = _first_text(row, [
            "span.c-label.a-font-primary-bold-l",
            "span.c-label.a-font-primary-bold",
            "span.c-label.a-font-primary-bold-m",
        ])
        title = _first_text(row, [
            "h3.c-title",
            "h3#title-of-a-story",
        ])
        artist = _first_text(row, [
            "span.c-label.a-no-trucate",
            "span.c-label",
            "span.a-no-trucate",
        ])
        if title and artist and rank:
            out.append({
                "position": _digits(rank) or rank,
                "title": title,
                "artist": artist,
            })
        if len(out) >= limit:
            break
    return out


def scrape_many(url_df, limit=10, sleep_seconds=1.5):
    work = url_df.copy()
    work["domain"] = work["url"].apply(lambda u: urlparse(u).netloc.replace("www.", ""))
    work = work[work["domain"].isin(DOMAIN_SCRAPERS.keys())].reset_index(drop=True)

    results = []
    for _, row in work.iterrows():
        url = row["url"]
        chart = row.get("chart")
        domain = row["domain"]
        try:
            entries = scrape_top_songs(url, limit=limit)
            for e in entries:
                results.append({
                    "domain": domain,
                    "chart": chart,
                    "url": url,
                    "position": e.get("position"),
                    "title": e.get("title"),
                    "artist": e.get("artist"),
                    "scraped_at": pd.Timestamp.utcnow().isoformat(),
                })
        except NotImplementedError:
            pass
        except Exception as exc:
            print(f"[ERROR] {domain} {url}: {exc}")
        time.sleep(sleep_seconds)
    return pd.DataFrame(results)


# Run on available URLs and save
try:
    urls_df = load_urls("../data/chart_official_sites.csv")
except Exception:
    urls_df = load_urls("chart_official_sites.csv")
    
try:
    urls_df
except NameError:
    urls_df = load_urls("../data/chart_official_sites.csv")

urls_df["domain"] = urls_df["url"].apply(
    lambda u: urlparse(str(u)).netloc.replace("www.", "")
)

supported_domains = set(DOMAIN_SCRAPERS.keys())
print("Supported:", supported_domains)
print("Domain counts:\n", urls_df["domain"].value_counts())

missing = urls_df[~urls_df["domain"].isin(supported_domains)]
print(f"\nUnsupported domains ({missing['domain'].nunique()}):")
print(missing["domain"].value_counts().head(20))
missing_rows = missing[["chart", "domain", "url"]].head(10)
display(missing_rows)

top_df = scrape_many(urls_df, limit=10, sleep_seconds=1.2)
print(top_df.head(10).to_string(index=False))
out_path = "../data/top_songs.csv"
try:
    top_df.to_csv(out_path, index=False)
    print(f"Saved to {out_path} ({len(top_df)} rows)")
except Exception:
    alt = "top_songs.csv"
    top_df.to_csv(alt, index=False)
    print(f"Saved to {alt} ({len(top_df)} rows)")


Supported: {'officialcharts.com', 'billboard.com'}
Domain counts:
 domain
en.wikipedia.org             86
turntablecharts.com           3
fhf.is                        3
ifpi.org                      2
mediaforest.biz               1
oricon.jp                     1
rim.org.my                    1
rias.org.sg                   1
top-lista.hr                  1
snepmusique.com               1
gfk-entertainment.com         1
theofficialsacharts.co.za     1
mahasz.hu                     1
irma.ie                       1
fimi.it                       1
zpav.pl                       1
slotop50.si                   1
sverigesradio.se              1
officialcharts.com            1
bigtop40.com                  1
collectionscanada.gc.ca       1
monitorlatino.com             1
billboard.com                 1
mediabase.com                 1
recordedmusic.co.nz           1
capif.org.ar                  1
national-report.com           1
recordreport.com.ve           1
Name: count, dtype: int64

Uns

Unnamed: 0,chart,domain,url
0,IFPI,ifpi.org,https://www.ifpi.org/
1,TurnTable Top,turntablecharts.com,https://turntablecharts.com/charts/top100
2,TurnTable Top 100 chart,turntablecharts.com,https://turntablecharts.com/charts/top100
3,TurnTable Top 100 Album,turntablecharts.com,https://turntablecharts.com/charts/2
4,The Official South African Charts,theofficialsacharts.co.za,https://theofficialsacharts.co.za/
5,China Top 100,en.wikipedia.org,https://en.wikipedia.org/wiki/Billboard_China_...
6,China Airplay/FL,en.wikipedia.org,https://en.wikipedia.org/wiki/Billboard_China_...
7,Music Radio China Top Chart Awards,en.wikipedia.org,https://en.wikipedia.org/wiki/Music_Radio_Chin...
8,IMI International Top 20 Singles,en.wikipedia.org,https://en.wikipedia.org/wiki/IMI_Internationa...
9,Billboard IndonesiaTop 100,en.wikipedia.org,https://en.wikipedia.org/wiki/Billboard_Indone...


Empty DataFrame
Columns: []
Index: []
Saved to ../data/top_songs.csv (0 rows)


In [42]:
# Resolve Wikipedia URLs to official chart endpoints and rerun
from urllib.parse import urlparse

SUPPORTED_DOMAINS = {"officialcharts.com", "billboard.com"}


def infer_billboard_endpoint(row):
    chart = str(row.get("chart", "")).lower()
    country = str(row.get("country", "")).lower()
    # Global first
    if "global" in chart:
        return "https://www.billboard.com/charts/billboard-global-200/"
    # Canada
    if ("canadian" in chart and "hot 100" in chart) or ("canada" in country and "hot 100" in chart):
        return "https://www.billboard.com/charts/canadian-hot-100/"
    # Argentina
    if ("argentina" in chart or "argentina" in country) and "hot 100" in chart:
        return "https://www.billboard.com/charts/billboard-argentina-hot-100/"
    # Billboard 200 (albums)
    if ("billboard 200" in chart) or ("billboard200" in chart) or ("billboard" in chart and "200" in chart):
        return "https://www.billboard.com/charts/billboard-200/"
    # US Hot 100 default
    if "hot 100" in chart or ("united states" in country and "billboard" in chart):
        return "https://www.billboard.com/charts/hot-100/"
    return None


def infer_officialcharts_endpoint(row):
    # UK Official Singles Chart (most commonly needed top songs endpoint)
    return "https://www.officialcharts.com/charts/singles-chart/"


def resolve_target_url(row):
    url = row["url"]
    dom = urlparse(url).netloc.replace("www.", "")
    if dom in SUPPORTED_DOMAINS:
        return url
    # Wikipedia heuristics → infer official chart page
    if dom.endswith("wikipedia.org"):
        chart = str(row.get("chart", "")).lower()
        country = str(row.get("country", "")).lower()
        # UK / Official Charts Company
        if "united kingdom" in country or "uk " in chart or "official charts" in chart:
            return infer_officialcharts_endpoint(row)
        # Billboard family
        bb = infer_billboard_endpoint(row)
        if bb:
            return bb
    return None


def scrape_many2(url_df, limit=10, sleep_seconds=1.2):
    import time
    results = []
    for _, r in url_df.iterrows():
        target = resolve_target_url(r)
        if not target:
            continue
        domain = urlparse(target).netloc.replace("www.", "")
        if domain not in DOMAIN_SCRAPERS:
            continue
        try:
            entries = scrape_top_songs(target, limit=limit)
            for e in entries:
                results.append({
                    "domain": domain,
                    "chart": r.get("chart"),
                    "country": r.get("country"),
                    "url": target,
                    "position": e.get("position"),
                    "title": e.get("title"),
                    "artist": e.get("artist"),
                    "scraped_at": pd.Timestamp.utcnow().isoformat(),
                })
        except Exception as exc:
            print(f"[ERROR] {domain} {target}: {exc}")
        time.sleep(sleep_seconds)
    return pd.DataFrame(results)


# Run resolver-based scraping
try:
    urls_df
except NameError:
    try:
        urls_df = load_urls("../data/chart_official_sites.csv")
    except Exception:
        urls_df = load_urls("chart_official_sites.csv")

resolved_top_df = scrape_many2(urls_df, limit=10, sleep_seconds=1.0)
print(resolved_top_df.head(10).to_string(index=False))
out_path = "../data/top_songs.csv"
try:
    resolved_top_df.to_csv(out_path, index=False)
    print(f"Saved to {out_path} ({len(resolved_top_df)} rows)")
except Exception:
    resolved_top_df.to_csv("top_songs.csv", index=False)
    print(f"Saved to top_songs.csv ({len(resolved_top_df)} rows)")


Empty DataFrame
Columns: []
Index: []
Saved to ../data/top_songs.csv (0 rows)


In [43]:
# New domain scrapers: TurnTable (Nigeria) and Official South African Charts (TOSAC)
from bs4 import BeautifulSoup

@scraper("turntablecharts.com")
def scrape_turntable(url, limit=10):
    soup = fetch_soup(url)
    out = []
    # Try common structures first
    candidates = [
        "div.chart__row", "div.song-card", "div.songItem", "div.chart-list-item",
        "table tbody tr", "table tr", "li",
    ]
    rows = []
    for sel in candidates:
        rows = soup.select(sel)
        if rows:
            break
    for row in rows:
        rank = _first_text(row, [".rank", ".position", "td:nth-child(1)", "span.rank", "div.rank"]) or _digits(row.get_text(" ", strip=True))
        title = _first_text(row, [
            ".title a", ".title", ".song a", ".song", ".track a", ".track",
            "td:nth-child(2)", "h3", "h4",
        ])
        artist = _first_text(row, [
            ".artist a", ".artist", ".singer", ".performer", ".by",
            "td:nth-child(3)", "p", "small",
        ])
        if title and artist and rank:
            out.append({
                "position": _digits(rank) or rank,
                "title": title,
                "artist": artist,
            })
        if len(out) >= limit:
            break
    return out


@scraper("theofficialsacharts.co.za")
def scrape_tosac(url, limit=10):
    soup = fetch_soup(url)
    out = []
    # Prefer table rows if available
    rows = soup.select("table tbody tr")
    if not rows:
        rows = soup.select("div.chart__table-row, li.chart__list-item, div.chart-card, div.chart-row")
    for row in rows:
        rank = _first_text(row, [
            ".chart__position", ".position", "td:nth-child(1)", "span.position", "div.position"
        ]) or _digits(row.get_text(" ", strip=True))
        title = _first_text(row, [
            ".chart__track a", ".chart__track", ".title a", ".title", "td:nth-child(2)", "h3", "h4"
        ])
        artist = _first_text(row, [
            ".chart__artist a", ".chart__artist", ".artist a", ".artist", "td:nth-child(3)", "p", "small"
        ])
        if title and artist and rank:
            out.append({
                "position": _digits(rank) or rank,
                "title": title,
                "artist": artist,
            })
        if len(out) >= limit:
            break
    return out


In [44]:
# Enhanced resolver for supported official chart sites
from urllib.parse import urlparse

_OFFICIALCHARTS_PATTERNS = [
    ("vinyl albums", "https://www.officialcharts.com/charts/vinyl-albums-chart/"),
    ("vinyl singles", "https://www.officialcharts.com/charts/vinyl-singles-chart/"),
    ("record store", "https://www.officialcharts.com/charts/record-store-chart/"),
    ("compilation", "https://www.officialcharts.com/charts/compilations-chart/"),
    ("album downloads", "https://www.officialcharts.com/charts/album-downloads-chart/"),
    ("albums streaming", "https://www.officialcharts.com/charts/albums-streaming-chart/"),
    ("audio streaming", "https://www.officialcharts.com/charts/audio-streaming-chart/"),
    ("albums", "https://www.officialcharts.com/charts/albums-chart/"),
    ("singles downloads", "https://www.officialcharts.com/charts/singles-downloads-chart/"),
    ("singles", "https://www.officialcharts.com/charts/singles-chart/"),
]

_BILLBOARD_PATTERNS = [
    ("argentina hot 100", "https://www.billboard.com/charts/billboard-argentina-hot-100/"),
    ("canadian hot 100", "https://www.billboard.com/charts/canadian-hot-100/"),
    ("brazil hot 100", "https://www.billboard.com/charts/billboard-brazil-hot-100/"),
    ("mexico airplay", "https://www.billboard.com/charts/mexico-airplay/"),
    ("global 200", "https://www.billboard.com/charts/billboard-global-200/"),
    ("global excl", "https://www.billboard.com/charts/billboard-global-excl-us/"),
    ("billboard 200", "https://www.billboard.com/charts/billboard-200/"),
    ("hot 100", "https://www.billboard.com/charts/hot-100/"),
]


def _resolve_chart_url(row):
    url = str(row.get("url", ""))
    if not url:
        return None
    parsed = urlparse(url)
    domain = parsed.netloc.replace("www.", "")
    chart_name = str(row.get("chart", "")).lower()

    if domain == "officialcharts.com":
        if "/charts/" in parsed.path:
            return url
        for key, target in _OFFICIALCHARTS_PATTERNS:
            if key in chart_name:
                return target
        return "https://www.officialcharts.com/charts/singles-chart/"

    if domain == "billboard.com":
        if "/charts/" in parsed.path:
            return url
        for key, target in _BILLBOARD_PATTERNS:
            if key in chart_name:
                return target
        return "https://www.billboard.com/charts/hot-100/"

    return None


def scrape_supported_charts(url_df, limit=10, sleep_seconds=1.0):
    work = url_df.copy()
    work["target_url"] = work.apply(_resolve_chart_url, axis=1)
    work = work.dropna(subset=["target_url"])
    work["domain"] = work["target_url"].apply(lambda u: urlparse(u).netloc.replace("www.", ""))
    work = work[work["domain"].isin(DOMAIN_SCRAPERS.keys())].reset_index(drop=True)

    results = []
    for _, row in work.iterrows():
        target = row["target_url"]
        domain = row["domain"]
        chart = row.get("chart")
        try:
            entries = scrape_top_songs(target, limit=limit)
            for e in entries:
                results.append({
                    "domain": domain,
                    "chart": chart,
                    "url": target,
                    "position": e.get("position"),
                    "title": e.get("title"),
                    "artist": e.get("artist"),
                    "scraped_at": pd.Timestamp.utcnow().isoformat(),
                })
        except Exception as exc:
            print(f"[ERROR] {domain} {target}: {exc}")
        time.sleep(sleep_seconds)
    return pd.DataFrame(results)


try:
    urls_df
except NameError:
    urls_df = load_urls("../data/chart_official_sites.csv")

enhanced_df = scrape_supported_charts(urls_df, limit=10, sleep_seconds=1.0)
print(enhanced_df.head(10).to_string(index=False))
out_path = "../data/top_songs.csv"
try:
    enhanced_df.to_csv(out_path, index=False)
    print(f"Saved to {out_path} ({len(enhanced_df)} rows)")
except Exception:
    enhanced_df.to_csv("top_songs.csv", index=False)
    print(f"Saved to top_songs.csv ({len(enhanced_df)} rows)")


Empty DataFrame
Columns: []
Index: []
Saved to ../data/top_songs.csv (0 rows)


In [45]:
# Override resolver to include TurnTable and TOSAC mappings, then rerun scrape
from urllib.parse import urlparse, urljoin

# Keep previously defined pattern maps for OfficialCharts and Billboard if present.

_DEF_TURNTABLE = {
    "songs": "https://turntablecharts.com/charts/top100",
    "albums": "https://turntablecharts.com/charts/2",
}

_DEF_TOSAC_FALLBACK = "https://theofficialsacharts.co.za/charts/"


def _resolve_chart_url(row):
    url = str(row.get("url", ""))
    if not url:
        return None
    parsed = urlparse(url)
    domain = parsed.netloc.replace("www.", "")
    chart_name = str(row.get("chart", "")).lower()

    # Existing supported providers
    if domain == "officialcharts.com":
        if "/charts/" in parsed.path:
            return url
        # Map by chart name keywords
        for key, target in _OFFICIALCHARTS_PATTERNS:
            if key in chart_name:
                return target
        return "https://www.officialcharts.com/charts/singles-chart/"

    if domain == "billboard.com":
        if "/charts/" in parsed.path:
            return url
        for key, target in _BILLBOARD_PATTERNS:
            if key in chart_name:
                return target
        return "https://www.billboard.com/charts/hot-100/"

    # New providers
    if domain == "turntablecharts.com":
        if "/charts/" in parsed.path:
            return url
        if "album" in chart_name:
            return _DEF_TURNTABLE["albums"]
        return _DEF_TURNTABLE["songs"]

    if domain == "theofficialsacharts.co.za":
        # If already on a charts page, use it
        if "/charts" in parsed.path:
            return url
        # Try to discover a Top 100 link
        try:
            soup = fetch_soup(url)
            # Prefer explicit "Top 100" links
            link = soup.find("a", string=lambda s: isinstance(s, str) and "top 100" in s.lower())
            if not link:
                # Any charts link
                link = soup.find("a", href=lambda h: isinstance(h, str) and "/charts" in h)
            if link:
                href = link.get("href")
                return urljoin(url, href)
        except Exception:
            pass
        return _DEF_TOSAC_FALLBACK

    return None


# Re-run using the enhanced resolver
try:
    urls_df
except NameError:
    urls_df = load_urls("../data/chart_official_sites.csv")

run_df = scrape_supported_charts(urls_df, limit=10, sleep_seconds=1.0)
print(run_df.head(10).to_string(index=False))
print(f"Scraped rows: {len(run_df)} from domains: {sorted(run_df['domain'].unique()) if len(run_df) else []}")
out_path = "../data/top_songs.csv"
try:
    run_df.to_csv(out_path, index=False)
    print(f"Saved to {out_path} ({len(run_df)} rows)")
except Exception:
    run_df.to_csv("top_songs.csv", index=False)
    print(f"Saved to top_songs.csv ({len(run_df)} rows)")


             domain         chart                                       url  position                title artist                       scraped_at
turntablecharts.com TurnTable Top https://turntablecharts.com/charts/top100         1                  FUN      1 2025-11-10T07:36:11.281806+00:00
turntablecharts.com TurnTable Top https://turntablecharts.com/charts/top100         2       Who's Dat Girl      2 2025-11-10T07:36:11.291050+00:00
turntablecharts.com TurnTable Top https://turntablecharts.com/charts/top100         3 Shakabulizzy (Remix)     25 2025-11-10T07:36:11.291163+00:00
turntablecharts.com TurnTable Top https://turntablecharts.com/charts/top100         4          Body (danz)      4 2025-11-10T07:36:11.291213+00:00
turntablecharts.com TurnTable Top https://turntablecharts.com/charts/top100         5       MONEY CONSTANT      5 2025-11-10T07:36:11.291269+00:00
turntablecharts.com TurnTable Top https://turntablecharts.com/charts/top100         6                  you      4 2025

## Exercice 2: Locate places on a map

The objective here is to identify organizer cities on a map.
You will have to code the following steps:
* Collect the URL of each city
* Go the this page using `urllib.request`
* Find the coordinates
* Store cities and coordinates in a data frame

The map can be obtain with the following code

# Project

The objective of this project is to practice all concepts taught in the main lecture. Therefore, you will have to collect data around a thematic, identify a problematic, clean and format the data, provide some exploratory analysis and visualization, build models and evaluate them. You will have also to design dashboard and to storytell the whole pipeline.


The whole project will have two outputs describing the methodology and the results:
* A technical report targeting your datascientist colleagues
* An oral presentation targeting your CEO, chief, client. In this case, we assume that the audience is not specialized in data science. (But you also need to present the methodology to convince them to trust the results).


**Requests:**
* Team work of two people (same group throughout the semester)
* All your work should be stored on a git repo: [tutorial](https://github.com/baskiotisn/2IN013robot2023/blob/d979333fb80c9b6acd9515aaec040943d10d365c/docs/tutoriel_git.pdf)

**Remarks:**
There are also other libraries or issues you'll encounter when collecting data. Some of these are listed below, but don't be shy and interact with a search engine to solve your own issue!
* [Regular expressions](https://docs.python.org/3/howto/regex.html) might be useful!
* [API with authentication](https://www.geeksforgeeks.org/authentication-using-python-requests/)
*   [Selenium](https://selenium-python.readthedocs.io/) when pages are generated via javascript scripts
* [Playwright](https://playwright.dev/) --> looks easier and more adapted than Selenium
* [Scrapy](https://scrapy.org/) for web crawling / or when you don't know the URL.[Tuto here](https://doc.scrapy.org/en/latest/intro/tutorial.html)
* [Summary of some difficulties in scrapping data from the web](https://www.zenrows.com/blog/web-scraping-challenges#page-structure-changes)

## Your daily task

No problem can be solved if it has not been clearly enonciated ! Defining the problem is, however, tricky as it might depends on the available data and their quality. 
You should first try to find a well defined question to be answered and check if any data sources are available.

Collecting data from the web and open data portals. Open data might include csv files, you can use them, but be aware that scraping should be your main acitivity in the dataset gathering.
You can begin to format the data, merge them to build dataframe that would be analyzed next week.