<a href="https://colab.research.google.com/github/KGzB/CAS-Applied-Data-Science/blob/master/Module-1/M1-D3-WWW.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Notebook 1, Module 1, Data and Data Management, CAS Applied Data Science, 2024-08-23, A. Mühlemann, University of Bern. (based on the template by S. Haug)


# 3. Data acquisition on the world wide web

**Learning outcomes:**

Participants will be able to collect data from www sources. Examples are provided and exercised. We have about 1.5h hours for this tutorial.

**Table of Contents**
- 3.1 Read json from the web
- 3.2 Retrieve and display pictures and files from the web
- 3.3 Scraping webpages (html scraping)
- 3.4 Cron jobs and Scheduled tasks
- Some notes and links concerning social media

**Further sources**
- Examples all over internet
- A book: https://www.packtpub.com/big-data-and-business-intelligence/mastering-social-media-mining-python


## 3.1 Analyse Aare with data from https://aareguru.existenz.ch/



Get the data from website, bring it into a format which can be imported into a dataframe, plot the time series and the histograms.

In [0]:
import requests
import pandas as pd
from datetime import datetime

# URL of the Aare Guru API only getting infos from bern
url = 'https://aareguru.existenz.ch/v2018/current?city=bern'

# Send GET request to the API
response = requests.get(url)
data = response.json()  # Parse the JSON response
df = pd.DataFrame(data['aarepast']) # Get the past flow and water temperature and save it in a data frame
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s') # change datetime
df

Get a rough overview of the data set we obtained

In [0]:
df.describe()

Plot water temperature.

In [0]:
import matplotlib.pyplot as plt
df.plot(x='timestamp', y='temperature', kind='line')
plt.show()

Let us now compare the aare temperature in Bern with the air temperature and percipitation. You can find the documentation of this API here: https://open-meteo.com/en/docs

In [0]:
# URL of weather api
url = 'https://api.open-meteo.com/v1/forecast?latitude=52.52&longitude=13.41&past_days=10&hourly=temperature_2m,precipitation'

# Send GET request to the API
response = requests.get(url)
data = response.json()  # Parse the JSON response
df2 = pd.DataFrame(data['hourly']) # Get the past flow and water temperature and save it in a data frame
df2['time'] = pd.to_datetime(df2['time']) # change datetime
df2


Let's also plot the temperature.

In [0]:
df2.plot(x='time', y='temperature_2m', kind='line')
plt.show()

We can see that the two timeseries have a different scope. Let us thus only take the timepoints that occur in both time series.

In [0]:
df2.rename(columns={'time': 'timestamp'}, inplace=True)
df_new =pd.merge(df, df2, on='timestamp', how='left') # we use a left join to have the fine timegrid from the aare data
df_new

Now let us plot the air and water temperature to see whether there is a correlation.

In [0]:
plt.figure(figsize=(10, 6))  # Optional: Adjust the figure size
plt.plot(df_new['timestamp'], df_new['temperature'], label="water temperature")
plt.plot(df_new['timestamp'], df_new['temperature_2m'], label="air temperature", marker='o' )
# Customize the plot
plt.xlabel('datetime')
plt.ylabel('Temperature')
plt.title('Aare and Air Temperature in Bern')
plt.legend()
plt.show()

**Possible further exercise or project for Module 1 and 2**

Find some colleague who can get the historical data (knows how to use the API) out of https://aareguru.existenz.ch/. Bring all data into one data frame. Look for correlations, averages (per month, per year ...). Combine the data with weather data, e.g. the wind on the Thun lake. For the Model 2 project, try to make a linear regression model predicting the Aare temperature.

## 3.2 Get pictures (or files) from webpages

Get 3 pictures from a webserver with the Image module and show it directly.

In [0]:
import requests
import pandas as pd
from datetime import datetime
from IPython.display import Image, display

# URL of weather api
url = ' https://dog.ceo/api/breed/hound/images'

# Send GET request to the API
response = requests.get(url)
data = response.json()
dog_images = pd.DataFrame(data)
for i in range(1,4):
    url = dog_images.iat[i,0]
    img = Image(url)
    display(img)

### Exercise
There is an API giving some cat pictures (https://api.thecatapi.com/v1/images/search?limit=10). Can you adapt the above code to obtain pictures of cats?

In [0]:
import requests
import pandas as pd
from datetime import datetime
from IPython.display import Image, display

# URL of weather api
url = 'https://api.thecatapi.com/v1/images/search?limit=10'

# Send GET request to the API
response = requests.get(url)
data = response.json()
cat_images = pd.DataFrame([item['url'] for item in data])
for i in range(1,4):
    url = cat_images.iat[i,0]
    img = Image(url)
    display(img)

## 3.3 Scrape Webpages (html scraping)

There are several billion online websites. With python you can easily read and parse this data if you have the links. Since pages are linked, one can in principle unnest probably all internet for webpages.

In Python there is a library https://www.crummy.com/software/BeautifulSoup/bs4/doc/ for pulling data out of html and xml pages. We don't practise that library here, however, if you at some point deal with a lot of html, you may want to use it.

Here we get the email adresses from a contact page of a website.


In [0]:
from urllib.request import urlopen
import numpy as np
import pandas as pd

startlink = "https://www.zeilenwerk.ch/agentur"
f = urlopen(startlink)
myfile = f.read()
str(myfile)
lines = str(myfile).split(' ')
addresses = []
for line in lines:
    if 'mailto' in line:
        tmp = np.array(line.split('"'))
        if len(tmp)== 3:
          tmp2 = np.array(tmp[1].split(':'))
          addresses.append(tmp2[1])
df_addrs = pd.DataFrame(addresses,columns=['Adresses'])
df_addrs

The above code is not optimal as you have probably seen. Lets use regular expressions instead (from StackOverflow). Regular expressions are a bit geeky, but very powerful and great fun. If you don't wan't to learn them, you mostly find the expression you want by googling.

In [0]:
import re # the regular expression module
startlink = "https://www.zeilenwerk.ch/agentur"
f = urlopen(startlink)
html = f.read()
# Extract email addresses
reobj = re.compile(r"\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}\b", re.IGNORECASE)
print(re.findall(reobj, html.decode('utf-8')))

### Exercise

Hier a nice little challenge for you. Use the code above (together with a for loop or two) to scrape all webpages of your employer company for public available email addresses and put them into a dataframe :)

In [0]:
import pandas as pd

INDEX_URL = "https://www.swissinfo.ch/eng/sitemap-2023.xml"

def read_sitemap(url: str) -> pd.DataFrame:
    """Liest eine (Teil-)Sitemap mit <url><loc>…</loc><lastmod>…</lastmod> ein."""
    df = pd.read_xml(url)
    # Manche Sitemaps liefern direkt <urlset>, manche sind wieder Indizes.
    # Wir normalisieren auf Spaltennamen 'loc' und 'lastmod', wenn vorhanden.
    cols = {c.lower(): c for c in df.columns}
    # Falls es ein weiterer Index ist (nur 'loc' mit weiteren ?mm&dd-Links), geben wir ihn einfach zurück.
    return df

# 1) Index einlesen -> Liste der verlinkten (Tages-)Sitemaps
index_df = pd.read_xml(INDEX_URL)
if "loc" not in index_df.columns:
    raise ValueError("Im Index wurden keine 'loc'-Einträge gefunden.")

sitemap_links = index_df["loc"].dropna().unique().tolist()

# 2) Alle verlinkten Sitemaps einlesen und zusammenführen
all_urls = []
for sm in sitemap_links:
    try:
        part = pd.read_xml(sm)
        # Nur Zeilen behalten, die tatsächliche Seiten-URLs enthalten (also <urlset> mit <loc>)
        if "loc" in part.columns:
            # Einige Sitemaps enthalten mehr Felder (z.B. lastmod). Wir behalten das Wesentliche.
            keep = [c for c in part.columns if c in {"loc", "lastmod", "changefreq", "priority"}]
            if not keep:
                keep = ["loc"]
            all_urls.append(part[keep])
    except Exception as e:
        print(f"Übersprungen wegen Fehler bei {sm}: {e}")

urls_df = pd.concat(all_urls, ignore_index=True).drop_duplicates(subset=["loc"])
# Optional: Nur die eigene Domain behalten (hier ohnehin schon der Fall)
# urls_df = urls_df[urls_df["loc"].str.startswith("https://www.swissinfo.ch/")]

print(urls_df.shape)
print(urls_df.head())

# Optional speichern
# urls_df.to_csv("swissinfo_sitemap_2023_urls.csv", index=False)


In [0]:
import requests, pandas as pd
from lxml import etree
from concurrent.futures import ThreadPoolExecutor, as_completed

INDEX_URL = "https://www.watson.ch/sitemap.xml"
HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; sitemap-loader/1.0)"}
TIMEOUT = 15
MAX_SITEMAPS = 10   # 👉 hier steuern
MAX_WORKERS = 8     # parallel

def fetch(url):
    r = requests.get(url, headers=HEADERS, timeout=TIMEOUT)
    r.raise_for_status()
    return r.content

def parse_urlset(xml_bytes):
    # Extrahiert <loc> und optional <lastmod> aus einem <urlset>
    root = etree.fromstring(xml_bytes)
    ns = root.nsmap.get(None, "")  # Standard-NS
    nsmap = {"ns": ns} if ns else {}
    locs = root.xpath("//ns:url/ns:loc/text()" if ns else "//url/loc/text()", namespaces=nsmap)
    lastmods = root.xpath("//ns:url/ns:lastmod/text()" if ns else "//url/lastmod/text()", namespaces=nsmap)
    if lastmods and len(locs) == len(lastmods):
        return pd.DataFrame({"loc": locs, "lastmod": lastmods})
    return pd.DataFrame({"loc": locs})

# 1) Index holen & auf erste N beschränken
index_xml = fetch(INDEX_URL)
root = etree.fromstring(index_xml)
ns = root.nsmap.get(None, "")
nsmap = {"ns": ns} if ns else {}
sitemap_links = root.xpath("//ns:sitemap/ns:loc/text()" if ns else "//sitemap/loc/text()", namespaces=nsmap)
sitemap_links = sitemap_links[:MAX_SITEMAPS]

# 2) Parallel alle Sitemaps laden & parsen
parts = []
with ThreadPoolExecutor(max_workers=MAX_WORKERS) as ex:
    futs = {ex.submit(fetch, sm): sm for sm in sitemap_links}
    for fut in as_completed(futs):
        sm = futs[fut]
        try:
            xml_bytes = fut.result()
            df = parse_urlset(xml_bytes)
            if not df.empty:
                parts.append(df[["loc"] + ([c for c in df.columns if c == "lastmod"])])
        except Exception as e:
            print(f"Fehler bei {sm}: {e}")

urls_df = pd.concat(parts, ignore_index=True).drop_duplicates("loc")
print(urls_df.shape)
print(urls_df.head())


In [0]:
print(urls_df["loc"][0])

In [0]:
import re
from urllib.request import urlopen
import pandas as pd
from tqdm import tqdm   # ✅ Fortschrittsbalken

# Regex für E-Mails
reobj = re.compile(r"\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}\b", re.IGNORECASE)

all_emails = []

# Schleife über die ersten 30 URLs mit Fortschrittsanzeige
for link in tqdm(urls_df["loc"].head(2), desc="Scraping", unit="url"):
    try:
        f = urlopen(link)
        html = f.read().decode("utf-8", errors="ignore")

        emails = re.findall(reobj, html)
        if emails:
            for e in emails:
                all_emails.append({"url": link, "email": e})

    except Exception as e:
        print(f"Fehler bei {link}: {e}")

# Ergebnis in DataFrame
emails_df = pd.DataFrame(all_emails)

print(pd.unique(emails_df['email']))


### Tables from webpages

If you or someone else pubslishes data in html tables, it can be collected with pandas quite easily, actually directly without using the urllib module.

In [0]:
import pandas as pd
link = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
tables = pd.read_html(link)
df = tables[0]
df.head()

Which countries have the largest population?

In [0]:
df.info()

In [0]:
s_df = df.iloc[1:,1:3]
s_df.sort_values('Population', ascending=False)

### Exercise
Find another interesting table to scrape from Wikipedia and look at it more closely.

In [0]:
import pandas as pd
link = "https://de.wikipedia.org/wiki/Liste_der_meistverkauften_Rapalben_in_Deutschland"
tables = pd.read_html(link)
df = tables[1]
df.head()

## Some notes and links concerning social Media

### 1. Google search

There is are APIs for doing google searches from Python. Hier is one explained.

https://stackoverflow.com/questions/37083058/programmatically-searching-google-in-python-using-custom-search


### 2. Twitter

Twitter generates about 500M tweets per day. Thus, data mining on twitter can be interesting.

Note: there are rate limits in the use of the Twitter API, as well as limitations in case you want to provide a downloadable data-set, see:

https://dev.twitter.com/overview/terms/agreement-and-policy

https://dev.twitter.com/rest/public/rate-limiting

Tweepy is one python module with clients for thwe Twitter API.

- https://www.tweepy.org/


### 3. Instagram

Largest photo sharing social media platform with 500 million monthly active users, and 95 million pictures and videos uploaded on Instagram daily in 2018 (?).

https://stackoverflow.com/questions/61010431/how-to-start-with-the-instagramapi-in-python