# Data Scraping: weather records 🌡️
*** 

## Step 1 : Explore the web page 🌐
***

[Wikipedia link: List of weather records](https://en.wikipedia.org/wiki/List_of_weather_records)



**This is the table we are interested in**

<div>
    <img src='https://raw.githubusercontent.com/Selimmmm/spe1/1eb8695ee9f14d62b127789817a86889db14aa34/projets/projet_III/images/9_img_table_countries.png' width="800"/>
</div>


<br><br><br>

## Step 2 : Explore the page source 🖥️
***


**This is the table and its HTML code (file available at: `code/table_countries.html`)**


<div>
    <img src='https://raw.githubusercontent.com/Selimmmm/spe1/1eb8695ee9f14d62b127789817a86889db14aa34/projets/projet_III/images/code_img_table_countries.png' width="800"/>
</div>


## Step 3 : Locate useful data 💾
***


#### Page source of a row from the table of interest

```html
<tr>
    <td><span class="flagicon"><span class="mw-image-border"
                  typeof="mw:File"><span><img alt=""
                         src="//upload.wikimedia.org/wikipedia/commons/thumb/a/ab/Flag_of_Panama.svg/23px-Flag_of_Panama.svg.png"
                         decoding="async"
                         width="23"
                         height="15"
                         class="mw-file-element"
                         srcset="//upload.wikimedia.org/wikipedia/commons/thumb/a/ab/Flag_of_Panama.svg/35px-Flag_of_Panama.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/a/ab/Flag_of_Panama.svg/45px-Flag_of_Panama.svg.png 2x"
                         data-file-width="900"
                         data-file-height="600"></span></span>&nbsp;</span><a href="/wiki/Panama"
           title="Panama">Panama</a>
    </td>
    <td style="background: #FF0A00; color:#FFFFFF; font-size:85%;">40.0&nbsp;°C (104.0&nbsp;°F)
    </td>
    <td><a href="/wiki/San_Francisco,_Panam%C3%A1"
           title="San Francisco, Panamá">San Francisco</a>
    </td>
    <td><span data-sort-value="000000001998-03-20-0000"
              style="white-space:nowrap">20 March 1998</span><sup id="cite_ref-ETESA_168-0"
             class="reference"><a href="#cite_note-ETESA-168">[161]</a></sup>
    </td>
</tr>
<tr>
```

#### Selection : 

- Second `<td>` tag contains : 
    - the temperature data point ✴️ (as text of the tag)
    - An interesting color we can re-use ! 🎨 (as the background-color of the tag)


- Third `<td>` tag contains the url of the wikipedia page of the place 
    - We'll need the GPS coordinates stored on this page

## Step 4 : Code 🤖
***


### A. Obtain the page source of the page
- `requests` is a library : code already written we can re-use
- `requests` will be used to send an HTTP request to wikipedia 
- The response will contain the page source of the web page




In [148]:
import requests

url = "https://en.wikipedia.org/wiki/List_of_weather_records"

response = requests.get(url)
status_code = response.status_code
print(f"Status code is: {status_code}")

if status_code == 200:
    print("As of now, everything's is working.")
else:
    print("Some debugging has to be done.")

Status code is: 200
As of now, everything's is working.


**The `text` attribute of the `response` object contains page source**.<br>
**We display the first 100 characters**

In [149]:
print(response.text[:100])

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-la



### B. Parse the page source
***
- An HTML page source has a tree structure
- Trees are very specific data structure 
- We can use the library `bs4` and its wonderful `BeautifulSoup` to get the data we need from the HTML source : the table

**<div style="color:red">There are multiple tables with classes `wikitable sortable` !! Be sure to select the right one</div>**


In [150]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text)
print(len(soup.find_all("table", class_="wikitable sortable")), "table with `wikitable sortable` class")
table = soup.find("table", class_="wikitable sortable")

3 table with `wikitable sortable` class


### C. Extract the data needed
***

[Some useful information about HTML tables](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/table)<br>
***(developer.mozilla.org is one of most useful website when dealing with frontend considerations)***


In [171]:
# List of all  `tr` tags 
# (except the first one as it is the header)
rows = table.find_all("tr")[1:]

**Example: get temperature data**

In [152]:
row_example = rows[12]

tds_example = row_example.find_all("td")

print('### Td ###')
print(tds_example[1])


print('\n### Td -> Text ###')
print(tds_example[1].text)

print('\n### Td -> Style ###')
print(tds_example[1].get("style"))

### Td ###
<td style="background: #840000; color:#FFFFFF; font-size:85%;">50.4 °C (122.7 °F)
</td>

### Td -> Text ###
50.4 °C (122.7 °F)


### Td -> Style ###
background: #840000; color:#FFFFFF; font-size:85%;


**Example: get url of page of location**

In [172]:
print('### Td -> a ###')

print(tds_example[2].find("a"))


print('\n### Td -> a -> href ###')

print(tds_example[2].find("a").get("href"))

### Td -> a ###
<a href="/wiki/Agadir" title="Agadir">Agadir</a>

### Td -> a -> href ###
/wiki/Agadir


In [173]:
raw_data = []
for row in rows:

    # All cells
    tds = row.find_all("td")

    # Temperature
    temperature_text = tds[1].text

    # Style to get background
    style = tds[1].get("style")

    # Color (background)
    color_code = style.split(";")[0].replace("background: ", "")

    # Url page of location
    a = tds[2].find("a")
    
    
    if a is not None:
        url_suffix = a.get("href")
    else:
        url_suffix = None
    raw_data.append(
        {
            "temperature_text":temperature_text,
            "color_code":color_code,
            "url_suffix":url_suffix
        }
    )

### D. Clean the data
***

- We rely on `pandas` which gives us a high ligh level API to manipulate tabular data <br>

**(high level mean it gives us the right to be lazy: complex operations can often be managed with a few lines of code)**


In [155]:
import pandas as pd
df = pd.DataFrame(raw_data)

In [156]:
def extract_temperature_in_celsius(temp_text):
    """This functions extract the temperature in celsius and convert to float"""
    temp = temp_text.split("°")[0]
    temp = temp.strip()
    return float(temp)

In [157]:
df["temperature"] = df.temperature_text.map(extract_temperature_in_celsius)

### E. Coordinates scraping 
***

**Select only the valid ones**

In [None]:
url_suffixes = df.url_suffix.tolist()

url_suffixes_page_exist = []
for url_suffix in  url_suffixes:
    if url_suffix is None:
        url_suffixes_page_exist.append(None)
    elif url_suffix.startswith("/wiki"):
        url_suffixes_page_exist.append(url_suffix)
    else:
        url_suffixes_page_exist.append(None)

In [1]:
import time

url_prefix = "https://en.wikipedia.org"

coordinates = []
for url_suffix in url_suffixes_page_exist:
    print(url_suffix)
    try:
        url_location = url_prefix + url_suffix
        response_location = requests.get(url_location)
        coords = BeautifulSoup(response_location.text).find(id="coordinates")
        if coords is not None:
            lat, long = coords.find("span", class_="latitude").text,coords.find("span", class_="longitude").text
        else:
            lat, long = None, None
            
        
    except:
        lat, long = None, None
    
    coordinates.append((lat, long))
    time.sleep(1)

NameError: name 'url_suffixes_page_exist' is not defined

In [102]:
# Assert we have 146 (not to break alignment with our DataFrame)
assert len(coordinates) == 146

df["coordinates_str"] = coordinates

# df.to_pickle("data/df_with_coordinates_str.pk")
# df.to_csv("data/df_with_coordinates_str.csv")

In [85]:
# Kudos to ChatGPT 
# Explain from showing onto the page

### F. Ask ChatGPT to finish the job
***
- Coordinates are not in a decimal format: we need it for plotting


In [174]:
# df = pd.read_pickle("data/df_with_coordinates_str.pk")
# df

In [175]:
### For asking nicely to ChatGPT : 
### print("\n".join(df.coordinates_str.astype(str).tolist()))

In [163]:
def dms_to_decimal(dms):
    """
    Convert DMS (degrees, minutes, seconds) string to decimal degrees. Minutes and seconds are optional and may include decimals.
    """
    import re
    
    # Match the degrees, optional fractional minutes, and optional fractional seconds with N, S, E, or W
    pattern = re.compile(r'(\d{1,3})°(\d{1,2}(?:\.\d+)?′)?(?:(\d{1,2}(?:\.\d+)?)″)?([NSEW])')
    match = pattern.search(dms)
    
    if not match:
        print(dms)
        raise ValueError("Input does not match DMS format")
    
    degrees, minutes, seconds, direction = match.groups()
    
    # Convert strings to integers or floats
    degrees = int(degrees)
    minutes = float(minutes.rstrip('′')) if minutes else 0.0
    seconds = float(seconds.rstrip('″')) if seconds else 0.0
    
    # Convert to decimal degrees
    decimal = degrees + minutes / 60 + seconds / 3600
    
    # Account for direction South or West
    if direction in 'SW':
        decimal = -decimal
    
    return decimal

In [167]:
df["lat"] = df["coordinates_str"].map(
    lambda lat_long: dms_to_decimal(lat_long[0]) if lat_long[0] is not None else None
)
df["long"] = df["coordinates_str"].map(
    lambda lat_long: dms_to_decimal(lat_long[1]) if lat_long[1] is not None else None
)

In [169]:
# df.to_pickle("data/df_with_coordinates_cleaned.pk")
# df.to_csv("data/df_with_coordinates_cleaned.csv")

***
***
***
# Draft

**Using pandas (but we loose the color from style**

In [144]:
import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_weather_records"

response = requests.get(url)
soup = BeautifulSoup(response.text)
table = soup.find("table", class_="wikitable sortable")
# df = pd.read_html(str(table), flavor="lxml")[0]

  df = pd.read_html(str(table), flavor="lxml")[0]


In [33]:
df

Unnamed: 0,Country/Region,Temperature,Town/Location,Date
0,Algeria,51.3 °C (124.3 °F),"Ouargla, Ouargla Province",5 July 2018[22]
1,Botswana,44.0 °C (111.2 °F),Maun,7 January 2016[23][24]
2,Burkina Faso,47.2 °C (117.0 °F),Dori,1984[25]
3,Chad,48.0 °C (118.4 °F),Faya-Largeau,25 May 2023[26]
4,Comoros,36.0 °C (96.8 °F),Hahaya International Airport,15 November 2017[27]
...,...,...,...,...
141,French Guiana,38.0 °C (100.4 °F),Saint-Laurent-du-Maroni,27 September 2016[145]
142,Paraguay,46.2 °C (115.2 °F),Las Palmas,10 December 2022[180]
143,Peru,41.6 °C (106.9 °F),Iñapari,7 October 2023[181]
144,Uruguay,44.0 °C (111.2 °F),"Paysandú, Paysandú Department Florida, Florida...",20 January 1943[182][183] 14 January 2022[184]


In [145]:
import re

text = "The coordinates are 31°57′N 5°19′E."
pattern = re.compile(r'(\d{1,3})°(\d{1,2})′([NS]) (\d{1,3})°(\d{1,2})′([EW])')

match = pattern.search(response_location.text)
if match:
    print("Found coordinates:", match.group())
else:
    print("No coordinates found.")

No coordinates found.


### No fractionnal seconds

In [131]:
def dms_to_decimal(dms):
    """
    Convert DMS (degrees, minutes, seconds) string to decimal degrees. Seconds are optional.
    """
    import re
    
    # Match the degrees, minutes, and optional seconds with optional leading zeros and N, S, E, or W
    pattern = re.compile(r'(\d{1,3})°(\d{1,2})′(?:(\d{1,2})″)?([NSEW])')
    match = pattern.search(dms)
    
    if not match:
        print(dms)
        raise ValueError("Input does not match DMS format") 
        # return None
    
    degrees, minutes, seconds, direction = match.groups()
    
    # Convert strings to integers
    degrees = int(degrees)
    minutes = int(minutes)
    seconds = int(seconds) if seconds else 0
    
    # Convert to decimal degrees
    decimal = degrees + minutes / 60 + seconds / 3600
    
    # Account for direction South or West
    if direction in 'SW':
        decimal = -decimal
    
    return decimal


## Fractionnal seconds but no minutes

In [132]:
def dms_to_decimal(dms):
    """
    Convert DMS (degrees, minutes, seconds) string to decimal degrees. Seconds are optional and may include decimals.
    """
    import re
    
    # Match the degrees, minutes, and optional fractional seconds with optional leading zeros and N, S, E, or W
    pattern = re.compile(r'(\d{1,3})°(\d{1,2})′(?:(\d{1,2}(?:\.\d+)?)″)?([NSEW])')
    match = pattern.search(dms)
    
    if not match:
        print(dms)
        raise ValueError("Input does not match DMS format")
    
    degrees, minutes, seconds, direction = match.groups()
    
    # Convert strings to integers or floats
    degrees = int(degrees)
    minutes = int(minutes)
    seconds = float(seconds) if seconds else 0.0
    
    # Convert to decimal degrees
    decimal = degrees + minutes / 60 + seconds / 3600
    
    # Account for direction South or West
    if direction in 'SW':
        decimal = -decimal
    
    return decimal
