## ETLGanz Get Weather
### 1. Objective
To scrape cities data from Wikipedia, weather data from OpenWeather Map.org, set up dataframes, and use SQL to store data for table queries. Additionally, we will use an API weather key to fetch real-time weather data as needed.



### 2. Global Configuration

In [4]:
# Global Configuration
# Import necessary libraries

import pandas as pd
import numpy as np
import requests as req
import sqlalchemy as sa
import pymysql as pms
import re
import myconfig as cfg

from bs4 import BeautifulSoup
from datetime import datetime

import warnings

warnings.filterwarnings('ignore')

#Default path for data
# path = "..\Data\\"

###  3. Third Party Data
#####
URLs for cities are below:
```markdown
- [Berlin](https://en.wikipedia.org/wiki/Berlin)
- [Hamburg](https://en.wikipedia.org/wiki/Hamburg)
- [Munich](https://en.wikipedia.org/wiki/Munich)
```
Weather Source from : http://api.openweathermap.org

In [5]:
url1 ="https://en.wikipedia.org/wiki/Berlin"
url2 ="https://en.wikipedia.org/wiki/Hamburg"
url3 ="https://en.wikipedia.org/wiki/Munich"

### 4. Cities
### 4.1 Initial Checks
##### Parameters
- **Site response**: status code 200 or something else!
- **Data Extraction Checks**: for city name, country, latitude, longitude, population etc.

In [3]:
response1 = req.get(url1)
response2 = req.get(url2)
response3 = req.get(url3)

response1.status_code,response2.status_code,response3.status_code

(200, 200, 200)

In [6]:
list_of_cities = ['Berlin', 'Hamburg', 'Munich']

for city in list_of_cities:

    url = "https://en.wikipedia.org/wiki/" + city
    headers = {'Accept-Language': 'en-US,en;q=0.8'}
    response = req.get(url, headers = headers)
    if response.status_code != 200: break
    soup = BeautifulSoup(response.content, "html.parser")



city_names = []
country_names = []
latitudes = []
longitudes = []
populations = []

In [7]:
for city in list_of_cities:

    url = "https://en.wikipedia.org/wiki/" + city
    headers = {'Accept-Language': 'en-US,en;q=0.8'}
    response = req.get(url, headers = headers)
    if response.status_code != 200: break
    soup = BeautifulSoup(response.content, "html.parser")
    
    city_name = soup.select('span.mw-page-title-main')[0].get_text()
    city_names.append(city_name)
    country_name = soup.select('td.infobox-data')[0].get_text()
    country_names.append(country_name)
    latitude = soup.select('span.latitude')[0].get_text()
    latitudes.append(latitude)
    longitude = soup.select('span.longitude')[0].get_text()
    longitudes.append(longitude)
    population = soup.select('th.infobox-header:-soup-contains("Population")')[0].parent.find_next_sibling().find(string=re.compile(r'\d+'))
    populations.append(population)

In [8]:
cities_df = pd.DataFrame(
    {"City": city_names,
     "Country": country_names,
     "Latitude": latitudes,
     "Longitude": longitudes,
     "Population": populations
    }
)

cities_df['Population'] = cities_df['Population'].str.replace(',', '').astype(int)
cities_df

Unnamed: 0,City,Country,Latitude,Longitude,Population
0,Berlin,Germany,52°31′12″N,13°24′18″E,3878100
1,Hamburg,Germany,53°33′N,10°00′E,1964021
2,Munich,Germany,48°08′15″N,11°34′30″E,1510378


In [9]:
cities_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   City        3 non-null      object
 1   Country     3 non-null      object
 2   Latitude    3 non-null      object
 3   Longitude   3 non-null      object
 4   Population  3 non-null      int64 
dtypes: int64(1), object(4)
memory usage: 252.0+ bytes


In [10]:
cities_df_final=cities_df.copy()
cities_df_final

Unnamed: 0,City,Country,Latitude,Longitude,Population
0,Berlin,Germany,52°31′12″N,13°24′18″E,3878100
1,Hamburg,Germany,53°33′N,10°00′E,1964021
2,Munich,Germany,48°08′15″N,11°34′30″E,1510378


### 4. Cities
### 4.2 Cities Dataframe Creation
#### 4.2.1 Config & Dataframe
You may also use the config file provided in the Github Repository to run this (**Option 1**)

In [11]:
# Option 1: Using the Alternate Method to Get Config Data
from safe import safe
safe.get('SERVER_HOST')
safe.get('SERVER_USER')
safe.get('SERVER_PASSWORD')
safe.get('SERVER_PORT')

3306

In [None]:
# Option 2: Without safe.py file
schema = "input" # Name of the schema in the database 
host = "input"  # IP address of the database
user = "input"    # User name to connect to the database
password = 'input'     # Password to connect to the database
port = input      # Port to connect to the database

connection_string = f'mysql+pymysql://{user}:{password}@{host}:{port}/{schema}'

### 4.2 Cities Dataframe Creation
#### 4.2.2 SQL Push-Pull Request

In [11]:
cities_df_final.to_sql('city',
                  if_exists='append',
                  con=connection_string,
                  index=False)

3

In [12]:
cities_from_sql = pd.read_sql("city", con=connection_string)
cities_from_sql

Unnamed: 0,City,Country,Latitude,Longitude,Population
0,Berlin,Germany,52°31′12″N,13°24′18″E,3878100
1,Hamburg,Germany,53°33′N,10°00′E,1964021
2,Munich,Germany,48°08′15″N,11°34′30″E,1510378


### 5. Weather
### 5.1 Initial Checks
##### Parameters
- **Data Extraction Checks**: using API from http://api.openweathermap.org
- **Site Info Checks**: status checks, .json() keys data exploration

In [None]:
# API Key Assignment
API_key = "input" # Insert your API Key here
city = "Berlin"

In [13]:
# Berlin_json
berlin = req.get(f"http://api.openweathermap.org/data/2.5/forecast?q={city}&appid={API_key}&units=metric")
berlin_json = berlin.json()
print("response: ", berlin.status_code)

response:  200


In [14]:
# Explore Data
berlin_json.keys()

dict_keys(['cod', 'message', 'cnt', 'list', 'city'])

In [15]:
# Part 1: Dive Deeper in 3 parts
berlin_json['city']

{'id': 2950159,
 'name': 'Berlin',
 'coord': {'lat': 52.5244, 'lon': 13.4105},
 'country': 'DE',
 'population': 1000000,
 'timezone': 3600,
 'sunrise': 1733209073,
 'sunset': 1733237703}

In [16]:
# Part 2: Dive Deeper in 3 parts
pd.json_normalize(berlin_json).head()

Unnamed: 0,cod,message,cnt,list,city.id,city.name,city.coord.lat,city.coord.lon,city.country,city.population,city.timezone,city.sunrise,city.sunset
0,200,0,40,"[{'dt': 1733216400, 'main': {'temp': 5.51, 'fe...",2950159,Berlin,52.5244,13.4105,DE,1000000,3600,1733209073,1733237703


In [17]:
# Part 3: Dive Deeper in 3 parts
# What I am looking to explore: City, Country, Date_Time, Weather, Temperature, Wind_Speed

#Key Areas
berlin_json['city']
berlin_json['city']['country']
berlin_json['list'][0]['dt_txt']
berlin_json['list'][0]['weather'][0]['description']
berlin_json['list'][0]['main']['temp']
berlin_json['list'][0]['wind']['speed']

2.43

In [18]:
# For Loop Check
list_of_cities = ['Berlin', 'Hamburg', 'Munich']

weather_data = []

for city in list_of_cities:
    response = req.get(f"http://api.openweathermap.org/data/2.5/forecast?q={city}&appid={API_key}&units=metric")
    city_json = response.json()

    for entry in city_json['list']:
        weather_data.append({
            "City": city_json['city']['name'],
            "Country": city_json['city']['country'],
            "Date_Time": entry['dt_txt'],
            "Weather": entry["weather"][0]["description"],
            "Temperature": entry['main']['temp'],
            "Wind_Speed": entry['wind']['speed']
        })

weather_df = pd.DataFrame(weather_data)
weather_df.head()

Unnamed: 0,City,Country,Date_Time,Weather,Temperature,Wind_Speed
0,Berlin,DE,2024-12-03 09:00:00,clear sky,5.51,2.43
1,Berlin,DE,2024-12-03 12:00:00,few clouds,6.19,3.05
2,Berlin,DE,2024-12-03 15:00:00,light rain,5.96,2.87
3,Berlin,DE,2024-12-03 18:00:00,light rain,5.55,5.04
4,Berlin,DE,2024-12-03 21:00:00,overcast clouds,5.29,2.4


### 5.2 Weather Dataframe Creation
#### 5.2.1 Config & Dataframe
You may also use the config file provided in the Github Repository to run this (**Option 1**)

In [None]:
# Option 1: Using the Alternate Method to Get Config Data
from safe import safe
safe.get('SERVER_HOST')
safe.get('SERVER_USER')
safe.get('SERVER_PASSWORD')
safe.get('SERVER_PORT')
safe.get('WEATHER_API_KEY')

'a78f79536c631d7efde7f72d8e05d8bf'

In [None]:
# Option 2: Without safe.py file
schema = "input" # Insert your schema name
host = "input"  # Insert your host name
user = "input"    # Insert your user name
password = 'input'     # Insert your password
port =  input # Insert your port number 

connection_string = f'mysql+pymysql://{user}:{password}@{host}:{port}/{schema}'

### 5.2 Weather Dataframe Creation
#### 5.2.2 SQL Push-Pull Request

In [20]:
weather_df_final=weather_df.copy()
weather_df_final

Unnamed: 0,City,Country,Date_Time,Weather,Temperature,Wind_Speed
0,Berlin,DE,2024-12-03 09:00:00,clear sky,5.51,2.43
1,Berlin,DE,2024-12-03 12:00:00,few clouds,6.19,3.05
2,Berlin,DE,2024-12-03 15:00:00,light rain,5.96,2.87
3,Berlin,DE,2024-12-03 18:00:00,light rain,5.55,5.04
4,Berlin,DE,2024-12-03 21:00:00,overcast clouds,5.29,2.40
...,...,...,...,...,...,...
115,Munich,DE,2024-12-07 18:00:00,light rain,4.67,5.65
116,Munich,DE,2024-12-07 21:00:00,snow,2.88,6.27
117,Munich,DE,2024-12-08 00:00:00,light snow,2.33,6.93
118,Munich,DE,2024-12-08 03:00:00,light snow,0.96,6.56


In [21]:
weather_df_final.to_sql('weathers',
                  if_exists='append',
                  con=connection_string,
                  index=False)

120

In [22]:
weather_from_sql = pd.read_sql("weathers", con=connection_string)
weather_from_sql

Unnamed: 0,weathers_id,City,Country,Date_Time,Weather,Temperature,Wind_Speed,city_id
0,1,Berlin,DE,2024-12-03 09:00:00,clear sky,5.51,2.43,
1,2,Berlin,DE,2024-12-03 12:00:00,few clouds,6.19,3.05,
2,3,Berlin,DE,2024-12-03 15:00:00,light rain,5.96,2.87,
3,4,Berlin,DE,2024-12-03 18:00:00,light rain,5.55,5.04,
4,5,Berlin,DE,2024-12-03 21:00:00,overcast clouds,5.29,2.40,
...,...,...,...,...,...,...,...,...
115,116,Munich,DE,2024-12-07 18:00:00,light rain,4.67,5.65,
116,117,Munich,DE,2024-12-07 21:00:00,snow,2.88,6.27,
117,118,Munich,DE,2024-12-08 00:00:00,light snow,2.33,6.93,
118,119,Munich,DE,2024-12-08 03:00:00,light snow,0.96,6.56,


### 7. Retrospection
##### 1. Cities
- **Challenges**: 
    - Extracting accurate geographical data from Wikipedia required careful parsing of HTML content.
    - Handling different formats of latitude and longitude data was tricky and required additional string manipulation.
    - Ensuring the data consistency and handling missing or malformed data during the scraping process.

- **Highlights**:
    - Successfully scraped and compiled city data into a structured DataFrame.
    - Efficiently used BeautifulSoup for web scraping and pandas for data manipulation.
    - Stored the cleaned and processed data into a SQL database for further analysis and querying.

##### 2. Weather
- **Challenges**:
    - Managing API rate limits and ensuring reliable data retrieval from the OpenWeatherMap API.
    - Parsing nested JSON responses to extract relevant weather information.
    - Handling different weather conditions and ensuring the data is stored in a consistent format.

- **Highlights**:
    - Successfully fetched and processed weather data for multiple cities.
    - Created a comprehensive DataFrame with detailed weather information.
    - Integrated the weather data into the SQL database, enabling complex queries and analysis.