## Web Scraping historical Weather Data with Selenium  WebDriver 

In this notebook, we collect the weather data from https://www.wunderground.com/ for New york city, JFK airport.
We scrape the historical daily data from the table "Daily Observations", from January 1st, 2017 to August 31, 2022.
For that, we use Selenium library with WebDriver (chromedriver).



First, we need to install selenium library:   see Selenium Documentation  https://pypi.org/project/selenium/  and https://github.com/SeleniumHQ/selenium/ 

Selenium requires a driver to interact with the browser Chrome: Install chromedriver.exe with the same version as your browser from  https://chromedriver.chromium.org/downloads . We use Chrome but one can also use other browsers, see https://github.com/SeleniumHQ/selenium/tree/trunk/py


In [3]:
## Import basic packages we will use 
import numpy as np
import pandas as pd
from datetime import date, timedelta, datetime



In [4]:
# Import Selenium package, will use  
# Selenium Documentation; https://www.selenium.dev/selenium/docs/api/py/api.html    


from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait 
import selenium.webdriver.support.expected_conditions as EC


In [5]:
#function: Iterate through the cells and element inside the rows  of the Table "Daily Observation"  
#Arguments
# drive =   webdriver.Chrome 
# tdclass : class of cell inside the table 'Daily Observations" 
# spanClass : container cell

def scraping_rows_element(driver, tdclass, spanClass):
    elements_text = []
    elements =wait.until(EC.visibility_of_all_elements_located((By.XPATH,
                                tdclass)))
    for element in elements:
        elements_text.append(element.find_element_by_xpath(spanClass).text)
    return elements_text

In [None]:


# Create empty DataFrame
# df to save daily data 
df = pd.DataFrame(columns = ["Day", 'time', 'Temperature_F','Dew_Point_F', 'Humidity_%','Wind','Wind_Speed_mph',
                            'Wind_Gust_mph','Pressure_in' , 'Precip_in', 'Condition' ])

# df2 to save all data : append daily df 

df2 = pd.DataFrame(columns = ["Day", 'time', 'Temperature_F','Dew_Point_F', 'Humidity_%','Wind','Wind_Speed_mph',
                            'Wind_Gust_mph','Pressure_in' , 'Precip_in', 'Condition' ])






options = webdriver.ChromeOptions()
options.add_argument('headless')

driver_path = 'C:/Users/olandoul/Downloads/chromedriver'
driver = webdriver.Chrome(executable_path= driver_path , options=options)

wait = WebDriverWait(driver, 10)





#  date time     
Start_date   = datetime(2022,1,1)
End_Date = datetime(2022,1,3)
Curr_Date =  Start_date

# URL website www.wunderground.com for new york city JFK aiport  
URL = 'https://www.wunderground.com/history/daily/us/ny/new-york-city/KJFK/date/{}-{}-{}.html'

while (Curr_Date <= End_Date):
   
    updated_URL = URL.format (Curr_Date.year,Curr_Date.month,Curr_Date.day)
    
    driver.get(updated_URL)
    
    print('gathering weather-data for KJFK from wunderground  for : ', Curr_Date)
    
    
    # time 
    times =  scraping_rows_element(driver,'//td[@class="mat-cell cdk-cell cdk-column-dateString mat-column-dateString ng-star-inserted"]','.//span[@class="ng-star-inserted"]')      
    df['time']= times
    df['Day'] = Curr_Date.date() 
    
    # Temperature
    Temperatures = scraping_rows_element(driver, '//td[@class="mat-cell cdk-cell cdk-column-temperature mat-column-temperature ng-star-inserted"]','.//span[@class="wu-value wu-value-to"]')
    df['Temperature_F'] = Temperatures
   
    #  Dew Point
    Dew_Points = scraping_rows_element(driver,'//td[@class="mat-cell cdk-cell cdk-column-dewPoint mat-column-dewPoint ng-star-inserted"]','.//span[@class="wu-value wu-value-to"]')  
    df['Dew_Point_F'] =Dew_Points
    
    #  Humidities
    Humidities =  scraping_rows_element(driver,'//td[@class="mat-cell cdk-cell cdk-column-humidity mat-column-humidity ng-star-inserted"]','.//span[@class="wu-value wu-value-to"]')
    df['Humidity_%'] = Humidities
    
    
    
     #  Winds
    Winds = scraping_rows_element(driver, '//td[@class="mat-cell cdk-cell cdk-column-windcardinal mat-column-windcardinal ng-star-inserted"]','.//span[@class="ng-star-inserted"]')   
    df['Wind'] =Winds
    
    #  Wind_Speeds
    Wind_Speeds = scraping_rows_element(driver,'//td[@class="mat-cell cdk-cell cdk-column-windSpeed mat-column-windSpeed ng-star-inserted"]','.//span[@class="wu-value wu-value-to"]') 
    df['Wind_Speed_mph'] =Wind_Speeds
    
    #  Wind_Gusts
    Wind_Gusts= scraping_rows_element(driver,'//td[@class="mat-cell cdk-cell cdk-column-windGust mat-column-windGust ng-star-inserted"]' ,'.//span[@class="wu-value wu-value-to"]')  
    df['Wind_Gust_mph'] =Wind_Gusts 
    
    # Pressure
    Pressures = scraping_rows_element(driver,'//td[@class="mat-cell cdk-cell cdk-column-pressure mat-column-pressure ng-star-inserted"]' , './/span[@class="wu-value wu-value-to"]')   
    df['Pressure_in'] =   Pressures
    
    # precipitations
    precipitations = scraping_rows_element(driver,'//td[@class="mat-cell cdk-cell cdk-column-precipRate mat-column-precipRate ng-star-inserted"]','.//span[@class="wu-value wu-value-to"]')   
    df['Precip_in'] =precipitations
    
    # Conditions
    Conditions = scraping_rows_element(driver,'//td[@class="mat-cell cdk-cell cdk-column-condition mat-column-condition ng-star-inserted"]','.//span[@class="ng-star-inserted"]')  
    df['Condition'] =  Conditions
    
    
    
    
    
    
    
    
    
    df2 = df2.append(df,ignore_index = True)  
    df = pd.DataFrame(columns = ["Day", 'time', 'Temperature_F','Dew_Point_F', 'Humidity_%','Wind','Wind_Speed_mph',
                            'Wind_Gust_mph','Pressure_in' , 'Precip_in', 'Condition' ])
   
    Curr_Date = Curr_Date + timedelta(days=1) 
    
    
    
      
   
# save data   
#df2.to_csv(path_or_buf ='Path/...../Weather-Data-2022/data-January-2022.csv', index = False)

# 

In [7]:
df2 

Unnamed: 0,Day,time,Temperature_F,Dew_Point_F,Humidity_%,Wind,Wind_Speed_mph,Wind_Gust_mph,Pressure_in,Precip_in,Condition
0,2022-01-01,12:51 AM,49,48,97,S,8,0,29.91,0.0,Cloudy
1,2022-01-01,1:18 AM,48,48,100,S,6,0,29.91,0.0,Cloudy
2,2022-01-01,1:51 AM,49,49,100,SSE,5,0,29.91,0.0,Fog
3,2022-01-01,2:09 AM,49,48,97,SE,3,0,29.90,0.0,Cloudy
4,2022-01-01,2:51 AM,49,48,97,SE,6,0,29.90,0.0,Cloudy
...,...,...,...,...,...,...,...,...,...,...,...
99,2022-01-03,7:51 PM,26,3,37,NNW,24,30,30.16,0.0,Mostly Cloudy / Windy
100,2022-01-03,8:51 PM,25,2,37,N,28,36,30.18,0.0,Partly Cloudy / Windy
101,2022-01-03,9:51 PM,24,8,51,N,17,30,30.21,0.0,Fair
102,2022-01-03,10:51 PM,23,11,60,N,17,28,30.22,0.0,Fair
