# Crawling (First Dataset):

##### In this project, we will explore information regarding global volcanoes with eruptions during the Holocene period (approximately the last 10,000 years)..
##### We will use scraping and crawling and we will explore global volcanoes from the [Global Volcanism Program web site](https://volcano.si.edu/).<br/>
##### The Smithsonian Institution's Global Volcanism Program (GVP) is housed in the Department of Mineral Sciences, National Museum of Natural History, in Washington D.C.

### Import modules:

In [1]:
from bs4 import BeautifulSoup
import requests
import time
import pandas as pd
from tqdm import tqdm

### Crawling

##### The get_eruptions_data function crawls a webpage specified by a given URL to extract eruption data. It uses the requests library to make an HTTP GET request and BeautifulSoup to parse the HTML content.

##### The function initializes lists to store eruption attributes. It searches for HTML elements with a specific class name and extracts relevant information such as VEI, start date, erupting status, and eruption status. 

##### The extracted data is appended to respective lists. If no elements are found, default or None values are used.

##### Finally, the function creates a dictionary with the extracted data and returns it.

In [2]:
def get_eruptions_data(url1):
   
    vei = []
    start_date = []
    currently_erupting = []
    last_erupt_duration = []
    eruption_status = []
    all_p = []

    response1 = requests.get(url1)
    soup1 = BeautifulSoup(response1.content, "html.parser")
    
    if response1.ok:
        i=0
        all_p=soup1.find_all(class_='EruptionAccordionHeader')
        for v in all_p:
            i = i + 1
            text=v.text
            text2=v.text.split()
            index = len(text2)
            if index >= 8:
                current_eruting=text.split()[7]
                if current_eruting == '(continuing)':
                    currently_erupting.append("1")
                else:
                    currently_erupting.append("0")
            else:
                currently_erupting.append("0")
                    
            date=text.split()[0] # start_date
            bce=text.split()[1]
            
            if bce == 'BCE': # check if its BCE
                date = f"-{date}"
                    
            confern=text.split()[-2] # eruption_status
                
            if confern == 'VEI:':
                confern=text.split()[-5]
            if confern == 'Uncertain':
                date=text.split()[1]
            if confern == 'Discredited':
                date=None
                    
            vei1=text.split()[-1] # VEI
                
            if vei1 == 'Eruption':
                vei1 =None
            

            start_date.append(date)
            eruption_status.append(confern)
            vei.append(vei1)
            
        if len(all_p) == 0:
            i = i + 1
            start_date.append(None)
            eruption_status.append(None)
            vei.append(None)
            currently_erupting.append("0")
            
        df = {"i" : i, "vei": vei, "start_date" : start_date, "currently_erupting" : currently_erupting, 
              "eruption_status" : eruption_status }
                
    return df 

##### The get_volcano_data function crawls a webpage specified by a URL to extract volcano-related data. It retrieves eruption data using the get_eruptions_data function and extracts specific attributes.

##### The function retrieves additional volcano data by parsing HTML elements on the webpage and appending the extracted information to respective lists.

In [3]:
def get_volcano_data(url1):
    temp = []
    volcano_numbers = []
    volcano_countries = []
    latitude = []
    longitude = []
    summit_elevation = []
    vei1 = []
    start_date1 = []
    currently_erupting1 = []
    eruption_status1 = []

    response1 = requests.get(url1)
    soup1 = BeautifulSoup(response1.content, "html.parser")
    if response1.ok:
        df = get_eruptions_data(url1) # func
        i, vei, start_date, currently_erupting, eruption_status = df.values() 
        for val in vei:
            vei1.append(val)
        for val in start_date:
            start_date1.append(val)
        for val in currently_erupting:
            currently_erupting1.append(val)
        for val in eruption_status:
            eruption_status1.append(val)
        
        temp = [t.text for t in soup1.findAll("li", class_ = "clear")] # .split(sep='\n')
        temp2 = [tt.text for tt in soup1.findAll("li", class_ = "shaded")]
        for j in range(i):
            volcano_numbers.append(temp[5])
            latitude.append(temp[0])
            longitude.append(temp[1])
            summit_elevation.append(temp[3].split(' m')[0])
            volcano_countries.append(temp2[0])
        df2 = {"i" : i, "volcano_numbers": volcano_numbers, "latitude" : latitude, 
               "longitude" : longitude, "summit_elevation" : summit_elevation, 
               "volcano_countries" : volcano_countries, "vei1" : vei1, "start_date1" : start_date1, 
               "currently_erupting1" :currently_erupting1, "eruption_status1" : eruption_status1}
    return df2
            

##### This crawling function retrieves volcano-related data from a webpage. It uses BeautifulSoup to parse the HTML content and extract the necessary information.

##### The function finds links to individual volcano pages and extracts data from specific HTML elements. It initializes lists to store the extracted attributes of the volcanoes.

##### A loop iterates over the links to the volcano pages. For each page, it retrieves volcano-related data using the get_volcano_data function and appends the extracted data to the respective lists.

##### Another loop assigns the extracted values from a temporary list to their respective attributes.

##### Finally, the function creates a pandas DataFrame from the collected data, prints it, and saves it as a CSV file.

In [4]:
url_main = "https://volcano.si.edu/volcanolist_holocene.cfm"
response_main = requests.get(url_main)
# print(response_main.ok)

soup_main = BeautifulSoup(response_main.content, "html.parser")
mtag = soup_main.find("div", attrs= {"class" : "TableSearchResults"})

linksToPages = ["https://volcano.si.edu/" + t['href'] for t in mtag.findAll("a")]
#linksToPages = ['https://volcano.si.edu/volcano.cfm?vn=283001', 'https://volcano.si.edu/volcano.cfm?vn=261140']

list_temp = [t.text.split(sep='\n') for t in mtag.findAll("tr")[1:]]


volcano_names = []
volcano_regions = []
volcano_types = []
volcano_numbers1 = []
volcano_countries1 = []
latitude1 = []
longitude1 = []
summit_elevation1 = []
vei2 = []
start_date2 = []
currently_erupting2 = []
eruption_status2 = []
    

for r in range(len(linksToPages)): #  list_temp
    my_df = get_volcano_data(linksToPages[r])
    j, volcano_number, latitude, longitude, summit_elevation,  volcano_country, vei, start_date, currently_erupting, 
    eruption_status = my_df.values()
    for val in volcano_number:
        volcano_numbers1.append(val)
    for val in latitude:
        latitude1.append(val)
    for val in longitude:
        longitude1.append(val)
    for val in summit_elevation:
        summit_elevation1.append(val)
    for val in volcano_country:
        volcano_countries1.append(val)
    for val in vei:
        vei2.append(val)
    for val in start_date:
        start_date2.append(val)
    for val in currently_erupting:
        currently_erupting2.append(val)
    for val in eruption_status:
        eruption_status2.append(val)
           
    for i in range(6):
        for k in range(j): # loop duplicate j
            if i == 1:
                volcano_names.append(list_temp[r][i])
            if i == 2:
                volcano_regions.append(list_temp[r][i])
            if i == 3:
                volcano_types.append(list_temp[r][i])
                
data_final = pd.DataFrame({"Volcano name" : volcano_names,
                           "Volcano number" : volcano_numbers1,
                           "Volcano country" : volcano_countries1,
                           "Volcano region" : volcano_regions,
                           "Volcano type" : volcano_types,
                           "Latitude" : latitude1,
                           "Longitude" : longitude1,
                           "Summit elevation" : summit_elevation1,
                           "VEI" : vei2,
                           "Start date" : start_date2,
                           "Currently erupting (Y/N)" : currently_erupting2,
                           "Eruption status" : eruption_status2})
print(data_final)
data_final.to_csv('volcano_crawling_df.csv', index=True)


       Volcano name Volcano number Volcano country  \
0               Abu         283001           Japan   
1        Acamarachi         355096           Chile   
2        Acatenango         342080       Guatemala   
3        Acatenango         342080       Guatemala   
4        Acatenango         342080       Guatemala   
...             ...            ...             ...   
11521  Zubair Group         221020           Yemen   
11522  Zubair Group         221020           Yemen   
11523         Zukur         221021           Yemen   
11524  Zuni-Bandera         327120   United States   
11525  Zuni-Bandera         327120   United States   

                              Volcano region       Volcano type  Latitude  \
0                                     Honshu          Shield(s)    34.5°N   
1      Northern Chile, Bolivia and Argentina      Stratovolcano  23.292°S   
2                                  Guatemala  Stratovolcano(es)  14.501°N   
3                                  Guatemal

##### First few rows of the DataFrame data_final, providing a quick overview of the extracted volcano-related data in a tabular format.

In [5]:
data_final.head()

Unnamed: 0,Volcano name,Volcano number,Volcano country,Volcano region,Volcano type,Latitude,Longitude,Summit elevation,VEI,Start date,Currently erupting (Y/N),Eruption status
0,Abu,283001,Japan,Honshu,Shield(s),34.5°N,131.6°E,641,,-6850.0,0,Confirmed
1,Acamarachi,355096,Chile,"Northern Chile, Bolivia and Argentina",Stratovolcano,23.292°S,67.618°W,6023,,,0,
2,Acatenango,342080,Guatemala,Guatemala,Stratovolcano(es),14.501°N,90.876°W,3976,1.0,1972.0,0,Confirmed
3,Acatenango,342080,Guatemala,Guatemala,Stratovolcano(es),14.501°N,90.876°W,3976,2.0,1926.0,0,Confirmed
4,Acatenango,342080,Guatemala,Guatemala,Stratovolcano(es),14.501°N,90.876°W,3976,3.0,1924.0,0,Confirmed


In [10]:
data_final

Unnamed: 0,Volcano name,Volcano number,Volcano country,Volcano region,Volcano type,Latitude,Longitude,Summit elevation,VEI,Start date,Currently erupting (Y/N),Eruption status
0,Abu,283001,Japan,Honshu,Shield(s),34.5°N,131.6°E,641,,-6850,0,Confirmed
1,Acamarachi,355096,Chile,"Northern Chile, Bolivia and Argentina",Stratovolcano,23.292°S,67.618°W,6023,,,0,
2,Acatenango,342080,Guatemala,Guatemala,Stratovolcano(es),14.501°N,90.876°W,3976,1,1972,0,Confirmed
3,Acatenango,342080,Guatemala,Guatemala,Stratovolcano(es),14.501°N,90.876°W,3976,2,1926,0,Confirmed
4,Acatenango,342080,Guatemala,Guatemala,Stratovolcano(es),14.501°N,90.876°W,3976,3,1924,0,Confirmed
...,...,...,...,...,...,...,...,...,...,...,...,...
11521,Zubair Group,221020,Yemen,Africa (northeastern) and Red Sea,Shield,15.05°N,42.18°E,191,,1846,0,Uncertain
11522,Zubair Group,221020,Yemen,Africa (northeastern) and Red Sea,Shield,15.05°N,42.18°E,191,2,1824,0,Confirmed
11523,Zukur,221021,Yemen,Africa (northeastern) and Red Sea,Shield,14.02°N,42.75°E,624,,,0,
11524,Zuni-Bandera,327120,United States,USA (New Mexico),Volcanic field,34.8°N,108°W,2550,0,-1170,0,Confirmed


##### Explore the data and gather information in the following cells:

In [7]:
data_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11526 entries, 0 to 11525
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Volcano name              11526 non-null  object
 1   Volcano number            11526 non-null  object
 2   Volcano country           11526 non-null  object
 3   Volcano region            11526 non-null  object
 4   Volcano type              11526 non-null  object
 5   Latitude                  11526 non-null  object
 6   Longitude                 11526 non-null  object
 7   Summit elevation          11526 non-null  object
 8   VEI                       7627 non-null   object
 9   Start date                10948 non-null  object
 10  Currently erupting (Y/N)  11526 non-null  object
 11  Eruption status           11117 non-null  object
dtypes: object(12)
memory usage: 1.1+ MB


In [9]:
data_final.describe()

Unnamed: 0,Volcano name,Volcano number,Volcano country,Volcano region,Volcano type,Latitude,Longitude,Summit elevation,VEI,Start date,Currently erupting (Y/N),Eruption status
count,11526,11526,11526,11526,11526,11526,11526,11526,7627,10948,11526,11117
unique,1307,1324,93,105,32,1286,1296,1108,9,1635,2,4
top,"Fournaise, Piton de la",233020,Japan,Kamchatka Peninsula,Stratovolcano,21.244°S,55.708°E,2632,2,2004,0,Confirmed
freq,202,202,1746,798,5429,202,202,202,3725,54,11489,9831
