# Scraping Seismic SEGY Datasets on U.S. Geological Survey (USGS)


### Sharing Geosciences Data for a changing world.

## Introduction

### Problem Statement:
Datasets are essential for all scientists to perform research, especially coders geoscientists. There are massive open-source data available online. However, it is often challenging for students and researchers to navigate through the datasets to access them. Because mainly data discoverability is poor, documentation is sometimes lacking, and licences can be unclear. I hope with this project to add toward the solution of these problems, explore all the seismic surveys available seismic data on the USGS website, and contribute by providing 600 seismic surveys in SEGY format ready to download with all the necessary navigation metadata.

### Example of Seismic line 

![Example of Seismic line.png](https://i.imgur.com/gdbNa4V.png)

### Example of 3D seismic cube

![Example of 3D seismic cube](https://i.imgur.com/0Wl1vJN.gif)

### Overview:
The page https://walrus.wr.usgs.gov/namss/ provides a map with search filters which contain a list of 600 seismic surveys. In this project, we will retrieve data information and seismic zip files from this page using web scraping: the process of extracting information from a website in an automated fashion using code. We will use the Python libraries `Requests` (https://pypi.org/project/requests/) and `Beautiful Soup` (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)to scrape data from this page.

### Seismic data:
Seismic data basically is a large ultrasound of the underground. Geophysical imaging (also known as geophysical tomography) is a minimally destructive geophysical technique that investigates the subsurface of a terrestrial planet. Geophysical imaging is a noninvasive imaging technique with a high parametrical and Spatio-temporal resolution. Geophysical imaging has evolved over the last 30 years due to advances in computing power and speed. It can be used to model a surface or object understudy in 2D or 3D as well as monitor changes.
For more info refer to the link: https://en.wikipedia.org/wiki/Geophysical_imaging

### USGS:
USGS is the sole science agency for the Department of the Interior. It is sought out by thousands of partners and customers for its natural science expertise and its vast earth and biological data holdings. As the Nation's largest water, earth, and biological science and civilian mapping agency, USGS collects, monitors, analyzes, and provides science about natural resource conditions, issues, and problems. For more inof refer to the link:https://www.usgs.gov/

### Pacific Coastal and Marine Science Center - The National Archive of Marine Seismic Surveys (NAMSS)
The National Archive of Marine Seismic Surveys (NAMSS) is a marine seismic reflection data archive consisting of data acquired by or contributed to U.S. Department of the Interior agencies. The USGS is committed to preserving these data on behalf of the academic community and the nation. Data are provided with free and open access. For more information regarding NAMSS, see the link: https://walrus.wr.usgs.gov/namss/. NAMSS is a massive website for open-source 2D and 3D seismic reflection data.

### What is web scrabing?
Web scraping is a term used to describe the use of a program or algorithm to extract and process large amounts of data from the web. Whether you are a data scientist, engineer, or anybody who analyzes large amounts of datasets, the ability to scrape data from the web is a useful skill to have.

### Why is python used for scraping?
Automated web scraping can be a solution to speed up the data collection process. You write your code once and it will get the information you want many times and from many pages. Python is a popular and best programming language for web scraping. Python can handle multiple data crawling or web scraping tasks comfortably. `Requests` and `BeautifulSoup`, are the most famous and widely used Python frameworks.

### Project workflow:
1. Choose a website and describe the project objective
2. Create a list with all the seismic surveys URLs using the charactors from a to z.
3. Download the webpage using requests.
4. Parse the HTML source code using beautiful soup
5. Extract surveys names, information and URLs from page
6. Compile extracted information into Python lists and dictionaries
7. Extract and combine data from multiple survey pages
8. Save the extracted information to a CSV file.

### Expected results:
By the end of the project, we will create a CSV file in the folowing format:

Survey name,Operator,Dates,Data type,Datum,North lat,South lat,East long,West long,SEGY size,Navigation size,SEGY zip,Navigation zip,url
B-00-95-LA,Bureau of Ocean Energy Management,1995,3D Multichannel Seismic,North American Datum 1927 (NAD27),28.00779,27.94530,-92.09660,-92.18261,(248.5 MB),(1.4 KB),https://walrus.wr.usgs.gov/namss/data/1995/namss.B-00-95-LA.mcs3d.airgun.zip,https://walrus.wr.usgs.gov/namss/media/navigation/2021/03/16/114121798695/B-00-95-LA.zip,https://walrus.wr.usgs.gov/namss/survey/b-00-95-la
B-01-75-AT,Bureau of Ocean Energy Management,1975,2D Multichannel Seismic,North American Datum 1983 (NAD83),28.32871,28.19741,-90.11074,-90.29505,(2.4 GB),(564.6 KB),https://walrus.wr.usgs.gov/namss/data/1975/namss.B-01-75-AT.mcs.airgun.zip,https://walrus.wr.usgs.gov/namss/media/navigation/2015/08/31/b-01-75-at.segp1,https://walrus.wr.usgs.gov/namss/survey/b-01-75-at

![example.png](https://i.imgur.com/PCctt2O.png)

### Runing the code:
You can execute the code using the "Run" button at the top of this page. You can make changes and save your own version of the naotebook to [Jovian](https://www.jovian,ai) by executing the folowing cells. Then Run-on Binder, or Colab (Google's cloud infrastructure), or Run-on Kaggle.

In [1]:
!pip install jovian --upgrade --quiet

In [2]:
import jovian

In [3]:
# Execute this to save new versions of the notebook
jovian.commit(filename="scraping_seismic_segy_datasets_on_us_geological_survey_usgsramysaleem")

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..
[jovian] Updating notebook "ramysaleem/scraping-seismic-segy-datasets-on-us-geological-survey-ramysaleem" on https://jovian.ai/
[jovian] Uploading notebook..
[jovian] Capturing environment..
[jovian] Committed successfully! https://jovian.ai/ramysaleem/scraping-seismic-segy-datasets-on-us-geological-survey-ramysaleem


'https://jovian.ai/ramysaleem/scraping-seismic-segy-datasets-on-us-geological-survey-ramysaleem'

## Project Method:

### 1. USGS website and NAMSS web page detailed  workflow

1. The first step of the project is to scrape the USGS Science Explorer data website https://www.usgs.gov/science-explorer-results?es=seismic+reflection&classification=data

2. Second, we got a list of multiple geosciences websites that contain all the available data where we selected the (NAMSS) https://walrus.wr.usgs.gov/namss/, which contain the seismic reflection data we are interested in. 

3. Third, we scraped the (NAMSS) website to collect 600 2D and 3D seismic data surveys.

4. Later, we got the survey name, name of the operator that shoots the seismic data, dates of acquisition, datum, the coordinates of the surveys as latitude and longitude, size of zipping seismic SEGY files, size of the navigation files, seismic SEGY zip files, navigation zip files and URL for additional information.

5. The (NAMSS) website contain a search map icon that opens the https://walrus.wr.usgs.gov/namss/search/ web page. This map has multiple filters which contain and hide all the seismic surveys.

6. We have used the programmer inspect tool to identify the base URL and attached "a" to "z" characters to get all the surveys names. After that, we append them into one list, clean it and add the surveys names to the base URL to get all the surveys web pages.

7. From the surveys web pages https://walrus.wr.usgs.gov/namss/survey/b-49-95-la/, we start collecting the needed information.

8. We have created a function to get the survey name using the `h1` tag.

9. Another function was implemented to get the survey information such as operator, dates, data type, datum using the `div` tag.

10. Also, we have created a function that collects the size of the seismic SEGY file and the size of the navigation file using the `span` tag.
    
11. Moreover, we have scraped the navigation metadata XML web page to collect the coordinates as latitude and longitude using the `northlat`, `southlat`, `eastlong` and `westlong` classes.

12. Then, we have created nine functions, that get all the survey information needed.
 
13. Additionally, we created the nine function using the for loop to pass the 600 surveys URL and XML pages list into the mentioned above functions to return a list. We have passed the 600 surveys in three batches. Every batch has 200 surveys to avoid any block when using the request command.

14. Finally, we create a function that writes all the info to a CSV file, and later we open it using pandas library as a data frame.

### 2. Download the webpage using `requests`

We installed and imported the requests library to download the web page.
The library can be installed using `pip`.

In [4]:
!pip install requests --upgrade --quiet

In [5]:
import requests

To download a page, we can use the `get` function from requests, which returns a response object.

In [6]:
sciences_data_url = 'https://www.usgs.gov/science-explorer-results?es=seismic+reflection&classification=data'

In [7]:
responses = requests.get(sciences_data_url)

`requests.get` returns a response object containing the data from the web pae and some other information.

The `.status_code` property can be used to check if the request was successful. Asuccessful response will have [HTTP status code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) an between 200 and 299.

In [8]:
responses.status_code

200

In [9]:
# or we can use the boolen
responses.ok

True

The request was successful. We can get the contents of the page using `response.tet`.

In [10]:
page_content = responses.text

We have checked the number of characters on the page.

In [11]:
len(page_content)

92192

The page contain over 92000 characters! Here are the firt 500 charaters of the page:

In [12]:
page_content[:500]

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"\n  "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">\n<html dir="ltr" lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">\n<head profile="http://www.w3.org/1999/xhtml/vocab">\n  <meta charset="utf-8">\n  <meta name="viewport" content="width=device-width, initial-scale=1.0">\n  <meta http-equiv="X-UA-Compatible" content="IE=Edge" />\n  \n  \n  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n<script type="text/x-mathjax-c'

We can also save it to a file and view the page locally within Jupyter using "file>open".

In [13]:
with open('webpage.html', 'w') as f:
    f.write(page_content)

### 3. Used `Beautiful Soup` to parse the USGS web site and select the web site that contains the seismic data

## [USGS](https://www.usgs.gov/science-explorer-results?es=3D+Seismic+data&classification=data)

We have scraped the Science Explorer web page to Exploring and get the web site that has the 2D and 3D Seismic Reflection Data.

We installed and imported the `beautifulsoup4` library to parse the web page.
The library can be installed using `pip`.

In [14]:
!pip install beautifulsoup4 --upgrade --quiet

In [15]:
from bs4 import BeautifulSoup

Then we parsered the web page and save the out put into a doc, which is a beautiful soup object. 

In [16]:
doc = BeautifulSoup(page_content, 'html.parser')

In [17]:
type(doc)

bs4.BeautifulSoup

### 3.1 Data Titles

We exstracted the web pages titles using the `<h3>` tags.

In [18]:
data_title_tags = doc.find_all('h3')

In [19]:
data_title_tags[:1]

[<h3 class="list-title h4">National Archive of Marine Seismic Surveys (NAMSS)</h3>]

In [20]:
len(data_title_tags)

20

In [21]:
data_title_tags[:5]

[<h3 class="list-title h4">National Archive of Marine Seismic Surveys (NAMSS)</h3>,
 <h3 class="list-title h4">Location of Seismic Reflection Line CRmv</h3>,
 <h3 class="list-title h4">Bathymetry, acoustic backscatter, and minisparker seismic-reflection datasets collected southwest of Montague Island and southwest of Chenega, Alaska during field activity 2014-622-FA</h3>,
 <h3 class="list-title h4">Marine geophysical data—Point Sal to Refugio State Beach, southern California</h3>,
 <h3 class="list-title h4">Data report for line 8 of the 2011 USGS seismic imaging survey at San Andreas Lake, San Mateo County, California</h3>]

We have created a for loop to collect all the titles available in the USGS web site.

In [22]:
data_title = []
for tag in data_title_tags:
    data_title.append(tag.text.strip())

data_title[:2]

['National Archive of Marine Seismic Surveys (NAMSS)',
 'Location of Seismic Reflection Line CRmv']

#### Data title fuction:

Let us now create a function that gets the titles from all the document.

In [23]:
def get_data_title(tags):

    #we have crated a for loop to get all the title tags  
    data_title = []
    for tag in data_title_tags:
        data_title.append(tag.text.strip())

    return data_title

In [24]:
data_title_tags_all = get_data_title(data_title_tags)

In [25]:
data_title_tags_all[:2]

['National Archive of Marine Seismic Surveys (NAMSS)',
 'Location of Seismic Reflection Line CRmv']

### 3.2 Data Description

We have use the `<div>` tag and class `<desc_selector>` to collect the data description.

In [26]:
desc_selector = "views-field views-field-drupal-contentfield-intro"
desc_tags = doc.find_all('div', {'class': desc_selector})

In [27]:
len(desc_tags)

20

In [28]:
desc_tags[:1]

[<div class="views-field views-field-drupal-contentfield-intro"> <span class="field-content">The National Archive of Marine Seismic Surveys (NAMSS) is a marine seismic reflection data archive consisting of data acquired by or contributed to U.S. Department of the Interior agencies. The USGS is committed to preserving these data on behalf of the academic community and the nation. Data are provided with free and open access.
 </span> </div>]

In [29]:
desc_tags[1:2]

[<div class="views-field views-field-drupal-contentfield-intro"> <span class="field-content">This dataset provides location information for the seismic reflection line CRmv across Crowleys Ridge in the New Madrid seismic zone, central US. The seismic reflection data are interpreted and discussed in the associated publication
 </span> </div>]

We have created a for loop to get all the description from the web site.

In [30]:
data_desc= []
for tag in desc_tags:
    data_desc.append(tag.text.strip())
    
data_desc[5:6]

['This data release includes chirp seismic-reflection data collected in 2014 aboard the USGS\xa0R/V Snavely\xa0in San Pablo Bay, part of northern San Francisco Bay.\xa0The\xa0data were collected as part of USGS efforts to better understand the fault geometry of the Hayward and Rodgers Creek faults beneath the bay.']

Let us do a sanity check to make sure that all the length are the same.

In [31]:
len(data_desc)

20

#### Data description function:

Let us now create a function that gets all the data description from all the document.

In [32]:
def get_data_desc(tags):
    
    data_desc =[]
    # we have crated a for loop to get all the data description tags
    for tag in desc_tags:
        data_desc.append(tag.text.strip())
        #data_desc = data_desc_1[0:20]
    return data_desc

In [33]:
data_desc_all =  get_data_desc(desc_tags)

In [34]:
data_desc_all[:1]

['The National Archive of Marine Seismic Surveys (NAMSS) is a marine seismic reflection data archive consisting of data acquired by or contributed to U.S. Department of the Interior agencies. The USGS is committed to preserving these data on behalf of the academic community and the nation. Data are provided with free and open access.']

### 3.3 Data page url

We have used the `<div>` tag, class `<data_link_selector>` and `<herf>` to collect the needed urls.

In [35]:
data_link_selector = "views-field green-link"
data_link_tags = doc.find_all('div', {'class': data_link_selector}, 'href')

In [36]:
data_link_tags[0].text

'https://walrus.wr.usgs.gov/namss/'

In [37]:
len(data_link_tags)

20

In [38]:
data_link_tags[:2]

[<div class="views-field green-link"><a href="https://walrus.wr.usgs.gov/namss/">https://walrus.wr.usgs.gov/namss/</a></div>,
 <div class="views-field green-link"><a href="https://doi.org/10.5066/P9TFRP5D">https://doi.org/10.5066/P9TFRP5D</a></div>]

In [39]:
len(data_link_tags)

20

In [40]:
data_link_tags[5].text

'https://doi.org/10.5066/F74T6GF1'

In [41]:
data_links = []

for tag in data_link_tags:
    data_links.append(tag.text.strip())
    
data_links[:2]

['https://walrus.wr.usgs.gov/namss/', 'https://doi.org/10.5066/P9TFRP5D']

#### Data URLs function:

Let us now create a function that gets all the data URLs from all the document.

In [42]:
def get_data_link_tags(tags):
    
    # we have crated a for loop to get all the data URLs tags
    data_links = []
    for tag in data_link_tags:
        data_links.append(tag.text.strip())
    
    return data_links

In [43]:
data_links_all = get_data_link_tags(data_link_tags)

In [44]:
data_links_all[1:2]

['https://doi.org/10.5066/P9TFRP5D']

### Page url function

We have creted a function that wrap all the code in one step and get the url from the survey web sites which will be applied in future use.

In [45]:
def get_url_page(survey_url):
    
    # Download the url survey page
    response = requests.get(survey_url) 
    
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    # Parse using Beautiful soup
    page_url_doc = BeautifulSoup(response.text, 'html.parser')
    
    return page_url_doc

In [46]:
sciences_doc = get_url_page(sciences_data_url)

In [47]:
sciences_doc.find('title')

<title>Science Explorer</title>

We can now use the function `get_url_page` to download any web page and parse it using beautiful soup.

### 3.4 USGS web pages Data frame 

After we collected all the required information, we have used the `Pandas` library to create a data frame.

To install the library inside the notbook use `pip` and then `import` to import it as pd for short.

In [48]:
!pip install pandas --quiet

In [49]:
import pandas as pd

C:\Users\r04ra18\Anaconda3\envs\geocomp\lib\site-packages\numpy\.libs\libopenblas.NOIJJG62EMASZI6NYURL6JBKM4EVBGM7.gfortran-win_amd64.dll
C:\Users\r04ra18\Anaconda3\envs\geocomp\lib\site-packages\numpy\.libs\libopenblas.pyqhxlvvq7vesdpuvuadxevjobghjpay.gfortran-win_amd64.dll


#### Data Dictionary:

Now let us create a dictionary to store all the collected data and make it easy to change it to the data frame.

In [50]:
data_dict = {
    'Data title' : data_title,
    'Data Description' : data_desc,
    'url' : data_links
}

#### Seismic survey data frame:

Finally let us store our data in a data frame 

In [51]:
data_df = pd.DataFrame(data_dict)

In [52]:
data_df

Unnamed: 0,Data title,Data Description,url
0,National Archive of Marine Seismic Surveys (NA...,The National Archive of Marine Seismic Surveys...,https://walrus.wr.usgs.gov/namss/
1,Location of Seismic Reflection Line CRmv,This dataset provides location information for...,https://doi.org/10.5066/P9TFRP5D
2,"Bathymetry, acoustic backscatter, and minispar...","High-resolution acoustic backscatter data, bat...",https://doi.org/10.5066/P9K1YQ35
3,Marine geophysical data—Point Sal to Refugio S...,"This data release includes approximately 1,032...",https://doi.org/10.5066/F7SX6BCD
4,Data report for line 8 of the 2011 USGS seismi...,"In June of 2011, the U.S. Geological Survey ac...",https://doi.org/10.5066/P9FX66OZ
5,"Chirp seismic-reflection data: San Pablo Bay, ...",This data release includes chirp seismic-refle...,https://doi.org/10.5066/F74T6GF1
6,High-resolution seismic imaging of the West Na...,"In November 2016, the U.S. Geological Survey a...",https://doi.org/10.5066/P9UREVME
7,High-resolution seismic imaging of the West Na...,"In November 2016, the U.S. Geological Survey a...",https://doi.org/10.5066/P92UWULX
8,Multichannel sparker seismic-reflection data o...,This data release contains high-resolution mul...,https://doi.org/10.5066/F7KP81BQ
9,2015 High Resolution Seismic Data Recorded at ...,"In May 2015, we acquired high-resolution seism...",https://doi.org/10.5066/P9F4IAAL


Here we have scraped the first page, which contains 20 seismic data websites. However, this website contains several pages.

### 3.5 USGS web pages CSV file

We have Created CSV file(s) with the extracted information to store the data

In [53]:
data_df.to_csv('data.csv', index=None)

# 4. Getting seismic survey data out of seismic data page

## 4.1 [NAMSS](https://walrus.wr.usgs.gov/namss/search/)

We have scraped the NAMSS web page which is the first site on the USGS web site to collect and get the requred information aboute 2D and 3D Seismic Reflection Data surveys.

#### Location map showing USA costs margin and gulfs. The pink lines show the 2D seismic surveys, and the pink boxes show the 3D surveys.

![Location map.png](https://i.imgur.com/8sLmBfu.png)

This web site is the first web site in our collected list in the previous step.

In [54]:
data_page_url = data_links[0]

In [55]:
data_page_url

'https://walrus.wr.usgs.gov/namss/'

We have used the programmer inspect tool to identify the base URL and attached "a" to "z" characters to get all the surveys names. After that, we append them into one list, clean it and add the surveys names to the base URL to get all the surveys web pages.

Here we want to add some extension to our URL to get the survey data page. After inspecting the developer tool, we need to add the filter, auto-complete and name id, and add the alphabet characters form ("a" to "z") to get the web pages where the data and information are hosted.

![Inspect.png](https://i.imgur.com/2xGF15F.png)

An example of testing the "a" character and check the URL will give the expected result.

In [56]:
seismic_data_page_url = data_links[0] + 'filter/autocomplete/?name=identifier&term=a'

In [57]:
seismic_data_page_url

'https://walrus.wr.usgs.gov/namss/filter/autocomplete/?name=identifier&term=a'

##### Used Beautiful Soup to parse the NAMSS web site and scrape the web site that contains the seismic data

Let us apply the `requests` and `BeautifulSoup` to parse the web site.

In [58]:
response = requests.get(seismic_data_page_url)

Getting the status code

In [59]:
response.status_code

200

In [60]:
len(response.text)

13135

In [61]:
seismic_data_doc = BeautifulSoup(response.text, 'html.parser')

## 4.2 Characters list form a to z :

After figuring out the base URL and extension, we can create a loop to iterate over the alphabetic characters to add them to the base URL.

In [62]:
# create a list of alphabite

characs = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

We have created a loop to iterate over the characters list and it each character to the end of the base URL to create the surveys web sites URL.

In [63]:
surveys_url = []

for charac in characs:
    surveys_url.append(data_links[0] + 'filter/autocomplete/?name=identifier&term=' + charac)

print(surveys_url[:3])

['https://walrus.wr.usgs.gov/namss/filter/autocomplete/?name=identifier&term=a', 'https://walrus.wr.usgs.gov/namss/filter/autocomplete/?name=identifier&term=b', 'https://walrus.wr.usgs.gov/namss/filter/autocomplete/?name=identifier&term=c']


#### Survey url function:

We have created a function of for loop to get all the 26 surveys in the web site and get all thier responses to apply beutiful soup and scrape the web pages.

In [64]:
def get_survey_names(url):

    # loop over the survey url list and get all the responses
    survey_names = []

    for survey in surveys_url:
        survey_names.append(requests.get(survey))

    return survey_names

In [65]:
survey_names_all = get_survey_names(surveys_url)

In [66]:
print(survey_names_all)

[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>]


In [67]:
seismic_survey_names = BeautifulSoup(response.text.strip(), 'html.parser')

Append all the collected surveys in one large list.

In [68]:
len(seismic_survey_names)

1

## 4.3 Clean survey URL list:

Now since we get the survey names, we need to clean them and convert them to lowercase.

Lets import and use the `golb` and `re` library to clean our survey URL list. 

In [69]:
import glob
import re

In [70]:
surveys_cleaned = [re.sub(r" ", '', file) for file in seismic_survey_names]

If you like to track the cleaning steps, please uncomment the following cell to see the result.

In [71]:
# surveys_cleaned

We need to remove the \n, " and [] ; then we need to split them to get list items.

In [72]:
#surveys_cleaned

converted_list = []

for element in surveys_cleaned:
    converted_list.append(element.replace('\n','').replace('"', '').strip('[]').strip(). split(','))

If you like to track the cleaning steps, please uncomment the following cell to see the result.

In [73]:
# converted_list

Now we need to remove the outer list to convert it from 2D list to 1D list.

In [74]:
def strip_outer(x):
    while len(x) == 1 and isinstance(x[0], list):
        x = x[0]
    return x

In [75]:
clean_list = strip_outer(converted_list)

In [76]:
clean_list[0]

'2014-645-FA_c'

In [77]:
len(clean_list)

612

After we finished all the cleaning, we need to convert all the uppercase characters to lowercase characters then add them to the base URL.

In [78]:
small_charac = []

for survey in clean_list:
    small_charac.append(survey.lower())

In [79]:
# print(small_charac)

In [80]:
len(small_charac)

612

In [81]:
data_links[0]

'https://walrus.wr.usgs.gov/namss/'

Now lets create a surveys data url, and start collecting our information.

In [82]:
surveys_data_link = []

for survey in small_charac:
    surveys_data_link.append(data_links[0] + 'survey/' + survey)

In [83]:
# print(surveys_data_link)

In [84]:
len(surveys_data_link)

612

#### Remove corrupted web site:

Some of the URL and XML have a corrupted web site so we have to remove them using the `del` and `remove` method.

In [85]:
surveys_data_link[:10]

['https://walrus.wr.usgs.gov/namss/survey/2014-645-fa_c',
 'https://walrus.wr.usgs.gov/namss/survey/2014-645-fa_s',
 'https://walrus.wr.usgs.gov/namss/survey/a-1-00-scmultichannel',
 'https://walrus.wr.usgs.gov/namss/survey/a-1-00-scsinglechannel',
 'https://walrus.wr.usgs.gov/namss/survey/a-1-02-sc_chirp',
 'https://walrus.wr.usgs.gov/namss/survey/a-1-02-sc_gimcs',
 'https://walrus.wr.usgs.gov/namss/survey/a-1-02-sc_htspark',
 'https://walrus.wr.usgs.gov/namss/survey/a-1-02-sc_huntec',
 'https://walrus.wr.usgs.gov/namss/survey/a-1-02-sc_msmcs',
 'https://walrus.wr.usgs.gov/namss/survey/a-1-02-sc_scag']

Let us use the `del` method to clean the list by indexing.

In [86]:
del surveys_data_link[2:10]

Let us use the `remove` method to clean the list by indexing.

In [87]:
surveys_data_link.remove('https://walrus.wr.usgs.gov/namss/survey/l-09-11-gamcs')
surveys_data_link.remove('https://walrus.wr.usgs.gov/namss/survey/p1-13-lagreencanyon')
surveys_data_link.remove('https://walrus.wr.usgs.gov/namss/survey/p1-13-lawalkerridge')

In [88]:
len(surveys_data_link)

601

#### Survey navigation metadata:

We have used a for loop to create an XML list from the URL list.

In [89]:
survey_data_xml = []
for i in surveys_data_link:
    survey_data_xml.append(i + '/metadata/seismic/')

In [90]:
survey_data_xml[:10]

['https://walrus.wr.usgs.gov/namss/survey/2014-645-fa_c/metadata/seismic/',
 'https://walrus.wr.usgs.gov/namss/survey/2014-645-fa_s/metadata/seismic/',
 'https://walrus.wr.usgs.gov/namss/survey/a-1-02-sc_scms/metadata/seismic/',
 'https://walrus.wr.usgs.gov/namss/survey/b-00-79-la/metadata/seismic/',
 'https://walrus.wr.usgs.gov/namss/survey/b-00-95-la/metadata/seismic/',
 'https://walrus.wr.usgs.gov/namss/survey/b-01-75-at/metadata/seismic/',
 'https://walrus.wr.usgs.gov/namss/survey/b-01-77-la/metadata/seismic/',
 'https://walrus.wr.usgs.gov/namss/survey/b-01-78-at/metadata/seismic/',
 'https://walrus.wr.usgs.gov/namss/survey/b-01-80-at/metadata/seismic/',
 'https://walrus.wr.usgs.gov/namss/survey/b-01-81-at/metadata/seismic/']

In [91]:
len(survey_data_xml)

601

## 4.4 Collecting seismic survey data out of seismic data NAMSS web site

Lets start and work on the first five data sites to collect the needed seismic information.

In [92]:
surveys_data_link[:5]

['https://walrus.wr.usgs.gov/namss/survey/2014-645-fa_c',
 'https://walrus.wr.usgs.gov/namss/survey/2014-645-fa_s',
 'https://walrus.wr.usgs.gov/namss/survey/a-1-02-sc_scms',
 'https://walrus.wr.usgs.gov/namss/survey/b-00-79-la',
 'https://walrus.wr.usgs.gov/namss/survey/b-00-95-la']

In [93]:
seismic_survey_page_url_1 = surveys_data_link[0]

In [94]:
seismic_survey_page_url_1

'https://walrus.wr.usgs.gov/namss/survey/2014-645-fa_c'

Here we have applied a similar approach to the one we adopt in the previous steps to parse and request the web site information.

In [95]:
responses_1 = requests.get(seismic_survey_page_url_1)

In [96]:
responses_1.status_code

200

In [97]:
responses_1.text[:200]

'\n\n<!DOCTYPE html>\n<html lang="en">\n<head>\n  <meta charset="utf-8">\n  <meta http-equiv="X-UA-Compatible" content="IE=edge">\n  <meta name="viewport" content="width=device-width, initial-scale=1">\n  <tit'

In [98]:
len(responses_1.text)

10210

In [99]:
first_seismic_survey_data_page = responses_1.text

In [100]:
first_seismic_survey_data_page[:200]

'\n\n<!DOCTYPE html>\n<html lang="en">\n<head>\n  <meta charset="utf-8">\n  <meta http-equiv="X-UA-Compatible" content="IE=edge">\n  <meta name="viewport" content="width=device-width, initial-scale=1">\n  <tit'

In [101]:
with open('first_seismic_survey_page.html', 'w') as f:
    f.write(first_seismic_survey_data_page)

## 4.5 Sismic data information:

Let us start collecting our seismic reflection data information now. We start by using `BeautifulSoup` to parser the web site and save it as a document.

In [102]:
doc_seismic_1 = BeautifulSoup(first_seismic_survey_data_page, 'html.parser')

### 4.5.1 Survey names

The figure shows a ship tailing seismic acquisition recording strings.

![seismic acquisition.png](https://i.imgur.com/WbNxcKK.png)

We exstracted the web pages titles using the `<h1>` tags.

In [103]:
survey_names_tags = doc_seismic_1.find_all('h1')

In [104]:
survey_names_tags

[<h1>2014-645-FA_c</h1>]

In [105]:
survey_names = survey_names_tags[0].text
survey_names

'2014-645-FA_c'

### 4.5.2 Operator

We have collected the contributor of the seismic data set.

The figure shows the major seven sisters operators in the USA.

![Operator.png](https://i.imgur.com/mToqDb9.png)

We have use the `<div>` tag and class `<col-xs-12 col-sm-9>` to collect the operator name.

In [106]:
operator_tags = doc_seismic_1.find_all('div', {'class':'col-xs-12 col-sm-9'})
operator_tags[1]

<div class="col-xs-12 col-sm-9">
<p>Chirp</p>
</div>

In [107]:
operator = operator_tags[2].text.strip()
operator

'Pacific Coastal and Marine Science Center'

### 4.5.3 Dates

We have used the `<div>` tag and class `<col-xs-12 col-sm-9>`  and `p` to collect the dates of the data acquisition. Then we index the fourth position, which always the dates.

In [108]:
dates_tags = doc_seismic_1.find_all('div', {'class':'col-xs-12 col-sm-9'}, 'p')
dates_tags[3]

<div class="col-xs-12 col-sm-9">
<p>Started on April 20, 2017, and ended on April 20, 2017.</p>
</div>

Let just keep the year date.

In [109]:
dates_type = dates_tags[3].text.strip()[-5:-1]
dates_type

'2017'

### 4.5.4 Data type

The seismic data either to be single-channel or multi-channel. The multi-channels either are 2D or 3D seismic data.

The figure shows seismic data acquisition on the surface and 2D seismic lines, and 3D seismic cube on the subsurface.

![Datatype.png](https://i.imgur.com/yEKtHQP.png)

We index the first position to collect the data type.

In [110]:
data_type_tags = doc_seismic_1.find_all('div', {'class':'col-xs-12 col-sm-9'}, 'p')
data_type_tags[0]

<div class="col-xs-12 col-sm-9">
<p>Singlechannel Seismic</p>
</div>

In [111]:
data_type = data_type_tags[0].text.strip()
data_type

'Singlechannel Seismic'

### 4.5.5 Geographic Coordinate System (GCS)

A coordinate system is any type of measuring system used to map space on either 2D or 3D surfaces. While the datum is the part of the GCS that determines which model (spheroid) is used to represent the earth's surface and where it is positioned relative to the surface. Here we refer to the GCS as datum in the following code since its most relaven to the seismic data.

![GCS.png](https://i.imgur.com/9kQUbtL.png)

Let collect the datum information which is essintial for loading the seismic data.

In [112]:
datum_tags = doc_seismic_1.find_all('div', {'class':'col-xs-12 col-sm-9'}, 'p')
datum_tags[-1]

<div class="col-xs-12 col-sm-9">
<p>World Geodetic System 1984 (WGS84)</p>
</div>

In [113]:
datum = datum_tags[-1].text.strip()
datum

'World Geodetic System 1984 (WGS84)'

The GSC navigation information is essential for loading the seismic data in any other geological software.

### 4.5.6 Location

![Location.png](https://i.imgur.com/NfSXRHK.png)

In this step, we have used the navigation metadata web page to get the information about the location (latitude and longitude). 

In [114]:
nav_page_xml = 'https://walrus.wr.usgs.gov/namss/survey/t-06-12-at/metadata/seismic/'

In [115]:
response_nav_xml = requests.get(nav_page_xml)

In [116]:
response_nav_xml.status_code

200

In [117]:
nav_data_doc = BeautifulSoup(response_nav_xml.text, 'html.parser')

We have used the name `'northlat'`, `'southlat'`, `'eastlong'` and `'westlong'`  to collect the location information.

In [118]:
northlat_names_tags = nav_data_doc.find_all('northlat')
nlat = northlat_names_tags[0].text

In [119]:
southlat_names_tags = nav_data_doc.find_all('southlat')
slat = southlat_names_tags[0].text

In [120]:
eastlong_names_tags = nav_data_doc.find_all('eastlong')
elong = eastlong_names_tags[0].text

In [121]:
westlong_names_tags = nav_data_doc.find_all('westlong')
wlong = westlong_names_tags[0].text

In [122]:
lat_long = nlat + ', ' + slat + ', ' + elong + ', ' + wlong
lat_long

'36.92747, 36.12804, -74.33366, -74.90127'

### 4.5.7 Seismic data zip SEGY files

Now let grab the zip files of the data. We have two zip files that need to be collected: 
1. SEGY seismic data file. 
2. Navigation file. 

In [123]:
seis_zip_files_tags = doc_seismic_1.find_all('a')
seis_zip_files_tags[10]

<a download="" href="https://walrus.wr.usgs.gov/namss/data/2014/namss.2014-645-FA.scs.chirp.zip">Download</a>

In [124]:
seis_zip_files = seis_zip_files_tags[10]['href']
seis_zip_files

'https://walrus.wr.usgs.gov/namss/data/2014/namss.2014-645-FA.scs.chirp.zip'

### 4.5.8 Size of the seismic file.

Lets now get the size on the seismic files.

We have used the `<span>` tag and index the first position to get the size of the file. Also, we have to clean the string using the `replace` method.

In [125]:
seis_files_size_tags = doc_seismic_1.find_all('span')
size_f = seis_files_size_tags[0].text
size_seis = size_f.replace('\xa0', ' ')

In [126]:
size_seis

'(20.5 MB)'

### 4.5.9 Navigation data zip files

We have used the `<a>` tag to collect the navigation data zip files.

In [127]:
nav_zip_files_tags = doc_seismic_1.find_all('a')
nav_zip_files_tags[12]

<a download="" href="/namss/media/navigation/2017/04/20/102230311158/2014-645-FA_chirp_nav.zip">Download</a>

Here we need to add our base url to get the full link.

In [128]:
base_url_zip = 'https://walrus.wr.usgs.gov'
nav_zip_files = nav_zip_files_tags[12]
full_navnav_zip = base_url_zip + nav_zip_files['href']
full_navnav_zip

'https://walrus.wr.usgs.gov/namss/media/navigation/2017/04/20/102230311158/2014-645-FA_chirp_nav.zip'

### 4.5.10 Size of the navigation zip file

We have used the `<span>` tag to collect the navigation data zip files. Also, we have to clean the string using the `replace` method.

Lets now get the size of the navigative files.

In [129]:
files_size_tags = doc_seismic_1.find_all('span')
size_n = files_size_tags[1].text
size_nav = size_n.replace('\xa0', ' ')

In [130]:
size_nav

'(3.6 KB)'

# 5. Example on survey page number 20:

Let us choose survey number 20 and try to collect the information again to double-check and make sure everything goes well, and the workflow can be applied on another web page.

In [131]:
seismic_survey_page_url_20 = surveys_data_link[19]
seismic_survey_page_url_20

'https://walrus.wr.usgs.gov/namss/survey/b-01-95-la'

In [132]:
responses_20 = requests.get(seismic_survey_page_url_20)#
responses_20

<Response [200]>

In [133]:
tweenty_seismic_survey_page = responses_20.text

In [134]:
with open('tweenty_seismic_survey_page.html', 'w') as f:
    f.write(tweenty_seismic_survey_page)

In [135]:
doc_seismic_20 = BeautifulSoup(tweenty_seismic_survey_page, 'html.parser')

In [136]:
survey20_names_tags = doc_seismic_20.find_all('h1')

In [137]:
survey20_names_tags

[<h1>B-01-95-LA</h1>]

In [138]:
survey20_names = survey20_names_tags[0].text
survey20_names

'B-01-95-LA'

In [139]:
operator_tags20 = doc_seismic_20.find_all('div', {'class':'col-xs-12 col-sm-9'}, 'p')
operator_tags20[2]

<div class="col-xs-12 col-sm-9">
<p>Bureau of Ocean Energy Management</p>
</div>

In [140]:
operator = operator_tags[2].text.strip()
operator

'Pacific Coastal and Marine Science Center'

In [141]:
dates_tags20 = doc_seismic_20.find_all('div', {'class':'col-xs-12 col-sm-9'}, 'p')
dates_tags20[3]

<div class="col-xs-12 col-sm-9">
<p>Started on Jan. 1, 1995, and ended on Jan. 1, 1995.</p>
</div>

In [142]:
dates20 = dates_tags20[3].text.strip()[-5:-1]
dates20

'1995'

In [143]:
data_type_tags20 = doc_seismic_20.find_all('div', {'class':'col-xs-12 col-sm-9'}, 'p')
data_type_tags20[0]

<div class="col-xs-12 col-sm-9">
<p>2D Multichannel Seismic</p>
</div>

In [144]:
data_type20 = data_type_tags20[0].text.strip()
data_type20

'2D Multichannel Seismic'

In [145]:
datum_tags20 = doc_seismic_20.find_all('div', {'class':'col-xs-12 col-sm-9'}, 'p')
datum_tags20[-1]

<div class="col-xs-12 col-sm-9">
<p>North American Datum 1927 (NAD27)</p>
</div>

In [146]:
datum20 = datum_tags20[-1].text.strip()
datum20

'North American Datum 1927 (NAD27)'

## 6. Define helper functions:

### 6.1 Getting the survey page

This function will get the survey page converted to text which can be accessed locally for further operations.

In [147]:
def get_surveys_url_doc(survey_url):
    
    # Download the url survey page
    response = requests.get(survey_url) 
    
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    # Parse using Beautiful soup
    survey_url_doc = BeautifulSoup(response.text, 'html.parser')
    
    return survey_url_doc

In [148]:
def surveys_final_docs(survey_url):
    survey_docs_lists = []
    for i in range(len(survey_url)):
        survey_docs_lists.append(get_surveys_url_doc(survey_url[i]))
    return survey_docs_lists

### 6.2 Getting the navigation page:

This function will get the metadata page converted to text which can be accessed locally for further operations.

In [149]:
def get_xml_servey_doc(survey_xml):
    
    # Download the xml survey page
    response = requests.get(survey_xml) 
    
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    # parse using Beautiful soup
    survey_xml_doc = BeautifulSoup(response.text, 'html.parser')
    
    return survey_xml_doc

### 6.3 Survey name

This function will get the survey name.

In [150]:
def get_survey_name(survey_url_doc):
    
    # Get the survey name using the h1 tag
    survey_names_tags = survey_url_doc.find_all('h1')
    # indexing the 1st position to get the name
    surveyname = survey_names_tags[0].text
    
    return surveyname

### 6.4 Survey information

This function will get the survey info such as operator, dates, datatype, and datum.

In [151]:
def get_survey_info(survey_url_doc):
    
    # Get the operator name, dates, datum, using the div tag
    info_tags = survey_url_doc.find_all('div', {'class':'col-xs-12 col-sm-9'})
    # indexing the 3rd position to get the operator
    operator = info_tags[2].text.strip()
    # indexing the 4th position and cleaning to get the dates 
    dates = info_tags[3].text.strip()[-5:-1]
    # indexing the 1st position to get the data type
    datatype = info_tags[0].text.strip()
    # indexing the last position to get the datum
    datum = info_tags[-1].text.strip()
    
    return operator, dates, datatype, datum

### 6.5 Survey Latitude & Longitude

This function will get the survey latitude and longitude.

In [152]:
def get_lat_lnog(survey_xml_doc):
    
    # Get the lat and long from the xml page
    northlat_names_tags = survey_xml_doc.find_all('northlat')
    nlat = northlat_names_tags[0].text
    southlat_names_tags = survey_xml_doc.find_all('southlat')
    slat = southlat_names_tags[0].text
    eastlong_names_tags = survey_xml_doc.find_all('eastlong')
    elong = eastlong_names_tags[0].text
    westlong_names_tags = survey_xml_doc.find_all('westlong')
    wlong = westlong_names_tags[0].text
    lat_long = nlat + ', ' + slat + ', ' + elong + ', ' + wlong
    
    return lat_long

### 6.6 Size of the zip files

This function will get the size of the survey seismic and navigation zip files.

In [153]:
def get_size_zip(survey_url_doc):
    
    # Get the size of the seismic segy and navigation data
    seis_files_size_tags = survey_url_doc.find_all('span')
    size_sf = seis_files_size_tags[0].text
    size_nf = seis_files_size_tags[1].text
    size_seisfile = size_sf.replace('\xa0', ' ')
    size_navfile = size_nf.replace('\xa0', ' ')
    
    return size_seisfile, size_navfile

### 6.7 Zip files

This function will get the seismic and navigation zip files.

In [154]:
def get_zip_files(survey_url_doc):
        
    # Get the seismic and navigation zip files
    seis_zip_files_tags = survey_url_doc.find_all('a')
    seiszip = seis_zip_files_tags[10]['href']
    
    # Get the matadata navigation zip file
    base_url_zip = 'https://walrus.wr.usgs.gov'
    nav_zip_files = seis_zip_files_tags[12]
    navzip = base_url_zip + nav_zip_files['href']
    
    return seiszip, navzip    

The helper functions section (no.6) cover all the functions of the projects.

## 7. Final Functions

Here is the complete code of the project in several functions, which will collect all the information from the 600 surveys.

This functions will take the survey page URL and the metadata page XML and collect all needed information for the project. The final output of the functions will be a dictionary with the seismic surveys information and zip files which will be saved as a CSV file.

### 7.1 Surveys URL pages as documents

The first function will take the URL and get the `response` and save the collected page locally for further work.

In [155]:
def create_survey_url_docs(survey_url):
    
        # Download the url survey page
        response = requests.get(survey_url) 

        # Check successful response
        if response.status_code != 200:
            raise Exception('Failed to load page {}'.format(topic_url))

        # Parse using Beautiful soup
        survey_url_doc = BeautifulSoup(response.text, 'html.parser')
        
        return survey_url_doc

The following function will take a list of the 600 surveys and iterate over them using the `for` loop to create a document for each URL survey.

In [156]:
def get_surveys_url_docs(survey_url_list):
    surveys_url_docs = []
    for i in range(len(survey_url_list)):
        surveys_url_docs.append(create_survey_url_docs(survey_url_list[i]))
    return surveys_url_docs

### 7.2 Surveys XML page as documents

A simmilar function will take the XML and get the `response` and save the collected page locally for further work.

In [157]:
def create_survey_xml_docs(survey_xml):
            
        # Download the xml survey page
        response_1 = requests.get(survey_xml) 

        # Check successful response
        if response_1.status_code != 200:
            raise Exception('Failed to load page {}'.format(topic_url))

        # parse using Beautiful soup
        survey_xml_doc = BeautifulSoup(response_1.text, 'html.parser')
        
        return survey_xml_doc

This function will take a list of the 600 surveys and iterate over them using the `for` loop to create a document for each XML survey.

In [158]:
def get_surveys_xml_docs(survey_xml_list):
    surveys_xml_docs = []
    for i in range(len(survey_xml_list)):
        surveys_xml_docs.append(create_survey_xml_docs(survey_xml_list[i]))
    return surveys_xml_docs

### 7.3 Surveys information

Now let us use the URL document to collect the needed survey information.

In [159]:
def create_survey_infos(survey_url_doc):  
    
        # Get the operator name, dates, datum, 
        survey_names_tags = survey_url_doc.find_all('h1')
        surveyname = survey_names_tags[0].text
        info_tags = survey_url_doc.find_all('div', {'class':'col-xs-12 col-sm-9'})
        operator = info_tags[2].text.strip()
        dates = info_tags[3].text.strip()[-5:-1]
        datatype = info_tags[0].text.strip()
        datum = info_tags[-1].text.strip()
        
        survey_infos_ls = [surveyname, operator, dates, datatype, datum]
        
        return survey_infos_ls

We have used a similar function that takes a list of the 600 surveys and iterate over them using the `for` loop to collect the information for each URL survey.

In [160]:
def get_survey_infos(survey_url_doc_list):
    survey_infos = []
    for i in range(len(survey_url_doc_list)):
        survey_infos.append(create_survey_infos(survey_url_doc_list[i]))
        
    return survey_infos

### 7.4 Surveys datum

Now let us collect the coordinates (latitude and longitude) for the surveys.

In [161]:
def create_lat_long(survey_xml_doc):
    
        # Get the lat and long from the xml page
        northlat_names_tags = survey_xml_doc.find_all('northlat')
        nlat = northlat_names_tags[0].text
        southlat_names_tags = survey_xml_doc.find_all('southlat')
        slat = southlat_names_tags[0].text
        eastlong_names_tags = survey_xml_doc.find_all('eastlong')
        elong = eastlong_names_tags[0].text
        westlong_names_tags = survey_xml_doc.find_all('westlong')
        wlong = westlong_names_tags[0].text
        latlong_ls = [nlat, slat, elong, wlong]
        #latlong_dict = {'North lat' : nlat, 'South lat' : slat, 'East long' : elong, 'West long' : wlong}
        
        return latlong_ls

We have used a similar function that takes a list of the 600 surveys and iterate over them using the `for` loop to collect the coordinates as latitude and longitude for each XML page survey.

In [162]:
def get_lat_long(survey_xml_doc_list):
    lat_long_ls = []
    for i in range(len(survey_xml_doc_list)):
        lat_long_ls.append(create_lat_long(survey_xml_doc_list[i]))
    return lat_long_ls

### 7.5 Surveys files size

Here lets collect the size of the seismic and navigation files.

In [163]:
def create_size_files(survey_url_doc):
        # Get the size of the seismic segy and navigation data
        seis_files_size_tags = survey_url_doc.find_all('span')
        size_sf = seis_files_size_tags[0].text
        size_nf = seis_files_size_tags[1].text
        size_seisfile = size_sf.replace('\xa0', ' ')
        size_navfile = size_nf.replace('\xa0', ' ')
        #size_files_dict = {'SEGY size' : size_seisfile, 'Navigation size' : size_navfile}
        size_files_ls = [size_seisfile, size_navfile]
        return size_files_ls

Here we have used a function that takes a list of the 600 surveys and iterate over them using the `for` loop to collect the size of zip files for each survey.

In [164]:
def get_size_files(survey_url_doc_list):
    size_files_ls = []
    for i in range(len(survey_url_doc_list)):
        size_files_ls.append(create_size_files(survey_url_doc_list[i]))
    return size_files_ls

### 7.6 Surveys zip files

The final step is to collect the zip file links which will directly download the seismic and navigation zip files.

In [165]:
def create_zip_files(survey_url_doc):
    
        # Get the seismic and navigation zip files
        seis_zip_files_tags = survey_url_doc.find_all('a')
        seiszip = seis_zip_files_tags[10]['href']

        # Get the matadata navigation zip file
        base_url_zip = 'https://walrus.wr.usgs.gov'
        nav_zip_files = seis_zip_files_tags[12]
        navzip = base_url_zip + nav_zip_files['href']
        
        #zip_files_dict = {'SEGY zip': seiszip, 'Navigation zip' : navzip, 'url' : survey_url}
        zip_files_ls = [seiszip, navzip]
        return zip_files_ls

We have used a similar function that takes a list of the 600 surveys and iterate over them using the `for` loop to collect the zip files for the seismic and navigation survey.

In [166]:
def get_zip_files(survey_url_doc_list):
    zip_files_ls = []
    for i in range(len(survey_url_doc_list)):
        zip_files_ls.append(create_zip_files(survey_url_doc_list[i]))
    return zip_files_ls

### 7.7 Surveys final data List

Now let us add all the collected information in one list using the `append`. 

In [167]:
def final_data_list(survey_infos, survey_latlong, survey_sizefiles, survey_zip):

    data_infos_ls = []

    for i in range(len(survey_infos)):
        data_infos_test = survey_infos[i] + survey_latlong[i] + survey_sizefiles[i] + survey_zip[i]
        data_infos_ls.append(data_infos_test)
    data_infos_ls
    return data_infos_ls

### 7.8 Surveys final data Dictionary 

Now let us create a dictionary to store all the collected data which make it easy to change it to the data frame.

In [168]:
def survey_data_dict(data_infos_ls):
    survey_info_flist = []

    for i in range(len(data_infos_ls)):

            survey_info_dict ={
                'Survey name' : data_infos_ls[i][0],
                'Operator' : data_infos_ls[i][1],
                'Dates' : data_infos_ls[i][2],
                'Data type' : data_infos_ls[i][3], 
                'Datum' : data_infos_ls[i][4],
                'North lat' : data_infos_ls[i][5],
                'South lat' : data_infos_ls[i][6],
                'East long' : data_infos_ls[i][7],
                'West long' : data_infos_ls[i][8],
                'SEGY size' : data_infos_ls[i][9],
                'Navigation size' : data_infos_ls[i][10],
                'SEGY zip': data_infos_ls[i][11],
                'Navigation zip' : data_infos_ls[i][12],
                            }
            survey_info_flist.append(survey_info_dict)
            
    return survey_info_flist

### 7.9 Surveys Data CSV

We have created a CSV function to take the out dictionary from the function `surveys_final_data` to create file(s) with the extracted information to store the data.

In [169]:
# data in csv

import csv

def write_csv(items, path):
    
    with open(path,'w', encoding="utf-8") as csvfile:
        writer = csv.writer(csvfile)

        # Write the headers in the first line
        headers = list(items[0].keys())
        writer.writerow(headers)
        
        # Write one item per line
        for item in items:
            values = []
            for header in headers:
                values.append(str(item.get(header, "")))
            writer.writerow(values)

### 7.10 Section summary

These sections (No. 7) utilise nine functions to get the surveys information for all the 600 surveys.

This section utilises a set of functions from loops, functions and appends that take the surveys URL list and the XML list to iterate over these lists getting each URL/XML and pass it into the functions, resulting in a dictionary with the required information for all the 600 surveys.

## 8. Results

In this section, we presented the project results, which is seismic and navigation information and zip files of the 600 surveys.

We have created three lists each list have 200 surveys URL and XML batches to pass it into the function `create_survey_info_dict` to avoid any block when using the request command.

### 8.1 Generate the Seismic Surveys data 

We have created a list for the URL and XML surveys from 0 to 200 by indexing the main list `surveys_data_link`.

In [170]:
survey_url_0_200 = surveys_data_link[0:200]
survey_xml_0_200 = survey_data_xml[0:200]
len(survey_url_0_200)

200

Later we have pssed the new created lists into the nine funtions to collect all the required information.

#### Generated data for surveys 0 - 200:

Getting the URL document. Notice that we have recorded the time needed to get the required information.

In [171]:
%%time
survey_data_url_doc_200 = get_surveys_url_docs(survey_url_0_200)

Wall time: 4min 21s


Getting the XML document.

In [172]:
%%time
survey_data_xml_doc_200 = get_surveys_xml_docs(survey_xml_0_200)

Wall time: 2min 53s


Getting the survey information.

In [173]:
survey_data_infos_200 = get_survey_infos(survey_data_url_doc_200)

Getting the coordinates

In [174]:
survey_data_latlong_200 = get_lat_long(survey_data_xml_doc_200)

Getting the files sizes.

In [175]:
survey_data_sizefiles_200 = get_size_files(survey_data_url_doc_200)

Getting the zip files.

In [176]:
survey_data_zipfiles_200 = get_zip_files(survey_data_url_doc_200)

Generate data list.

In [177]:
survey_data_flist_200 = final_data_list(
    survey_data_infos_200,
    survey_data_latlong_200,
    survey_data_sizefiles_200,
    survey_data_zipfiles_200
)

Generate a dictionary.

In [178]:
survey_final_dict_200 = survey_data_dict(survey_data_flist_200)

In [179]:
survey_final_dict_200[3]

{'Survey name': 'B-00-79-LA',
 'Operator': 'Bureau of Ocean Energy Management',
 'Dates': '1979',
 'Data type': '2D Multichannel Seismic',
 'Datum': 'North American Datum 1927 (NAD27)',
 'North lat': '28.00779',
 'South lat': '27.94530',
 'East long': '-92.09660',
 'West long': '-92.18261',
 'SEGY size': '(7.5 MB)',
 'Navigation size': '(13.1 KB)',
 'SEGY zip': 'https://walrus.wr.usgs.gov/namss/data/1979/namss.B-00-79-LA.mcs.vaporchoc.zip',
 'Navigation zip': 'https://walrus.wr.usgs.gov/namss/media/navigation/2015/08/31/b-00-79-la.segp1'}

#### Generated data for surveys 200 - 400:

We have created a list for the URL and XML surveys from 200 to 400 by indexing the main list `surveys_data_link`.

In [180]:
survey_url_200_400 = surveys_data_link[200:400]
survey_xml_200_400 = survey_data_xml[200:400]
len(survey_url_200_400)

200

In [181]:
%%time
survey_data_url_doc_200_400 = get_surveys_url_docs(survey_url_200_400)

Wall time: 4min 10s


In [182]:
%%time
survey_data_xml_doc_200_400 = get_surveys_xml_docs(survey_xml_200_400)

Wall time: 2min 51s


In [183]:
survey_data_infos_200_400 = get_survey_infos(survey_data_url_doc_200_400)

In [184]:
survey_data_latlong_200_400 = get_lat_long(survey_data_xml_doc_200_400)

In [185]:
survey_data_sizefiles_200_400 = get_size_files(survey_data_url_doc_200_400)

In [186]:
survey_data_zipfiles_200_400 = get_zip_files(survey_data_url_doc_200_400)

In [187]:
survey_data_flist_200_400 = final_data_list(
    survey_data_infos_200_400,
    survey_data_latlong_200_400,
    survey_data_sizefiles_200_400,
    survey_data_zipfiles_200_400
)

In [188]:
survey_final_dict_200_400 = survey_data_dict(survey_data_flist_200_400)

In [189]:
survey_final_dict_200_400[3]

{'Survey name': 'B-22-75-LA',
 'Operator': 'Bureau of Ocean Energy Management',
 'Dates': '1975',
 'Data type': '2D Multichannel Seismic',
 'Datum': 'North American Datum 1927 (NAD27)',
 'North lat': '28.77294',
 'South lat': '28.65648',
 'East long': '-91.01254',
 'West long': '-91.15930',
 'SEGY size': '(41.4 MB)',
 'Navigation size': '(8.1 KB)',
 'SEGY zip': 'https://walrus.wr.usgs.gov/namss/data/1975/namss.B-22-75-LA.mcs.aquapulse.zip',
 'Navigation zip': 'https://walrus.wr.usgs.gov/namss/media/navigation/2015/09/01/b-22-75-la.segp1'}

#### Generated data for surveys 400 - 600:

We have created a list for the URL and XML surveys from 400 to 602 by indexing the main list `surveys_data_link`.

In [190]:
survey_url_400_600 = surveys_data_link[400:602]
survey_xml_400_600 = survey_data_xml[400:602]
len(survey_url_400_600)

201

In [191]:
%%time
survey_data_url_doc_400_600 = get_surveys_url_docs(survey_url_400_600)

Wall time: 4min 14s


In [192]:
%%time
survey_data_xml_doc_400_600 = get_surveys_xml_docs(survey_xml_400_600)

Wall time: 2min 54s


In [193]:
survey_data_infos_400_600 = get_survey_infos(survey_data_url_doc_400_600)

In [194]:
survey_data_latlong_400_600 = get_lat_long(survey_data_xml_doc_400_600)

In [195]:
survey_data_sizefiles_400_600 = get_size_files(survey_data_url_doc_400_600)

In [196]:
survey_data_zipfiles_400_600 = get_zip_files(survey_data_url_doc_400_600)

In [197]:
survey_data_flist_400_600 = final_data_list(
    survey_data_infos_400_600,
    survey_data_latlong_400_600,
    survey_data_sizefiles_400_600,
    survey_data_zipfiles_400_600
)

In [198]:
survey_final_dict_400_600 = survey_data_dict(survey_data_flist_400_600)

In [199]:
survey_final_dict_400_600[3]

{'Survey name': 'B-58-91-LA',
 'Operator': 'Bureau of Ocean Energy Management',
 'Dates': '1991',
 'Data type': '3D Multichannel Seismic',
 'Datum': 'North American Datum 1927 (NAD27)',
 'North lat': '27.86193',
 'South lat': '27.56530',
 'East long': '-92.80660',
 'West long': '-93.24867',
 'SEGY size': '(27.0 GB)',
 'Navigation size': '(2.6 KB)',
 'SEGY zip': 'https://walrus.wr.usgs.gov/namss/data/1991/namss.B-58-91-LA.mcs3d.airgun.zip',
 'Navigation zip': 'https://walrus.wr.usgs.gov/namss/media/navigation/2016/11/15/103155531173/B-58-91-LA.zip'}

### 8.2 Survey data CSV files

We have Created CSV file(s) with the extracted surveys data information to store the data

In [200]:
survey_data_0_200 = write_csv(survey_final_dict_200, 'surveydata_0_200.csv')

In [201]:
survey_data_200_400 = write_csv(survey_final_dict_200_400, 'surveydata_200_400.csv')

In [202]:
survey_data_400_600 = write_csv(survey_final_dict_400_600, 'surveydata_400_600.csv')

### 8.3 Seismic Surveys Data frames

After we have created and saved the CSV files, we can access and present them using the `Pandas` library.

In [203]:
# let's jest view three rows
df_200 = pd.read_csv('surveydata_0_200.csv')
df_200[3:6]

Unnamed: 0,Survey name,Operator,Dates,Data type,Datum,North lat,South lat,East long,West long,SEGY size,Navigation size,SEGY zip,Navigation zip
3,B-00-79-LA,Bureau of Ocean Energy Management,1979,2D Multichannel Seismic,North American Datum 1927 (NAD27),28.00779,27.9453,-92.0966,-92.18261,(7.5 MB),(13.1 KB),https://walrus.wr.usgs.gov/namss/data/1979/nam...,https://walrus.wr.usgs.gov/namss/media/navigat...
4,B-00-95-LA,Bureau of Ocean Energy Management,1995,3D Multichannel Seismic,North American Datum 1927 (NAD27),28.32871,28.19741,-90.11074,-90.29505,(248.5 MB),(1.4 KB),https://walrus.wr.usgs.gov/namss/data/1995/nam...,https://walrus.wr.usgs.gov/namss/media/navigat...
5,B-01-75-AT,Bureau of Ocean Energy Management,1975,2D Multichannel Seismic,North American Datum 1983 (NAD83),40.5882,37.85201,-70.39941,-74.59488,(2.4 GB),(564.6 KB),https://walrus.wr.usgs.gov/namss/data/1975/nam...,https://walrus.wr.usgs.gov/namss/media/navigat...


In [204]:
df_200_400 = pd.read_csv('surveydata_200_400.csv')
df_200_400[0:3]

Unnamed: 0,Survey name,Operator,Dates,Data type,Datum,North lat,South lat,East long,West long,SEGY size,Navigation size,SEGY zip,Navigation zip
0,B-21-83-LA,Bureau of Ocean Energy Management,1983,2D Multichannel Seismic,North American Datum 1927 (NAD27),29.17737,26.09115,-87.55914,-93.41922,(3.6 GB),(2.9 MB),https://walrus.wr.usgs.gov/namss/data/1983/nam...,https://walrus.wr.usgs.gov/namss/media/navigat...
1,B-21-88-LA,Bureau of Ocean Energy Management,1988,3D Multichannel Seismic,North American Datum 1927 (NAD27),28.21156,28.07931,-90.07547,-90.22775,(1.2 GB),(1.4 KB),https://walrus.wr.usgs.gov/namss/data/1988/nam...,https://walrus.wr.usgs.gov/namss/media/navigat...
2,B-22-75-AT,Bureau of Ocean Energy Management,1975,2D Multichannel Seismic,World Geodetic System 1984 (WGS84),39.47816,37.27581,-72.48754,-74.97537,(1.0 GB),(182.0 KB),https://walrus.wr.usgs.gov/namss/data/1975/nam...,https://walrus.wr.usgs.gov/namss/media/navigat...


In [205]:
df_400_600 = pd.read_csv('surveydata_400_600.csv')
df_400_600[0:3]

Unnamed: 0,Survey name,Operator,Dates,Data type,Datum,North lat,South lat,East long,West long,SEGY size,Navigation size,SEGY zip,Navigation zip
0,B-58-79-LA,Bureau of Ocean Energy Management,1979,2D Multichannel Seismic,North American Datum 1927 (NAD27),29.16374,27.84176,-89.76667,-93.4616,(61.9 MB),(168.1 KB),https://walrus.wr.usgs.gov/namss/data/1979/nam...,https://walrus.wr.usgs.gov/namss/media/navigat...
1,B-58-83-LA,Bureau of Ocean Energy Management,1983,2D Multichannel Seismic,North American Datum 1927 (NAD27),29.11513,28.11947,-91.81355,-93.2832,(44.0 MB),(318.9 KB),https://walrus.wr.usgs.gov/namss/data/1983/nam...,https://walrus.wr.usgs.gov/namss/media/navigat...
2,B-58-84-LA,Bureau of Ocean Energy Management,1984,2D Multichannel Seismic,North American Datum 1927 (NAD27),28.50671,28.42055,-92.11479,-92.26291,(11.7 MB),(7.3 KB),https://walrus.wr.usgs.gov/namss/data/1984/nam...,https://walrus.wr.usgs.gov/namss/media/navigat...


### 8.4 Merge Dataframes in one CSV file

Also we have created a final CSV file for all the 600 surveys which can be used for further analysis.

In [206]:
df_all_600 = pd.concat([df_200, df_200_400, df_400_600])
df_all_600[5:10]

Unnamed: 0,Survey name,Operator,Dates,Data type,Datum,North lat,South lat,East long,West long,SEGY size,Navigation size,SEGY zip,Navigation zip
5,B-01-75-AT,Bureau of Ocean Energy Management,1975,2D Multichannel Seismic,North American Datum 1983 (NAD83),40.5882,37.85201,-70.39941,-74.59488,(2.4 GB),(564.6 KB),https://walrus.wr.usgs.gov/namss/data/1975/nam...,https://walrus.wr.usgs.gov/namss/media/navigat...
6,B-01-77-LA,Bureau of Ocean Energy Management,1977,2D Multichannel Seismic,North American Datum 1927 (NAD27),28.95265,28.57102,-89.17095,-89.37113,(15.4 MB),(11.3 KB),https://walrus.wr.usgs.gov/namss/data/1977/nam...,https://walrus.wr.usgs.gov/namss/media/navigat...
7,B-01-78-AT,Bureau of Ocean Energy Management,1978,2D Multichannel Seismic,North American Datum 1983 (NAD83),39.8712,39.59591,-72.03049,-72.47954,(16.9 MB),(9.2 KB),https://walrus.wr.usgs.gov/namss/data/1978/nam...,https://walrus.wr.usgs.gov/namss/media/navigat...
8,B-01-80-AT,Bureau of Ocean Energy Management,1980,2D Multichannel Seismic,North American Datum 1983 (NAD83),36.35034,30.44691,-74.22113,-80.29978,(2.4 GB),(542.7 KB),https://walrus.wr.usgs.gov/namss/data/1980/nam...,https://walrus.wr.usgs.gov/namss/media/navigat...
9,B-01-81-AT,Bureau of Ocean Energy Management,1981,2D Multichannel Seismic,North American Datum 1983 (NAD83),41.01726,30.18925,-66.81061,-79.63947,(4.4 GB),(970.9 KB),https://walrus.wr.usgs.gov/namss/data/1981/nam...,https://walrus.wr.usgs.gov/namss/media/navigat...


Here we show the final survey data file for all the 600 surveys which is the main result of the project.

In [207]:
df_all_600.to_csv('surveydata_all_600.csv', index=False)

## 9. Sharing results as CSV files

The output results of the project as CSV files will be available alongside the notebook for all students, researchers and companies staff geoscientists to use and interpret 2D and 3D seismic data.

If you interpret seismic data and want to share your results with the broader geoscientist community, please use the [Virtual Seismic Atlas](https://www.seismicatlas.org/) (VSA) web site by [Professor Robert Butler](https://www.abdn.ac.uk/people/rob.butler/).

## 10. Project Summary

In this project, we have presented a novel Python coding-driven approach that scrapes survey data on the USGS web site to collect information about subsurface seismic reflection 2D and 3D datasets. The seismic surveys data on the USGS web site were captured, scraped and presented using code from well-known scraping libraries: `Requests` and `BeautifulSoup`. 


Application of the web scraping workflow on the NAMSS seismic survey dataset showed its efficacy in collecting, simplifying, and visualising the most critical data and highlighted discrepancies between serial surveys with subsurface 2D and 3D datasets. 


The workflow has been automated using python code functions, which provide collections of the surveys information and categorises zip files of seismic data for future research and interpretations.

We discuss the main benefits of the project; these include:



1. Web scraping for data collections can be made prior to any data analysis or interpretation, allowing the interpreter to know the limitation and uncertainty in the data or increase the data they are interpreting in. 


2. Tabulated data in data frame format allow a numerical analysis of data information which allow assessments to be made of interpretation efficacy and can be used to highlight anomalies between serial surveys datasets and with seismic or navigation data files.  


3. Web scraping fields can be user-defined and used to develop data acquisition strategies to improve data sharing and decrease overall data inaccessibility. 



4. The automated web scrape approach provides user-controlled, quick and easy data collections of the sub-surface data and associated geological models. It could be developed further to provide refined data collection and for broader automated analysis such as machine learning. 



Our automated web scraping project on Jovian is open source and freely available. 

All the code, functions, results and project workflow can be effectively applied to other websites and subsurface geological datasets to provide insight into data availability from a data-accessibility viewpoint. 

The potential to integrate the project workflow can be presented as a back-end of other methods of data analysis, e.g. machine learning, multiple interpretation analysis and stochastic methods that could produce a further contribution to the solution of data limitation.

![ramysaleem](https://i.imgur.com/D6HESuc.gif)

## 11. Future Work

This project has successfully collected and explored 600 seismic reflection surveys from the USGS web site on the coast of the USA. However, many geosciences open-source data need to be explored, collected and made more accessible to the researcher, such as collecting the geosciences data such as well data and seismic data on the Geological Survey of the Netherlands, UK, Germany and Norway. 


Here are a few web sites that could be a **potential science web scraping projects** that can make our science open-source data more accessible.


1. Well data on the USGS web site. (https://www.data.bsee.gov/).



2. Geosciences data on the Geological Survey of the Netherlands (https://www.nlog.nl/datacenter/).



3. Geosciences data on the British Geological Survey (https://www.bgs.ac.uk/geological-data/opengeoscience/).


4. Geosciences data on the Geological Survey of Germany (https://www.bgr.bund.de/EN/Home/homepage_node_en.html;jsessionid=6C08E0F50D81D1FDFCA4D4BBFC5A315E.2_cid331).



5. Geosciences data on the Geological Survey of Norway (https://www.ngu.no/en/topic/datasets).

## 12. Acknowledgement & References

I would like to express my special thanks of gratitude to my supervisors [Dr Clare Bond](https://www.abdn.ac.uk/people/clare.bond/) and [Professor Rob Butler](https://www.abdn.ac.uk/people/rob.butler), alongside with [Mr. Aakash N S](https://aakashns.medium.com/?source=collection_about-------------------------------------) and [Jovian team](https://blog.jovian.ai/about) who gave me the golden opportunity to do this wonderful project on the topic of web scraping, which also helped me in doing a lot of research that will be part of my final PhD thesis.

This project was carried out as part of my machine learning data collection technique and part of a University of Aberdeen provided PhD supported by The NERC Centre for Doctoral Training in Oil & Gas.

1. Web scraping notebook by [Aakash N S](https://aakashns.medium.com/?source=collection_about-------------------------------------).



2. Workshop on `Web-Scraping` by Aakash N S, [Let's Build a Python Web Scraping Project from Scratch | Hands-On Tutorial](https://www.youtube.com/watch?v=RKsLLG-bzEY&t=6677s).



3. [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) documentation library.



4. [`Scrapy`](https://docs.scrapy.org/en/latest/) documentation library.



5. [USGS science explorer](https://www.usgs.gov/science/science-explorer/Geology). 



6. [USGS seismic reflection data](https://www.usgs.gov/science-explorer-results?es=3D+Seismic+data&classification=data).



7. [The National Archive of Marine Seismic Surveys NAMSS](https://walrus.wr.usgs.gov/namss/).



8. [3D seismic profile animation](https://www.usgs.gov/media/images/3d-seismic-profile-animation).



9. [How to Purchase Property using Web Scraping - Using Python, Beautiful Soup and Pandas by Pritesh Patel](https://blog.jovian.ai/how-to-purchase-property-using-web-scraping-1d7448ef6c2d). 



10. [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes).



11. [Geophysical imaging](https://en.wikipedia.org/wiki/Geophysical_imaging).



12. [Coordinate Systems: What's the Difference?](https://www.esri.com/arcgis-blog/products/arcgis-pro/mapping/coordinate-systems-difference/#:~:text=A%20datum%20is%20one%20parameter,positioned%20relative%20to%20the%20surface.).

![footer.png](https://i.imgur.com/T1d4vMI.png)