## Project and Data Overview
The ultimate project goal is to prevent whale-vessel collisions by establishing shipping cooridors and alerting vessels of whale locations near ports. The first step is to establish a dataset of all coastal ports and their locations around the world. This dataset will be created by scraping the [World Port Source](http://www.worldportsource.com/) (WPS) website that contains all the necessary information. The information scaped will include: country, port name, port location name, location coordinates, address, code, port type, and port size. This dataset will eventually be used in conjunction with a dataset of all whales around the world, and their shared space with coastal ports. 

Example of website information to be scraped:

![image.png](attachment:08483e7a-7e27-498f-8c32-8727ec37969b.png)

In [1]:
#import webscraping libraries 
from bs4 import BeautifulSoup #import BeautifulSoup to parse html and extract data 
import requests #open URL, download html and pass it to BeautifulSoup.
import csv #import to write to file
import pandas as pd #import for visualizations
import matplotlib.pyplot as plt #import for visualizations
import seaborn as sns #import for visualizations
import lxml

In [2]:
#resources: https://www.topcoder.com/thrive/articles/web-crawler-in-python
            #https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/

In [3]:
#create and access url
url = 'http://www.worldportsource.com/countries.php' #ports -> country list 
response = requests.get(url, headers={'User-Agent': 'Safari'}) #request html from url

In [4]:
#get name and link of all countries with ports - make list
soup = BeautifulSoup(response.content, 'lxml')
#soup

In [5]:
countries = soup.find_all('table')[1].find_all('a')
#countries

In [6]:
#get the link of each country that goes to link of ports per country, and add to country_list
#ex: http://www.worldportsource.com/ports/index/ALB.php
num = 0
country_list = []
for anchor in countries:
    #print(anchor)
    if anchor.has_attr("href"):  #only select anchor is 'href' is an attribute. #removes '<a name="A">A</a>'
        #print(anchor)
        if 'php' in anchor['href']: #only select anchor is 'php' is in the anchor. #removes '<a href="#K">K</a>'
            #print(anchor['href'])
            relative_url = anchor['href'] #ex: '/ports/index/ALB.php'
            #print(relative_url)
            absolute_url = 'http://www.worldportsource.com' + relative_url
            country_list.append(absolute_url)
            num += 1 #count each url added to country_list

In [7]:
print(num)
#country_list

196


In [205]:
##TEST
'''
testlist = ['http://www.worldportsource.com/ports/index/ALB.php',
            'http://www.worldportsource.com/ports/index/DZA.php',
            'http://www.worldportsource.com/ports/index/ASM.php',
            'http://www.worldportsource.com/ports/index/AGO.php']
            
port_list = []
for link in testlist:
    response = requests.get(link, headers={'User-Agent': 'Safari'}) #request html from url
    soup = BeautifulSoup(response.content, 'lxml')
    ports = soup.find_all('table')[1].find_all('a')
    
    for p in ports:
        if p.has_attr("href"):
            if 'review' in p['href']:
                relative_url = p['href'].split('review')[-1]
                absolute_url = 'http://www.worldportsource.com' + '/ports' + relative_url
                port_list.append(absolute_url)
                num += 1
                
            elif 'php' in p['href']:
                relative_url = p['href']
                absolute_url = 'http://www.worldportsource.com' + relative_url
                port_list.append(absolute_url)
                num += 1
print(num)
port_list
'''

'port_list = []\nfor link in testlist:\n    response = requests.get(link, headers={\'User-Agent\': \'Safari\'}) #request html from url\n    soup = BeautifulSoup(response.content, \'lxml\')\n    ports = soup.find_all(\'table\')[1].find_all(\'a\')\n    \n    for p in ports:\n        if p.has_attr("href"):\n            if \'review\' in p[\'href\']:\n                relative_url = p[\'href\'].split(\'review\')[-1]\n                absolute_url = \'http://www.worldportsource.com\' + \'/ports\' + relative_url\n                port_list.append(absolute_url)\n                num += 1\n                \n            elif \'php\' in p[\'href\']:\n                relative_url = p[\'href\']\n                absolute_url = \'http://www.worldportsource.com\' + relative_url\n                port_list.append(absolute_url)\n                num += 1\nprint(num)\nport_list\n'

In [9]:
#enter each port
##some ports go straight to a 'review' page instead of the contact info that we need. handle in loop
#takes some time to run
#link = 'http://www.worldportsource.com/ports/index/ALB.php'
#http://www.worldportsource.com/ports/review/ARG_Port_of_Buenos_Aires_47.php
#http://www.worldportsource.com/ports/ARG_Port_of_Buenos_Aires_47.php

num = 0
port_list = []

for link in country_list:
    response = requests.get(link, headers={'User-Agent': 'Safari'}) #request html from url
    soup = BeautifulSoup(response.content, 'lxml')
    ports = soup.find_all('table')[1].find_all('a') #find each port in html 
    
    for anchor in ports:
        #print(anchor) # ex: '<a href="#D">D</a>', '<a href="/ports/ALB_Port_of_Durres_2181.php">Port of Durres</a>'
        if anchor.has_attr("href"): #only select anchor is 'href' is an attribute.
            if 'review' in anchor['href']: #if anchor contains 'review', 
                relative_url = anchor['href'].split('review')[-1] #then remove 'review' and build correct url
                absolute_url = 'http://www.worldportsource.com' + '/ports' + relative_url
                port_list.append(absolute_url) #add correct url to port_list
                num += 1 #count each url added to port_list
                
            elif 'php' in anchor['href']: #only select anchor is 'php' is in the anchor. #removes '<a href="#K">K</a>'
                relative_url = anchor['href'] #ex: '/ports/ALB_Port_of_Vlore_2184.php' 
                #print(relative_url)
                absolute_url = 'http://www.worldportsource.com' + relative_url #build absolute url
                port_list.append(absolute_url) #add to port_list
                num += 1 #count each url added to port_list

In [10]:
print(num)
port_list[:5] #first 5 ports
#port_list

6290


['http://www.worldportsource.com/ports/ALB_Port_of_Durres_2181.php',
 'http://www.worldportsource.com/ports/ALB_Port_of_Sarande_2182.php',
 'http://www.worldportsource.com/ports/ALB_Port_of_Shengjin_2183.php',
 'http://www.worldportsource.com/ports/ALB_Port_of_Vlore_2184.php',
 'http://www.worldportsource.com/ports/DZA_Port_of_Algiers_1419.php']

In [12]:
#Extract relevant information

In [341]:
testlist = ['http://www.worldportsource.com/ports/ALB_Port_of_Durres_2181.php', 
'http://www.worldportsource.com/ports/ALB_Port_of_Sarande_2182.php']
#build empty dataframe - name columns
port_df = pd.DataFrame(columns=['Port Location', 'Port Name', 'Port Authority', 'Address',
                                'Phone', 'Fax', '800 Number', 'Email', 'Web Site', 'Latitude', 
                                'Longitude', 'UN/LOCODE', 'Port Type', 'Port Size'])
port_df

Unnamed: 0,Port Location,Port Name,Port Authority,Address,Phone,Fax,800 Number,Email,Web Site,Latitude,Longitude,UN/LOCODE,Port Type,Port Size


In [343]:
#extract data from tables of all ports in all countries 

for url in port_list[0:5]:
    response = requests.get(url, headers={'User-Agent': 'Safari'}) #request html from url
    soup = BeautifulSoup(response.content, 'lxml')
    
    table = soup.find_all('table')[1]
    #print(table)
    
    port_dict = {} 
    
    #find headers
    for tr in table.find_all('tr'):
        #print(tr)
        column_data = tr.find('th')
        #print(column_data)
        #print(type(header))
        row_data = tr.find_all('td')
        #print(row_data)
        
        #extract the "key" e.g. "Port Location"
        if column_data is not None: 
            #print(column_data)
            column_data = column_data.string
            column_data = column_data.strip(':')
            #print(column_data)
            row_value = None
            
            if row_data is not None and len(row_data) > 1:
                #print(row_data)
                if column_data == 'Address':
                    #address = row_data[1].get_text()
                    address = ""
                    for each in row_data[1]:
                        address = address + (each.string if each.name != "br" else "\n")
                    #print(address)
                    if address == None or address == "":
                        address = None
                    row_value = address
                    #print(address)
                elif column_data == 'Email':
                    email = row_data[1].get_text()
                    if email == None or email == "":
                        email = None
                    row_value = email
                        #print(email)
                    #print(email)
                elif column_data == 'Website':
                    website = row_data[1].get_text()
                    if website == None or website == "":
                        website = None
                    row_value = website
                    #print(website)     
                else:
                    #print(row_data[1])
                    row_data = row_data[1].string
                    #print(row_data)
                    if row_data == None or row_data == "":
                        row_data = None
                    row_value = row_data
                       
            #print(column_data + ': ' + (row_value if row_value is not None else 'None'))
            
            port_dict[column_data] = row_value
            
    #print(port_dict)    
    port_df = pd.concat([port_df, pd.DataFrame([port_dict])], ignore_index=True)

In [344]:
port_df

Unnamed: 0,Port Location,Port Name,Port Authority,Address,Phone,Fax,800 Number,Email,Web Site,Latitude,Longitude,UN/LOCODE,Port Type,Port Size
0,Durres,Port of Durres,Durres Port Authority,Kapitenerija e Portit\nL Nr 1 Rruga Tregtare\n...,355 52 23115,355 52 22028,,apd@san.com.al,,"41° 18' 28"" N","19° 27' 17"" E",ALDRZ,Seaport,Small
1,Sarande,Port of Sarande,Sarande Port Authority,Port Office\nSarande Port\nSarande\nAlbania,355 85 25827,355 85 25827,,,,"39° 52' 19"" N","20° 0' 20"" E",ALSAR,"Pier, Jetty or Wharf",Small
2,Shengjin,Port of Shengjin,Shengjin Port Authority,Shengjin\nAlbania,,,,,,"41° 48' 41"" N","19° 35' 17"" E",ALSHG,Harbor,Small
3,Vlore,Port of Vlore,Vlore Drejtoria e Portit Detar,Albania,,,,porti-vlore@aul.com.al,,"40° 28' 7"" N","19° 27' 36"" E",ALVOA,Seaport,Small
4,Algiers,Port of Algiers,Entreprise Portuaire d'Alger,"02 Rue d'Angkor\nBP 259\nAlgiers, Gare 259\nAl...",213 21 423614,213 21 423603,,epal@portalger.com.dz,www.portalger.com.dz,"36° 46' 25"" N","3° 4' 2"" E",DZALG,Deepwater Seaport,Large
5,Durres,Port of Durres,Durres Port Authority,Kapitenerija e Portit\nL Nr 1 Rruga Tregtare\n...,355 52 23115,355 52 22028,,apd@san.com.al,,"41° 18' 28"" N","19° 27' 17"" E",ALDRZ,Seaport,Small
6,Sarande,Port of Sarande,Sarande Port Authority,Port Office\nSarande Port\nSarande\nAlbania,355 85 25827,355 85 25827,,,,"39° 52' 19"" N","20° 0' 20"" E",ALSAR,"Pier, Jetty or Wharf",Small
7,Shengjin,Port of Shengjin,Shengjin Port Authority,Shengjin\nAlbania,,,,,,"41° 48' 41"" N","19° 35' 17"" E",ALSHG,Harbor,Small
8,Vlore,Port of Vlore,Vlore Drejtoria e Portit Detar,Albania,,,,porti-vlore@aul.com.al,,"40° 28' 7"" N","19° 27' 36"" E",ALVOA,Seaport,Small
9,Algiers,Port of Algiers,Entreprise Portuaire d'Alger,"02 Rue d'Angkor\nBP 259\nAlgiers, Gare 259\nAl...",213 21 423614,213 21 423603,,epal@portalger.com.dz,www.portalger.com.dz,"36° 46' 25"" N","3° 4' 2"" E",DZALG,Deepwater Seaport,Large


In [337]:
port_df['Address']

0    Kapitenerija e Portit\nL Nr 1 Rruga Tregtare\n...
0          Port Office\nSarande Port\nSarande\nAlbania
0                                    Shengjin\nAlbania
0                                              Albania
0    02 Rue d'Angkor\nBP 259\nAlgiers, Gare 259\nAl...
Name: Address, dtype: object

In [None]:
#test = pd.DataFrame(columns=['Location', 'Name', 'Authority'])
#row = {'Location': 'Durres', 'Name': 'Port of Durres', 'Authority': 'Port Authority of Durres', }
#row
#test = pd.concat([test, pd.DataFrame([row])])
#test

Unnamed: 0,Location,Name,Authority


In [346]:
#save work to csv
port_df.to_csv('WorldPortData.csv', index=False)

#### TO DO
- clean up addresses
- clean up column names 
- fix indexing 
- create a df of ports by sizes -> df of largest ports, df of mid-sized ports, df of small ports 
- map largest ports 
- eliminate ports that are not coastal (search for port types containing 'seaport')