# Tracing Clues - The Orca Project

The goal of this project is to get the data from the [Orca Network](http://www.orcanetwork.org/Main/) and create a data model that would allow data insights along with building a knowledge graph. 

There are several stages that will need to be done, but to start we need to grab the data. This website hard codes the sightings into nested tables in html. Right now the plan is to use the Beautiful Soup Library to scrape it. 

## Getting the Website

In [16]:
import requests

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
page = requests.get("http://www.orcanetwork.org/Archives/index.php?categories_file=Sightings%20Archive%20-%20Dec%2005", headers=headers)
page

<Response [200]>

## Beautiful Soup

In [17]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')


In [18]:
soup

<!DOCTYPE doctype html public "-//w3c//dtd html 3.2//en">

<html>
<head>
<title>
Welcome to Orca Network - Sightings Archive - Dec 05</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="Orca Network: Enhancing awareness of the Southern Resident Orca (killer whale) community to foster a stewardship ethic to protect and restore orca habitat." name="description"/>
<meta content="Orcinus orca, orca, killer whale, Puget Sound, Salish Sea, salmon, Orca Network, Southern Resident community, J pod, K pod, L pod, orca gifts, orca, donate, gray whale, Baja, San Ignacio, orca protection, orca habitat, whale research, San Juan Island, Whidbey Island" name="keywords"/>
<meta content="Howard Garrett" name="author"/>
<meta content="Welcome to Orca Network" property="og:title">
<meta content="http://www.orcanetwork.org/Archives" property="og:url">
<meta content="http://www.orcanetwork.org/Archives/Images/FB.jpg" property="og:image">
<meta content="Welcome to Orc

In [32]:
for br_tag in soup.find_all('!--'):
    print(br_tag.text, br_tag.next_sibling)

In [36]:
soup.find_all('h3')[0].getText()

'December 2005 Whale Sightings'

In [54]:
divTag = soup.find("div")
children = divTag.findChildren()
for child in children:
    print(child)

<a name="top"></a>
<h3><center>December 2005 Whale Sightings</center></h3>
<center>December 2005 Whale Sightings</center>
<b>December 27, 2005</b>
<br/>
<br/>
<b>Humpback</b>
<b>1500</b>
<br/>
<br/>
<br/>
<br/>
<b>humpback</b>
<br/>
<br/>
<br/>
<b>orcas</b>
<b>8:45 am</b>
<br/>
<br/>
<b>December 24, 2005</b>
<br/>
<br/>
<b>orcas</b>
<b>8:30 am</b>
<br/>
<br/>
<b>December 23, 2005</b>
<br/>
<br/>
<b>noon</b>
<b>orcas</b>
<b>J pod</b>
<br/>
<br/>
<b>12:00 PM</b>
<b>J Pod</b>
<br/>
<br/>
<br/>
<br/>
<b>December 22, 2005</b>
<br/>
<br/>
<b>orcas</b>
<b>12:15 pm</b>
<br/>
<br/>
<br/>
<br/>
<b>December 18, 2005</b>
<br/>
<br/>
<b>orcas</b>
<b>9 am</b>
<br/>
<br/>
<b>December 17, 2005</b>
<br/>
<br/>
<b>orcas</b>
<b>11:50 am</b>
<br/>
<br/>
<b>orcas</b>
<b>1:25 pm</b>
<br/>
<br/>
<b>orcas</b>
<b>9 am</b>
<b>11:40 am</b>
<br/>
<br/>
<b>orcas</b>
<b>9 am</b>
<br/>
<br/>
<b>December 16, 2005</b>
<br/>
<br/>
<b>3:30 p.m</b>
<b>orcas</b>
<br/>
<br/>
<br/>
<b>4:30pm</b>
<b>orca</b>
<br/>
<br/>
<br/

bs4.element.Tag

In [85]:
import re
pattern = r"<br>||</br>"

cleanUpDiv = re.sub(pattern, "", divTag.getText())
textOnly = re.sub('\n', ' ', textOnly)
textOnly = re.sub('Clip Map to enlarge    Map © 2005 used with permission byAdvanced Satellite Productions, Inc.', '', textOnly)
textOnly = textOnly.strip()
textOnly

'December 2005 Whale Sightings December 27, 2005  A new Humpback whale was spotted South of Race Rocks this afternoon about 1500. New in the fact its photo id matches nothing taken at least this year. Ron Bates MMRG, Victoria * I am a driver with Prince of Whales. We had a single humpback 3 miles south of Race Rocks. No Orca. Beemer * Marty Tilley called to report many, many orcas off N. Pender Island, traveling south in Swanson Channel at 8:45 am.  December 24, 2005  Mark from the Victoria Clipper called to report a pod of orcas off Pt. Wells near Edmonds, mid-channel heading south at 8:30 am.  December 23, 2005  Tom McMillen called at noon to report a pod of orcas resting just off Shilshole, heading south. At around 1 pm he called to confirm it was J pod, they were at West Pt. heading south slowly. At 1:30 they were off Magnolia, heading south into Elliott Bay. * 12:00 PM, On our Seatlle whale watch trip today the Island Explorer II sighted and is presently with J Pod, south bound of

In [124]:
searchPattern = r"December \d{1,2}, \d{4}"
splitList = re.split(searchPattern, textOnly)
splitList = splitList[1:]
print(len(splitList))
mapFromDateToReport = dict()
for date, text in zip(re.findall(searchPattern, textOnly), splitList):
    mapFromDateToReport[date] = [i.strip() for i in text.split('*')]
    
mapFromDateToReport

16


{'December 27, 2005': ['A new Humpback whale was spotted South of Race Rocks this afternoon about 1500. New in the fact its photo id matches nothing taken at least this year. Ron Bates MMRG, Victoria',
  'I am a driver with Prince of Whales. We had a single humpback 3 miles south of Race Rocks. No Orca. Beemer',
  'Marty Tilley called to report many, many orcas off N. Pender Island, traveling south in Swanson Channel at 8:45 am.'],
 'December 24, 2005': ['Mark from the Victoria Clipper called to report a pod of orcas off Pt. Wells near Edmonds, mid-channel heading south at 8:30 am.'],
 'December 23, 2005': ['Tom McMillen called at noon to report a pod of orcas resting just off Shilshole, heading south. At around 1 pm he called to confirm it was J pod, they were at West Pt. heading south slowly. At 1:30 they were off Magnolia, heading south into Elliott Bay.',
  '12:00 PM, On our Seatlle whale watch trip today the Island Explorer II sighted and is presently with J Pod, south bound off S