# Abstract

*A problem well stated is a problem half-solved.*

*This is your space to describe your intentions for the project, before writing a single line of code. What are you studying? What are you hoping to build? If you can't explain that clearly before you start digging into the data, you're going to have a hard time planning where to go with this.*

# Obtain the Data

*Describe your data sources here and explain why they are relevant to the problem you are trying to solve.*

*Your code should download the data and save it in data/raw. If you've got the data from an offline source, describe where it came from and what the files look like. Don't do anything to the raw data files just yet; that comes in the next step.*

*After completing this step, be sure to edit `references/data_dictionary` to include descriptions of where you obtained your data and what information it contains.*

In [8]:
## %%writefile ../src/data/make_dataset.py

# Imports
%run ../src/utils/helpers.py

import boto3
import requests
from datetime import date, timedelta
from IPython.display import clear_output
from bs4 import BeautifulSoup


# Helper functions go here
def s3_file_exists(filename, s3, bucket_name):
    bucket = s3.Bucket(bucket_name)
    key = filename
    objs = list(bucket.objects.filter(Prefix=key))
    return len(objs) > 0 and objs[0].key == key


def save_to_s3(url, filepath, s3, bucket_name):
    # Do this as a quick and easy check to make sure your S3 access is OK
    bucket_found = bucket_name in [b.name for b in s3.buckets.all()]
    if not bucket_found:
        print('Not seeing your S3 bucket, might want to double check permissions in IAM')

    # Given an Internet-accessible URL, download the image and upload it to S3,
    # without needing to persist the image to disk locally
    r = requests.get(url, stream=True)
    if r.headers['Content-Type'] == 'image/jpeg':
        file_object = r.raw
        data = file_object.read()

        # Do the actual upload to s3
        s3.Bucket(bucket_name).put_object(Key=filepath, Body=data)
    else:
        print(f"No image found at {url}")
    return


def get_soup(url):
    """
    Given url, return soup. Return None if host unresponsive
    """
    soup = None
    try:
        r = requests.get(url, timeout=5)
        if r.ok:
            soup = BeautifulSoup(r.content, 'html.parser')
    except:
        print(f'Failed to get {url}')
    return soup


def get_thumbnail_paths(day):
    base = "https://spaceneedledev.com/panocam/assets/"
    tail = "thumbnail.jpg"
    get_url = lambda x: f'https://spaceneedledev.com/panocam/assets/{x}'

    paths = []
    y = day.year
    m = str(day.month).zfill(2)
    d = str(day.day).zfill(2)
    day_slug = f'{y}/{m}/{d}/'
    day_url = f'{base}{day_slug}'
    soup = get_soup(day_url)
    time_slugs = [link['href'] for link in soup.find_all(name='a') if '_' in link['href']]

    for t in time_slugs:
        filepath = f'{day_slug}{t}{tail}'
        url = get_url(filepath)
        paths.append((url, filepath))
        
    return paths


def get_thumbnails():
    s3 = boto3.resource('s3')
    bucket_name = 'eye-of-the-needle'
    bucket = s3.Bucket(bucket_name)
    files = [file.key for file in bucket.objects.filter()]
    print('Got list of files in bucket')
    
    start_date = date(2015, 1, 1)
    end_date = date(2018, 6, 1)
    print(f'Searching for files between {start_date} and {end_date}')
    
    num_days = (end_date - start_date).days + 1
    date_list = [end_date - timedelta(days=x) for x in range(0, num_days)]
    for day in date_list:
        clear_output()
        paths = get_thumbnail_paths(day)
        print(f'Got thumbnail paths for {day}')
        for path in paths:
            if path[1] not in files:
                save_to_s3(path[0], path[1], s3, bucket_name)
                print(f'Saved {path[1]} to S3')
        print(f'Got all thumbnails for {day}')
    
    return


def run():
    """
    Executes a set of helper functions that download data from one or more sources
    and saves those datasets to the data/raw directory.
    """
    get_thumbnails()
    return

In [9]:
run()

Got thumbnail paths for 2015-01-01
Saved 2015/01/01/2015_0101_000921/thumbnail.jpg to S3
Saved 2015/01/01/2015_0101_072000/thumbnail.jpg to S3
Saved 2015/01/01/2015_0101_074000/thumbnail.jpg to S3
Saved 2015/01/01/2015_0101_080000/thumbnail.jpg to S3
Saved 2015/01/01/2015_0101_081000/thumbnail.jpg to S3
Saved 2015/01/01/2015_0101_082000/thumbnail.jpg to S3
Saved 2015/01/01/2015_0101_083000/thumbnail.jpg to S3
Saved 2015/01/01/2015_0101_084000/thumbnail.jpg to S3
Saved 2015/01/01/2015_0101_085000/thumbnail.jpg to S3
Saved 2015/01/01/2015_0101_090000/thumbnail.jpg to S3
Saved 2015/01/01/2015_0101_091000/thumbnail.jpg to S3
Saved 2015/01/01/2015_0101_092000/thumbnail.jpg to S3
Saved 2015/01/01/2015_0101_093000/thumbnail.jpg to S3
Saved 2015/01/01/2015_0101_094000/thumbnail.jpg to S3
Saved 2015/01/01/2015_0101_095000/thumbnail.jpg to S3
Saved 2015/01/01/2015_0101_100000/thumbnail.jpg to S3
Saved 2015/01/01/2015_0101_101000/thumbnail.jpg to S3
Saved 2015/01/01/2015_0101_102000/thumbnail.jpg

In [152]:
good_url = "https://spaceneedledev.com/panocam/assets/2019/05/30/2019_0530_093000/thumbnail.jpg"
good_fp = "2019/05/29/2019_0530_093000/thumbnail.jpg"
save_to_s3(good_url, good_fp, bucket_name)

In [None]:
files = [file.key for file in bucket.objects.filter()]
print(len(files))
files

In [149]:
s3 = boto3.resource('s3')
s3.

In [140]:
good_url = "https://spaceneedledev.com/panocam/assets/2019/05/31/2019_0530_093000/thumbnail.jpg"
good_fp = "2019/05/31/2019_0531_093000/thumbnail.jpg"
save_to_s3(good_url, good_fp, bucket_name)

Good to go. Found the bucket to upload the image into.
No image found at https://spaceneedledev.com/panocam/assets/2019/05/31/2019_0530_093000/thumbnail.jpg


In [85]:
r = requests.get("https://spaceneedledev.com/panocam/assets/2019/05/31/2019_0531_093000/thumbnail.jpg")
g = requests.get("https://spaceneedledev.com/panocam/assets/2019/05/30/2019_0530_093000/thumbnail.jpg")

In [2]:
r = requests.get("https://spaceneedledev.com/panocam/assets/2019/05/28/")

In [17]:
def get_thumbnail_paths(start_date, end_date):
    base = "https://spaceneedledev.com/panocam/assets/"
    tail = "thumbnail.jpg"
    get_url = lambda x: f'https://spaceneedledev.com/panocam/assets/{x}'

    num_days = (end_date - start_date).days
    date_list = [end_date - timedelta(days=x) for x in range(0, num_days)]
    times = [str(t).zfill(6) for t in range(221000, 41000-1, -1000) if int(str(t)[-4]) < 6]
    
    paths = []
    for day in date_list:
        y = day.year
        m = str(day.month).zfill(2)
        d = str(day.day).zfill(2)
        
        for t in times:
            filepath = f'{y}/{m}/{d}/{y}_{m}{d}_{t}/{tail}'
            url = get_url(filepath)
            paths.append((url, filepath))
    return paths


s3 = boto3.resource('s3')
bucket_name = 'eye-of-the-needle'
bucket = s3.Bucket(bucket_name)
files = [file.key for file in bucket.objects.filter()]

start_date = date(2015, 1, 1)
end_date = date(2017, 6, 30)
paths = get_thumbnail_paths(start_date, end_date)
paths[:100]    

[('https://spaceneedledev.com/panocam/assets/2017/06/30/2017_0630_221000/thumbnail.jpg',
  '2017/06/30/2017_0630_221000/thumbnail.jpg'),
 ('https://spaceneedledev.com/panocam/assets/2017/06/30/2017_0630_220000/thumbnail.jpg',
  '2017/06/30/2017_0630_220000/thumbnail.jpg'),
 ('https://spaceneedledev.com/panocam/assets/2017/06/30/2017_0630_215000/thumbnail.jpg',
  '2017/06/30/2017_0630_215000/thumbnail.jpg'),
 ('https://spaceneedledev.com/panocam/assets/2017/06/30/2017_0630_214000/thumbnail.jpg',
  '2017/06/30/2017_0630_214000/thumbnail.jpg'),
 ('https://spaceneedledev.com/panocam/assets/2017/06/30/2017_0630_213000/thumbnail.jpg',
  '2017/06/30/2017_0630_213000/thumbnail.jpg'),
 ('https://spaceneedledev.com/panocam/assets/2017/06/30/2017_0630_212000/thumbnail.jpg',
  '2017/06/30/2017_0630_212000/thumbnail.jpg'),
 ('https://spaceneedledev.com/panocam/assets/2017/06/30/2017_0630_211000/thumbnail.jpg',
  '2017/06/30/2017_0630_211000/thumbnail.jpg'),
 ('https://spaceneedledev.com/panocam/ass

In [18]:
start_date = date(2017, 6, 28)
end_date = date(2017, 6, 30)
paths2 = get_thumbnail_paths2(start_date, end_date)
paths2[:100]    

[('https://spaceneedledev.com/panocam/assets/2017/06/30/2017_0630_041000/thumbnail.jpg',
  '2017/06/30/2017_0630_041000/thumbnail.jpg'),
 ('https://spaceneedledev.com/panocam/assets/2017/06/30/2017_0630_042000/thumbnail.jpg',
  '2017/06/30/2017_0630_042000/thumbnail.jpg'),
 ('https://spaceneedledev.com/panocam/assets/2017/06/30/2017_0630_043000/thumbnail.jpg',
  '2017/06/30/2017_0630_043000/thumbnail.jpg'),
 ('https://spaceneedledev.com/panocam/assets/2017/06/30/2017_0630_044000/thumbnail.jpg',
  '2017/06/30/2017_0630_044000/thumbnail.jpg'),
 ('https://spaceneedledev.com/panocam/assets/2017/06/30/2017_0630_045000/thumbnail.jpg',
  '2017/06/30/2017_0630_045000/thumbnail.jpg'),
 ('https://spaceneedledev.com/panocam/assets/2017/06/30/2017_0630_050000/thumbnail.jpg',
  '2017/06/30/2017_0630_050000/thumbnail.jpg'),
 ('https://spaceneedledev.com/panocam/assets/2017/06/30/2017_0630_051000/thumbnail.jpg',
  '2017/06/30/2017_0630_051000/thumbnail.jpg'),
 ('https://spaceneedledev.com/panocam/ass

In [21]:
set(paths2).difference(set(paths))

{('https://spaceneedledev.com/panocam/assets/2017/06/29/2017_0629_100500/thumbnail.jpg',
  '2017/06/29/2017_0629_100500/thumbnail.jpg')}

In [14]:
from bs4 import BeautifulSoup


def get_soup(url):
    """
    Given url, return soup. Return None if host unresponsive
    """
    soup = None
    r = requests.get(url, timeout=5)
    if r.ok:
        soup = BeautifulSoup(r.content, 'html.parser')
    return soup


def get_thumbnail_paths2(start_date, end_date):
    base = "https://spaceneedledev.com/panocam/assets/"
    tail = "thumbnail.jpg"
    get_url = lambda x: f'https://spaceneedledev.com/panocam/assets/{x}'

    num_days = (end_date - start_date).days
    date_list = [end_date - timedelta(days=x) for x in range(0, num_days)]
#     times = [str(t).zfill(6) for t in range(221000, 41000-1, -1000) if int(str(t)[-4]) < 6]
    
    paths = []
    for day in date_list:
        y = day.year
        m = str(day.month).zfill(2)
        d = str(day.day).zfill(2)
        day_slug = f'{y}/{m}/{d}/'
        day_url = f'{base}{day_slug}'
        soup = get_soup(day_url)
        time_slugs = [link['href'] for link in soup.find_all(name='a') if '_' in link['href']]
        
        for t in time_slugs:
            filepath = f'{day_slug}{t}{tail}'
            url = get_url(filepath)
            paths.append((url, filepath))
    return paths

In [12]:
[link['href'] for link in soup.find_all(name='a') if '_' in link['href']]

['2019_0528_041000/',
 '2019_0528_042000/',
 '2019_0528_043000/',
 '2019_0528_044000/',
 '2019_0528_045000/',
 '2019_0528_050000/',
 '2019_0528_051000/',
 '2019_0528_052000/',
 '2019_0528_053500/',
 '2019_0528_055000/',
 '2019_0528_060200/',
 '2019_0528_061300/',
 '2019_0528_062200/',
 '2019_0528_063000/',
 '2019_0528_064000/',
 '2019_0528_065000/',
 '2019_0528_070000/',
 '2019_0528_071000/',
 '2019_0528_072000/',
 '2019_0528_073000/',
 '2019_0528_074000/',
 '2019_0528_075000/',
 '2019_0528_080000/',
 '2019_0528_081000/',
 '2019_0528_082000/',
 '2019_0528_083000/',
 '2019_0528_084000/',
 '2019_0528_085000/',
 '2019_0528_090000/',
 '2019_0528_091000/',
 '2019_0528_092000/',
 '2019_0528_093000/',
 '2019_0528_094000/',
 '2019_0528_095000/',
 '2019_0528_100000/',
 '2019_0528_101000/',
 '2019_0528_102000/',
 '2019_0528_103000/',
 '2019_0528_104000/',
 '2019_0528_105000/',
 '2019_0528_110000/',
 '2019_0528_111000/',
 '2019_0528_112000/',
 '2019_0528_113000/',
 '2019_0528_114000/',
 '2019_052

In [179]:
[l for l in r.iter_lines()]

[b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">',
 b'<html>',
 b' <head>',
 b'  <title>Index of /panocam/assets/2019/05/28</title>',
 b' </head>',
 b' <body>',
 b'<h1>Index of /panocam/assets/2019/05/28</h1>',
 b'<pre>      <a href="?C=N;O=D">Name</a>                    <a href="?C=M;O=A">Last modified</a>      <a href="?C=S;O=A">Size</a>  <a href="?C=D;O=A">Description</a><hr>      <a href="/panocam/assets/2019/05/">Parent Directory</a>                             -   ',
 b'      <a href="2019_0528_041000/">2019_0528_041000/</a>       28-May-2019 04:42    -   ',
 b'      <a href="2019_0528_042000/">2019_0528_042000/</a>       28-May-2019 04:42    -   ',
 b'      <a href="2019_0528_043000/">2019_0528_043000/</a>       28-May-2019 05:03    -   ',
 b'      <a href="2019_0528_044000/">2019_0528_044000/</a>       28-May-2019 05:03    -   ',
 b'      <a href="2019_0528_045000/">2019_0528_045000/</a>       28-May-2019 05:19    -   ',
 b'      <a href="2019_0528_050000/">2019_0528_

In [75]:
def get_thumbnail_urls(start_date, end_date):
    base = "https://spaceneedledev.com/panocam/assets/"
    tail = "thumbnail.jpg"
    get_url = lambda x: f'https://spaceneedledev.com/panocam/assets/{x}'

    bucket_name = 'eye-of-the-needle'
    
    num_days = (end_date - start_date).days
    date_list = [end_date - timedelta(days=x) for x in range(0, num_days)]
    times = [str(t).zfill(6) for t in range(211000, 50000-1, -1000) if int(str(t)[-4]) < 6]
    
    for day in date_list:
        y = day.year
        m = str(day.month).zfill(2)
        d = str(day.day).zfill(2)
        
        for t in times:
            filepath = f'{y}/{m}/{d}/{y}_{m}{d}_{t}/{tail}'
            url = get_url(filepath)
            #save_to_s3(url, filepath, bucket_name)
            print(f'Id save {url} to {filepath}')
    return

In [76]:
start_date = date(2019, 5, 1)
end_date = date.today()
get_thumbnail_urls(start_date, end_date)

Id save https://spaceneedledev.com/panocam/assets/2019/05/31/2019_0531_211000/thumbnail.jpg to 2019/05/31/2019_0531_211000/thumbnail.jpg
Id save https://spaceneedledev.com/panocam/assets/2019/05/31/2019_0531_210000/thumbnail.jpg to 2019/05/31/2019_0531_210000/thumbnail.jpg
Id save https://spaceneedledev.com/panocam/assets/2019/05/31/2019_0531_205000/thumbnail.jpg to 2019/05/31/2019_0531_205000/thumbnail.jpg
Id save https://spaceneedledev.com/panocam/assets/2019/05/31/2019_0531_204000/thumbnail.jpg to 2019/05/31/2019_0531_204000/thumbnail.jpg
Id save https://spaceneedledev.com/panocam/assets/2019/05/31/2019_0531_203000/thumbnail.jpg to 2019/05/31/2019_0531_203000/thumbnail.jpg
Id save https://spaceneedledev.com/panocam/assets/2019/05/31/2019_0531_202000/thumbnail.jpg to 2019/05/31/2019_0531_202000/thumbnail.jpg
Id save https://spaceneedledev.com/panocam/assets/2019/05/31/2019_0531_201000/thumbnail.jpg to 2019/05/31/2019_0531_201000/thumbnail.jpg
Id save https://spaceneedledev.com/panoca

In [46]:
from datetime import date, datetime, timedelta
start_date = date(2015, 1, 1)
end_date = date.today()
num_days = (end_date - start_date).days + 1
get_thumbnail_urls(start_date, end_date)

2019 5 31


In [49]:
date_list = [end_date - timedelta(days=x) for x in range(0, num_days + 1)]
str(date_list[3].month).zfill(2)


'05'

In [67]:
[str(t).zfill(6) for t in range(211000, 50000-1, -1000)]

'050000'

In [65]:
[t for t in range(211000, 50000-1, -1000)][-1]

50000

In [12]:
re.match('fetch("[.*]")', fetches)

In [53]:
bucket_name = 'eye-of-the-needle'
filepath = 'test_s3_image.jpg'
url = 'https://spaceneedledev.com/panocam/assets/2019/05/28/2019_0528_120000/thumbnail.jpg'
save_to_s3(url, filepath, bucket_name)

Good to go. Found the bucket to upload the image into.


# Scrub the Data

*Look through the raw data files and see what you will need to do to them in order to have a workable data set. If your source data is already well-formatted, you may want to ask yourself why it hasn't already been analyzed and what other people may have overlooked when they were working on it. Are there other data sources that might give you more insights on some of the data you have here?*

*The end goal of this step is to produce a [design matrix](https://en.wikipedia.org/wiki/Design_matrix), containing one column for every variable that you are modeling, including a column for the outputs, and one row for every observation in your data set. It needs to be in a format that won't cause any problems as you visualize and model your data.*

In [8]:
## %%writefile ../src/features/build_features.py

# imports
# helper functions go here

def run():
    """
    Executes a set of helper functions that read files from data/raw, cleans them,
    and converts the data into a design matrix that is ready for modeling.
    """
    # clean_dataset_1('data/raw', filename)
    # clean_dataset_2('data/raw', filename)
    # save_cleaned_data_1('data/interim', filename)
    # save_cleaned_data_2('data/interim', filename)
    # build_features()
    # save_features('data/processed')
    pass


*Before moving on to exploratory analysis, write down some notes about challenges encountered while working with this data that might be helpful for anyone else (including yourself) who may work through this later on.*

# Explore the Data

*Before you start exploring the data, write out your thought process about what you're looking for and what you expect to find. Take a minute to confirm that your plan actually makes sense.*

*Calculate summary statistics and plot some charts to give you an idea what types of useful relationships might be in your dataset. Use these insights to go back and download additional data or engineer new features if necessary. Not now though... remember we're still just trying to finish the MVP!*

In [None]:
## %%writefile ../src/visualization/visualize.py

# imports
# helper functions go here

def run():
    """
    Executes a set of helper functions that read files from data/processed,
    calculates descriptive statistics for the population, and plots charts
    that visualize interesting relationships between features.
    """
    # data = load_features('data/processed')
    # describe_features(data, 'reports/')
    # generate_charts(data, 'reports/figures/')
    pass


*What did you learn? What relationships do you think will be most helpful as you build your model?*

# Model the Data

*Describe the algorithm or algorithms that you plan to use to train with your data. How do these algorithms work? Why are they good choices for this data and problem space?*

In [None]:
## %%writefile ../src/models/train_model.py

# imports
# helper functions go here

def run():
    """
    Executes a set of helper functions that read files from data/processed,
    calculates descriptive statistics for the population, and plots charts
    that visualize interesting relationships between features.
    """
    # data = load_features('data/processed/')
    # train, test = train_test_split(data)
    # save_train_test(train, test, 'data/processed/')
    # model = build_model()
    # model.fit(train)
    # save_model(model, 'models/')
    pass


In [None]:
## %%writefile ../src/models/predict_model.py

# imports
# helper functions go here

def run():
    """
    Executes a set of helper functions that read files from data/processed,
    calculates descriptive statistics for the population, and plots charts
    that visualize interesting relationships between features.
    """
    # test_X, test_y = load_test_data('data/processed')
    # trained_model = load_model('models/')
    # predictions = trained_model.predict(test_X)
    # metrics = evaluate(test_y, predictions)
    # save_metrics('reports/')
    pass



_Write down any thoughts you may have about working with these algorithms on this data. What other ideas do you want to try out as you iterate on this pipeline?_

# Interpret the Model

_Write up the things you learned, and how well your model performed. Be sure address the model's strengths and weaknesses. What types of data does it handle well? What types of observations tend to give it a hard time? What future work would you or someone reading this might want to do, building on the lessons learned and tools developed in this project?_