# Working with images, audio, and other assets

## **Introduction**

A common practice in scraping is the download, storage, and further processing of media content (non-web pages or data files). This media can include images, audio, and video.  To store the content locally (or in a service like S3) and do it correctly, we need to know what the type of media is, and it's not enough to trust the file extension in the URL.  We will learn how to download and correctly represent the media type based on information from the web server.

Another common task is the generation of thumbnails of images, videos, or even a page of a website.  We will examine several techniques of how to generate thumbnails and make website page screenshots.  Many times these are used on a new website as thumbnail links to the scraped media that is now stored locally.

Finally, it is often the need to be able to transcode media, such as converting non-MP4 videos to MP4, or changing the bit-rate or resolution of a video.  Another scenario is to extract only the audio from a video file.  We won't look at video transcoding, but we will rip MP3 audio out of an MP4 file using ffmpeg.  It's a simple step from there to also transcode video with ffmpeg.

## **Downloading media content from the web**

Downloading media content from the web is a simple process: use Requests or another library and download it just like you would HTML content.

### **Getting ready**

There is a class named `URLUtility` in the `urls.py` mdoule in the util folder of the solution.  This class handles several of the scenarios in this chapter with downloading and parsing URLs. We will be using this class in this recipe and a few others

### **How to do it**

In [1]:
import const
from urls import URLUtility

util = URLUtility(const.ApodEclipseImage())
print(len(util.data))

Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg
Read 171014 bytes
171014


## **Parsing a URL with urllib to get the filename**

When downloading content from a URL, we often want to save it in a file.  Often it is good enough to save the file in a file with a name found in the URL.  But the URL consists of a number of fragments, so how can we find the actual filename from the URL, especially where there are often many parameters after the file name?

### **How to do it**

In [2]:
util = URLUtility(const.ApodEclipseImage())
print(util.filename_without_ext)

Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg
Read 171014 bytes
BT5643s


### **How it works**

In [3]:
from urllib.parse import urlparse


parsed = urlparse(const.ApodEclipseImage())
parsed

ParseResult(scheme='https', netloc='apod.nasa.gov', path='/apod/image/1709/BT5643s.jpg', params='', query='', fragment='')

In the constructor for URLUtility, there is a call to urlib.parse.urlparse.  The following demonstrates using the function interactively:

```py
parsed = urlparse(const.ApodEclipseImage())
parsed
```
```bash
ParseResult(scheme='https', netloc='apod.nasa.gov', path='/apod/image/1709/BT5643s.jpg', params='', query='', fragment='')
```
The ParseResult object contains the various components of the URL.  The path element contains the path and the filename.  The call to the .filename_without_ext property returns just the filename without the extension:

```py
@property
def filename_without_ext(self):
    filename = os.path.splitext(os.path.basename(self._parsed.path))[0]
    return filename
```
The call to os.path.basename returns only the filename portion of the path (including the extension). `os.path.splittext()` then separates the filename and the extension, and the function returns the first element of that tuple/list (the filename).

## **Determining the type of content for a URL**

When performing a `GET` requests for content from a web server, the web server will return a number of headers, one of which identities the type of the content from the perspective of the web server.  In this recipe we learn to use that to determine what the web server considers the type of the content.

### **How to do it**

We prioceed  as follows:

In [4]:
util = URLUtility(const.ApodEclipseImage())
print("The content type is: " + util.contenttype)

Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg
Read 171014 bytes
The content type is: image/jpeg


### **How it works**

The `.contentype` property is implemented as follows:
```py
@property
def contenttype(self):
    self.ensure_response()
    return self._response.headers['content-type']
```
The `.headers` property of the `_response` object is a **dictionary-like class of headers**.  The content-type key will retrieve the content-type specified by the server.  This call to the `ensure_response()` method simply ensures that the `.read()` function has been executed.

### **There's more**

The **headers in a response contain a wealth of information.  If we look more closely at the headers property of the response**, we can see the following headers are returned:

In [5]:
import urllib

response = urllib.request.urlopen(const.ApodEclipseImage())

for header in response.headers: print(header)

Date
Server
X-Frame-Options
Last-Modified
ETag
Accept-Ranges
Content-Length
Connection
Content-Type
Strict-Transport-Security
Content-Security-Policy


And we can see the values for each of these headers.

In [6]:
for header in response.headers: print(header + " ==> " + response.headers[header])

Date ==> Tue, 10 Jan 2023 14:13:15 GMT
Server ==> WebServer/1.0
X-Frame-Options ==> sameorigin
Last-Modified ==> Thu, 31 Aug 2017 20:26:32 GMT
ETag ==> "547bb44-29c06-5581275ce2b86"
Accept-Ranges ==> bytes
Content-Length ==> 171014
Connection ==> close
Content-Type ==> image/jpeg
Strict-Transport-Security ==> max-age=31536000; includeSubDomains
Content-Security-Policy ==> upgrade-insecure-requests


## **Determining the file extension from a content/type**

It is good practice to use the `content-type` header to determine the type of content, and to determine the extension to use for storing the content as a file.

### **How to do it**

In [7]:
util = URLUtility(const.ApodEclipseImage())
print("Filename from content-type : " + util.extension_from_contenttype)
print("Filename from url : " + util.extension_from_url)

Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg
Read 171014 bytes
Filename from content-type : .jpg
Filename from url : .jpg


This reports **both the extension determined from the file type, and also from the URL.  These can be different, but in this case they are the same**.

### **How it works**

The following is the implementation of the `.extension_from_contenttype` property:
```python
@property
def extension_from_contenttype(self):
    self.ensure_response()

    map = const.ContentTypeToExtensions()
    if self.contenttype in map:
        return map[self.contenttype]
    return None
```

The first line ensures that we have read the response from the URL.  The function then uses a python dictionary, defined in the const module, which contains a dictionary of content-types to extension:

```py
def ContentTypeToExtensions():
    return {
        "image/jpeg": ".jpg",
        "image/jpg": ".jpg",
        "image/png": ".png"
    }
```
If the content type is in the dictionary, then the corresponding value will be returned.  Otherwise, None is returned.

Note the corresponding property, `.extension_from_url`:

```py
@property
def extension_from_url(self):
    ext = os.path.splitext(os.path.basename(self._parsed.path))[1]
    return ext
```
This uses the same technique as the .filename property to parse the URL, but instead returns the `[1]` element, which represents the extension instead of the base filename.

### **There's more**

As stated, it's best to use the content-type header to determine an extension for storing the file locally.  There are other techniques than what is provided here, but this is the easiest.

## **Downloading and saving images to the local file system**

Sometimes when scraping we just download and parse data, such as HTML, to extract some data, and then throw out what we read.  Other times, we want to keep the downloaded content by storing it as a file.

### **How to do it**

In [15]:
import glob
from posixpath import expanduser
from core.file_blob_writer import FileBlobWriter

# download the image
item = URLUtility(const.ApodEclipseImage())

# create a file writer to write the data
FileBlobWriter(expanduser("./")).write(item.filename, item.data)

Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg
Read 171014 bytes
Attempting to write 171014 bytes to BT5643s.jpg:
The write was successful


### **How it works**

The sample simply writes the data to a file using standard Python file access functions.  It does it in an object oriented manner by using a standard interface for writing data and with a file based implementation in the FileBlobWriter class:

```py
""" Implements the IBlobWriter interface to write the blob to a file """

from core.i_blob_writer import IBlobWriter

class FileBlobWriter(IBlobWriter):
    def __init__(self, location):
        self._location = location

    def write(self, filename, contents):
        full_filename = self._location + "/" + filename
        print ("Attempting to write {0} bytes to {1}:".format(len(contents), filename))

        with open(full_filename, 'wb') as outfile:
            outfile.write(contents)

        print("The write was successful")
```
The class is passed a string representing the directory where the file should be placed. The data is actually written during a later call to the `.write()` method.  This method merges the filename and `directory(_location)`, and then opens/creates the file and writes the bytes. The `with` statement ensures that the file is closed.

### **There's more**

This write could have simply been handled using a function that wraps the code.  This object will be reused throughout this chapter. We could use the duck-typing of python, or just a function, but the clarity of interfaces is easier.  Speaking of that, the following is the definition of this interface:

```py
""" Defines the interface for writing a blob of data to storage """

from interface import Interface

class IBlobWriter(Interface):
   def write(self, filename, contents):
      pass
```
We will also see another implementation of this interface that lets us store files in S3.  Through this type of implementation, through interface inheritance, we can easily substitute implementations.

## **Taking a screenshot of a website**

### **How to do it**

In [16]:
from core.website_screenshot_generator import WebsiteScreenshotGenerator
from core.file_blob_writer import FileBlobWriter
from os.path import expanduser

# get the screenshot
image_bytes = WebsiteScreenshotGenerator().capture("http://espn.go.com", 500, 500).image_bytes

# save it to a file
FileBlobWriter(expanduser("./")).write("website_screenshot.png", image_bytes)

Capturing website screenshot of: http://espn.go.com


AttributeError: module 'selenium.webdriver' has no attribute 'PhantomJS'