# Chapter 6 Storing Data

## Media Files

It's easy to download a file, say picture, using `urllib` library.

In [1]:
from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup

http_res = urlopen("http://www.pythonscraping.com/")
bs = BeautifulSoup(http_res, "html.parser")
image_location = bs.find("a", {"id": "logo"}).find("img")["src"]

In [2]:
urlretrieve(image_location, "download/logo.jpg")

('download/logo.jpg', <http.client.HTTPMessage at 0x7fde1b0607f0>)

Executing the above `urlretrieve()` downloads `logo.jpg` in `download` directory. Since we're using Jupyter Notebook, we should be able to view the picture directly here using the path `download/logo.jpg`:

![logo](download/logo.jpg)

Web scraping is never designed for downloading one picture; it is instead used for downloading all available pictures greedily. Let's write codes to do so:

In [3]:
import os
import re
from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup

downloadDirectory = 'download'
baseUrl = 'http://pythonscraping.com'

def getAbsoluteURL(baseUrl, source):
    if source.startswith('http://www.'):
        url = 'http://{}'.format(source[11:])
    elif source.startswith('http://'):
        url = source
    elif source.startswith('www.'):
        url = source[4:]
        url = 'http://{}'.format(source)
    else:
        url = '{}/{}'.format(baseUrl, source)
    if baseUrl not in url:
        return None
    return url

def getDownloadPath(baseUrl, absoluteUrl, downloadDirectory):
    path = absoluteUrl.replace('www.', '')
    path = path.replace(baseUrl, '')
    path = downloadDirectory+path
    directory = os.path.dirname(path)

    if not os.path.exists(directory):
        os.makedirs(directory)

    return path

html = urlopen('http://www.pythonscraping.com')
bs = BeautifulSoup(html, 'html.parser')
downloadList = bs.findAll(src=True)

for download in downloadList:
    fileUrl = getAbsoluteURL(baseUrl, download['src'])
    if fileUrl is not None:
        print(fileUrl)
        if re.match(r"^.*(\.jpg)$", fileUrl): # end with ".jpg"
            urlretrieve(fileUrl,
                        getDownloadPath(baseUrl, fileUrl, downloadDirectory))

http://pythonscraping.com/misc/jquery.js?v=1.4.4
http://pythonscraping.com/misc/jquery.once.js?v=1.2
http://pythonscraping.com/misc/drupal.js?pa2nir
http://pythonscraping.com/sites/all/themes/skeletontheme/js/jquery.mobilemenu.js?pa2nir
http://pythonscraping.com/sites/all/modules/google_analytics/googleanalytics.js?pa2nir
http://pythonscraping.com/sites/default/files/lrg_0.jpg
http://pythonscraping.com/img/lrg%20(1).jpg


Again we can use Markdown to check the following two files in `download` directory:

- `sites/default/files/lrg_0.jpg`
- `img/lrg%20(1).jpg`

Note that since we have defined `getDownloadPath()`, we will create corresponding directories in `download`.

![lrg_0.jpg](download/sites/default/files/lrg_0.jpg)
![lrg%20(1).jpg](download/img/lrg%2520(1).jpg)

## Storing Data to CSV

It's easy to play with CSV file. Have a look at this example:

In [4]:
import csv

csvFile = open('download/test.csv', 'w+')
try:
    writer = csv.writer(csvFile)
    writer.writerow(('number', 'number plus 2', 'number times 2'))
    for i in range(10):
        writer.writerow( (i, i+2, i*2))
finally:
    csvFile.close()

To have a better visualization of `test.csv`, we use `pandas` library.

In [5]:
import pandas as pd

df = pd.read_csv("download/test.csv")
df

Unnamed: 0,number,number plus 2,number times 2
0,0,2,0
1,1,3,2
2,2,4,4
3,3,5,6
4,4,6,8
5,5,7,10
6,6,8,12
7,7,9,14
8,8,10,16
9,9,11,18


Now let's combine web scraping and CSV files.

In [6]:
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://en.wikipedia.org/wiki/Comparison_of_text_editors')
bs = BeautifulSoup(html, 'html.parser')
# The main comparison table is currently the first table on the page
table = bs.findAll('table',{'class':'wikitable'})[0]
rows = table.findAll('tr')

csvFile = open('download/editors.csv', 'wt+')
writer = csv.writer(csvFile)
try:
    for row in rows:
        csvRow = []
        for cell in row.findAll(['td', 'th']):
            csvRow.append(cell.get_text().strip())
        writer.writerow(csvRow)
finally:
    csvFile.close()

Again, let's check the first 10 rows.

In [7]:
df = pd.read_csv("download/editors.csv")
df[:10]

Unnamed: 0,Name,Creator,First public release,Latest stable version,Latest Release Date,Programming language,Cost (US$),Software license,Open source,CLI available,Minimum installed size
0,Acme,Rob Pike,1993,Plan 9 and Inferno,,C,Free,LPL (OSI approved),Yes,,
1,AkelPad,"Alexey Kuznetsov, Alexander Shengalts",2003,4.9.8,2016-07-18,C,Free,BSD,Yes,,
2,Alphatk,Vince Darley,1999,8.3.3,2004-12-10,,$40,"Proprietary, with BSD components",No,,
3,Aquamacs,David Reitter,2005,3.3,2016-09-20,"C, Emacs Lisp",Free,GPL,Yes,,
4,Atom,GitHub,2014,1.31.1,2018-09-28,"HTML, CSS, JavaScript, C++",Free,MIT,Yes,No,~ 150 MB
5,BBEdit,Rich Siegel,1992,12.1.3,2018-04-11,"Objective-C, Objective-C++",$49.99,Proprietary,No,,
6,Bluefish,Bluefish Development Team,1999,2.2.10,2017-01-27,C,Free,GPL,Yes,,
7,Brackets,Adobe Systems,2012,1.12,2018-02-05,"HTML, CSS, JavaScript, C++",Free,MIT,Yes,,
8,Coda,Panic,2007,2.6.6,2017-06-05,Objective-C,$99,Proprietary,No,,
9,ConTEXT,ConTEXT Project Ltd,1999,0.98.6,2009-08-14,Object Pascal (Delphi),Free,BSD,Yes,,
