# Chapter 7 Reading Documents

## Text

Have you ever wondered how `BeautifulSoup` read a text file from website? Let's see the example of [http://www.pythonscraping.com/pages/warandpeace/chapter1.txt](http://www.pythonscraping.com/pages/warandpeace/chapter1.txt):

In [1]:
from bs4 import BeautifulSoup

In [2]:
from urllib.request import urlopen
textPage = urlopen("http://www.pythonscraping.com/pages/warandpeace/chapter1.txt")
result = textPage.read()
print(result[:100]) # first 100 characters

b'CHAPTER I\n\n"Well, Prince, so Genoa and Lucca are now just family estates of theBuonapartes. But I wa'


Well, not too bad. The `b` at the beginning of the output means Python is printint out binary strings. One easy way to convert to normal string is to call `decode()`.

In [3]:
print(result.decode("ascii")[:100]) # first 100 characters

CHAPTER I

"Well, Prince, so Genoa and Lucca are now just family estates of theBuonapartes. But I wa


However, not all languages can be smoothly "printed". In other words, we need different decoder.

In [4]:
textPage = urlopen(
    'http://www.pythonscraping.com/pages/warandpeace/chapter1-ru.txt')
result = textPage.read()
print(result[:100])

b'\xd0\xa7\xd0\x90\xd0\xa1\xd0\xa2\xd0\xac \xd0\x9f\xd0\x95\xd0\xa0\xd0\x92\xd0\x90\xd0\xaf\n\nI\n\n\xe2\x80\x94 Eh bien, mon prince. G\xc3\xaanes et Lucques ne sont plus que des apanages'


In [5]:
print(result.decode("utf-8")[:100])

ЧАСТЬ ПЕРВАЯ

I

— Eh bien, mon prince. Gênes et Lucques ne sont plus que des apanages, des поместья


## CSV

In the textbook, we use built-in Python library to read CSV files from web pages.

In [6]:
from urllib.request import urlopen
from io import StringIO
import csv

data = (urlopen('http://pythonscraping.com/files/MontyPythonAlbums.csv')
        .read().decode('ascii', 'ignore'))
dataFile = StringIO(data)
csvReader = csv.reader(dataFile)

for row in list(csvReader):
    print(row)

['Name', 'Year']
["Monty Python's Flying Circus", '1970']
['Another Monty Python Record', '1971']
["Monty Python's Previous Record", '1972']
['The Monty Python Matching Tie and Handkerchief', '1973']
['Monty Python Live at Drury Lane', '1974']
['An Album of the Soundtrack of the Trailer of the Film of Monty Python and the Holy Grail', '1975']
['Monty Python Live at City Center', '1977']
['The Monty Python Instant Record Collection', '1977']
["Monty Python's Life of Brian", '1979']
["Monty Python's Cotractual Obligation Album", '1980']
["Monty Python's The Meaning of Life", '1983']
['The Final Rip Off', '1987']
['Monty Python Sings', '1989']
['The Ultimate Monty Python Rip Off', '1994']
['Monty Python Sings Again', '2014']


But a better and modern way to process CSV files on websites is to use `pandas`. It's a one-liner.

In [7]:
import pandas as pd

df = pd.read_csv("http://pythonscraping.com/files/MontyPythonAlbums.csv")
df

Unnamed: 0,Name,Year
0,Monty Python's Flying Circus,1970
1,Another Monty Python Record,1971
2,Monty Python's Previous Record,1972
3,The Monty Python Matching Tie and Handkerchief,1973
4,Monty Python Live at Drury Lane,1974
5,An Album of the Soundtrack of the Trailer of t...,1975
6,Monty Python Live at City Center,1977
7,The Monty Python Instant Record Collection,1977
8,Monty Python's Life of Brian,1979
9,Monty Python's Cotractual Obligation Album,1980


It's easy to convert CSV files into `dict` using `csv` library.

In [8]:
data = (urlopen('http://pythonscraping.com/files/MontyPythonAlbums.csv')
        .read().decode('ascii', 'ignore'))
dataFile = StringIO(data)
dictReader = csv.DictReader(dataFile)
print(dictReader.fieldnames)

list_dict = list(dictReader)
for row in list_dict:
    print(row)

['Name', 'Year']
OrderedDict([('Name', "Monty Python's Flying Circus"), ('Year', '1970')])
OrderedDict([('Name', 'Another Monty Python Record'), ('Year', '1971')])
OrderedDict([('Name', "Monty Python's Previous Record"), ('Year', '1972')])
OrderedDict([('Name', 'The Monty Python Matching Tie and Handkerchief'), ('Year', '1973')])
OrderedDict([('Name', 'Monty Python Live at Drury Lane'), ('Year', '1974')])
OrderedDict([('Name', 'An Album of the Soundtrack of the Trailer of the Film of Monty Python and the Holy Grail'), ('Year', '1975')])
OrderedDict([('Name', 'Monty Python Live at City Center'), ('Year', '1977')])
OrderedDict([('Name', 'The Monty Python Instant Record Collection'), ('Year', '1977')])
OrderedDict([('Name', "Monty Python's Life of Brian"), ('Year', '1979')])
OrderedDict([('Name', "Monty Python's Cotractual Obligation Album"), ('Year', '1980')])
OrderedDict([('Name', "Monty Python's The Meaning of Life"), ('Year', '1983')])
OrderedDict([('Name', 'The Final Rip Off'), ('Yea

Here is an example how to use `list_dict`. Basically, there are a list of independent `OrderedDict` in `list_dict`, each of which has two keys "Name" and "Year".

In [9]:
list_dict[0].get("Name")

"Monty Python's Flying Circus"

Again, it's easier to use `pandas` to create a "records"-style dictionary. Note that, each of `pd_list_dict` is a Python `dict` rather than `OrderedDict`. Therefore, `list_dict` $\ne$ `pd_list_dict`.

In [10]:
pd_list_dict = df.to_dict("records")
pd_list_dict

[{'Name': "Monty Python's Flying Circus", 'Year': 1970},
 {'Name': 'Another Monty Python Record', 'Year': 1971},
 {'Name': "Monty Python's Previous Record", 'Year': 1972},
 {'Name': 'The Monty Python Matching Tie and Handkerchief', 'Year': 1973},
 {'Name': 'Monty Python Live at Drury Lane', 'Year': 1974},
 {'Name': 'An Album of the Soundtrack of the Trailer of the Film of Monty Python and the Holy Grail',
  'Year': 1975},
 {'Name': 'Monty Python Live at City Center', 'Year': 1977},
 {'Name': 'The Monty Python Instant Record Collection', 'Year': 1977},
 {'Name': "Monty Python's Life of Brian", 'Year': 1979},
 {'Name': "Monty Python's Cotractual Obligation Album", 'Year': 1980},
 {'Name': "Monty Python's The Meaning of Life", 'Year': 1983},
 {'Name': 'The Final Rip Off', 'Year': 1987},
 {'Name': 'Monty Python Sings', 'Year': 1989},
 {'Name': 'The Ultimate Monty Python Rip Off', 'Year': 1994},
 {'Name': 'Monty Python Sings Again', 'Year': 2014}]