conda install lxml <br>
conda install beautifulsoup4


# WEB SCRAPING 
(thanks to Cristobal Donoso)

In [1]:
import requests # to obtain html data
import bs4      # beautifulsoup4 to parse the html content

The first step is to get the content related to an specific web page. You shall to read the privacy policies section over the data. In this tutorial we are going to use the page of chilean deputies

![hola](./images/privacypolicies.png)

In [2]:
res = requests.get('https://www.camara.cl/trabajamos/sala_votaciones.aspx')

In [3]:
res

<Response [200]>

Once we have obtained the content of the page, we can access the html plain text

In [4]:
res.text

'\r\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml">\r\n<head id="ctl00_Head1"><meta http-equiv="Content-Type" content="text/html;charset=utf-8" /><title>\r\n\tCámara de Diputados de Chile\r\n</title><link rel="stylesheet" type="text/css" href="/common/styles/main.css" media="screen" /><link rel="stylesheet" type="text/css" href="/common/styles/print.css" media="print" />\r\n\r\n    <script type="text/javascript" src="/common/scripts/jquery-1.2.6.min.js"></script>\r\n\r\n    <script type="text/javascript" src="/common/scripts/main.js"></script>\r\n\r\n   \r\n\r\n    <link rel="shortcut icon" href="/media/images/favicon.ico" /><link href="/WebResource.axd?d=pPzMpXLOsVTlHg-dfa2oezlyxtxfyJankOAeiJ8eRUQd5BFIUN4aXG0dZiMTW-Yw2byc2_X86KJrRXtd5TCIrDvH5YM7Njz5P0X40whzQ-np6-cOtlR5C2kcYbmtRSYYpkEcVu5qnoKJJ2ihbz8ba02ZmBt_hpYh2AuzTUMGzl41&amp;t=633398852440000000" type="text/css"

You can see the entire html (as plain text) content associated with the request. In order to parser and extract information from this document we shall use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).<br>
Using the ```bs4.BeautifulSoup(text, struct_of_data)``` method we can initialize an object which allow us to request and work with all tools that beatifulsoup give us. The ```'xlml'``` is an easy-to-use library for processing XML and HTML in the Python.

In [5]:
soup = bs4.BeautifulSoup(res.text, 'lxml')

It is important to know the HTML syntax. In order, to explore the structure we need to **inspect** the web page using the navigator tools. A direct way is to select the item using the right button and click on inspect.
![a](./images/navigation.png)

Now we can select everything (in the html tags) that we need. For example we can select **the table** section. The selection function returns a list of objects that match the entered keyword. 

We use:
+ ```.``` to refer to **class names**
+ ```#``` for **ids**

In [6]:
table = soup.select('.tabla') # From the HTML we selected the class name "tabla" 
print(table)

[<table class="tabla">
<thead>
<tr>
<th style="width:10%">Fecha</th>
<th>Documento</th>
<th>Materia</th>
<th>Artículo</th>
<th>Tipo</th>
<th>Resultado</th>
<th>Afir.</th>
<th>Neg.</th>
<th>Abst.</th>
<th>Detalle</th>
</tr>
</thead>
<tbody>
<tr>
<td>22 de ago de 2019 - 12:9</td>
<td>Boletín N° 12043-05</td>
<td>Moderniza la legislación tributaria</td>
<td>Artículo cuadragésimo segundo transitorio, cuya votación...</td>
<td>PARTICULAR</td>
<td>APROBADO</td>
<td>93</td>
<td>53</td>
<td>0</td>
<td><a href="sala_votacion_detalle.aspx?prmID=31491">Ver</a></td>
</tr>
<tr>
<td>22 de ago de 2019 - 12:8</td>
<td>Boletín N° 12043-05</td>
<td>Moderniza la legislación tributaria</td>
<td>artículo cuadragésimo primero transitorio, cuya votación...</td>
<td>PARTICULAR</td>
<td>APROBADO</td>
<td>146</td>
<td>0</td>
<td>0</td>
<td><a href="sala_votacion_detalle.aspx?prmID=31494">Ver</a></td>
</tr>
<tr>
<td>22 de ago de 2019 - 12:8</td>
<td>P. Resolución N° 715</td>
<td>Solicita a S. E. el Presidente de

Alternatively, we could get info from tags by accessing to the soup attribute. If you select a specific HTML object (no list, as we saw above). You can clean tags to recover the text: we must use the ```.getText ()``` method or the ```.text``` attribute

In [7]:
header = soup.thead
print(header.text)



Fecha
Documento
Materia
Artículo
Tipo
Resultado
Afir.
Neg.
Abst.
Detalle




The complete documentation of **BeautifulSoup** would be find [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). Feel free to explore and use all powerful methods!

### Extracting Data from Chilean Deputies Web Page

The ```find_all()``` method scans the entire document looking for results, but sometimes you only want to find one result; in those cases it is convenient to use the ```find()``` method

In [8]:
table = soup.find('table', attrs={'class': 'tabla'})
print(type(table))

<class 'bs4.element.Tag'>


Now we can iterate over table elements.
![a](./images/tablehtml.png)

In [9]:
for row in table.find_all('tr'):
    for data in row.find_all('td'):
        print(data.text)
    print('-'*20)

--------------------
22 de ago de 2019 - 12:9
Boletín N° 12043-05
Moderniza la legislación tributaria
Artículo cuadragésimo segundo transitorio, cuya votación...
PARTICULAR
APROBADO
93
53
0
Ver
--------------------
22 de ago de 2019 - 12:8
Boletín N° 12043-05
Moderniza la legislación tributaria
artículo cuadragésimo primero transitorio, cuya votación...
PARTICULAR
APROBADO
146
0
0
Ver
--------------------
22 de ago de 2019 - 12:8
P. Resolución N° 715
Solicita a S. E. el Presidente de la República que...


UNANIME
146
0
0
Ver
--------------------
22 de ago de 2019 - 12:7
Boletín N° 12043-05
Moderniza la legislación tributaria
Artículo cuadragésimo transitorio, cuya votación separada...
PARTICULAR
APROBADO
144
0
1
Ver
--------------------
22 de ago de 2019 - 11:54
Boletín N° 12043-05
Moderniza la legislación tributaria
Artículo trigésimo octavo transitorio, cuya votación...
PARTICULAR
APROBADO
142
4
1
Ver
--------------------
22 de ago de 2019 - 11:53
Boletín N° 12043-05
Moderniza la leg

It is also necessary to keep the header table

In [10]:
header_raw = table.thead.text
print(header_raw)
print(type(header_raw))



Fecha
Documento
Materia
Artículo
Tipo
Resultado
Afir.
Neg.
Abst.
Detalle


<class 'str'>


In [11]:
header = header_raw.split('\n')[2:-2]
print(header)

['Fecha', 'Documento', 'Materia', 'Artículo', 'Tipo', 'Resultado', 'Afir.', 'Neg.', 'Abst.', 'Detalle']


### Pre-processing text 

At this point, we've collected a lot unstructured data (plain text). So, To deal with floating text and create our relations and structures, we need to parse the language properties. In particular, we need to use a python package that allow us to work with regular expresion: [Regex](https://docs.python.org/3/library/re.html).<br><br>
Regex is a default-installed python package. We can use it by coding ```import re``` 

In [12]:
import re
re.__version__

'2.2.1'

![](./images/regex_reference.png)

We'll make general Regex formulas or pattern to find substring inside the text. To create a formula you have to use ```re.compile(S)```, where S is a string with a regex pattern that uses regex syntax (see Regex quick reference image).

First, we save the strings in python lists

In [13]:
table_list = []
for bar in table.find_all('tr'):
    row = []
    for foo in bar.find_all('td'):
        row.append(foo.text)
    table_list.append(row)
table_list = table_list[1:] #because the 0 element is void
print(len(table_list))

20


We will divide the schedule element into date and time

In [14]:
table_list

[['22 de ago de 2019 - 12:9',
  'Boletín N° 12043-05',
  'Moderniza la legislación tributaria',
  'Artículo cuadragésimo segundo transitorio, cuya votación...',
  'PARTICULAR',
  'APROBADO',
  '93',
  '53',
  '0',
  'Ver'],
 ['22 de ago de 2019 - 12:8',
  'Boletín N° 12043-05',
  'Moderniza la legislación tributaria',
  'artículo cuadragésimo primero transitorio, cuya votación...',
  'PARTICULAR',
  'APROBADO',
  '146',
  '0',
  '0',
  'Ver'],
 ['22 de ago de 2019 - 12:8',
  'P. Resolución N° 715',
  'Solicita a S. E. el Presidente de la República que...',
  '',
  '',
  'UNANIME',
  '146',
  '0',
  '0',
  'Ver'],
 ['22 de ago de 2019 - 12:7',
  'Boletín N° 12043-05',
  'Moderniza la legislación tributaria',
  'Artículo cuadragésimo transitorio, cuya votación separada...',
  'PARTICULAR',
  'APROBADO',
  '144',
  '0',
  '1',
  'Ver'],
 ['22 de ago de 2019 - 11:54',
  'Boletín N° 12043-05',
  'Moderniza la legislación tributaria',
  'Artículo trigésimo octavo transitorio, cuya votación..

In [15]:
# sche = []
# for h in table_list:
#     sche.append(h[0])
#=======================================
schedules = [h[0] for h in table_list]

In [16]:
date_pattern = re.compile('\d\d?\s.*(20)\d{2}')
# time_pattern = re.compile('\d{2}:\d{2}')
time_pattern = re.compile('\d{2}:\d{1,}')


There is two main methods to find substrings: 
- ```re.match()``` checks for a match only at the beginning of the string
- ```re.search()``` checks for a match anywhere in the string
- ```group()``` returns entire match

In [17]:
dates = []
times = []
for sch in schedules:
#     print(schedules)
    date = date_pattern.match(sch).group()
    time = time_pattern.search(sch)
    if time:
        time = time.group()

    dates.append(date)
    times.append(time)

In [18]:
dates

['22 de ago de 2019',
 '22 de ago de 2019',
 '22 de ago de 2019',
 '22 de ago de 2019',
 '22 de ago de 2019',
 '22 de ago de 2019',
 '22 de ago de 2019',
 '22 de ago de 2019',
 '22 de ago de 2019',
 '22 de ago de 2019',
 '22 de ago de 2019',
 '22 de ago de 2019',
 '22 de ago de 2019',
 '22 de ago de 2019',
 '22 de ago de 2019',
 '22 de ago de 2019',
 '22 de ago de 2019',
 '22 de ago de 2019',
 '22 de ago de 2019',
 '22 de ago de 2019']

We can do whatever we want using [Regex library](https://docs.python.org/3/library/re.html)

### Formatting pandas tables

In [19]:
import pandas as pd

In [20]:
header

['Fecha',
 'Documento',
 'Materia',
 'Artículo',
 'Tipo',
 'Resultado',
 'Afir.',
 'Neg.',
 'Abst.',
 'Detalle']

In [21]:
new_table = pd.DataFrame()
new_table['Fecha'] = dates
new_table['Hora'] = times
for i, he in enumerate(header[1:-1]):
    new_table[he] = [h[i+1] for h in table_list]

In [22]:
new_table.head(10)

Unnamed: 0,Fecha,Hora,Documento,Materia,Artículo,Tipo,Resultado,Afir.,Neg.,Abst.
0,22 de ago de 2019,12:9,Boletín N° 12043-05,Moderniza la legislación tributaria,"Artículo cuadragésimo segundo transitorio, cuy...",PARTICULAR,APROBADO,93,53,0
1,22 de ago de 2019,12:8,Boletín N° 12043-05,Moderniza la legislación tributaria,"artículo cuadragésimo primero transitorio, cuy...",PARTICULAR,APROBADO,146,0,0
2,22 de ago de 2019,12:8,P. Resolución N° 715,Solicita a S. E. el Presidente de la República...,,,UNANIME,146,0,0
3,22 de ago de 2019,12:7,Boletín N° 12043-05,Moderniza la legislación tributaria,"Artículo cuadragésimo transitorio, cuya votaci...",PARTICULAR,APROBADO,144,0,1
4,22 de ago de 2019,11:54,Boletín N° 12043-05,Moderniza la legislación tributaria,"Artículo trigésimo octavo transitorio, cuya vo...",PARTICULAR,APROBADO,142,4,1
5,22 de ago de 2019,11:53,Boletín N° 12043-05,Moderniza la legislación tributaria,"Artículo trigésimo primero transitorio, cuya v...",PARTICULAR,APROBADO,91,56,0
6,22 de ago de 2019,11:52,Boletín N° 12043-05,Moderniza la legislación tributaria,"Artículo vigésimo séptimo transitorio, cuya vo...",PARTICULAR,APROBADO,84,63,0
7,22 de ago de 2019,11:51,Boletín N° 12043-05,Moderniza la legislación tributaria,"Artículo décimo noveno transitorio, cuya votac...",PARTICULAR,APROBADO,91,56,0
8,22 de ago de 2019,11:50,Boletín N° 12043-05,Moderniza la legislación tributaria,"Artículo décimo octavo transitorio, cuya votac...",PARTICULAR,APROBADO,90,57,0
9,22 de ago de 2019,11:49,Boletín N° 12043-05,Moderniza la legislación tributaria,"Artículo décimo séptimo transitorio, cuya vota...",PARTICULAR,APROBADO,91,56,0


In [23]:
len(new_table)

20

The last column has a link to see the votes detail. So, we can open those links by looping in the structure

In [24]:
res = requests.get('https://www.camara.cl/camara/diputados.aspx#tab')
soup2 = bs4.BeautifulSoup(res.text, 'lxml')

In [25]:
soup2

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head id="ctl00_Head1"><meta content="text/html;charset=utf-8" http-equiv="Content-Type"/><title>
	Cámara de Diputados
</title><link href="/common/styles/main.css" media="screen" rel="stylesheet" type="text/css"/><link href="/common/styles/print.css" media="print" rel="stylesheet" type="text/css"/>
<script src="/common/scripts/jquery-1.2.6.min.js" type="text/javascript"></script>
<script src="/common/scripts/main.js" type="text/javascript"></script>
<link href="/media/images/favicon.ico" rel="shortcut icon"/></head>
<script type="text/javascript">

    var _gaq = _gaq || [];
    _gaq.push(['_setAccount', 'UA-34027487-1']);
    _gaq.push(['_trackPageview']);

    (function () {
        var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
        ga.src = ('https:' == document.location.pro

We search for the names of deputies

In [26]:
foo = soup2.find('ul', {'class':'diputados'})
pattern_dip = re.compile('^\w*')
names_dip = {} # we store diputies name in a dict, to have a pair key,value
for index, value in enumerate(foo.find_all('h5')):
    name = value.find('a').text
    name = " ".join(name.split()[1:])
    names_dip[name] = index

In [27]:
names_dip

{'Florcita Alarcón': 0,
 'Jorge Alessandri': 1,
 'René Alinco': 2,
 'Sebastián Álvarez': 3,
 'Jenny Álvarez': 4,
 'Pedro Pablo Alvarez-Salamanca': 5,
 'Sandra Amar': 6,
 'Gabriel Ascencio': 7,
 'Pepe Auth': 8,
 'Nino Baltolu': 9,
 'Boris Barrera': 10,
 'Ramón Barros': 11,
 'Jaime Bellolio': 12,
 'Bernardo Berger': 13,
 'Alejandro Bernales': 14,
 'Karim Bianchi': 15,
 'Sergio Bobadilla': 16,
 'Gabriel Boric': 17,
 'Jorge Brito': 18,
 'Miguel Ángel Calisto': 19,
 'Karol Cariola': 20,
 'Álvaro Carter': 21,
 'Loreto Carvajal': 22,
 'Natalia Castillo': 23,
 'José Miguel Castro': 24,
 'Juan Luis Castro': 25,
 'Ricardo Celis': 26,
 'Andrés Celis': 27,
 'Daniella Cicardini': 28,
 'Sofía Cid': 29,
 'Juan Antonio Coloma': 30,
 'Miguel Crispi': 31,
 'Luciano Cruz-Coke': 32,
 'Catalina Del Real': 33,
 'Mario Desbordes': 34,
 'Marcelo Díaz': 35,
 'Jorge Durán': 36,
 'Eduardo Durán': 37,
 'Francisco Eguiguren': 38,
 'Fidel Espinoza': 39,
 'Maya Fernández': 40,
 'Iván Flores': 41,
 'Camila Flores': 4

In [28]:
keys = [value for key, value in names_dip.items()]

In [29]:
import numpy as np

In [32]:
# tabla2[3]

In [33]:
base = 'https://www.camara.cl/trabajamos/'
favor = []
contra = []
abstencion = []
for index, a in enumerate(table.find_all('a')):
    print(a)
    url = base+a['href']
    res = requests.get(url)
    soup2 = bs4.BeautifulSoup(res.text, 'lxml')
    tabla2 = soup2.find_all('div', {'class':'stress'})
    partial_vote = []
    for t in tabla2[1:-2]:
        partial = [] #to save favor - contra - abstencion
        votes_partial = np.zeros(len(names_dip)) # a vector with 0 for each diputado
        for aa in t.find_all('a'):
#             print(aa)
            list_name = aa.text.split()
            if list_name[0] == 'Del':
#                 print(m)
                m = list_name[-1]+' '+list_name[0]+' '+list_name[1]
            else:
#                 print(m)
                m = list_name[-1]+' '+list_name[0]
            try:
                iloc = names_dip[m]
                votes_partial[iloc] = 1
                
            except:
                continue
        partial_vote.append(votes_partial)
    favor.append(partial_vote[0])
    contra.append(partial_vote[1])
    abstencion.append(partial_vote[2])

<a href="sala_votacion_detalle.aspx?prmID=31491">Ver</a>
<a href="sala_votacion_detalle.aspx?prmID=31494">Ver</a>
<a href="sala_votacion_detalle.aspx?prmID=31490">Ver</a>
<a href="sala_votacion_detalle.aspx?prmID=31489">Ver</a>
<a href="sala_votacion_detalle.aspx?prmID=31488">Ver</a>
<a href="sala_votacion_detalle.aspx?prmID=31487">Ver</a>
<a href="sala_votacion_detalle.aspx?prmID=31486">Ver</a>
<a href="sala_votacion_detalle.aspx?prmID=31485">Ver</a>
<a href="sala_votacion_detalle.aspx?prmID=31484">Ver</a>
<a href="sala_votacion_detalle.aspx?prmID=31483">Ver</a>
<a href="sala_votacion_detalle.aspx?prmID=31482">Ver</a>
<a href="sala_votacion_detalle.aspx?prmID=31481">Ver</a>
<a href="sala_votacion_detalle.aspx?prmID=31480">Ver</a>
<a href="sala_votacion_detalle.aspx?prmID=31479">Ver</a>
<a href="sala_votacion_detalle.aspx?prmID=31478">Ver</a>
<a href="sala_votacion_detalle.aspx?prmID=31477">Ver</a>
<a href="sala_votacion_detalle.aspx?prmID=31476">Ver</a>
<a href="sala_votacion_detalle.

In [34]:
votes_contra = pd.DataFrame(index=names_dip.keys())
for label, votes in zip(new_table['Documento'],contra):
    votes_contra[label] = votes

In [36]:
# votes_info["Boletín N° 12043-05"].min()

In [37]:
votes_info

NameError: name 'votes_info' is not defined

# Another Example

In [38]:
import time

In [39]:
html = requests.get('https://en.wikipedia.org/wiki/Lists_of_science_fiction_films')
b = bs4.BeautifulSoup(html.text, 'lxml')

In [40]:
links = []
#in this case, all of the links we're in a '<li>' brackets.
for i in b.find_all(name = 'li'):
    # pull the actual link for each one
    for link in i.find_all('a', href=True):
        links.append(link['href'])
#select only 
links = links[1:11]
# each link only returns something like 'wiki/List_of_science_fiction_films_of_the_1920s'
# so I add the other part of the URL to each.
decade_links = ['https://en.wikipedia.org' + i for i in links]

In [41]:
decade_links

['https://en.wikipedia.org/wiki/List_of_science_fiction_films_of_the_1920s',
 'https://en.wikipedia.org/wiki/List_of_science_fiction_films_of_the_1930s',
 'https://en.wikipedia.org/wiki/List_of_science_fiction_films_of_the_1940s',
 'https://en.wikipedia.org/wiki/List_of_science_fiction_films_of_the_1950s',
 'https://en.wikipedia.org/wiki/List_of_science_fiction_films_of_the_1960s',
 'https://en.wikipedia.org/wiki/List_of_science_fiction_films_of_the_1970s',
 'https://en.wikipedia.org/wiki/List_of_science_fiction_films_of_the_1980s',
 'https://en.wikipedia.org/wiki/List_of_science_fiction_films_of_the_1990s',
 'https://en.wikipedia.org/wiki/List_of_science_fiction_films_of_the_2000s',
 'https://en.wikipedia.org/wiki/List_of_science_fiction_films_of_the_2010s']

In [42]:
# create two new lists, one for the title of the page, 
# and one for the link to the page
film_titles = []
film_links = []
# for loop to pull from each decade page with list of films.
# look at https://en.wikipedia.org/wiki/List_of_science_fiction_films_of_the_1920s
# to follow along as an exampe
for decade in decade_links[-2:]:
    print(f'Collecting films from {decade}')
    html = requests.get(decade)
    b = bs4.BeautifulSoup(html.text, 'lxml')
    # get to the table on the page
    for i in b.find_all(name='table', class_='wikitable'):
        # get to the row of each film
        for j in i.find_all(name='tr'):
            #get just the title cell for each row.
            # contains the title and the URL
            for k in j.find_all(name='i'):
                # get within that cell to just get the words
                for link in k.find_all('a', href=True):
                    # get the title and add to the list
                    film_titles.append(link['title'])
                    # get the link and add to that list
                    film_links.append(link['href'])
    #be a conscientious scraper and pause between scrapes
    time.sleep(1)
print(f'Number of Film Links Collected: {len(film_links)}')
print(f'Number of Film Titles Collected: {len(film_titles)}')
# remove film links that don't have a description page on Wikipedia
new_film_links = [i for i in film_links if 'redlink' not in i]
# same goes for titles
new_film_titles = [i for i in film_titles if '(page does not exist)' not in i]
print(f'Number of Film Links with Wikipedia Pages: {len(new_film_links)}')
print(f'Number of Film Titles with Wikipedia Pages: {len(new_film_titles)}')
#use this list to fetch from the API
title_links = list(zip(new_film_titles, new_film_links))

Collecting films from https://en.wikipedia.org/wiki/List_of_science_fiction_films_of_the_2000s
Collecting films from https://en.wikipedia.org/wiki/List_of_science_fiction_films_of_the_2010s
Number of Film Links Collected: 664
Number of Film Titles Collected: 664
Number of Film Links with Wikipedia Pages: 653
Number of Film Titles with Wikipedia Pages: 653


In [43]:
title = []
director = []
actors = []
country = []
notes = []

In [44]:
for decade in decade_links[-2:]:
    print(f'Collecting films from {decade}')
    html = requests.get(decade)
    b = bs4.BeautifulSoup(html.text, 'lxml')
    # get to the table on the page
    for i in b.find_all(name='table', class_='wikitable'):
        for row in i.find_all('tr'):
#             print(row)
            cells=row.find_all('td')
            if len(cells)==5:
                title.append(cells[0].find(text=True))
                director.append(cells[1].find(text=True))
                x = cells[2].find_all(text=True)
                print(x)
                y = [s for s in x if len(s) > 2]
                print(y)
                actors.append(y)
                country.append(cells[3].find("a",href=True)['title'])
                print(cells[3].find("a",href=True))
                notes.append(cells[4].find(text=True).rstrip('\n'))
#                 break

Collecting films from https://en.wikipedia.org/wiki/List_of_science_fiction_films_of_the_2000s
['Arnold Schwarzenegger', ', ', 'Tony Goldwyn', ', ', 'Michael Rapaport']
['Arnold Schwarzenegger', 'Tony Goldwyn', 'Michael Rapaport']
<a href="/wiki/United_States" title="United States"><img alt="United States" class="thumbborder" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/></a>
['Jean-Paul Belmondo', ', ', 'Arielle Dombasle', ', ', 'Patrick Bouchitey']
['Jean-Paul Belmondo', 'Arielle Dombasle', 'Patrick Bouchitey']
<a href="/wiki/France" title="France"><img alt="France" class="thumb

In [45]:
cells[3].find("a",href=True)["title"]

'Hong Kong'

In [46]:
header = b.find(name='table', class_='wikitable').find_all('th')[0:5]

In [47]:
header[0].text

'Title'

In [48]:
df=pd.DataFrame(title,columns=[header[0].text])
df[header[1].text]=director
df[header[2].text]=actors
df[header[3].text]=country
df[header[4].text]=notes

In [49]:
df

Unnamed: 0,Title,Director,Cast,Country,Subgenre/Notes
0,The 6th Day,Roger Spottiswoode,"[Arnold Schwarzenegger, Tony Goldwyn, Michael ...",United States,Action
1,Amazone,Philippe de Broca,"[Jean-Paul Belmondo, Arielle Dombasle, Patrick...",France,Adventure
2,Battlefield Earth,Roger Christian,"[John Travolta, Barry Pepper, Forest Whitaker]",United States,Action
3,The Cell,Tarsem Singh,"[Jennifer Lopez, Vince Vaughn, Vincent D'Onofrio]",United States,Psychological thriller
4,Escaflowne,Kazuki Akane,[],Japan,Action
5,Godzilla vs. Megaguirus,Masaaki Tezuka,"[Misako Tanaka, Shosuke Tanihara, Masato Ibu]",Japan,Monster movie
6,Happy Accidents,Brad Anderson,"[Marisa Tomei, Vincent D'Onofrio]",United States,Romantic
7,Hollow Man,Paul Verhoeven,"[Kevin Bacon, Elisabeth Shue, Josh Brolin]",United States,Action thriller
8,I.K.U.,Shu Lea Cheang,[Aja],Japan,Adult film
9,The Last Man,Harry Ralston,"[David Arnott, Jeri Ryan, Dan Montgomery]",United States,Post apocalyptic


In [52]:
df.to_csv("peliculas.csv",index=False)

In [53]:
df_nuevo = pd.read_csv("peliculas.csv")

In [54]:
df_nuevo

Unnamed: 0,Title,Director,Cast,Country,Subgenre/Notes
0,The 6th Day,Roger Spottiswoode,"['Arnold Schwarzenegger', 'Tony Goldwyn', 'Mic...",United States,Action
1,Amazone,Philippe de Broca,"['Jean-Paul Belmondo', 'Arielle Dombasle', 'Pa...",France,Adventure
2,Battlefield Earth,Roger Christian,"['John Travolta', 'Barry Pepper', 'Forest Whit...",United States,Action
3,The Cell,Tarsem Singh,"['Jennifer Lopez', 'Vince Vaughn', ""Vincent D'...",United States,Psychological thriller
4,Escaflowne,Kazuki Akane,[],Japan,Action
5,Godzilla vs. Megaguirus,Masaaki Tezuka,"['Misako Tanaka', 'Shosuke Tanihara', 'Masato ...",Japan,Monster movie
6,Happy Accidents,Brad Anderson,"['Marisa Tomei', ""Vincent D'Onofrio""]",United States,Romantic
7,Hollow Man,Paul Verhoeven,"['Kevin Bacon', 'Elisabeth Shue', 'Josh Brolin']",United States,Action thriller
8,I.K.U.,Shu Lea Cheang,['Aja'],Japan,Adult film
9,The Last Man,Harry Ralston,"['David Arnott', 'Jeri Ryan', 'Dan Montgomery']",United States,Post apocalyptic
