# <span style="color:orange">Scraping the web with Python </span>

## Why should we scrap the web? There are around 6 billion indexed web pages. These pages are full of data. Data that can be turned into an innovative business. Therefore, scrap the web!

### In these tutorials, we will scrap the web with Python and its packages to achieve so. Let's program!

In [44]:
# Import packages
from bs4 import BeautifulSoup
import requests
import pandas as pd

### As I am a Sci-Fi aficionado, let's scrap a Wikipedia list of movies in the 60 and put it into a Pandas DataFrame!. I am sure we could get some information from it. Why did we produced more movies in an specific year?
<img src="https://media.giphy.com/media/3o85xp0wP6CJarLIti/source.gif"
     alt="2020"
     style="float: left; margin-right: 10px;" />
     

### It is a good practice to download the webpage. Especially when you work with dynamic pages.

In [45]:
# Set the URL and Download it
scifi_url = 'https://en.wikipedia.org/wiki/List_of_science_fiction_films_of_the_1960s'
download_url = requests.get(scifi_url)

### Beautiful soup work with objects. We need to create the object from the web page we just downloaded. Let's parse it and create the object.

In [46]:
# Creating the object and the local copy.
soup = BeautifulSoup(download_url.text, features="html.parser")

with open("download_scifi.html", "w", encoding="utf-8") as file:
    file.write(soup.prettify())

In [47]:
print (soup) #Lets print our page! We have a local copy of it!

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of science fiction films of the 1960s - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"YCWaOvTxzvMjnv6SL8D82QAAAMY","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_science_fiction_films_of_the_1960s","wgTitle":"List of science fiction films of the 1960s","wgCurRevisionId":999500803,"wgRevisionId":999500803,"wgArticleId":13247069,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1: long volume value","CS1 errors: missing periodical","CS1 maint: multiple names: auth

### The information we need, its located inside the Table. The element from the webpage is a wiki table. We have to filter all the information to clean it a bit and get better results. 

In [48]:
# Selection of the wiki body
table_60 = soup.select("table.wikitable tbody")[0]
print (table_60)

<tbody><tr>
<td colspan="7" scope="row" style="text-align:left; background:#e9e9e9"><a href="/wiki/1960_in_film" title="1960 in film"><span id="1960">1960</span></a>
</td></tr>
<tr>
<th>Title</th>
<th>Director</th>
<th>Cast</th>
<th>Country</th>
<th>Subgenre/Notes
</th></tr>
<tr>
<td><i><a href="/wiki/12_to_the_Moon" title="12 to the Moon">12 to the Moon</a></i></td>
<td><a href="/wiki/David_Bradley_(director)" title="David Bradley (director)">David Bradley</a></td>
<td><a class="new" href="/w/index.php?title=Ken_CHans_Bendriklark_(actor)&amp;action=edit&amp;redlink=1" title="Ken CHans Bendriklark (actor) (page does not exist)">Ken Clark</a>, <a href="/wiki/Michi_Kobi" title="Michi Kobi">Michi Kobi</a>, <a href="/wiki/Tom_Conway" title="Tom Conway">Tom Conway</a></td>
<td>United States</td>
<td><sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup><sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup>
</td></tr>
<tr>
<td><i><a href="/wiki/The_A

### The information is getting better, we have the body of the wikitable. We need from this table to get just the headers. We need then to extract the "th" element. 

In [49]:
# Extract the header of our table
head_table = table_60.select("tr th")
print (head_table)

[<th>Title</th>, <th>Director</th>, <th>Cast</th>, <th>Country</th>, <th>Subgenre/Notes
</th>, <th>Title</th>, <th>Director</th>, <th>Cast</th>, <th>Country</th>, <th>Subgenre/Notes
</th>, <th>Title</th>, <th>Director</th>, <th>Cast</th>, <th>Country</th>, <th>Subgenre/Notes
</th>, <th>Title</th>, <th>Director</th>, <th>Cast</th>, <th>Country</th>, <th>Subgenre/Notes
</th>, <th>Title</th>, <th>Director</th>, <th>Cast</th>, <th>Country</th>, <th>Subgenre/Notes
</th>, <th>Title</th>, <th>Director</th>, <th>Cast</th>, <th>Country</th>, <th>Subgenre/Notes
</th>, <th>Title</th>, <th>Director</th>, <th>Cast</th>, <th>Country</th>, <th>Subgenre/Notes
</th>, <th>Title</th>, <th>Director</th>, <th>Cast</th>, <th>Country</th>, <th>Subgenre/Notes
</th>, <th>Title</th>, <th>Director</th>, <th>Cast</th>, <th>Country</th>, <th>Subgenre/Notes
</th>, <th>Title</th>, <th>Director</th>, <th>Cast</th>, <th>Country</th>, <th>Subgenre/Notes
</th>]


### Eureka!!, We got what we need. Maybe you ask yourself why there are 6 "th" elements in the table. This is because there are 6 sub-tables inside the main table. We then need to grab the first element of six that will create our table in Pandas.

In [50]:
# Table columns creation
table_columns = []
for header in head_table[0 : 5]:
    table_columns.append(header)
print (table_columns)

[<th>Title</th>, <th>Director</th>, <th>Cast</th>, <th>Country</th>, <th>Subgenre/Notes
</th>]


### Better! We need to keep cleaning the table head text elements. 

In [51]:
# Table columns creation
table_columns = []
for header in head_table[0 : 5]:
    column_header = header.get_text(separator=" ", strip=True)
    table_columns.append(column_header)
print (table_columns)

['Title', 'Director', 'Cast', 'Country', 'Subgenre/Notes']


### Great, we got them. Now we need to take the data for our table. The data is within the elements "tr". We have to extract the data and append it to a list. Let's do it!

In [52]:
rows = table_60.select("tr")
print (rows)

[<tr>
<td colspan="7" scope="row" style="text-align:left; background:#e9e9e9"><a href="/wiki/1960_in_film" title="1960 in film"><span id="1960">1960</span></a>
</td></tr>, <tr>
<th>Title</th>
<th>Director</th>
<th>Cast</th>
<th>Country</th>
<th>Subgenre/Notes
</th></tr>, <tr>
<td><i><a href="/wiki/12_to_the_Moon" title="12 to the Moon">12 to the Moon</a></i></td>
<td><a href="/wiki/David_Bradley_(director)" title="David Bradley (director)">David Bradley</a></td>
<td><a class="new" href="/w/index.php?title=Ken_CHans_Bendriklark_(actor)&amp;action=edit&amp;redlink=1" title="Ken CHans Bendriklark (actor) (page does not exist)">Ken Clark</a>, <a href="/wiki/Michi_Kobi" title="Michi Kobi">Michi Kobi</a>, <a href="/wiki/Tom_Conway" title="Tom Conway">Tom Conway</a></td>
<td>United States</td>
<td><sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup><sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup>
</td></tr>, <tr>
<td><i><a href="/wiki/The_Amaz

### The information we need starts from index 1 and is contained in the selector "td". Let's write a loop and append them into a list. 

In [53]:
table_data = []
for index, element in enumerate(rows):
    if index > 0:
        row_list = []
        values = element.select("td")
        for value in values:
            row_list.append(value.text.strip())
        table_data.append(row_list) #List of lists Pandas
print (table_data)

[[], ['12 to the Moon', 'David Bradley', 'Ken Clark, Michi Kobi, Tom Conway', 'United States', '[1][2]'], ['The Amazing Transparent Man', 'Edgar G. Ulmer', 'Marguerite Chapman, Douglas Kennedy, James Griffith', 'United States', ''], ['Atomic War Bride', 'Veljko Bulajić', 'Antun Vrdoljak, Zlatko Madunić, Ljubiša Jovanović', 'Yugoslavia', '[3]'], ['Beyond the Time Barrier', 'Edgar G. Ulmer', 'Robert Clarke, Darlene Tompkins, Arianne Ulmer', 'United States', '[4][5]'], ['The Cape Canaveral Monsters', 'Phil Tucker', 'Scott Peters, Linda Connell, Jason Johnson, Katherine Victor', 'United States', ''], ['Dinosaurus!', 'Irvin Yeaworth', 'Ward Ramsey, Paul Lukather, Kristina Hanson', 'United States', ''], ['Horrors of Spider Island (a.k.a. Ein Toter hing im Netz)', 'Fritz Böttger', "Harald Maresch, Helga Franck, Alexander D'Arcy", 'West Germany', ''], ['The Human Vapor', 'Ishirō Honda', 'Tatsuya Mihashi, Kaoru Yachigusa, Yoshio Tsuchiya', 'Japan', ''], ['Last Woman on Earth', 'Roger Corman', '

### We have a list of lists. We have now to create our Pandas DataFrame with the elements we got. 

In [58]:
df_movies = pd.DataFrame(data=table_data, columns=table_columns)
df_movies #Print the table

Unnamed: 0,Title,Director,Cast,Country,Subgenre/Notes
0,,,,,
1,12 to the Moon,David Bradley,"Ken Clark, Michi Kobi, Tom Conway",United States,[1][2]
2,The Amazing Transparent Man,Edgar G. Ulmer,"Marguerite Chapman, Douglas Kennedy, James Gri...",United States,
3,Atomic War Bride,Veljko Bulajić,"Antun Vrdoljak, Zlatko Madunić, Ljubiša Jovanović",Yugoslavia,[3]
4,Beyond the Time Barrier,Edgar G. Ulmer,"Robert Clarke, Darlene Tompkins, Arianne Ulmer",United States,[4][5]
...,...,...,...,...,...
204,Night of the Bloody Apes,René Cardona,"José Moreno, Armando Silvestre",Mexico,
205,Stereo,David Cronenberg,"Jack Messinger, Iain Ewing, Clara Mayer",Canada,
206,A Time of Roses,Risto Jarva,"Arto Tuominen, Ritva Vepsä, Tarja Markus",Finland,
207,The Valley of Gwangi,Jim O'Connolly,"James Franciscus, Gila Golan, Laurence Naismith",United States,


## We have finished our data extraction. We cleaned the Table! Congratulations. 
## We can suggest :
* We have to extract and clean our data systematically. Sometimes, one by one.
* Further fancy techniques could be added after essential extraction.
* Copy of the web page is suggested while working with dynamic pages is more complicated. 

# <span style="color:orange">Check the next Binder and fin more insights!<span>