# Web Scrapping from Wikipedia

### List of best-selling books _Between 20 million and 50 million copies_


In [3]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [4]:
url = 'https://en.wikipedia.org/wiki/List_of_best-selling_books'
url
page = requests.get(url).text

In [5]:
soup = BeautifulSoup(page,
                     'html.parser')
# soup
# Output too large. Uncomment the line above to see the output

In [None]:
# print(soup.prettify())

# Output too large. Uncomment the line above to see the output

In [None]:
table = soup.findAll('table')[0]
table

In [None]:
table = soup.findAll('table')[2]
table

We can see from the fist lines ( `The_Tale_of_Peter_Rabbit` and `Beatrix Potter`) that this table is the one we're aiming for, therefore we have assigned it already to the variable `table`.

 Next, we want to get solely the headers of our table, so we code just aiming to find  `th` or `table headers` tags :

In [9]:
html_headers = table.findAll('th')
html_headers

[<th>Book</th>,
 <th>Author(s)</th>,
 <th>Original language</th>,
 <th>First published</th>,
 <th>Approximate sales</th>,
 <th>Genre
 </th>]

With the function `text`, we can get only the text from between the `th` tags, then strip them. We will do this while in a for-loop for all the `html_headers` we found in the cell above. This results:

In [87]:
raw_headers = [title.text\
               .strip() for title in html_headers]

raw_headers

['Book',
 'Author(s)',
 'Original language',
 'First published',
 'Approximate sales',
 'Genre']

Then define a dataframe with only the columns as the new values `raw_headers` we fetched above:

In [160]:
df_bs_books = pd.DataFrame( columns = raw_headers )
df_bs_books

Unnamed: 0,Book,Author(s),Original language,First published,Approximate sales,Genre


In [None]:
target_table = table.findAll('tr')
target_table
# find all the `table row` elements and assign them to the variable 'target_table'.

This `target_table` contains all the data we want to fetch. We can check this out by selecting the first (headers) and last rows of it. This is:

In [79]:
target_table[0]
# headers


<tr>
<th>Book</th>
<th>Author(s)</th>
<th>Original language</th>
<th>First published</th>
<th>Approximate sales</th>
<th>Genre
</th></tr>

These are the headers of out table, and if we fetch the last element of this table, as reads below:

In [None]:
target_table[-1]
# last row


<tr style="background:lavender;">
<td><i><a href="/wiki/The_Naked_Ape" title="The Naked Ape">The Naked Ape</a></i>
</td>
<td><a href="/wiki/Desmond_Morris" title="Desmond Morris">Desmond Morris</a>
</td>
<td>English
</td>
<td>1968
</td>
<td>20 million<sup class="reference" id="cite_ref-128"><a href="#cite_note-128">[128]</a></sup></td>
<td><a class="mw-redirect" href="/wiki/Social_Science" title="Social Science">Social Science</a>, <a href="/wiki/Anthropology" title="Anthropology">Anthropology</a>, <a href="/wiki/Psychology" title="Psychology">Psychology</a>
</td></tr>

We see that the last element of the table is called `The Naked Ape`, so we are on the right one.

In [None]:
for row in target_table[1:]:
    html_data = row.findAll('td')
    print(html_data)  

In the code above, we are starting from the 2nd value since the first one is a blank space.  

Moving foward, we can also print every observation of the table once we rid it from the HTML code we are not interested in. This can be done by stripping and text-isolating the values from `html_data` and assinging them to a new variable `cell_row_data`.


 This occurs when we apply the functions `.strip()` and `.text` - respecively - to every element of the variable `html_data`. We are printing this new variable `cell_row_data` for demonstrating purposes:

In [130]:
for row in target_table[1:]:
    html_data = row.findAll('td')
    cell_row_data = [data.text\
                     .strip() for data in html_data]
    print(cell_row_data)

['The Tale of Peter Rabbit', 'Beatrix Potter', 'English', '1902', '45 million[56]', "Children's Literature"]
['Jonathan Livingston Seagull', 'Richard Bach', 'English', '1970', '44 million[57]', 'Novella, Self-help']
['The Very Hungry Caterpillar', 'Eric Carle', 'English', '1969', '43 million[58]', "Children's Literature, picture book"]
['A Message to Garcia', 'Elbert Hubbard', 'English', '1899', '40 million[47]', 'Essay/Literature']
['To Kill a Mockingbird', 'Harper Lee', 'English', '1960', '40 million[59]', 'Southern Gothic, Bildungsroman']
['Flowers in the Attic', 'V. C. Andrews', 'English', '1979', '40 million[60]', 'Gothic horror, Family saga']
['Cosmos', 'Carl Sagan', 'English', '1980', '40 million[61]', 'Popular science, Anthropology, Astrophysics, Cosmology, Philosophy, History']
["Sophie's World (Sofies verden)", 'Jostein Gaarder', 'Norwegian', '1991', '40 million[62]', 'Philosophical novel, Young adult']
['Angels & Demons', 'Dan Brown', 'English', '2000', '39 million[63]', 'My

Removing the intermediate printings, and just adding two more lines, the code which creates the final dataframe reads as follows:

In [161]:
for row in target_table[1:]:
    html_data = row.findAll('td')
    cell_row_data = [data.text\
                     .strip() for data in html_data]

    length = len(df_bs_books)
    df_bs_books.loc[length] = cell_row_data

These two new lines of code can be explained as: by every new for-itineration, the length of the dataframe `df_bs_books` increases, and so does the variable `length` ( the latter one is defined by the former one ).

We just need now to adjust the index so that we don't count the header as an observation

In [163]:
df_bs_books.index = df_bs_books.index + 1

And finally we display our dataframe prior to store it as a csv file

In [164]:
df_bs_books

Unnamed: 0,Book,Author(s),Original language,First published,Approximate sales,Genre
1,The Tale of Peter Rabbit,Beatrix Potter,English,1902,45 million[56],Children's Literature
2,Jonathan Livingston Seagull,Richard Bach,English,1970,44 million[57],"Novella, Self-help"
3,The Very Hungry Caterpillar,Eric Carle,English,1969,43 million[58],"Children's Literature, picture book"
4,A Message to Garcia,Elbert Hubbard,English,1899,40 million[47],Essay/Literature
5,To Kill a Mockingbird,Harper Lee,English,1960,40 million[59],"Southern Gothic, Bildungsroman"
...,...,...,...,...,...,...
70,The Secret,Rhonda Byrne,English,2006,20 million[124],Self-help
71,Fear of Flying,Erica Jong,English,1973,20 million[125],Romantic novel
72,Dune,Frank Herbert,English,1965,20 million[126],Science fiction novel
73,Charlie and the Chocolate Factory,Roald Dahl,English,1964,20 million[127],Children's fantasy novel


In [165]:
df_bs_books.to_csv('20_to_50M_best_selling_books.csv')