## Extraction de données

L'objectif de cette partie,c'est d'extraire les données relatives aux livres disponibles sur le site 'https://manybooks.net/categories' que les utilisateurs ont deja évalués afin de construire notre systeme de recommendation

### Importation des librairies necessaires

In [57]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
from tqdm import tqdm
from numpy as np

### Extraire les genres et leurs references du web

In [38]:
url = 'https://manybooks.net/categories'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc,'html.parser')

Afin de connaitre comment extraire les données ne devons tout d'abord explorer le contenu de soup

In [39]:
print(soup.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en" prefix="content: http://purl.org/rss/1.0/modules/content/  dc: http://purl.org/dc/terms/  foaf: http://xmlns.com/foaf/0.1/  og: http://ogp.me/ns#  rdfs: http://www.w3.org/2000/01/rdf-schema#  schema: http://schema.org/  sioc: http://rdfs.org/sioc/ns#  sioct: http://rdfs.org/sioc/types#  skos: http://www.w3.org/2004/02/skos/core#  xsd: http://www.w3.org/2001/XMLSchema# ">
 <head>
  <meta charset="utf-8"/>
  <script type="text/javascript">
   (window.NREUM||(NREUM={})).init={ajax:{deny_list:["bam.nr-data.net"]}};(window.NREUM||(NREUM={})).loader_config={licenseKey:"fd2d2a57b6",applicationID:"135398014"};window.NREUM||(NREUM={}),__nr_require=function(t,e,n){function r(n){if(!e[n]){var i=e[n]={exports:{}};t[n][0].call(i.exports,function(e){var i=t[n][1][e];return r(i||e)},i,i.exports)}return e[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<n.length;i++)r(n[i]);return r}({1:[function(t,e,n){function r(){}function i(t

- Extraction des genres dans la liste cat et des references qui leur correspondent dans ref

In [64]:
cat=[]
ref=[]
for el in soup.find_all('div',class_='views-field views-field-name'):
    cat.append(el.text)
    ref.append(el.a.get('href'))

In [65]:
#Creer un dictionnaire contennant les genres et leurs references
dic={'Genre':cat,'Reference':ref}

On va les rassembler dans un meme dataframe

In [66]:
genrerefdf=pd.DataFrame(dic)
genrerefdf.drop_duplicates(inplace=True)

In [67]:
genrerefdf.head()

Unnamed: 0,Genre,Reference
0,Adventure,/categories/ADV
1,African-American Studies,/categories/AFR
2,Art,/categories/ART
3,Banned Books,/categories/BAN
4,Biography,/categories/BIO


### Extraire les references des livres de chaque genre

On commence par un petit exemple juste pour comprendre comment on va extraire les données

In [10]:
url='https://manybooks.net/categories/ADV'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc,'html.parser')

In [11]:
print(soup.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en" prefix="content: http://purl.org/rss/1.0/modules/content/  dc: http://purl.org/dc/terms/  foaf: http://xmlns.com/foaf/0.1/  og: http://ogp.me/ns#  rdfs: http://www.w3.org/2000/01/rdf-schema#  schema: http://schema.org/  sioc: http://rdfs.org/sioc/ns#  sioct: http://rdfs.org/sioc/types#  skos: http://www.w3.org/2004/02/skos/core#  xsd: http://www.w3.org/2001/XMLSchema# ">
 <head>
  <meta charset="utf-8"/>
  <script type="text/javascript">
   (window.NREUM||(NREUM={})).init={ajax:{deny_list:["bam.nr-data.net"]}};(window.NREUM||(NREUM={})).loader_config={licenseKey:"fd2d2a57b6",applicationID:"135398014"};window.NREUM||(NREUM={}),__nr_require=function(t,e,n){function r(n){if(!e[n]){var i=e[n]={exports:{}};t[n][0].call(i.exports,function(e){var i=t[n][1][e];return r(i||e)},i,i.exports)}return e[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<n.length;i++)r(n[i]);return r}({1:[function(t,e,n){function r(){}function i(t

On passe maintenent a l'extraction des references des livres dans un dataframe qu'on nomme data :

In [12]:
data=pd.DataFrame()
for i in range(genrerefdf.shape[0]):
    url='https://manybooks.net'+genrerefdf['Reference'][i]
    r = requests.get(url)
    html_doc = r.text
    soup = BeautifulSoup(html_doc,'html.parser')
    L=[]
    for link in soup.find_all('div',class_='field field--name-field-title field--type-string field--label-hidden field--item'):
        L.append(link.a.get('href'))
    for j in tqdm(range(1,100)):
        url=url+'?language=All&sort_by=field_downloads&page='+str(i)
        r = requests.get(url)
        html_doc = r.text
        soup = BeautifulSoup(html_doc,'html.parser')
        for link in soup.find_all('div',class_='field field--name-field-title field--type-string field--label-hidden field--item'):
            L.append(link.a.get('href'))
    L=set(L)
    L=list(L)
    for j in range(len(L)):
        row = {'Genre':genrerefdf['Genre'][i], 'Reference': genrerefdf['Reference'][i], 'Book Ref': L[j]}
        data = data.append(row, ignore_index = True)

    

100%|██████████| 99/99 [01:27<00:00,  1.13it/s]
100%|██████████| 99/99 [01:44<00:00,  1.05s/it]
100%|██████████| 99/99 [01:47<00:00,  1.09s/it]
100%|██████████| 99/99 [01:09<00:00,  1.43it/s]
100%|██████████| 99/99 [01:44<00:00,  1.06s/it]
100%|██████████| 99/99 [01:22<00:00,  1.20it/s]
100%|██████████| 99/99 [01:13<00:00,  1.35it/s]
100%|██████████| 99/99 [01:08<00:00,  1.45it/s]
100%|██████████| 99/99 [01:30<00:00,  1.10it/s]
100%|██████████| 99/99 [01:16<00:00,  1.30it/s]
100%|██████████| 99/99 [01:30<00:00,  1.10it/s]
100%|██████████| 99/99 [01:59<00:00,  1.20s/it]
100%|██████████| 99/99 [01:30<00:00,  1.09it/s]
100%|██████████| 99/99 [01:34<00:00,  1.04it/s]
100%|██████████| 99/99 [01:15<00:00,  1.31it/s]
100%|██████████| 99/99 [01:55<00:00,  1.16s/it]
100%|██████████| 99/99 [01:32<00:00,  1.07it/s]
100%|██████████| 99/99 [01:23<00:00,  1.18it/s]
100%|██████████| 99/99 [01:30<00:00,  1.10it/s]
100%|██████████| 99/99 [01:20<00:00,  1.23it/s]
100%|██████████| 99/99 [01:56<00:00,  1.

ConnectionError: ('Connection aborted.', ConnectionAbortedError(10053, 'An established connection was aborted by the software in your host machine', None, 10053, None))

Vu des problemes de connexion, le telechargement s'est arreté.
On le reprend alors depuis l'iteration ou il s'est arreté

In [15]:
for i in range(24,genrerefdf.shape[0]):
    url='https://manybooks.net'+genrerefdf['Reference'][i]
    r = requests.get(url)
    html_doc = r.text
    soup = BeautifulSoup(html_doc,'html.parser')
    L=[]
    for link in soup.find_all('div',class_='field field--name-field-title field--type-string field--label-hidden field--item'):
        L.append(link.a.get('href'))
    for j in tqdm(range(1,100)):
        url=url+'?language=All&sort_by=field_downloads&page='+str(i)
        r = requests.get(url)
        html_doc = r.text
        soup = BeautifulSoup(html_doc,'html.parser')
        for link in soup.find_all('div',class_='field field--name-field-title field--type-string field--label-hidden field--item'):
            L.append(link.a.get('href'))
    L=set(L)
    L=list(L)
    for j in range(len(L)):
        row = {'Genre':genrerefdf['Genre'][i], 'Reference': genrerefdf['Reference'][i], 'Book Ref': L[j]}
        data = data.append(row, ignore_index = True)


100%|██████████| 99/99 [01:33<00:00,  1.06it/s]
100%|██████████| 99/99 [01:47<00:00,  1.08s/it]
100%|██████████| 99/99 [02:11<00:00,  1.32s/it]
100%|██████████| 99/99 [01:15<00:00,  1.32it/s]
100%|██████████| 99/99 [01:20<00:00,  1.23it/s]
100%|██████████| 99/99 [01:12<00:00,  1.36it/s]
100%|██████████| 99/99 [01:23<00:00,  1.19it/s]
100%|██████████| 99/99 [02:53<00:00,  1.75s/it]
100%|██████████| 99/99 [01:38<00:00,  1.01it/s]
100%|██████████| 99/99 [02:06<00:00,  1.28s/it]
100%|██████████| 99/99 [02:29<00:00,  1.51s/it]
100%|██████████| 99/99 [02:55<00:00,  1.78s/it]
100%|██████████| 99/99 [03:53<00:00,  2.36s/it]
 13%|█▎        | 13/99 [01:05<07:13,  5.04s/it]


ConnectionError: ('Connection aborted.', ConnectionAbortedError(10053, 'An established connection was aborted by the software in your host machine', None, 10053, None))

In [17]:
for i in range(37,genrerefdf.shape[0]):
    url='https://manybooks.net'+genrerefdf['Reference'][i]
    r = requests.get(url)
    html_doc = r.text
    soup = BeautifulSoup(html_doc,'html.parser')
    L=[]
    for link in soup.find_all('div',class_='field field--name-field-title field--type-string field--label-hidden field--item'):
        L.append(link.a.get('href'))
    for j in tqdm(range(1,100)):
        url=url+'?language=All&sort_by=field_downloads&page='+str(i)
        r = requests.get(url)
        html_doc = r.text
        soup = BeautifulSoup(html_doc,'html.parser')
        for link in soup.find_all('div',class_='field field--name-field-title field--type-string field--label-hidden field--item'):
            L.append(link.a.get('href'))
    L=set(L)
    L=list(L)
    for j in range(len(L)):
        row = {'Genre':genrerefdf['Genre'][i], 'Reference': genrerefdf['Reference'][i], 'Book Ref': L[j]}
        data = data.append(row, ignore_index = True)


100%|██████████| 99/99 [02:47<00:00,  1.69s/it]
100%|██████████| 99/99 [03:14<00:00,  1.97s/it]
100%|██████████| 99/99 [02:56<00:00,  1.78s/it]
100%|██████████| 99/99 [02:17<00:00,  1.38s/it]
100%|██████████| 99/99 [02:20<00:00,  1.42s/it]
100%|██████████| 99/99 [01:52<00:00,  1.14s/it]
100%|██████████| 99/99 [02:18<00:00,  1.40s/it]
100%|██████████| 99/99 [01:52<00:00,  1.14s/it]
100%|██████████| 99/99 [01:57<00:00,  1.19s/it]
100%|██████████| 99/99 [01:51<00:00,  1.12s/it]
100%|██████████| 99/99 [01:48<00:00,  1.09s/it]
100%|██████████| 99/99 [02:08<00:00,  1.30s/it]
100%|██████████| 99/99 [02:25<00:00,  1.47s/it]
100%|██████████| 99/99 [01:52<00:00,  1.14s/it]
100%|██████████| 99/99 [01:58<00:00,  1.20s/it]
100%|██████████| 99/99 [02:32<00:00,  1.54s/it]
100%|██████████| 99/99 [01:56<00:00,  1.18s/it]
100%|██████████| 99/99 [03:06<00:00,  1.88s/it]
100%|██████████| 99/99 [02:30<00:00,  1.52s/it]
100%|██████████| 99/99 [02:03<00:00,  1.25s/it]
 22%|██▏       | 22/99 [00:57<03:20,  2.

ChunkedEncodingError: ("Connection broken: ConnectionAbortedError(10053, 'An established connection was aborted by the software in your host machine', None, 10053, None)", ConnectionAbortedError(10053, 'An established connection was aborted by the software in your host machine', None, 10053, None))

In [19]:
for i in range(57,genrerefdf.shape[0]):
    url='https://manybooks.net'+genrerefdf['Reference'][i]
    r = requests.get(url)
    html_doc = r.text
    soup = BeautifulSoup(html_doc,'html.parser')
    L=[]
    for link in soup.find_all('div',class_='field field--name-field-title field--type-string field--label-hidden field--item'):
        L.append(link.a.get('href'))
    for j in tqdm(range(1,100)):
        url=url+'?language=All&sort_by=field_downloads&page='+str(i)
        r = requests.get(url)
        html_doc = r.text
        soup = BeautifulSoup(html_doc,'html.parser')
        for link in soup.find_all('div',class_='field field--name-field-title field--type-string field--label-hidden field--item'):
            L.append(link.a.get('href'))
    L=set(L)
    L=list(L)
    for j in range(len(L)):
        row = {'Genre':genrerefdf['Genre'][i], 'Reference': genrerefdf['Reference'][i], 'Book Ref': L[j]}
        data = data.append(row, ignore_index = True)


100%|██████████| 99/99 [02:12<00:00,  1.34s/it]
100%|██████████| 99/99 [02:36<00:00,  1.58s/it]
100%|██████████| 99/99 [02:02<00:00,  1.24s/it]
100%|██████████| 99/99 [02:01<00:00,  1.22s/it]
100%|██████████| 99/99 [02:37<00:00,  1.59s/it]


In [22]:
data.drop_duplicates(inplace=True)

In [23]:
data.shape

(2358, 3)

In [69]:
data.head()

Unnamed: 0,Genre,Reference,Book Ref
0,Adventure,/categories/ADV,/titles/londonjaetext97wtfng10.html
1,Adventure,/categories/ADV,/titles/vernejuletext04820kc10.html
2,Adventure,/categories/ADV,/titles/marchmonta3582835828.html
3,Adventure,/categories/ADV,/titles/doyleartetext94lostw10.html
4,Adventure,/categories/ADV,/titles/londonjaetext95callw10.html


### Extraire les caracteristiques de chaque livre

On commence d'abord par analyser un petit exemple afin d'en deduire comment extraire les données voulues

In [33]:
url='https://manybooks.net/titles/vernejuletext942000010.html'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc,'html.parser')

In [34]:
print(soup.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en" prefix="content: http://purl.org/rss/1.0/modules/content/  dc: http://purl.org/dc/terms/  foaf: http://xmlns.com/foaf/0.1/  og: http://ogp.me/ns#  rdfs: http://www.w3.org/2000/01/rdf-schema#  schema: http://schema.org/  sioc: http://rdfs.org/sioc/ns#  sioct: http://rdfs.org/sioc/types#  skos: http://www.w3.org/2004/02/skos/core#  xsd: http://www.w3.org/2001/XMLSchema# ">
 <head>
  <meta charset="utf-8"/>
  <script type="text/javascript">
   (window.NREUM||(NREUM={})).init={ajax:{deny_list:["bam.nr-data.net"]}};(window.NREUM||(NREUM={})).loader_config={licenseKey:"fd2d2a57b6",applicationID:"135398014"};window.NREUM||(NREUM={}),__nr_require=function(t,e,n){function r(n){if(!e[n]){var i=e[n]={exports:{}};t[n][0].call(i.exports,function(e){var i=t[n][1][e];return r(i||e)},i,i.exports)}return e[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<n.length;i++)r(n[i]);return r}({1:[function(t,e,n){function r(){}function i(t

On va extraire les caracteristiques des livres dans un dataframe qu'on nomme books :

In [24]:
books=pd.DataFrame()
for i in tqdm(range(data.shape[0])):
    url='https://manybooks.net'+data['Book Ref'][i]
    r = requests.get(url)
    html_doc = r.text
    soup = BeautifulSoup(html_doc,'html.parser')
    info=[0 for i in range(6)]
    if soup.find('div',class_='field field--name-field-title field--type-string field--label-hidden field--item')==None:
        info[0]=np.nan
    else : 
        info[0]=soup.find('div',class_='field field--name-field-title field--type-string field--label-hidden field--item').text
    if soup.find('div',class_='field field--name-field-author-er field--type-entity-reference field--label-hidden field--items')==None:
        info[1]=np.nan
    else : 
        info[1]=soup.find('div',class_='field field--name-field-author-er field--type-entity-reference field--label-hidden field--items').text
    if soup.find('div',class_='field field--name-field-pages field--type-integer field--label-hidden field--item')==None:
        info[2]=np.nan
    else : 
        info[2]=soup.find('div',class_='field field--name-field-pages field--type-integer field--label-hidden field--item').text
    if soup.find('div',class_='field field--name-field-published-year field--type-integer field--label-hidden field--item')==None:
        info[3]=np.nan
    else : 
        info[3]=soup.find('div',class_='field field--name-field-published-year field--type-integer field--label-hidden field--item').text
    if soup.find('div',class_='field field--name-field-downloads field--type-integer field--label-hidden field--item')==None:
        info[4]=np.nan
    else : 
        info[4]=soup.find('div',class_='field field--name-field-downloads field--type-integer field--label-hidden field--item').text
    if soup.find('div',class_='fivestar-widget-static fivestar-widget-static-vote fivestar-widget-static-5 clearfix')==None:
        info[5]=np.nan
    else : 
        info[5]=soup.find('div',class_='fivestar-widget-static fivestar-widget-static-vote fivestar-widget-static-5 clearfix').text
    row={'Title':info[0],
        'Author':info[1],
        'Genre':data['Genre'][i],
        'Pages':info[2],
        'Year':info[3],
        'Downloads':info[4],
        'Rating':info[5]}
    books = books.append(row, ignore_index = True)

100%|██████████| 2358/2358 [48:05<00:00,  1.22s/it] 


In [26]:
books.head()

Unnamed: 0,Title,Author,Genre,Pages,Year,Downloads,Rating
0,White Fang,\nJack London\n,Adventure,176,1906,41412,5.0
1,20000 Lieues sous les mers,\nJules Verne\n,Adventure,449,1870,31688,4.0
2,By Wit of Woman,\nArthur W. Marchmont\n,Adventure,268,1905,32146,0.0
3,The Lost World,\nArthur Conan Doyle\n,Adventure,198,1912,32490,4.625
4,The Call of the Wild,\nJack London\n,Adventure,86,1903,86550,4.0833333333333


### Téléchargement du dataset books dans un fichier csv qu'on va nommer 'books.csv'

In [31]:
books.to_csv('books.csv',index=False)

In [6]:
L=[1,2]
L[0:0]

[]