# Webscrapping time

Le webscrapping s'effectuera si possible en deux partie :
* Récupération des liens des jeux vidéo 
* Rentrer dans les liens pour récupérer les données

En présence de plus de 1000000 données, le code doit être optimisé pour que cela prennent le moins de temps. La navigation via selenium est donc à éviter.

In [1]:
from bs4 import BeautifulSoup as BS
from rich import print
from requests import get
import re
from time import sleep, time
import json

## Récupération des liens

### Test première page

In [15]:
page = get('https://www.mobygames.com/browse/games/offset,25/so,0a/list-games/')

In [16]:
code = page.content.decode("utf8")
soupe = BS(code, "lxml")

In [19]:
soupe

<!DOCTYPE html>
<html>
<head>
<title>MobyGames: Game Browser</title>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="MobyGames game browser game All Games" name="description"/>
<meta content="MobyGames" property="og:site_name"/>
<meta content="website" property="og:type"/>
<meta content="https://www.mobygames.com/images/mobygames-logo-bg.png" property="og:image"/>
<meta content="MobyGames: Game Browser" property="og:title"/>
<meta content="MobyGames game browser game All Games" property="og:description"/>
<meta content="tJt3KCZCKYjBQCF3Fi55jOw6LD2AiPzSrlP6E-mZvSs" name="google-site-verification"/>
<meta content="aeca3386c9fda93d0a64006ba1249e4c" name="p:domain_verify"/>
<link href="https://www.mobygames.com/favicon.ico" rel="shortcut icon"/>
<link href="https://www.mobygames.com/images/moby300x300.png" rel="apple-touch-icon image_src"/>
<link href="https://www.

In [38]:
tablo = soupe.find_all(attrs={'id' : ['mof_object_list']})

In [39]:
len(tablo)

1

In [41]:
print(tablo[0].prettify())

In [57]:
elements = tablo[0].find_all(name = 'tr')

In [58]:
elements.pop(0)

<tr><td><b><a class="windowTitle" href="https://www.mobygames.com/browse/games/offset,25/so,0d/list-games/" title="Sort by Game Title"><img alt="sorted in ascending order" border="0" height="7" hspace="3" src="/images/asc.gif" width="7"/>Game Title</a></b></td><td><b><a class="windowTitle" href="https://www.mobygames.com/browse/games/offset,25/so,1d/list-games/" title="Sort by Year">Year</a></b></td><td><b><a class="windowTitle" href="https://www.mobygames.com/browse/games/offset,25/so,2a/list-games/" title="Sort by Publisher">Publisher</a></b></td><td><b>Genre</b></td><td><b>Platform</b></td></tr>

In [61]:
elements[1].find_all(href = True)

[<a href="https://www.mobygames.com/game/007-quantum-of-solace__">007: Quantum of Solace</a>,
 <a href="https://www.mobygames.com/browse/games/2008/">2008</a>,
 <a href="https://www.mobygames.com/company/activision-publishing-inc">Activision Publishing, Inc.</a>,
 <a href="https://www.mobygames.com/genre/sheet/action/">Action</a>,
 <a href="https://www.mobygames.com/browse/games/ps2/">PlayStation 2</a>]

In [66]:
motif = re.compile('https://www.mobygames.com/game/(.*?)')
for element in elements:
    liste_href = element.find_all(href = True)
    for href in liste_href:
        if motif.findall(href['href']):
            print(href['href'])
            break

### Test sur plusieurs pages

In [5]:
motif = re.compile('https://www.mobygames.com/game/(.*?)')
for i in range(140225,140250, 25):
    page = get('https://www.mobygames.com/browse/games/offset,' + str(i)+ '/so,0a/list-games/')
    code = page.content.decode("utf8")
    soupe = BS(code, "lxml")
    tablo, *_ = soupe.find_all(attrs={'id' : ['mof_object_list']})
    elements = tablo.find_all(name = 'tr')
    elements.pop(0)
    for element in elements:
        liste_href = element.find_all(href = True)
        for href in liste_href:
            if motif.findall(href['href']):
                with open(r'C:\Users\Lucas\Documents/M2/S2/Big data/Shiny_VG/Lucas/scrapping/url.json', 'a') as f:
                    f.write(json.dumps(href['href']) +',')
                    f.write('\n')
                break
    print(i)

In [76]:
160000/200

800.0

In [77]:
800*85

68000

Je prendrais donc 18h environs pour récupérer tout les URL. 

## Récupération des infos dans url

### Préliminaire

Cette partie servira seulement a voir si je peux récupérer toutes les informations seulement en récupérant le code HTML par request et BS.

In [20]:
page = get('https://www.mobygames.com/game/007-quantum-of-solace__')

In [21]:
code = page.content.decode("utf8")
soupe = BS(code, "lxml")

In [26]:
truc, *_ = soupe.find_all(attrs = {'class' : ['niceHeaderTitle']})

In [32]:
enfant = truc.text

In [33]:
enfant

'007: Quantum of Solace (PlayStation 2)Discuss Review + Want + Have Contribute '

In [36]:
truc, *_ = soupe.find_all(attrs = {'id' : ['coreGameGenre']})

In [38]:
truc.text

'ESRB RatingTeenGenreActionPerspectiveBehind\xa0viewGameplayShooterInterfaceDirect\xa0controlNarrativeSpy\xa0/\xa0espionageMiscLicensed'

Nous avons bien tous les élements juste en utilisant beautifu

## Travail sur plusieurs threads 

En effet, 140 000 requêtes HTTP c'est un peu long, même très long. Je vais donc essayer de faire travailler python sur plusieurs threads

##### Comparaison du temps d'éxécution

In [385]:
liste_test=["https://www.mobygames.com/game/teenage-mutant-ninja-turtles____",
"https://www.mobygames.com/game/teenage-mutant-ninja-turtles___",
"https://www.mobygames.com/game/teenage-mutant-ninja-turtles__",
"https://www.mobygames.com/game/teenage-mutant-ninja-turtles_____",
"https://www.mobygames.com/game/teenage-mutant-ninja-turtles-",
"https://www.mobygames.com/game/teenage-mutant-ninja-turtles_",
"https://www.mobygames.com/game/teenage-mutant-ninja-turtles",
"https://www.mobygames.com/game/teenage-mutant-ninja-turtles-2-battle-nexus_",
"https://www.mobygames.com/game/teenage-mutant-ninja-turtles-2-battle-nexus",
"https://www.mobygames.com/game/teenage-mutant-ninja-turtles-3-mutant-nightmare_",
"https://www.mobygames.com/game/teenage-mutant-ninja-turtles-3-mutant-nightmare",
"https://www.mobygames.com/game/teenage-mutant-ninja-turtles-3-shredders-last-stand",
"https://www.mobygames.com/game/teenage-mutant-ninja-turtles-arcade-attack",
"https://www.mobygames.com/game/teenage-mutant-ninja-turtles-basketball",
"https://www.mobygames.com/game/teenage-mutant-ninja-turtles-cowabunga",
"https://www.mobygames.com/game/teenage-mutant-ninja-turtles-danger-of-the-ooze",
"https://www.mobygames.com/game/teenage-mutant-ninja-turtles-double-pack"]

In [8]:
##TEST 1 sans threads
debut = time()
for url in liste_test:
    page = get('https://www.mobygames.com/game/007-quantum-of-solace__')
    code = page.content.decode("utf8")
    soupe = BS(code, "lxml")
print(f'Il faut {time() - debut} secondes pour faire {len(liste_test)} requêtes HTTP')

In [55]:
import urllib.request # for downloading data
from tqdm import tqdm # for displaying a smart progress meter in loops
from bs4 import BeautifulSoup # for XML parsing & searching
import concurrent.futures # for multi-threading
import pickle # to save Python objects to the disk as files

In [46]:
## TEST 2 avec threads et request

def obtient_url(url):
    page = get(url)
    code = page.content.decode("utf8")
    soupe = BS(code, "lxml")
    
no_threads = 10

debut = time()
with concurrent.futures.ThreadPoolExecutor(max_workers=no_threads) as executor:
    for url in tqdm(liste_test):
        print(f"Thread starting for WKN: {url}")
        executor.submit(obtient_url, url)
        
print(f'Il faut {time() - debut} secondes pour faire {len(liste_test)} requêtes HTTP')

  0%|          | 0/17 [00:00<?, ?it/s]

 53%|█████▎    | 9/17 [00:00<00:00, 84.34it/s]

100%|██████████| 17/17 [00:00<00:00, 109.97it/s]


In [14]:
## TEST 3 avec threads et urllib

def obtient_url(url):
    req = urllib.request.Request(url=url)
    with urllib.request.urlopen(req) as f:
        s = f.read().decode('utf-8')
        soup = BeautifulSoup(s, 'html.parser')

no_threads = 10

debut = time()
with concurrent.futures.ThreadPoolExecutor(max_workers=no_threads) as executor:
    for url in tqdm(liste_test):
        print(f"Thread starting for WKN: {url}")
        executor.submit(obtient_url, url)
        
print(f'Il faut {time() - debut} secondes pour faire {len(liste_test)} requêtes HTTP')

  0%|          | 0/17 [00:00<?, ?it/s]

 29%|██▉       | 5/17 [00:00<00:00, 36.59it/s]

 59%|█████▉    | 10/17 [00:00<00:00, 41.80it/s]

100%|██████████| 17/17 [00:00<00:00, 59.39it/s]


In [53]:
4.50/17

0.2647058823529412

#### Temps total a enregistré 

In [44]:
## TEST 3 avec threads et urllib

def obtient_url(url):
    print('here')
    req = urllib.request.Request(url=url)
    print(req)
    with urllib.request.urlopen(req) as f:
        print('ici')
        s = f.read().decode('utf-8')
        soup = BeautifulSoup(s, 'html.parser')
        truc, *_ = soupe.find_all(attrs = {'class' : ['niceHeaderTitle']})
        truc2, *_ = soupe.find_all(attrs = {'id' : ['coreGameGenre']})
        with open(r'C:\Users\Lucas\Documents/M2/S2/Big data/Shiny_VG/Lucas/scrapping/data.json', 'a') as f:
                    f.write(json.dumps({truc: truc2}) +',')
                    f.write('\n')

no_threads = 10

debut = time()
with concurrent.futures.ThreadPoolExecutor(max_workers=no_threads) as executor:
    for url in tqdm(liste_test):
        print(f"Thread starting for WKN: {url}")
        executor.submit(obtient_url, url)
        
print(f'Il faut {time() - debut} secondes pour faire {len(liste_test)} requêtes HTTP')

  0%|          | 0/17 [00:00<?, ?it/s]

 29%|██▉       | 5/17 [00:00<00:00, 48.15it/s]

 59%|█████▉    | 10/17 [00:00<00:00, 32.56it/s]

 94%|█████████▍| 16/17 [00:00<00:00, 41.72it/s]

100%|██████████| 17/17 [00:00<00:00, 41.88it/s]


In [386]:
## TEST 2 avec threads et request

def obtient_url(url):
    page = get(url)
    code = page.content.decode("utf8")
    soupe = BS(code, "lxml")
    truc, *_ = soupe.find_all(attrs = {'class' : ['niceHeaderTitle']})
    truc2, *_ = soupe.find_all(attrs = {'id' : ['coreGameGenre']})
    with open(r'C:\Users\Lucas\Documents/M2/S2/Big data/Shiny_VG/Lucas/scrapping/data.json', 'a') as f:
        f.write(json.dumps({truc.text: truc2.text}) +',')
        f.write('\n')
    liste_test.remove(url)
    
no_threads = 10

debut = time()
with concurrent.futures.ThreadPoolExecutor(max_workers=no_threads) as executor:
    for url in tqdm(liste_test):
        executor.submit(obtient_url, url)
        
print(f'Il faut {time() - debut} secondes pour faire {len(liste_test)} requêtes HTTP')

100%|██████████| 17/17 [00:00<00:00, 681.82it/s]


In [381]:
liste_test

['https://www.mobygames.com/game/teenage-mutant-ninja-turtles-2-battle-nexus_',
 'https://www.mobygames.com/game/teenage-mutant-ninja-turtles-danger-of-the-ooze']

## Recherche d'information 

In [277]:
page = get('https://www.mobygames.com/game/007-quantum-of-solace_')
code = page.content.decode("utf8")
soupe = BS(code, "lxml")

### Titre 

In [278]:
balise_titre, *_ = soupe.find_all(attrs = {'class': ['niceHeaderTitle']})

In [279]:
for enfant in balise_titre.children:
    print(enfant)

In [280]:
balise_titre.childGenerator()

<list_iterator at 0x2f2144ac3a0>

In [281]:
balise_titre.findChildren()[0].text

'007: Quantum of Solace'

## Description 

In [282]:
fenetre, *_ = soupe.find_all(attrs={'class': ['col-md-8 col-lg-8']})

In [283]:
fenetre.find_all('p')

[<p style="margin-top: 0">
       James Bond is back to settle the score in the Quantum of Solace™ game.
     </p>,
 <p style="margin-top: 0">
       James Bond is back to settle the score in the Quantum of Solace™ game. 
       Introducing a more lethal and cunningly efficient Bond, the game blends 
       intense first-person action with a unique third-person cover combat 
       system that allows players to truly feel what it is like to be the 
       ultimate secret agent as they use their stealth, precision shooting and 
       lethal combat skills to progress through missions. Seamlessly blending 
       the heart-pounding action and excitement of the upcoming “Quantum of 
       Solace” feature film with the “Casino Royale” movie, the title propels 
       players into the cinematic experience of international espionage.
     </p>,
 <p>
       Innovative Touch Screen Control – The Quantum of Solace game for the 
       Nintendo DS introduces streamlined controls, with players s

In [284]:
truc = fenetre.find_all(attrs={'style': ["margin-top: 0"]})

In [285]:
text = ''
if len(truc) > 1:
    for machin in truc:
        text += machin.text
elif len(truc) == 1:
    text += truc.text
text

'\n      James Bond is back to settle the score in the Quantum of Solace™ game.\n    \n      James Bond is back to settle the score in the Quantum of Solace™ game. \n      Introducing a more lethal and cunningly efficient Bond, the game blends \n      intense first-person action with a unique third-person cover combat \n      system that allows players to truly feel what it is like to be the \n      ultimate secret agent as they use their stealth, precision shooting and \n      lethal combat skills to progress through missions. Seamlessly blending \n      the heart-pounding action and excitement of the upcoming “Quantum of \n      Solace” feature film with the “Casino Royale” movie, the title propels \n      players into the cinematic experience of international espionage.\n    '

## Description 2

In [286]:
fenetre, *_ = soupe.find_all(attrs={'class': ['col-md-8 col-lg-8']})

In [287]:
text = ''
for enfant in fenetre.children:
    try:
        d = str(enfant.text.lower())
        if d == 'screenshots':
            break
    except AttributeError:
        text += enfant

In [288]:
print(text)

### Description

In [289]:
desc = ''
boolean = False
fenetre, *_ = soupe.find_all(attrs={'class': ['col-md-8 col-lg-8']})
balise_desc = fenetre.find_all(attrs={'style': ["margin-top: 0"]})
if balise_desc:
    if len(truc) > 1:
        for texte in balise_desc:
            desc += texte.text
    elif len(truc) == 1:
        desc += balise_desc.text
else:
    for enfant in fenetre.children:
        try:
            d = str(enfant.text.lower())
            if d == 'screenshots' or d == '[':
                break
            if boolean == True:
                desc += d
            if d == 'description':
                boolean = True
        except AttributeError:
            desc += enfant
desc = desc.replace("[edit description | view history]","")

In [290]:
desc

'\n      James Bond is back to settle the score in the Quantum of Solace™ game.\n    \n      James Bond is back to settle the score in the Quantum of Solace™ game. \n      Introducing a more lethal and cunningly efficient Bond, the game blends \n      intense first-person action with a unique third-person cover combat \n      system that allows players to truly feel what it is like to be the \n      ultimate secret agent as they use their stealth, precision shooting and \n      lethal combat skills to progress through missions. Seamlessly blending \n      the heart-pounding action and excitement of the upcoming “Quantum of \n      Solace” feature film with the “Casino Royale” movie, the title propels \n      players into the cinematic experience of international espionage.\n    '

### Critics reviews

In [296]:
fenetre = soupe.find_all(attrs={'class': ['reviewList table table-striped table-condensed table-hover']})

In [324]:
motif_re = re.compile("^([0-9]|[1-9][0-9]|100)$")
note = 0
n = 0
if fenetre:
    for critique in fenetre[0].find_all('td'):
        if motif_re.findall(critique.text):
            note = note + int(critique.text)
            n = n + 1
    moyenne = note / n
else: moyenne = 'NaN'

In [325]:
moyenne

64.77777777777777

### Le reste

In [329]:
fenetre, *_ = soupe.find_all(attrs={'id': ['floatholder coreGameInfo']})

In [330]:
len(fenetre)

1

In [333]:
left_part, *_ = fenetre.find_all(attrs={'id': ['coreGameRelease']})
right_part, *_ = fenetre.find_all(attrs={'id': ['coreGameGenre']})

In [339]:
for information in left_part.find_all('div'):
    print(information.text)
    print('------')

In [355]:
informations = right_part.find_all('div')

In [356]:
len(informations)

18

In [357]:
dictionnaire = dict()
for i in range(0, len(informations)- 1, 2):
    dictionnaire[informations[i].text] = informations[i + 1].text

In [358]:
dictionnaire

{'': "ESRB RatingTeenGenreActionPerspectiveBehind\xa0view, Diagonal-downVisual2D\xa0scrollingGameplayBeat\xa0'em\xa0up\xa0/\xa0brawler, ShooterInterfacePoint\xa0and\xa0selectNarrativeSpy\xa0/\xa0espionageMiscLicensed",
 'ESRB Rating': 'Teen',
 'Genre': 'Action',
 'Perspective': 'Behind\xa0view, Diagonal-down',
 'Visual': '2D\xa0scrolling',
 'Gameplay': "Beat\xa0'em\xa0up\xa0/\xa0brawler, Shooter",
 'Interface': 'Point\xa0and\xa0select',
 'Narrative': 'Spy\xa0/\xa0espionage',
 'Misc': 'Licensed'}

In [364]:
informations[1]

<div><div style="font-size: 100%; font-weight: bold;">ESRB Rating</div><div style="font-size: 90%; padding-left: 1em; padding-bottom: 0.25em;"><a href="https://www.mobygames.com/attribute/sheet/attributeId,92/">Teen</a></div><div style="font-size: 100%; font-weight: bold;">Genre</div><div style="font-size: 90%; padding-left: 1em; padding-bottom: 0.25em;"><a href="https://www.mobygames.com/genre/sheet/action/">Action</a></div><div style="font-size: 100%; font-weight: bold;">Perspective</div><div style="font-size: 90%; padding-left: 1em; padding-bottom: 0.25em;"><a href="https://www.mobygames.com/genre/sheet/behind-view/">Behind view</a>, <a href="https://www.mobygames.com/genre/sheet/diagonal-down/">Diagonal-down</a></div><div style="font-size: 100%; font-weight: bold;">Visual</div><div style="font-size: 90%; padding-left: 1em; padding-bottom: 0.25em;"><a href="https://www.mobygames.com/genre/sheet/2d-scrolling/">2D scrolling</a></div><div style="font-size: 100%; font-weight: bold;">Gam

In [366]:
dictionnaire.keys()

dict_keys(['', 'ESRB Rating', 'Genre', 'Perspective', 'Visual', 'Gameplay', 'Interface', 'Narrative', 'Misc'])