# Wikipedia Scrape Notebook

Here we will test out grabbing sections of text from Wikipedia articles, using some hobbled together code and the wikipedia module

In [1]:
import pandas as pd
import numpy as np
import wikipedia

So first we'll import our csv that we exported from the vgchartz scraping notebook

These are the entries that have values for 'critic_score' and that are not themselves series (i.e. we removed the Pokemon Series entry and instead just kept the individual games)

In [2]:
df = pd.read_csv('../data/critic_scores.csv')
df.sample(5)

Unnamed: 0.1,Unnamed: 0,position,game,console,publisher,developer,vgchart_score,critic_score,user_score,total_shipped,total_sales,na_sales,pal_sales,japan_sales,other_sales,release_date,last_update
4867,1659,21660,X-Blades,PC,SouthPeak Interactive,Gaijin Entertainment,,6.4,,,0.00m,,0.00m,,,10th Feb 09,
5793,2831,37832,Medieval: Total War - Viking Invasion,PC,Activision,The Creative Assembly,,8.2,,,,,,,,07th May 03,
2972,2978,7979,Pro Evolution Soccer 2014,X360,Konami Digital Entertainment,Konami,,7.9,,,0.25m,0.08m,0.16m,,0.02m,24th Sep 13,02nd Mar 18
4670,4571,19572,Karnaaj Rally,GBA,Jaleco,Paragon 5,,7.9,,,0.01m,0.01m,0.00m,,0.00m,02nd Jan 03,
3468,240,10241,Spider-Man: Web of Shadows - Amazing Allies Ed...,PSP,Activision,Shaba Games,,7.7,,,0.16m,0.12m,0.02m,,0.02m,21st Oct 08,


In [138]:
df['plots'] = ''
df.shape

(6544, 18)

**Scraping time**

So this block is going to go through every game name in our list, find the corresponding Wikipedia pages, and attatch to our DataFrame the 'plot' or 'story' section

We have to have the name exactly as it appears on the page name, so we're going to lose quite a few rows naturally...hopefully it's not too much

In [4]:
# create a list of all the names you think/know the section might be called
possibles = ['Plot','Synopsis','Plot synopsis','Plot summary', 
             'Story','Plotline','Gameplay','Summary',
            'Content','Premise']

# sometimes those names have 'Edit' on accident
possibles_edit = [i + 'Edit' for i in possibles]

#then merge those two lists together
all_possibles = possibles + possibles_edit



# now for the actual fetching!
for idx, row in df.iterrows():
    
    # load the page once and save it as a variable
    try:
        wik = wikipedia.WikipediaPage(row['game'])
    except:
        try:
            wik = wikipedia.WikipediaPage(row['game']+ ' (video game)')
        except:
            wik = 'N/A'
            print('no page found')
        
        
    # a new try, except for the plot
    try:
        # for all possible titles in all_possibles list
        for j in all_possibles:
            # if that section does exist, i.e. it doesn't return 'None'
            if wik.section(j) != None:
                #then that's what the plot is! Otherwise try the next one!
                plot = wik.section(j).replace('\n','').replace("\'","")
                df.at[idx, 'plots'] = plot
                break
            
                
    # if none of those work, or if the page didn't load from above, then plot
    # equals np.NaN
    except:
        df.at[idx, 'plots'] = 'N/A'
        print(row['game'])
        continue
        

no page found
Pokémon Red / Green / Blue Version
no page found
Pokémon Gold / Silver Version
no page found
Pokémon Diamond / Pearl Version
no page found
Pokémon Ruby / Sapphire Version
no page found
Pokémon Sun/Moon
no page found
Pokémon Black / White Version




  lis = BeautifulSoup(html).find_all('li')


no page found
Pokémon Heart Gold / Soul Silver Version
no page found
Pokémon FireRed / LeafGreen Version
no page found
Gran Turismo
no page found
Pokémon: Ultra Sun and Ultra Moon
no page found
GoldenEye 007
no page found
Pokémon Platinum Version
no page found
Pokémon Crystal Version
no page found
FIFA Soccer 11
no page found
Pokémon Mystery Dungeon: Explorers of Time / Darkness
no page found
Gran Turismo
no page found
God of War
no page found
Nintendogs + cats
no page found
Need for Speed: Most Wanted
no page found
FIFA 07 Soccer
no page found
EyeToy Play
no page found
Star Wars Battlefront II
no page found
Cooking Mama 2: Dinner With Friends
no page found
FIFA Soccer 11
no page found
Need for Speed: Hot Pursuit
no page found
Dragon Ball: Xenoverse 2
no page found
Rockstar Games Double Pack: Grand Theft Auto III & Grand Theft Auto Vice City
no page found
Classic NES Series: Super Mario Bros.
no page found
Super Mario All-Stars: Limited Edition
no page found
Need for Speed: Hot Pursuit

Alright we got it!!! The output of the cell above will show us all of the titles that we didn't find a page for. Going to worry about sorting through those later

In [141]:
df.sample(5)

Unnamed: 0.1,Unnamed: 0,position,game,console,publisher,developer,vgchart_score,critic_score,user_score,total_shipped,total_sales,na_sales,pal_sales,japan_sales,other_sales,release_date,last_update,plots
4604,3591,18592,Disney Sports Basketball,GBA,Konami,Konami,,6.7,,,0.02m,0.01m,0.00m,,0.00m,23rd Nov 02,,
6282,2608,47609,Swamp Buggy Racing,PC,WizardWorks,Daylight Productions,,1.9,,,,,,,,21st Jan 00,,
917,1824,1825,Minecraft,WiiU,Mojang,4J Studios,,5.5,,,1.47m,0.50m,0.49m,0.38m,0.09m,17th Jun 16,05th Aug 18,
5809,3238,38239,MicroBot,PSN,Electronic Arts,Naked Sky Entertainment,,6.4,,,,,,,,04th Jan 11,,
2054,4639,4640,Sacred 2: Fallen Angel,X360,CDV Software Entertainment,Ascaron Entertainment,,7.0,,,0.54m,0.29m,0.16m,0.04m,0.05m,11th May 09,,


**SAVE YOUR WORK**

Let's save the raw return of the scrape and then save the results that we got entries for as seperate csv files

In [9]:
df.to_csv('../data/wikipedia_results.csv')
df[(df.plots != '') & (df.plots != 'N/A')].to_csv('../data/games_with_plots.csv')

In [140]:
games_with_plots = pd.read_csv('../data/games_with_plots.csv')
games_with_plots.shape

(4196, 19)

So after scraping, we got 4196 games with full plot descriptions...that'll do for now