# WebScraping PS4 store with Selenium

This notebook shows how use selenium to scrape data from playstationstore.com
The scope is only to understand the capabilities of web scraping and prepare a dataset for academic purporse.



<a href="https://colab.research.google.com/drive/1_WyM24eXWf-pdcqJKCcsb1pWwnbEWRi8?authuser=2#scrollTo=VYfg3I_fSt6e"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>


In [None]:
!pip install selenium
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

In [None]:
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
from tqdm import tqdm_notebook as tqdm
import pandas
import json
import pprint

In [None]:
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

In [None]:
wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
wd.get("https://store.playstation.com/it-it/grid/STORE-MSF75508-FULLGAMES/1?direction=desc&platform=ps4&sort=release_date")

In [None]:
wd.save_screenshot('screenshot.png')

%pylab inline
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
img=mpimg.imread('/content/screenshot.png')
imgplot = plt.imshow(img)
plt.show()

# Game List from PS4

Iniziamo a scaricare la lista dei giochi con i css selectors





In [None]:
list_games = wd.find_elements_by_css_selector("div.grid-cell__body")
print(len(list_games))


In [None]:
import pprint
import time
detail_games = []
for game in list_games:
    title = game.find_elements_by_css_selector("a>div.grid-cell__title>span")[0].text
    price = game.find_elements_by_css_selector("h3.price-display__price")[0].text
    url = game.find_elements_by_css_selector("div.grid-cell__body>a.internal-app-link.ember-view")[0].get_attribute("href")
    detail_games.append({'url': url,
                            'title': title,
                            'price': price,
                            })
    
    time.sleep(1.1)
    
pprint.pprint(detail_games)
len(detail_games)

In [None]:
def parse_game(game):
  title = ""
  price = ""
  url = ""
  try:
    title = game.find_elements_by_css_selector("a>div.grid-cell__title>span")[0].text
    price = game.find_elements_by_css_selector("h3.price-display__price")[0].text
    url = game.find_elements_by_css_selector("div.grid-cell__body>a.internal-app-link.ember-view")[0].get_attribute("href")
  except:
    pass
  return {'title': title,
          'price': price,
          'url': url}


Here's how to download the first two pages of projects...

In [None]:
detail_games = []
for num in tqdm(range(1,2)):
  wd.get(f"https://store.playstation.com/it-it/grid/STORE-MSF75508-FULLGAMES/{num}?direction=desc&platform=ps4&sort=release_date")
  wd.save_screenshot(f'screenshot_{num}.png')
  list_games = wd.find_elements_by_css_selector("div.grid-cell__body")
  for game in list_games:
    detail_games.append(parse_game(game))

print(len(detail_games))

Have you seen the bookstore **tqdm**!
Find at this link all the documentation **https://github.com/tqdm/tqdm**:
is very useful to make our notebook more nice...

### How to end scraping?
Facciamo scraping dalle prime 7 pagine

In [None]:
h = random.random(1)
print(h)

In [None]:
import time
detail_games = []
for num in tqdm(range(1,8)):
  time.sleep(1.2)
  wd.get(f"https://store.playstation.com/it-it/grid/STORE-MSF75508-FULLGAMES/{num}?direction=desc&platform=ps4&sort=release_date")
  #wd.save_screenshot(f'screenshot_{num}.png')
  list_games = wd.find_elements_by_css_selector("div.grid-cell__body")
  for game in list_games:
    detail_games.append(parse_game(game))

print(len(detail_games))

# Pandas and data processing

Creiamo un DF pandas e disponiamo le colonne in modo diverso


In [None]:
import pandas as pd
df = pd.DataFrame(detail_games)
df["ID"] = df.index + 1
df = df[df.columns[[3,0,2,1]]]
df.head()

The `.info()` method provides an indication of the structure and data of the `DataFrame`.

In [None]:
df.info()

In [None]:
df.to_csv('ds_games.csv')

###PS4 pages ###

Now, the goal is to navigate and download the details of each project and pictures of the houses.

**Pandas** provides the *.read_csv* method that allows you to upload in CSV format files within a DataFrame.

In [None]:
# open csv file
import pandas as pd
ds_detail_games = pd.read_csv("ds_games.csv", index_col=[0])
ds_detail_games.head()

***Primi 5 giochi, per vedere come funziona e se funziona***


In [None]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
details = []
for ID, game in ds_detail_games.head().iterrows():
    link = game["url"]
    print(link)
    wd.set_window_size(1920, 1080)
    wd.get(link)
  
    title = wd.find_elements_by_css_selector("h2.pdp__title")[0].text
    wd.save_screenshot(f'screenshot_{title}.png')
    Val = wd.find_elements_by_css_selector("div.provider-info__rating-count")[0].text.replace(" Valutazioni","")
    Genre = wd.find_elements_by_css_selector("li.tech-specs__menu-items")[0].text
    time.sleep(16)
    Pub = wd.find_elements_by_css_selector("span.provider-info__list-item")[1].text.replace("Pubblicato ","")
    details.append({'ID': ID+1,
                    'title': title,
                    'Val': Val,
                    'Genre': Genre,
                    'Pub': Pub})
    break
  
len(details)
pprint.pprint(details)
print(len(details))




Adesso per ogni link della lista che ci siamo ricavati prima, identifichiamo tutti i fattori che ci interessano per ogni gioco e li inseriamo in un DF Pandas

In [None]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import re
details = []
ID = 0
for ID, game in tqdm(ds_detail_games.iterrows(), total=ds_detail_games.shape[0]):
    time.sleep(1)
    link = game["url"]
    genre_list = 0
    Pub_list = 0
    Title = 0
    Val = 0
    Star = 0
    N_full_star = 0
    N_half_stars = 0

    #print(link)
    try:
      wd.set_window_size(1920, 1080)
      wd.get(link)
      time.sleep(15)
      wd.save_screenshot(f'screenshot_{ID}.png')
      genre_list = wd.find_elements_by_css_selector("li.tech-specs__menu-items")
      Pub_list = wd.find_elements_by_css_selector("span.provider-info__list-item")
      Title = wd.find_elements_by_css_selector("h2.pdp__title")
      Val = wd.find_elements_by_css_selector("div.provider-info__rating-count")
      Age=wd.find_elements_by_css_selector("img.content-rating__rating-img")
      N_full_star = wd.find_elements_by_css_selector("div i.star-rating__star.fa.fa-star")
      N_half_stars= wd.find_elements_by_css_selector("div i.star-rating__star.fa.fa-star-half-o")
      if(len(genre_list) > 0):
        genre = genre_list[0].text
      if(len(Pub_list) > 0):
        Pub = Pub_list[1].text.replace("Pubblicato ","")
      if(len(Title) > 0):
        Title = Title[0].text
      if(len(Val) > 0):
        Val = Val[0].text.replace(" Valutazioni","")
      if(len(Age) > 0):
        Age = "".join(filter(lambda i: i.isdigit(), Age[0].get_attribute("src").split("/")[-1]))
      Star= float(len(N_full_star))+float(len((N_half_stars))/2)
    except Exception as e:
      print(e)
    details.append({'ID': ID+1,
                    'genre': genre,
                    'Pub': Pub,
                    'Title': Title,
                    'Val': Val,
                    'Age': Age,
                    'Star': Star,
                    'Url':link})
  
print(len(details))
#pprint.pprint(details)


In [None]:
len(details)

Store the data with *pandas*

In [None]:
import pandas as pd
ds_details = pd.DataFrame(details)
ds_details.set_index("ID")
ds_details.head()

In [None]:
ds_details.info()

In [None]:
ds_details.to_csv("ds_project_details.csv")

# PS4 Game images




Our goal is to create a dataset of images and a `Dataframe` composed by:
- `project_id`
- `image_id`


In [None]:

wd.get(f"https://store.playstation.com/it-it/grid/STORE-MSF75508-FULLGAMES/1?direction=desc&platform=ps4&sort=release_date")

In [None]:
import time
import requests

detail_games = []
list_images = []
ID = 0
for num in tqdm(range(1,8)):
  wd.get(f"https://store.playstation.com/it-it/grid/STORE-MSF75508-FULLGAMES/{num}?direction=desc&platform=ps4&sort=release_date")
  list_games = wd.find_elements_by_css_selector("div.grid-cell.grid-cell--game")
  for game in list_games:
    try:
      ID = ID+1
      src = game.find_element_by_css_selector("div.product-image__img.product-image__img--main>img").get_attribute("src")
      list_images.append({"game_id": ID,
                          "img_file": "img_" + str(ID) + ".jpg"})
      img_file = requests.get(src, stream=True)
      if img_file.status_code == 200:
        with open("/content/immagini/img_" + str(ID) + ".jpg", 'wb') as f:
          f.write(img_file.content)
    except Exception as e:
      print(e)
  



In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
print(len(list_images))

In [None]:
%pylab inline
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
img=mpimg.imread('/content/images/img_174')
imgplot = plt.imshow(img)
plt.show()

import pandas as pd
ds_images = pd.DataFrame(list_images)
ds_images.head()

In [None]:
import pandas as pd
ds_images = pd.DataFrame(list_images)
ds_images.set_index("game_id")
ds_images.head()

In [None]:
ds_images.info()

In [None]:
ds_images.to_csv("ds_images.csv")

In [None]:
!zip -r "/content/images.zip" "/content/images/"


# API

Let's see how to use the **requests** library to hook APIs provided by our suppliers or colleagues.

## Crarifai

Adesso con Clarifai identificheremo 10 concetti per ogni immagine che utilizzeremo nella GUI tkinter prevista nel progettl.


```
pip install clarifai
```






In [None]:
!pip install clarifai

In [None]:
from clarifai.rest import ClarifaiApp

# setup your key!!!
clarifai_key = "7e6002d1bf3a43c59f2019f1d1a423a5"
app = ClarifaiApp(api_key=clarifai_key)

# and use the general model
model = app.public_models.general_model

Now let's go and identify all the concepts for each image of each project.

In [None]:
ds_images = pd.read_csv("ds_images.csv", index_col="game_id")

img_details = []
count = 0
for game_id, image in tqdm(ds_images.iterrows(), total=ds_images.shape[0]):
  try:
    response = model.predict_by_filename("/content/immagini/" + image['img_file'])
    if(response['status']['description'] == "Ok"):
      for concept in response["outputs"][0]["data"]["concepts"]:
          name = concept["name"]
          value = concept["value"]
          img_details.append({
            "game_id": game_id,
            "image": image['img_file'],
            "name": name,
            "value": value
          })
          count = count +1
          if count >=10:
            count = 0
            break
  except Exception as e:
    print(e)

print(len(img_details))

In [None]:
pprint.pprint(img_details)

In [None]:
import pandas as pd
ds_img_details = pd.DataFrame(img_details)
ds_img_details.set_index("game_id")
ds_img_details.head()

In [None]:
ds_img_details.info()

In [None]:
ds_img_details.to_csv('ds_img_details.csv')