# Scrapping Planet Zoo Wiki Data

This project is about having a list of all animals currently featured by [Planet Zoo](https://www.planetzoogame.com/). The data is scrapped from the [wiki fandom page](https://planetzoo.fandom.com/wiki/List_of_Animals).
The main purpose is to have a custom control of what animals I already done research and to check which ones are hosted in a zoo in franchise mode.

There is also [Guest View Radius](https://steamcommunity.com/sharedfiles/filedetails/?id=2638946337) published in the Steam Community that helps to determine best radius.

The ETL process is done through PySpark. Although it would have been easier to get all data in a spreadsheet, as many others have done it, I did try it in Spark just for fun.

Data is stored locally in parquet format.

In [1]:
import requests

import pandas as pd

from bs4 import BeautifulSoup
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import lit
from pyspark.sql.functions import when

## Extraction

### Animal list from the Wiki

In [2]:
wiki_url = 'https://planetzoo.fandom.com/wiki/'
list_url = 'List_of_Animals'
req = requests.get(wiki_url + list_url)
soup = BeautifulSoup(req.text, 'html.parser')

In [3]:
zoo_soups = {}
for row in soup.find('table').find_all('tr'):
    cols = row.find_all('td')
    if len(cols) != 0:
        href = cols[0].find('a')['href'].split('/')[-1]
        req2 = requests.get(wiki_url + href)
        soup2 = BeautifulSoup(req2.text, 'html.parser')
        zoo_soups[cols[0].get_text().strip()] = soup2

In [4]:
biomes = {}
for k, v in zoo_soups.items():
    biomes_tags = v.find('div', {'data-source': 'biome'})
    tmp = []
    for biome in biomes_tags.find_all('img'):
        extra_string = biome['alt'].find('Icon.png')
        if extra_string > -1:
            tmp.append(biome['alt'][:extra_string])
        else:
            tmp.append(biome['alt'])
            
    biomes[k] = tmp

In [5]:
animal_list = []
for i, row in enumerate(soup.find('table').find_all('tr')):
    cols = row.find_all('td')
    if len(cols) != 0:
        animal_list.append((
            cols[0].get_text().strip(),
            cols[1].find('img')['alt'].strip(),
            cols[2].get_text().strip(),
            cols[3].get_text().strip(),
            biomes[cols[0].get_text().strip()]
        ))

In [7]:
len(animal_list) == len(animal_list)

True

In [8]:
headers = [elem.text.strip() for elem in soup.find('table').find_all('th')]

### Radius Data

In [9]:
rad_df = pd.read_excel('radius.xlsx')

In [10]:
rad_df

Unnamed: 0,Species,Good,Neutral,Bad
0,Indian Peafowl,<12m,12-24m,>24m
1,Japanese Macaque,<12m,12-24m,>24m
2,Koala,<12m,12-24m,>24m
3,Meerkat,<12m,16-24m,>24m
4,Red Panda,<12m,12-24m,>24m
...,...,...,...,...
82,Giant Panda,<20m,20-36m,>36m
83,Gray Seal,<20m,20-36m,>36m
84,Indian Elephant,<24m,24-48m,>48m
85,Polar Bear,<24m,24-48m,>48m


## Transform

In [11]:
headers[-2] = 'Enclosure'
headers[-1] = 'Package'
headers.append('Biomes')
headers

['Species', 'Status', 'Enclosure', 'Package', 'Biomes']

In [12]:
spark = (SparkSession.builder.appName('PlanetZooAnimals').getOrCreate())

23/01/14 14:32:42 WARN Utils: Your hostname, MacBook-Pro-de-Miguel.local resolves to a loopback address: 127.0.0.1; using 192.168.1.38 instead (on interface en0)
23/01/14 14:32:42 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/01/14 14:32:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [13]:
df = spark.createDataFrame(animal_list, headers)

In [14]:
df2 = df.select(*df.columns[:-1], explode(df.Biomes).alias('Biome'))
# df2.show()

In [15]:
df2 = df2.withColumn('Enclosure', when(df.Enclosure == 'Full', 'Habitat').otherwise(df.Enclosure))
df2 = df2.withColumn('Package', when(df.Package == 'Standard', 'Base').otherwise(df.Package))

In [16]:
df_rad = spark.createDataFrame(rad_df)

  for column, series in pdf.iteritems():
  for column, series in pdf.iteritems():


In [17]:
df2.createOrReplaceTempView('ANIMAL_LIST')
df_rad.createOrReplaceTempView('RADIUS_VIEW')

In [18]:
query = """
SELECT A.Species, Status, Enclosure, Package, Biome, Good, Neutral, Bad
FROM ANIMAL_LIST A
LEFT JOIN RADIUS_VIEW R
ON A.Species = R.Species
ORDER BY A.Species
"""

In [19]:
df3 = spark.sql(query)
# df3.show(df2.count())
# df3.show()

## Loading

In [20]:
df3.write.parquet('zoopedia.parquet')

                                                                                

In [21]:
spark.stop()