# Gathering the Data
- **Name: Andrew Angulo**
- **Data Systems Project**

The purpose of this project is to acquire information from websites by scraping off the data. Doing this will allow us to further operate within SQL and asses our data properly

---
### Website URL's
In order to aquire the data we have to first either retrieve a .HTML of the websites current page or we can simply
retrieve it using **request**
- [Arknights Operator Table](https://gamepress.gg/arknights/tools/interactive-operator-list#tags=null##cn##stats)
    - **Description:** This website includes tabular data based on the characters rarity and the stats of the given character
    - **Columns:** 2
    - **Rows:** 269 (as of November 2022)
    - **Data Format:** Long
    - **Created By:** Gamepress
- [Arknights Banner Table](https://gamepress.gg/arknights/database/banner-list-gacha)
    - **Description:** This website includes tabular data based on the upcoming banners that players are able to look forward to which also provide the specific character that will be on that banner
    - **Columns:** 5
    - **Rows:** 9 (as of November 2022)
    - **Data Format:** Wide
    - **Created By:** Daniel O'Brien, NorseFTX
    
### Goal of this project:
- The goal of this project is to determine **If some characters are better than others based on their stats, and how much would the player have to invest in the game to acquire that character?**

### Permission to Scrape Website:
- [Evidence 1](https://i.imgur.com/uSV9kiN.png)
    - **NorseFTX:** is the person who created these datasets and publishes them on **gamepress** within the Arknights section
- [Evidence 2](https://i.imgur.com/ObkCO0d.png)
    - **ChibiChu:** he is a staff member who works at **gamepress**
    
<u><i>**NOTE:** All information and permission to scrape the websites were asked to them directly through their [Discord](https://discord.com/invite/yq8D9GX) as **gamepress** does include a link to their discord server where their staff does resides</u></i>

---
### Description of Imports:

- we use **pandas** so once we grab our data and store it into a Dictionary of List we can then transfer that data into a Dataframe


- we use **requests** so we can grab all of the HTML from just the URL


- we use **lmxl import html** because since we are working with **requests** we can utilize xpath to parse and traverse the html contents and return the data we need

In [1]:
import pandas as pd
import requests
from lxml import html

---
### Why are we scraping these websites?

We are scraping this website because it provides us data that we need to answer our central question: **Are some characters better than others based on their stats, and how much would the player have to invest in the game to collect that character?**

##### Variables:
- **OperatorURL:** Stores a string of the link to the website that contains the table of all of the operators
- **BannerURL:** Stores a string of the link to the website that contains the table of all of the upcoming banners
- **OperatorPage/BannerPage:** retrieves the information from the given URL
- **OperatorTree/BannerTree:** parses and stores the HTML from the given website

##### Why do we need this?
- The reason we need these variables is because without it we can't answer our proposal question due to the fact we will have no HTML to work with or any data in particular to work with. With these variables we are one step closer to answering our proposal question

##### How do we grab the data?
- We grab the data by using **requests.get()** as it retrieves all of the HTML from the specifice URL after that we parse it with **html.fromstring(<u>insert_variable</u>.content)** which allows us to read it and go through all of its contents using xpath and other methods

In [2]:
# Retrieving the URL of the website
OperatorURL = "https://gamepress.gg/arknights/tools/interactive-operator-list#tags=null##cn##stats"
BannerURL = "https://gamepress.gg/arknights/database/banner-list-gacha"

#Retrieving the page
OperatorPage = requests.get(OperatorURL)
BannerPage = requests.get(BannerURL)

# Parsing the page
OperatorTree = html.fromstring(OperatorPage.content) 
BannerTree = html.fromstring(BannerPage.content) 

---
### Grabbing the Operator Data:

##### Variables:
- **Operators:** Grabs the names of all the operators
- **HP:** Grabs the health of all the operators
- **ATK:** Grabs the attack of all the operators
- **COST:** Grabs the cost to deploy the operator
- **BLOCK:** Grabs the amount an operator can block an enemy
- **REDEPLOY:** Grabs the amount needed to redeploy the operator
- **INTERVAL:** Grabs the second it takes for the next attack to be processed by the operator
- **TARGET:** Grabs the amount of enemys the operator can target
- **DAMAGE_TYPE:** Grabs the type of damage the operator deals
- **ROLE:** Grabs the role the operator belongs to

##### Why do we need this?
- The reason we need this variables is to set up our dictionary of lists and also because in order to answer our proposal question we need all of this data so we can finally see if **some characters are better then one another and if its worth investing into them** 

##### How do we grab the data?
- In order to grab the data we are using **List Comprehensions** and **xpath** in order to store all of the data from the HTML into a list while also cleaning up the values by either turning it into a integer, float or just stripping it and removing '\n' or any other problems.

In [3]:
#Grabbing the Operators
Operators = OperatorTree.xpath("/.//td[@class = 'operator-cell']/div[@class = 'operator-title']/a/text()")

In [4]:
HP = [int(i.strip()) for i in OperatorTree.xpath("/.//tr[@class = 'trustStat']/td[position() = 1]/text()")]
ATK = [int(i.strip()) for i in OperatorTree.xpath("/.//tr[@class = 'trustStat']/td[position() = 2]/text()")]
DEF = [int(i.strip()) for i in OperatorTree.xpath("/.//tr[@class = 'trustStat']/td[position() = 3]/text()")]
COST = [int(i.strip()) for i in OperatorTree.xpath("/.//tr[@class = 'trustStat']/td[position() = 4]/text()")]
RES = [int(i.strip()) for i in OperatorTree.xpath("/.//div[@class = 'stats-table-cell']/table[position() = 2]/tbody/tr[position() =1]/td[position() = 1]/text()")]
BLOCK = [int(i.strip()) for i in OperatorTree.xpath("/.//div[@class = 'stats-table-cell']/table[position() = 2]/tbody/tr[position() = 1]/td[position() = 2]/text()")]
REDEPLOY = [int(i.strip()) for i in OperatorTree.xpath("/.//div[@class = 'stats-table-cell']/table[position() = 2]/tbody/tr[position() = 1]/td[position() = 3]/span[position() =1]/text()")]
INTERVAL = [float(i.strip().replace('s', '')) for i in OperatorTree.xpath("/.//div[@class = 'stats-table-cell']/table[position() = 2]/tbody/tr[position() = 1]/td[position() = 4]/text()")]
TARGET = [i.strip().replace('Target: ', '').replace(' (Block #)', '') for i in OperatorTree.xpath("/.//div[@class = 'target-cell']/text()")]
DAMAGE_TYPE = [i.strip() for i in OperatorTree.xpath("/.//div[@class = 'target-damage-type stats-section tab-section']/div/a/text()")]
ROLE = [i.strip() for i in OperatorTree.xpath("/.//div[@class = 'info-div']/span[position() = 1]/text()")]

for i in range(len(TARGET)): #Converts string to an int if it isnt a string
    try:
        TARGET[i] = int(TARGET[i])
    except:
        continue

---
### Creating the Dictionary:

When we create the Dictionary you might be wondering **"why arent we using the header within the table as our keys within the dictionary"**. To answer your question the reason why we aren't using the headers as our keys is because in relation to our proposal question we are mainly seeking the stats of the operators so we can eventually compare them to other operators and see if possibly they are better or worse.

In [5]:
OperatorDoL = {'Operators': Operators, 'HP': HP, 'ATK': ATK, 'DEF': DEF, 'COST': COST, 'RES': RES, 'BLOCK': BLOCK
      , 'REDEPLOY': REDEPLOY, 'INTERVAL': INTERVAL, 'TARGET': TARGET, 'DMG': DAMAGE_TYPE, 'ROLE': ROLE}

---
### Assembling the Dataframe:

Now that we have our data in a respective dictionary we can translate it towards into a pandas dataframe. By doing this we can further visualize our dataset and eventually translate this into SQL

In [6]:
OperatorDF = pd.DataFrame.from_dict(OperatorDoL)
OperatorDF

Unnamed: 0,Operators,HP,ATK,DEF,COST,RES,BLOCK,REDEPLOY,INTERVAL,TARGET,DMG,ROLE
0,Vigil,1755,542,154,17,0,1,70,1.00,1,Physical,Vanguard
1,Penance,4655,916,616,36,10,3,70,1.60,1,Physical,Defender
2,Texas the Omertosa,1598,659,320,10,0,1,18,0.93,1,Physical,Specialist
3,Stainless,2723,633,461,19,0,2,70,1.50,1,Physical,Supporter
4,Młynar,4266,385,502,12,15,3,70,1.20,1,Physical,Guard
...,...,...,...,...,...,...,...,...,...,...,...,...
264,Yato,1030,262,192,7,0,2,70,1.05,1,Physical,Vanguard
265,'Justice Knight',595,217,41,3,0,1,200,1.00,1,Physical,Sniper
266,THRM-EX,1443,350,443,3,50,0,200,0.93,1,Physical,Specialist
267,Castle-3,1391,413,90,3,0,1,200,1.50,1,Physical,Guard


---
### Tidying the Dataframe:
After accessing **OperatorDF** we can clearly see that this dictionary is not tidy at all. There is no index, some of the columns are unclear. In order to fix this we can do a simple pivot and tidy up our data so we can have the respective index that translate to the data we want of that specific character


- **Independent Variable:** Role, Operators
- **Dependent Variable:** ATK, BLOCK, COST, DEF, DMG, HP, INTERVAL, REDEPLOY, RES, TARGET 

The independent variables are **Role and Operators** because within the game Arknights each operator has a distinct name and are assigned to a role ex: Caster, Vanguard. With these two keys we can access the deoendent variables which give us the operators stats within the game

In [7]:
OperatorDFTidy = OperatorDF.pivot_table(index=['ROLE', 'Operators'],aggfunc=lambda x: ''.join(str(v) for v in x))
OperatorDFTidy

Unnamed: 0_level_0,Unnamed: 1_level_0,ATK,BLOCK,COST,DEF,DMG,HP,INTERVAL,REDEPLOY,RES,TARGET
ROLE,Operators,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Caster,12F,482,1,24,50,Arts,1461,2.9,70,10,AoE
Caster,Absinthe,703,1,22,124,Arts,1420,1.6,80,20,1
Caster,Amiya,682,1,20,121,Arts,1680,1.6,70,20,1
Caster,Astgenne,705,1,34,122,Arts,1440,2.3,80,20,1
Caster,Beeswax,805,1,23,225,Arts,2005,2.0,70,15,AoE
...,...,...,...,...,...,...,...,...,...,...,...
Vanguard,Vigil,542,1,17,154,Physical,1755,1.0,70,0,1
Vanguard,Vigna,618,1,11,351,Physical,1845,1.0,70,0,1
Vanguard,Wild Mane,628,1,14,372,Physical,2225,1.0,80,0,1
Vanguard,Yato,262,2,7,192,Physical,1030,1.05,70,0,1


---
### Getting the Banner Data:

##### Variables:
- **BannerHeaders:** Grabs the header from the table within the HTML of the website using xpath
- **BannerName:** Grabs the name of the banners within the HTML of the website using xpath
- **CNDate:** Grabs the date of the banners when it was released in China within the HTML using xpath
- **ENDate:** Grabs the date of the banners when it was released globally within the HTML using xpath
    - **NOTE:** Only works if there are values within the table. Currently the table doesnt contain dates for upcoming banners for global. So the code is commentend out
- **FinalOperators:** Stores all of the characters from the List of List and converts it into a List of strings.
    - **NOTE:** This variable is in relation with **FeaturedOperators** as that variable contains the List of List of all the character names using procedural xpath.
    
##### Why do we need this?
- We need all of these variables to help us answer our proposal question **Are some characters better than others based on their stats, and how much would the player have to invest in the game to collect that character?** In order to determine how much the player has to invest into the game to obtain a specific operator they want, we have to obtain the dates and retrieve the amount of days or in-game currency they have have to spend to obtain this character.

##### How do we grab the data?
- The way we grab this data is mainly by obtaining the URL, then we parse it, after that for the headers we can use a xpath to retrieve the data. For everything else, we can use List comprehensions and xpath to properly store the data into a list while also cleaning it if nessecary. 

In [8]:
# Grabbing the headers
BannerHeaders = BannerTree.xpath("//table[@class = 'views-table views-view-table cols-4']/thead/tr/th/text()")
print(BannerHeaders)

['Event Banner', 'Banner Date (CN)', 'Banner Date (NA)', 'Featured Characters']


In [9]:
BannerName = [i.strip() for i in BannerTree.xpath("/.//td[@headers = 'view-field-event-banner-table-column']/a/div/text()") if i != '\n' ]
CNDate = [i for i in BannerTree.xpath("/.//td[@headers = 'view-field-cn-end-date-table-column']/time/text()")]
CNDate = [' - '.join(CNDate[i : i+2]) for i in range(0, len(CNDate), +2)]

# ENDate wont work because no values are under the NA Date
# It will work if there are dates
#ENDate = [i for i in BannerTree.xpath("/.//td[@headers = 'view-field-start-time-table-column]/time/text()")]
#ENDate = [' - '.join(CNDate[i : i+2]) for i in range(0, len(CNDate), +2)]

PlaceHolder = []
FeaturedOperators = []
FinalOperators = []
for child in BannerTree.xpath("/.//td[@headers = 'view-field-featured-characters-table-column']"):
    for descendants in child:
        PlaceHolder.append(descendants.text)
    FeaturedOperators.append(PlaceHolder)
    PlaceHolder = []

for i in FeaturedOperators: #Converts LoL to a List that contains strings
    pH = ', '.join(i)
    FinalOperators.append(pH)
print(FinalOperators)

["Lappland, Liskarm, Provence, FEater, Exusiai, Skadi, Tsukinogi, Mountain, Akafuyu, Kal'tsit", 'Blue Poison, Hibiscus the Purifier, Ebenholz', 'Ptilopsis, Greyy the Lightningbearer, Dorothy', 'Gavial the Invincible, Pozyomka (Позёмка), Cantabile', 'Skyfire, Magallan, Bagpipe, Shamare, Mr. Nothing, Toddifons, Passenger, Carnelian, La Pluma, Mulberry', 'Franka, Proviso, Młynar', 'Jackie, Mudrock, Whisperain, Roberta, Mulberry, Saileach, Chestnut, Rockrock, Horn', 'Cliffheart, Totter, Paprika, Stainless', 'Texas the Omertosa, Penance, Lunacub']


---
### Creating the Dictionary & Dataframe:

When creating this dictionary we can refer to the headers we grabbed earlier and use them as our keys. However if you notice to our dictionary you might wonder where is **Banner Data (EN)**. Since we dont have any values for that due to the website not inputting any we can simply not include it. So for the keys we can use the headers and the values we can input BannerName, CNDate, and FinalOperators as those variables contain Lists of the data we are looking for

In [10]:
bannerDict = {BannerHeaders[0]: BannerName, BannerHeaders[1]: CNDate,BannerHeaders[3]: FinalOperators}

bannerDF = pd.DataFrame.from_dict(bannerDict)

bannerDF

Unnamed: 0,Event Banner,Banner Date (CN),Featured Characters
0,Joint Operation 6,2022-05-19 - 2022-06-02,"Lappland, Liskarm, Provence, FEater, Exusiai, ..."
1,Dissonanzen,2022-06-09 - 2022-06-23,"Blue Poison, Hibiscus the Purifier, Ebenholz"
2,Pathfinder of Sands,2022-07-05 - 2022-07-19,"Ptilopsis, Greyy the Lightningbearer, Dorothy"
3,Great Axe and Pen Nib - [Summer] Series Limite...,2022-08-11 - 2022-08-25,"Gavial the Invincible, Pozyomka (Позёмка), Can..."
4,Joint Operation 7,2022-08-25 - 2022-09-08,"Skyfire, Magallan, Bagpipe, Shamare, Mr. Nothi..."
5,Never Vowed,2022-09-08 - 2022-09-22,"Franka, Proviso, Młynar"
6,The Front That Was,2022-09-27 - 2022-10-11,"Jackie, Mudrock, Whisperain, Roberta, Mulberry..."
7,Bearing and Sparks,2022-10-11 - 2022-10-25,"Cliffheart, Totter, Paprika, Stainless"
8,Chop the Thorns: Open Circuits - Celebration S...,2022-11-01 - 2022-11-15,"Texas the Omertosa, Penance, Lunacub"


---
### Tidying the Dataframe:
After accessing **OperatorDF** we can clearly see that this dictionary is not tidy at all. There is no index, some of the columns are unclear. In order to fix this we can do a simple pivot and tidy up our data so we can have the respective index that translate to the data we want of that specific character


- **Independent Variable:** Event Banner
- **Dependent Variable:** Banner Date(CN), Featured Characters

The independent variable we can see is **Event Banner** because of this we can access all the data of that respective banner as it will give us the date when it release, and the featured operators that the banner came with.

In [11]:
bannerDFTidy = bannerDF.copy()
bannerDFTidy = bannerDFTidy.set_index(['Event Banner'])
pd.set_option('display.max_colwidth', None)
bannerDFTidy = bannerDFTidy.sort_values(by='Banner Date (CN)')
bannerDFTidy = bannerDFTidy.reindex(index=bannerDFTidy.index[::-1])
bannerDFTidy

Unnamed: 0_level_0,Banner Date (CN),Featured Characters
Event Banner,Unnamed: 1_level_1,Unnamed: 2_level_1
Chop the Thorns: Open Circuits - Celebration Series Limited Headhunting,2022-11-01 - 2022-11-15,"Texas the Omertosa, Penance, Lunacub"
Bearing and Sparks,2022-10-11 - 2022-10-25,"Cliffheart, Totter, Paprika, Stainless"
The Front That Was,2022-09-27 - 2022-10-11,"Jackie, Mudrock, Whisperain, Roberta, Mulberry, Saileach, Chestnut, Rockrock, Horn"
Never Vowed,2022-09-08 - 2022-09-22,"Franka, Proviso, Młynar"
Joint Operation 7,2022-08-25 - 2022-09-08,"Skyfire, Magallan, Bagpipe, Shamare, Mr. Nothing, Toddifons, Passenger, Carnelian, La Pluma, Mulberry"
Great Axe and Pen Nib - [Summer] Series Limited Headhunting,2022-08-11 - 2022-08-25,"Gavial the Invincible, Pozyomka (Позёмка), Cantabile"
Pathfinder of Sands,2022-07-05 - 2022-07-19,"Ptilopsis, Greyy the Lightningbearer, Dorothy"
Dissonanzen,2022-06-09 - 2022-06-23,"Blue Poison, Hibiscus the Purifier, Ebenholz"
Joint Operation 6,2022-05-19 - 2022-06-02,"Lappland, Liskarm, Provence, FEater, Exusiai, Skadi, Tsukinogi, Mountain, Akafuyu, Kal'tsit"


---
### Getting the Popular Characters Data:

##### Variables:
- **PopularChar:** Grabs the popular characters from the HTML and stores it into a List
    
##### Why do we need this?
- The reason we are grabbing this data is because maybe a user wants to look at the popular characters of the day and make their mission to unlock one of these Operators. Relating this back to our proposal question **Are some characters better than others based on their stats, and how much would the player have to invest in the game to collect that character?** Grabbing this data will help the user possibly make them have a better judgment on who to invest for

##### How do we grab the data?
- We grab this data by using xpath and list comprehensions so we can store all of the data within a List. We also use .strip() within the list comprehension so we can possibly get rid of any malicious characters that will hinder our data

In [12]:
PopularChar = [i.strip() for i in BannerTree.xpath("/.//div[@class = 'popular-items-block popular-block']/ul/li/a/span[@class = 'pages-ranking-title']/text()")]
temp = PopularChar[0]
PopularChar[0] = PopularChar[1]
PopularChar[1] = temp
print(PopularChar)

['Irene', 'Specter the Unchained', 'Lumen', 'Texas the Omertosa', 'Skadi the Corrupting Heart', 'Surtr', 'Mudrock', 'Gladiia', 'Specter', 'Skadi']


---
### Assembling the Dataframe:

When creating this Dataframe since it isn't as complex as the ones we have made before all we really need to do is call a **pd.DataFrame** with our List and increase the index by 1 because we are showing the top 10 operators of the day. After that we can simply just show the data in our Dataframe. This Dataframe doesn't need to be tidy at all because all were representing is one column.

In [13]:
PopularOperatorDF = pd.DataFrame(PopularChar, columns=['Popular Operators Today'])
PopularOperatorDF.index += 1 
PopularOperatorDF

Unnamed: 0,Popular Operators Today
1,Irene
2,Specter the Unchained
3,Lumen
4,Texas the Omertosa
5,Skadi the Corrupting Heart
6,Surtr
7,Mudrock
8,Gladiia
9,Specter
10,Skadi


### My intention with all of this data

With all of this data I intend to first answer the question **If some characters are better than others based on their stats, and how much would the player have to invest in the game to acquire that character?** While also creating functions that properly apply the foundation to assure that the proposal question is answered. Also further representing this data within a SQL database will provide a better understanding to see if the data has any relation with one another. 