# DS 3000 HW 5 

Due: Sunday July 20th @ 11:59 PM EST

### Submission Instructions
Submit this `ipynb` file and the a `PDF` file included with the coding results to Gradescope (this can also be done via the assignment on Canvas).  To ensure that your submitted files represent your latest code, make sure to give a fresh `Kernel > Restart & Run All` just before uploading the files to gradescope. 

**Notice that this is a group assignment. Each group only need to submit one copy and when you submit the work, please include everyone in your group.**

### Tips for success
- Start early
- Make use of Piazza
- Make use of Office hour
- Remember to use cells and headings to make the notebook easy to read (if a grader cannot find the answer to a problem, you will receive no points for it)
- Under no circumstances may one student view or share their ungraded homework or quiz with another student [(see also)](http://www.northeastern.edu/osccr/academic-integrity), though you are welcome to **talk about** (not show each other) the problems.

## Project proposal

For this course, we aim to complete a data analysis project about the the game [Palworld](https://en.wikipedia.org/wiki/Palworld). To help you start with the project, here are a couple of things you need to consider and work on to get a clean data for later analysis. 

To start with the project, please take some time to get familiar with the game. You don't need to play it but please at least know the basic terminologies, like what is a Pal. (And also, if you do play it, please do not spend too much time on it.)

The two recommended database is [https://palworld.gg/](https://palworld.gg/) and [https://paldb.cc/en/](https://paldb.cc/en/). You can use either, or both, or some other database about the Palworld. 

### Part 1.1 (10 points)

Please list 2-3 questions you may be interested to study with the Palworld database. It can be anything related in the game, like the Pals, items or constructions. Some potential question structures can be: 
- Are `A` and `B` related? How they are related?
- Which features may affect `C`'s change?
- If I need a higher `D`, which features may have a lower/higher value?
- Based on `E` and `F`, which items/pals are similar?
- I need to predict the value for `G`, which features I need to consider?

#### Question 1
Are a Pal's stamina and running speed related? If so, how?

#### Question 2
What features affect a Pal's rank in the tier list? (divide Pals into their ranks)

#### Question 3
What features are needed for consideration for predicting the price of a Pal? (divide Pals into price ranges)

### Part 1.2 (20 points)

Based on the questions we proposed in the part 1.1, what features we may need to include in the analysis? Check the websites, which website has those information? **You need to pick at least 8 features for analysis.** We recommend a mix of numerical (numbers etc.) and categorical (level etc.) features. Is there any other features that you think it may be important but hard to extract or find on the website (can be something in or not in the game)?

#### Features 1 and 2 (for Question 1)
`Stamina` and `Running Speed`

#### Features 3 - 5 (for Question 2)
`HP`, `Shot Attack`, and `Crafting Speed`

#### Features 6 - 8 (for Question 3)
`Rarity`, `Type`, `Work Suitability`

### Part 1.3 (20 points)

Suppose you do have all the features you mentioned in part 1.2. List 3-4 data visulizations you can make with those features. You do not need to make those visulizations here. Just describe the type of the visualizations (histogram, scatter plot etc. ), which features are involved, will there any hover data or color being added, and **discuss how these data visualizations may be related (or even answer) to your questions in part 1.1**. 

1. Scatterplot: `Stamina` vs. `Running Speed`
- **Features Involved:** `Stamina` (x-axis), `Running Speed` (y-axis)
- **Color:** Use colors to indicate the `Rarity`
- **Purpose:** This will reveal a correlation between stamina and speed; a clear upward or downward trend could show us how Pals balance endurance and mobility.

2. Histogram: Distribution of `Crafting Speed` by Tier
- **Features Involved:** `Crafting Speed`, grouped by `Tier Rank`
- **Approach:** Plot multiple histograms overlaid or side-by-side for all tiers (S, A, B, C, D)
- **Color:** Use different colors for each tier group
- **Purpose:** Supports Question 2 by showing how crafting speed is distributed for different ranks. Higher-tier Pals may be found to be clustering toward higher values.

3. Box Plot: `HP`, `Shot Attack`, and `Crafting Speed`
- **Features Involved:** `HP`, `Shot Attack`, `Crafting Speed`
- **Approach:** Side-by-side box plots to compare distributions of key stats
- **X-axis:** Feature names (3 categories)
- **Y-axis:** Value of each stat
- **Purpose:** This will also help answer Question 2 by giving a clean summary of central tendency and spread across key features that affect rank.

4. Bar Chart: Average Pal Price by `Work Suitability`
- **Features Involved:** `Work Suitability` (x-axis), `Average Price` (y-axis)
- **Approach:** Grouped bar chart, with one bar per Work Suitability type (e.g. Mining, Farming, etc.)
- **Color:** Bars will be colored by `Rarity`
- **Purpose:** Answers Question 3 by visualizing how price differs by utility role. If some work roles show consistent results in higher prices, those features may prove important for price prediction.

### Part 1.4  (50 points)

Now, go ahead and try to scrape the features you need. 

Please show all the codes you have for web scrapping. Your current output data frame should include at least 4 features. (You do not need to scrape all features at this moment, although it is recommend to start earlier. Also, you can choose to not to use the ones you have scraped in the later analysis. No need to worry if you need to change anything later). **Please design your code in pipeline and clearly document each function.** See the Python Style Guide in Week 1 for proper documentation. It is also recommended to save the data you have scrapped. 

In [15]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import requests
from bs4 import BeautifulSoup
import pandas as pd

In [120]:
url = 'https://palworld.gg/pals'

def get_html(url):
    """ 
    Retrieves the HTML content of a given URL
    
    Args:
        url (str): URL of the PalWorld database page
        
    Returns:
        html_str (str): Raw HTML content of the page as a long string
    """
    html_str = requests.get(url).text

    return html_str

In [175]:
html = get_html(url)

def build_pal_df(html):
    """ 
    Extracts Pal data from the HTML content via scraping to return a DataFrame with the selected features
    
    Args:
        html_str (str): Raw HTML content of the PalWorld Pals page
        
    Returns:
        df_pals (pd.DataFrame): dataframe containing at least four features of Pals (Rarity, Type)
    """
    soup = BeautifulSoup(html)

    unfiltered_list = []
    name_list = []
    rarity_list = []
    type_list = []
    
    for pal in soup.find_all(class_ = 'name'):
        pal_name = pal.text
        unfiltered_list.append(pal_name.split('#')[0])
    name_list = unfiltered_list[::2]
    rarity_list = unfiltered_list[1::2]
    
    df_pals = pd.DataFrame({'Name' : name_list,
                            'Rarity' : rarity_list})
    
    return df_pals

In [176]:
build_pal_df(html).head()

Unnamed: 0,Name,Rarity
0,Anubis,Epic
1,Arsox,Common
2,Astegon,Epic
3,Azurmane,Rare
4,Azurobe,Rare
