___
# Step 2 - Prepare Phase
**Author: Alexandru Nitulescu**
____

### Table of Contents
* [Introduction](#section-one)
* [Setup](#section-two)
    - [Installation of packages](#subsection-one)
    - [Importing packages](#subsection-two)
* [Webscrape data](#section-three)
* [Create the Dataframe and save it](#section-four)

<a id="section-one"></a>
### Introduction
In the preparation phase of this project, Python and various packages are used to scrape data from [www.nba.com](https://www.nba.com/). However, before collecting and processing the data, we need to determine what information is necessary for the project and how it should be organized. By asking key questions such as:
what data to retrieve, what metrics to display on future dashboards and how the data will be visualized, we were able to define the scope and requirements of the project. Additionally, identifying any prior knowledge needed and determining how to proceed helps to ensure that we were collecting and analyzing the necessary data to create a project solution. This phase is essential in setting the foundation for the subsequent data processing and database management phases.

<a id="section-two"></a>
### Setup
Before we begin our journey, we need to make sure that we have all the required libraries installed. In this section, we will import the necessary libraries for our web scraping task.

* **BeautifulSoup (bs4)** - a library used for web scraping and parsing HTML and XML documents.

* **Pandas** - a library used for data manipulation, analysis, and cleaning.
* **Selenium** - a web testing framework that allows automated browser actions. The webdriver module provides various methods for locating elements on a web page using By class.

<a id="subsection-one"></a>
#### Installation of packages
To install all the packages from the **requirements.txt**, use following command in your terminal:

`pip install -r requirements`
<a id="subsection-two"></a>
#### Importing packages

In [2]:
# Import necessary packages
import pandas as pd 
from bs4 import BeautifulSoup 
from selenium.webdriver.common.by import By 
from selenium import webdriver 

<a id="section-three"></a>
### Webscrape data

In [3]:
# Create a new web browser instance of Chrome
driver = webdriver.Chrome()

In [4]:
# The url we want to visit
url = 'https://www.nba.com/stats/teams/boxscores-traditional'

In [5]:
# Open the webpage in our instance
driver.get(url=url)

In [6]:
# Navigate to the dropdown menu and select the option "all" to then click on it
selection = driver.find_element(By.XPATH, '/html/body/div[1]/div[2]/div[2]/div[3]/section[2]/div/div[2]/div[2]/div[1]/div[3]/div/label/div/select/option[1]')
selection.click()

In [7]:
# Get the page source and store it in variable src
src = driver.page_source

We have now written some code that automates the process of visiting the NBA website and extracting data from it. The script uses the Selenium library to simulate a web browser instance and navigate to the desired page. Once the page has loaded, the code uses the find_element method to locate the dropdown menu element and select the "all" option, before clicking it. The page source is then extracted and stored in the src variable for further processing.

In [8]:
# Parse the HTML source code
parser = BeautifulSoup(src, "html.parser")

In [9]:
# Find the div element containing the table
table = parser.find("div", attrs={
    "class": "Crom_container__C45Ti crom-container"
})

We are then parsing the HTML source code stored in the `src` variable using the BeautifulSoup library. Next code row searches for the div element with the class `Crom_container__C45Ti crom-container`, which contains the table we are interested in. We then store it in the `table` variable. Be aware that while searching for the correct table, it may not be necessary to provide such details, but for the sake of eliminating any errors throughout the process, we are aiming to specify and simplify as much as possible to ensure that the code accurately executes the intended actions.

In [20]:
# Create an empty list for rows
rows = []

# Find all th elements within the table and extract their text
headers = table.findAll('th')
print(headers)
header_list = [header.text.strip() for header in headers[1:]]

[<th class="Crom_text__NpR1_ Crom_sticky__uYvkp" field="TEAM_ABBREVIATION" sort="true">TEAM</th>, <th class="Crom_text__NpR1_" field="MATCHUP" sort="true">MATCH UP</th>, <th class="Crom_text__NpR1_" dir="D" field="GDATE" sort="true">GAME DATE</th>, <th dir="D" field="WL" sort="true" title="Win/Loss">W/L</th>, <th dir="D" field="MIN" sort="true" title="Minutes Played">MIN</th>, <th dir="D" field="PTS" sort="true" title="Points">PTS</th>, <th dir="D" field="FGM" sort="true" title="Field Goals Made">FGM</th>, <th dir="D" field="FGA" sort="true" title="Field Goals Attempted">FGA</th>, <th dir="D" field="FG_PCT" sort="true" title="Field Goal Percentage">FG%</th>, <th dir="D" field="FG3M" sort="true" title="3 Point Field Goals Made">3PM</th>, <th dir="D" field="FG3A" sort="true" title="3 Point Field Goals Attempted">3PA</th>, <th dir="D" field="FG3_PCT" sort="true" title="3 Point Field Goal Percentage">3P%</th>, <th dir="D" field="FTM" sort="true" title="Free Throws Made">FTM</th>, <th dir="

In [24]:
# Find all tr elements within the table except for the first one
rows = table.findAll('tr')[1:]
print(rows[0])

<tr><td class="Crom_text__NpR1_ Crom_sticky__uYvkp"><a class="Anchor_anchor__cSc3P" data-has-children="true" data-has-more="false" data-is-external="false" href="/stats/team/1610612758">SAC</a></td><td class="Crom_text__NpR1_"><a class="Anchor_anchor__cSc3P" data-has-children="true" data-has-more="false" data-is-external="false" href="/game/0022201195">SAC @ DAL</a></td><td class="Crom_text__NpR1_"><a class="Anchor_anchor__cSc3P" data-has-children="true" data-has-more="false" data-is-external="false" href="/games?date=04/05/2023">04/05/2023</a></td><td>L</td><td>48</td><td>119</td><td><a class="StatEventLink_sel__pAwmA" data-id="nba:games:game-details-box-score:video-box-score" data-pos="" data-premium="false" data-track="video" href="/stats/events/?CFID=&amp;CFPARAMS=&amp;ContextMeasure=FGM&amp;GameID=0022201195&amp;PlayerID=0&amp;Season=2022-23&amp;SeasonType=Regular%20Season&amp;TeamID=1610612758&amp;flag=3&amp;sct=plot&amp;section=game">46</a></td><td><a class="StatEventLink_sel__p

In [12]:
# Verify the amount of rows scraped
len(rows)

2392

In [25]:
# List comprehension to iterate over the rows of the table
team_stats = [[td.getText().strip() for td in rows[i].findAll('td')[1:]] for i in range(len(rows))]
print(team_stats[0])

['SAC @ DAL', '04/05/2023', 'L', '48', '119', '46', '106', '43.4', '12', '37', '32.4', '15', '19', '78.9', '22', '35', '57', '31', '9', '7', '1', '19', '-4']


An empty list is created for the rows to be stored in. All `th` elements within the table are found and their text is extracted and stored in a list called `header_list`. All `tr` elements within the table, except the first one, are located and stored in the `rows` variable. A list comprehension is used to iterate over the rows of the table. For each row, a list of text from the `td` elements, excluding the first one, is created and stored in a list called team_stats.

<a id="section-four"></a>
### Create the Dataframe and save it
We will now create a Pandas dataframe using the data from the two lists called `team_stats` and `header_list`. 
* `team_stats` contains the data 
* `header_list` contains the column names

Afterwards, we will preview the first five rows of the dataframe to ensure its consistency.
Finally we will save the dataframe as a `.csv` in the `data` directory. The `sep` parameter specifies the separator between data values which in our case will be a semicolon.

In [15]:
# Create a dataframe from the 'team_stats' list and the 'header_list'
df = pd.DataFrame(team_stats, columns=header_list)

In [16]:
# Preview the dataframe
df.head()

Unnamed: 0,MATCH UP,GAME DATE,W/L,MIN,PTS,FGM,FGA,FG%,3PM,3PA,...,FT%,OREB,DREB,REB,AST,TOV,STL,BLK,PF,+/-
0,SAC @ DAL,04/05/2023,L,48,119,46,106,43.4,12,37,...,78.9,22,35,57,31,9,7,1,19,-4
1,NOP vs. MEM,04/05/2023,W,53,138,44,88,50.0,21,39,...,74.4,8,36,44,35,16,9,8,22,7
2,CHI @ MIL,04/05/2023,L,48,92,36,83,43.4,11,33,...,81.8,5,33,38,28,15,10,3,11,-13
3,BKN @ DET,04/05/2023,W,48,123,45,86,52.3,17,43,...,69.6,11,27,38,36,13,6,7,12,15
4,WAS @ ATL,04/05/2023,L,48,116,45,94,47.9,6,28,...,76.9,12,27,39,23,14,10,9,26,-18


In [17]:
# Save the dataframe as a CSV file in the './data' directory.
# The sep parameter is used to specify that the separator between the data values should be a semicolon.
#df.to_csv("./data/raw_data.csv", sep=";")

**Don't forget to uncomment!**
___

### Further reading
* https://pandas.pydata.org/docs/user_guide/index.html#user-guide - Documentation for pandas.
* https://www.selenium.dev/documentation/ - Documentation for Selenium.