# Step 3 - Prepare Phase
by Alexandru Nitulescu
____

### Introduction
In the preparation phase of this project, we used Python and various packages to scrape data from www.nba.com. However, before collecting and processing the data, we need to determine what information is necessary for the project and how it should be organized. By asking key questions such as:
 what data to retrieve, what metrics to display on the dashboard, and how the data will be visualized, we were able to define the scope and requirements of the project. Additionally, identifying any prior knowledge needed and determining how to proceed helped to ensure that we were collecting and analyzing the necessary data to create a dynamic and informative dashboard. This phase was essential in setting the foundation for the subsequent data processing and database management phases. The end goal of this project is to showcase my skills in data analysis and database management for potential job opportunities in the field.


#### 3.1 - Setup

Before we begin our journey, we need to make sure that we have all the required libraries installed. In this section, we will import the necessary libraries for our web scraping task.

In [1]:
# Import necessary packages
import pandas as pd 
from bs4 import BeautifulSoup 
from selenium.webdriver.common.by import By 
from selenium import webdriver 

#### 3.2 - Webscrape data

In [2]:
# Create a new web browser instance of Chrome
driver = webdriver.Chrome()

In [3]:
# The url we want to visit
url = 'https://www.nba.com/stats/teams/boxscores-traditional'

In [4]:
# Open the webpage in our instance
driver.get(url=url)

In [5]:
# Navigate to the dropdown menu and select the option "all" to then click on it
selection = driver.find_element(By.XPATH, '/html/body/div[1]/div[2]/div[2]/div[3]/section[2]/div/div[2]/div[2]/div[1]/div[3]/div/label/div/select/option[1]')
selection.click()

In [6]:
# Get the page source and store it in variable src
src = driver.page_source

In [7]:
# Parse the HTML source code
parser = BeautifulSoup(src, "html.parser")

In [8]:
# Find the div element containing the table
table = parser.find("div", attrs={
    "class": "Crom_container__C45Ti crom-container"
})

In [9]:
# Create empty list for rows
rows = []

# Find all th elements within the table and extract their text
headers = table.findAll('th')
header_list = [header.text.strip() for header in headers[1:]]

In [10]:
# Find all tr elements within the table except for the first one
rows = table.findAll('tr')[1:]

In [11]:
# Verify the amount of rows scraped
len(rows)

2350

In [12]:
# List comprehension to iterate over the rows of the table
team_stats = [[td.getText().strip() for td in rows[i].findAll('td')[1:]] for i in range(len(rows))]

In [13]:
# Showcase the row structure
print(team_stats[0])

['ATL vs. DAL',
 '04/02/2023',
 'W',
 '53',
 '132',
 '51',
 '108',
 '47.2',
 '12',
 '35',
 '34.3',
 '18',
 '22',
 '81.8',
 '16',
 '37',
 '53',
 '28',
 '11',
 '10',
 '3',
 '22',
 '2']

In [14]:
# Create a dataframe from the 'team_stats' list and the 'header_list'
df = pd.DataFrame(team_stats, columns=header_list)

In [15]:
# Show the five first rows
df.head()

Unnamed: 0,MATCH UP,GAME DATE,W/L,MIN,PTS,FGM,FGA,FG%,3PM,3PA,...,FT%,OREB,DREB,REB,AST,TOV,STL,BLK,PF,+/-
0,ATL vs. DAL,04/02/2023,W,53,132,51,108,47.2,12,35,...,81.8,16,37,53,28,11,10,3,22,2
1,CHA vs. TOR,04/02/2023,L,48,108,42,85,49.4,15,31,...,69.2,10,27,37,26,18,3,4,11,-20
2,PHI @ MIL,04/02/2023,L,48,104,40,87,46.0,12,36,...,92.3,11,25,36,19,11,3,2,17,-13
3,POR @ MIN,04/02/2023,W,48,107,43,93,46.2,9,30,...,60.0,11,31,42,29,10,12,3,26,2
4,MIL vs. PHI,04/02/2023,W,48,117,46,80,57.5,10,28,...,71.4,7,35,42,28,12,8,5,17,13


In [16]:
# Save the dataframe as a CSV file in the './data' directory.
# The sep parameter is used to specify that the separator between the data values should be a semicolon.
df.to_csv("./data/raw_data.csv", sep=";")