**Process of Data Collection & Conversion to .csv File**

In this process, we outline the code steps taken to collect the data and convert it to .csv files. 

In [9]:
import requests
from bs4 import BeautifulSoup

import pandas as pd
import duckdb
import numpy as np
import time

These files were then read, cleaned, and manipulated in the phase 2 submission file. 

**Player Stats**
In this section after importing above, I collected the url and created a dataframe that collected all of the data from all of the seasons from the 2003-2004 season to the 2022-2023 season for the top player stats, ranking which players led during the season and what their stats were. I combined all of this data into one main data frame using a for loop. The data is originally filtered by average points per game for each player per season.

In [10]:
#Create a list of all of the seasons to collect data from
season_years = ["2003-04", "2004-05", "2005-06", "2006-07", "2007-08", "2008-09", "2009-10", "2010-11", "2012-13", \
    "2013-14", "2014-15", "2015-16", "2016-17", "2017-18", "2018-19", "2019-20", "2020-21", "2021-22","2022-23"]

#Intialize empty list for dataframes to be added to
dataframes = []

#Loop through the season years and 
for season_year in season_years:

    #Add {season_year} into the api url to change each iteration
    url = f"https://stats.nba.com/stats/leagueLeaders?LeagueID=00&PerMode=PerGame&Scope=S&Season={season_year}&SeasonType=Regular%20Season&StatCategory=PTS"
    response = requests.get(url).json()

    table_headers = response['resultSet']['headers']
    season_data = pd.DataFrame(response['resultSet']['rowSet'], columns=table_headers)

    # Do we want this to be first or last?
    season_data['Year'] = season_year

    # Append the dataframe to the list
    dataframes.append(season_data)

# Concatenate all dataframes into one giant dataframe
player_stat_df = pd.concat(dataframes, ignore_index=True)
player_stat_df
player_stat_df.to_csv('player_stats.csv')


**Team Stat Data Collection**
@Akhil add in process here

In [11]:
#Add here
#Existing method (incorrect data - acting as place holder)

data=[]

#Collect this number of years
years = ['2004-06-16', '2005-06-16', '2006-06-16', '2007-06-16', '2008-06-16', '2009-06-16', '2010-06-16', \
    '2011-06-16', '2012-06-16', '2013-06-16', '2014-06-16', '2015-06-16', '2016-06-16', '2017-06-16', '2018-06-16', \
    '2019-06-16', '2020-06-16', '2021-06-16', '2022-06-16', '2023-06-16'] 

#Loop through the different years and collect the data about team name and average points per game
for year in years:
    url = f"https://www.teamrankings.com/nba/stat/points-per-game?date={year}"

    response = requests.get(url)

    if response.status_code != 200:
        print("Something went wrong:", response.status_code, response.reason)
        continue

    page = BeautifulSoup(response.content, 'html.parser')

    table_body = page.find('tbody')

    rows = table_body.find_all('tr')

    year_num = int(year[:4])
    season_str = str((year_num-1))+"-"+str(year_num)[2:4]

    for row in rows:
        team_name = row.find('td', {"class": "text-left nowrap"}).text
        PTG = row.find('td', {"class" :'text-right'}).text
        data.append([season_str, team_name, PTG])

team_stats_df = pd.DataFrame(data, columns=['Year', 'Team', 'PTG'])

**Historical MVP Data Collection**
This is the data historically ranking the MVPs over the seasons that are being analyzed. It goes through the list of target seasons and splits the text on the page to create a dataframe. This had to be done by using a text representation since the other sources were not able to be scraped.

In [12]:
url = "https://www.nba.com/news/history-mvp-award-winners"

response = requests.get(url)

if response.status_code != 200:
        print("Something went wrong:", response.status_code, response.reason)

page = BeautifulSoup(response.content, 'html.parser')


target_seasons = ["2003-04", "2004-05", "2005-06", "2006-07", "2007-08", "2008-09", "2009-10", "2010-11", "2011-12", \
    "2012-13", "2013-14", "2014-15", "2015-16", "2016-17", "2017-18", "2018-19", "2019-20", "2020-21", "2021-22", "2022-23"]

data = {} 

# Loop through the target seasons
for target_season in target_seasons:
    for p_tag in page.find_all('p'):
        if target_season in p_tag.text:
            split_data = p_tag.text.split(' — ')
            if len(split_data) > 1:
                winner = split_data[1].split(',')[0]
                data[target_season] = winner

# Create a DataFrame from the data dictionary
mvp_df = pd.DataFrame(list(data.items()), columns=["Year", "MVP_Name"])
mvp_df
mvp_df.to_csv('mvp_historical.csv')