### Obtaining player data per season
In order to make the player evolution model, let's first get data about the NBA players and their recent performance. 
We are going to use the [Basketball Reference](https://www.basketball-reference.com/) website to do it.
First, we have downloaded content from [this page](https://www.basketball-reference.com/leagues/NBA_2022_totals.html), which contains data of statistics of each player per game in the 2021/2022 season.

In [1]:
import pandas as pd
import os
pd.options.display.max_colwidth = 100
pd.options.display.max_rows = 100

In [2]:
df = pd.read_csv('season_21_22.csv')
df['Player Name'] = df['Player'].str.split(pat = "\\", expand = True)[0]
df['Player Link'] = 'https://www.basketball-reference.com/players/'+ df['Player'].str.split(pat = "\\", expand = True)[1].str[0] +'/' + df['Player'].str.split(pat = "\\", expand = True)[1]  + '.html'

In [3]:
df.head()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Player Name,Player Link
0,1,Precious Achiuwa\achiupr01,C,22,TOR,73,28,23.6,3.6,8.3,...,4.5,6.5,1.1,0.5,0.6,1.2,2.1,9.1,Precious Achiuwa,https://www.basketball-reference.com/players/a/achiupr01.html
1,2,Steven Adams\adamsst01,C,28,MEM,76,75,26.3,2.8,5.1,...,5.4,10.0,3.4,0.9,0.8,1.5,2.0,6.9,Steven Adams,https://www.basketball-reference.com/players/a/adamsst01.html
2,3,Bam Adebayo\adebaba01,C,24,MIA,56,56,32.6,7.3,13.0,...,7.6,10.1,3.4,1.4,0.8,2.6,3.1,19.1,Bam Adebayo,https://www.basketball-reference.com/players/a/adebaba01.html
3,4,Santi Aldama\aldamsa01,PF,21,MEM,32,0,11.3,1.7,4.1,...,1.7,2.7,0.7,0.2,0.3,0.5,1.1,4.1,Santi Aldama,https://www.basketball-reference.com/players/a/aldamsa01.html
4,5,LaMarcus Aldridge\aldrila01,C,36,BRK,47,12,22.3,5.4,9.7,...,3.9,5.5,0.9,0.3,1.0,0.9,1.7,12.9,LaMarcus Aldridge,https://www.basketball-reference.com/players/a/aldrila01.html


### Obtaining more player info
The data gathered before contained the link for each player main page in Basketball Reference. We are going to do a web scrapping process in these pages in order to get useful data from the previous years for each one of the player that played in the 2021/22 NBA season

In [4]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [5]:
all_years = ['per_game.1997','per_game.1998','per_game.1999','per_game.2000','per_game.2001',
            'per_game.2002','per_game.2003','per_game.2004','per_game.2005','per_game.2006',
            'per_game.2007','per_game.2008','per_game.2009','per_game.2010','per_game.2011',
            'per_game.2012','per_game.2013','per_game.2014','per_game.2015','per_game.2016',
            'per_game.2017','per_game.2018','per_game.2019','per_game.2020','per_game.2021',
            'per_game.2022']

In [6]:
final_df = pd.DataFrame(columns=['season', 'height', 'weight', 'name', 'age', 'team_id', 'lg_id', 'pos',
       'g', 'gs', 'mp_per_g', 'fg_per_g', 'fga_per_g', 'fg_pct', 'fg3_per_g',
       'fg3a_per_g', 'fg3_pct', 'fg2_per_g', 'fg2a_per_g', 'fg2_pct',
       'efg_pct', 'ft_per_g', 'fta_per_g', 'ft_pct', 'orb_per_g', 'drb_per_g',
       'trb_per_g', 'ast_per_g', 'stl_per_g', 'blk_per_g', 'tov_per_g',
       'pf_per_g', 'pts_per_g'])

In [7]:
%%time
for player in df['Player Link'].unique():
    html = urlopen(player)
    bs = BeautifulSoup(html, 'html.parser')
    test_up = {}
    for year in range(len(bs.find_all(attrs={'class':'full_table','id':all_years}))):
        test = {}
        for child in bs.find_all(attrs={'class':'full_table','id':all_years})[year].children:
            if child.a is None:
                test[child['data-stat']] = child.string
            else:
                test[child['data-stat']] = child.a.string
            test['height'] = int(bs.find(attrs={"itemprop":"height"}).find_parent().contents[3].string.split('(')[1].split('c')[0])
            test['weight'] = int(bs.find(attrs={"itemprop":"height"}).find_parent().contents[3].string.split(',\xa0')[1].split('k')[0])
            test['name'] = bs.find('h1',attrs={'itemprop':'name'}).contents[1].string
        test_up[year] = test
    player_df = pd.DataFrame.from_dict(test_up, orient = 'index')
    final_df = pd.concat([final_df, player_df])

Wall time: 18min 26s


In [8]:
final_df.head()

Unnamed: 0,season,height,weight,name,age,team_id,lg_id,pos,g,gs,...,ft_pct,orb_per_g,drb_per_g,trb_per_g,ast_per_g,stl_per_g,blk_per_g,tov_per_g,pf_per_g,pts_per_g
0,2020-21,203,102,Precious Achiuwa,21,MIA,NBA,PF,61,4,...,0.509,1.2,2.2,3.4,0.5,0.3,0.5,0.7,1.5,5.0
1,2021-22,203,102,Precious Achiuwa,22,TOR,NBA,C,73,28,...,0.595,2.0,4.5,6.5,1.1,0.5,0.6,1.2,2.1,9.1
0,2013-14,211,120,Steven Adams,20,OKC,NBA,C,81,20,...,0.581,1.8,2.3,4.1,0.5,0.5,0.7,0.9,2.5,3.3
1,2014-15,211,120,Steven Adams,21,OKC,NBA,C,70,67,...,0.502,2.8,4.6,7.5,0.9,0.5,1.2,1.4,3.2,7.7
2,2015-16,211,120,Steven Adams,22,OKC,NBA,C,80,80,...,0.582,2.7,3.9,6.7,0.8,0.5,1.1,1.1,2.8,8.0


Let's save in a csv file to be easier to access later

In [9]:
final_df.to_csv('players_stats_by_season_v2.csv', index = False)