### Obtaining draft data
In order to make the college evolution model, let's first get data about the recent drafts. 
We are going to use the [Basketball Reference](https://www.basketball-reference.com/) website to do it.
First, we have downloaded content from pages such as [this one](https://www.basketball-reference.com/draft/NBA_2021.html), which contains all the draft picks from that year. We saved each one of those pages tables in a csv file.

In [1]:
import pandas as pd
import os
from urllib.request import urlopen
from bs4 import BeautifulSoup, Comment

In [2]:
df_list = []
for file in os.listdir():
    if 'draft_2' in  file:
        df = pd.read_csv(file)
        df['draft_year'] = file.split('_')[1].split('.')[0]
        df_list.append(df)

In [3]:
df = pd.concat(df_list)
df = df[['draft_year','Pk','Tm','Player','College','Yrs','G','MP','PTS']]

In [4]:
df.head()

Unnamed: 0,draft_year,Pk,Tm,Player,College,Yrs,G,MP,PTS
0,2011,1,CLE,Kyrie Irving,Duke,11.0,611.0,20804.0,14089.0
1,2011,2,MIN,Derrick Williams,Arizona,7.0,428.0,8864.0,3809.0
2,2011,3,UTA,Enes Freedom,,11.0,748.0,16101.0,8349.0
3,2011,4,CLE,Tristan Thompson,Texas,11.0,730.0,19556.0,6592.0
4,2011,5,TOR,Jonas Valančiūnas,,10.0,695.0,18141.0,9319.0


A second step we have done was to get data info about the college of each one of the drafted player. It was retrieved from the [Sports Reference](https://www.sports-reference.com/cbb/) website, in pages such as [this one](https://www.sports-reference.com/cbb/conferences/big-12/)

In [5]:
df_conf = pd.read_csv('College conferences.csv')

In [6]:
df = df.merge(df_conf, how = 'left', left_on = 'College', right_on = 'School').drop(columns = ['School'])

In [7]:
df['Big Six Conference'] = 0
df.loc[df.Conference.notna(),'Big Six Conference'] = 1

In [8]:
df.head()

Unnamed: 0,draft_year,Pk,Tm,Player,College,Yrs,G,MP,PTS,Conference,Big Six Conference
0,2011,1,CLE,Kyrie Irving,Duke,11.0,611.0,20804.0,14089.0,ACC,1
1,2011,2,MIN,Derrick Williams,Arizona,7.0,428.0,8864.0,3809.0,PAC-12,1
2,2011,3,UTA,Enes Freedom,,11.0,748.0,16101.0,8349.0,,0
3,2011,4,CLE,Tristan Thompson,Texas,11.0,730.0,19556.0,6592.0,Big 12,1
4,2011,5,TOR,Jonas Valančiūnas,,10.0,695.0,18141.0,9319.0,,0


Now we are going to do a web scrapping in the basketball references pages in order to obtain the main page link of each drafted player, which will contain valuable information for our model

In [9]:
years = ['2011','2012','2013','2014','2015','2016','2017','2018','2019','2020','2021']

In [10]:
link_list = []

In [11]:
%%time
for year in years:
    html = urlopen('https://www.basketball-reference.com/draft/NBA_' + year + '.html')
    bs = BeautifulSoup(html, 'html.parser')
    for i in range(len(bs.find_all('td', attrs = {'data-stat':'player','class':'left'}))):
        if (bs.find_all('td', attrs = {'data-stat':'college_name','class':'left'})[i]['csk'] != 'Zzz'):
            link_list.append(bs.find_all('td', attrs = {'data-stat':'player','class':'left'})[i].contents[0]['href'])

Wall time: 34.5 s


In [12]:
link_list[:5]

['/players/i/irvinky01.html',
 '/players/w/willide02.html',
 '/players/t/thomptr01.html',
 '/players/k/knighbr03.html',
 '/players/w/walkeke02.html']

In [13]:
len(link_list)

533

And finally, we are going to web scrape each one of this player link's pages to get data about the player first year in NBA and also about his college history

In [14]:
%%time
player_df_college = pd.DataFrame()
player_df_first_year = pd.DataFrame()
for link in link_list:
    html = urlopen('https://www.basketball-reference.com' + link)
    bs = BeautifulSoup(html, 'html.parser')
    if len(bs.find_all("div", {"id": "all_all_college_stats"})) > 0:
        data = bs.find_all("div", {"id": "all_all_college_stats"})[0].find_all(string=lambda text: isinstance(text, Comment))[0]
        bs2 = BeautifulSoup(data, "html.parser")
        for season in bs2.find('tbody').find_all('tr'):
            player_dict_college = {}
            player_dict_college['season'] = season.find('th').string
            for element in season.find_all('td'):
                if element.string is None:
                    player_dict_college[element['data-stat']] = 0
                else:
                    if element.string.isdecimal():
                        player_dict_college[element['data-stat']] = int(element.string)
                    else:
                        player_dict_college[element['data-stat']] = element.string

            player_dict_college['name'] = bs.find('h1', itemprop ='name').span.string
            player_df_college = player_df_college.append(player_dict_college, ignore_index=True)
        
    if len(bs.find_all("table", {"id": "totals"})) > 0:
        player_dict_first_year = {}
        for element in bs.find("table", {"id": "totals"}).find_all(class_ = 'full_table')[0].find_all('td'):
            if element.string is not None:
                if element.string.isdecimal():
                    player_dict_first_year[element['data-stat']] = int(element.string)
                else:
                    player_dict_first_year[element['data-stat']] = element.string
        
        player_dict_first_year['name'] = bs.find('h1', itemprop ='name').span.string
        player_dict_first_year['first_season'] = bs.find("table", {"id": "totals"}).find_all(class_ = 'full_table')[0].find_all('th')[0].string
        player_df_first_year = player_df_first_year.append(player_dict_first_year, ignore_index=True)        

Wall time: 7min 27s


Let's now save the data into csv files to be easier to access when we need to

In [15]:
df_final = df.merge(player_df_college, left_on = 'Player', right_on = 'name')

In [16]:
df_final.head()

Unnamed: 0,draft_year,Pk,Tm,Player,College,Yrs,G,MP,PTS,Conference,...,name,orb,pf,pts,pts_per_g,season,stl,tov,trb,trb_per_g
0,2011,1,CLE,Kyrie Irving,Duke,11.0,611.0,20804.0,14089.0,ACC,...,Kyrie Irving,6.0,23.0,192.0,17.5,2010-11,16.0,27.0,37.0,3.4
1,2011,2,MIN,Derrick Williams,Arizona,7.0,428.0,8864.0,3809.0,PAC-12,...,Derrick Williams,68.0,78.0,486.0,15.7,2009-10,19.0,60.0,219.0,7.1
2,2011,2,MIN,Derrick Williams,Arizona,7.0,428.0,8864.0,3809.0,PAC-12,...,Derrick Williams,105.0,106.0,741.0,19.5,2010-11,37.0,100.0,314.0,8.3
3,2011,4,CLE,Tristan Thompson,Texas,11.0,730.0,19556.0,6592.0,Big 12,...,Tristan Thompson,138.0,100.0,471.0,13.1,2010-11,34.0,64.0,282.0,7.8
4,2011,8,DET,Brandon Knight,Kentucky,9.0,451.0,13202.0,6301.0,SEC,...,Brandon Knight,25.0,83.0,657.0,17.3,2010-11,25.0,120.0,153.0,4.0


In [17]:
df_final.to_csv('draft_data_college_history.csv', index = False)

In [18]:
player_df_first_year.head()

Unnamed: 0,age,ast,blk,drb,efg_pct,fg,fg2,fg2_pct,fg2a,fg3,...,name,orb,pf,pos,pts,stl,team_id,tov,trb,trp_dbl
0,19.0,275.0,20.0,147.0,0.517,350.0,277.0,0.491,564.0,73.0,...,Kyrie Irving,44.0,110.0,PG,944.0,54.0,CLE,160.0,191.0,0.0
1,20.0,38.0,31.0,234.0,0.449,205.0,168.0,0.467,360.0,37.0,...,Derrick Williams,77.0,95.0,PF,583.0,30.0,MIN,77.0,311.0,
2,20.0,27.0,62.0,202.0,0.439,194.0,194.0,0.441,440.0,0.0,...,Tristan Thompson,187.0,134.0,PF,494.0,27.0,CLE,81.0,389.0,
3,20.0,251.0,10.0,178.0,0.483,319.0,214.0,0.434,493.0,105.0,...,Brandon Knight,33.0,154.0,PG,847.0,49.0,DET,171.0,211.0,0.0
4,21.0,289.0,20.0,204.0,0.411,281.0,212.0,0.392,541.0,69.0,...,Kemba Walker,30.0,79.0,PG,799.0,60.0,CHA,119.0,234.0,1.0


In [19]:
player_df_first_year.to_csv('draft_data_first_year_performance.csv', index = False)