# Data and Analysis Plan: College Basketball to NBA: Star Prediction
## Team-13

- Amaan Bhojani (bhojani.a@northeastern.edu)
- Jirawat Zhou (zhou.jir@northeastern.edu)
- Joshua Newstadt (newstadt.j@northeastern.edu)
- Jarrett Anderson (anderson.jar@northeastern.edu)


## Project Goal:
Our motivation for the project is to try and see if we can predict which current college basketball players will have successful NBA players. Even today, certain college players seem like they will be stars in the league and they end up being busts, we aim to shed some light as to whose skills will properly translate at the next level. 

## Overview:
We will scrape our data from two separate websites. 

#### Basketball Reference
The first site we will scrape is basketball-reference.com which will give us a [list of all Players](https://www.basketball-reference.com/leagues/NBA_2021_totals.html) for a given season which we can reference to find their [college stats](https://www.basketball-reference.com/players).

<img src="https://lh3.googleusercontent.com/5eDIHwyT_y_jIQPsINtZy1EzIt0prGIfR-vWGhHylJTOeai_-MBZhBw-8ucdvPPdm63z8ynS9NOCP2znIPPWm5Ae_XHsLoLZOcBv77Y_kXGZEjl4ZLOBiHo_zPneCTA6bT7pEh-uRmM=w2400?source=screenshot.guru" width=800px>

From the list of players we can obtain:
- name
- url to specific player page

<img src="https://lh3.googleusercontent.com/LvO1eGKEqjfSLe_OZeXCA4zjDx6-ifVXGq9YeXSAc7b_Qf-lee-QMWrM-3RZpS_h1a0u8u-BXKGzrPFbtciSqm-YnrVbUHtJLk6yb7yLUMSkaaZsSxDPey71r_qjr3TqyyZWiZ-7EWM=w2400" width=800px>

Upon visiting an individual player's page we can find their college stats.

<img src="https://lh3.googleusercontent.com/pk68MH39TLKlwvPJQTZnZPVHbiWC55S3LNfZrivqDA0RZQSDGYliGO-xOIfUpN6D4Er3Yz9a6B2-zgLVHbmA_XG4FukDrPaAjzzzcggQWBralbFjgkl7CEMersioOxbPrY-S4Z4ZplQ=w2400" width=675px>

This is the table we are targeting as it contains the:
- season
- age
- college
- total stats for the season

Some players will have played multiple college seasons while other only one. To simplify our analysis, each player's college stats will be reduced to their total college career averages across every category.  


#### Fantasy Pros
The second site that we will scrape is fantasypros.com which will give us a [list of all Players including their Fantasy Points](https://www.fantasypros.com/nba/stats/overall.php) for a given season.

<img src="https://lh3.googleusercontent.com/kwNP7yD1RCmmBpr7ws4jNgE_AY1JV4TFnacpKohGnlt9bn-JlL5ujDX0aEWubOrXgDeP1wcfTuhaZ0QVuuAKJkeOkSe4Yud2Z-yszUzAfvOhWvsC_uVS4ahGFjO-p8A6-dxa5Lx9py4=w2400" width=800px>

This table is targeted as it provides each NBA player's total accrued Fantasy Points for a given season.

## Pipeline Overview
We will accomplish this task by creating various functions. 

#### Basketball Reference
To get each player's total college career stats we will use seven functions:
- `get_nba_player_html(year)`
    - returns the raw html webscrapped from [basketball-reference.com](https://www.basketball-reference.com/leagues/NBA_2021_totals.html) 
- `extract_nba_player(html_player)` 
    - returns a DataFrame of the webscrapped NBA Player's and their respective NBA Stat Page URLs 
- `get_nba_collegestat(player_url)`
    - returns the raw html webscrapped from the NBA Player's Stat Page
- `extract_nba_collegestat(html_stat)`
    - returns a Dictionary of the NBA Player's College Stats by season
- `build_nba_college(df_player)`
    - returns a DataFrame of the all the NBA Player's College Stats by season
- `get_multiple_years(df_college_stats)`
    -  returns a GroupBy Object representing the grouping of an NBA Player and their individual seasons of college play
- `avg_college_stats(df_multiples)`
    - returns a DataFrame of an NBA player and their Average College Career Stats
    
#### Fantasy Pros
To get each player's total fantasy points we will use 2 functions:
- `get_fantasy_stat(year)`
    - returns the raw html webscraped from [fantasypros.com](https://www.fantasypros.com/nba/stats/overall.php)
- `extract_nba_stat(nba_html)`
    - returns a DataFrame of NBA Player's Total Fantasy Points and Season Stats

As well as three scripts:
- **Scrape List of Player's and their College Stats:** use `get_nba_player(year)` and `extract_nba_player(html_player)` to get all the NBA Players and their Player Page URLs, populating a DataFrame `df_player`. From there use `build_nba_college(df_player)` on the DataFrame `df_player`, which populates a DataFrame `df_college_stats`. 
- **Clean the College Data:** create a Series Object named `multiples` which identifies the rows which have a duplicated `player_name`. Then use `get_multiple_years(df_college_stats)` on `df_college_stats`, which populates a DataFrame `df_multiples`. From there use `avg_college_stats(df_multiples)` on `df_multiples`, populating a DataFrame `df_college_avg`. Then we use the `multiples` Series Object as the identifier for which rows to drop from the DataFrame `df_college_stats`. This reduced DataFrame is saved as DataFrame `df_college`. Finally we append the `df_college` and `df_college_avg` DataFrames into one DataFrame`df_final_college`.
- **Scrape List of Player's and their Fantasy Points:** use `get_fantasy_stat(year)` and `extract_nba_stat(nba_html)` populating a DataFrame `df_nba_stat`
- **Combine Player's Fantasy Points and their Average College Stats:** initialize an `points_dict`, `fantasy_dict`, `missing_pts_list`, and `final_stats_list`. Then in a loop fill the `points_dict` with the NBA Player's Fantasy Points from `df_nba_stat`. In a separate loop fill the `fantasy_dict` with an NBA Player and their Fantasy Points from `points_dict` if that NBA player is contained in `df_final_college`. Then use `set_index('player_name')` on `df_final_college` to set the NBA Player's Names as the Index Values, populating a DataFrame `df_dropped`. In a third separate loop, iterate over `df_dropped` and drop any rows where the index of `df_dropped` is **not in** `points_dict`. Finally in a fourth separate loop create a Series Object `player_series` for each index in `df_dropped` and add the NBA Player's Fantasy Points from `fantasy_dict` to `player_series`. Append this updated `player_series` to `final_stats_list` and populate a DataFrame `df_all_stats`.

## Pipeline

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np

### Basketball Reference

In [2]:
def get_nba_player_html(year):
    """ Web scrapping https://www.basketball-reference.com/leagues/NBA_2021_totals.html
        to retrieve NBA player college statistic
        
        Args:
            year (int) : Represent year in yyyy format (e.g. 2021)
        
        Return:
            html_player (html) : raw html webscrapped from basketball-reference.com
    """
    nba_player = f'https://www.basketball-reference.com/leagues/NBA_{year}_totals.html'
    html_player = requests.get(nba_player)
    status = html_player.status_code
    
    # If page load correctly
    if status == 200:
        return html_player.text

In [3]:
def extract_nba_player(html_player):
    """ Extract the NBA player from the crawled webpage in order to retrieve stat
    
        Args:
            html_player (String) : Represent the crawled league webpage
            
        Return:
            df_player (DataFrame) : Represent list of player and corresponding stat URL
    """
    base_url = 'https://www.basketball-reference.com'
    soup = BeautifulSoup(html_player)
    df_player = pd.DataFrame()
    
    # Find all player with stat URL
    for player in soup.find_all('td', {'data-stat': 'player'}):
        link = player.find_all('a')[0]
        
        # Construct Dict that contain Player Name and URL to player stat
        dict_player = {'player_name' : player.text,
                      'url' : base_url + link.attrs['href']}
        
        df_player = df_player.append(dict_player, ignore_index=True)
        
    
    # Drop any duplicate and retain the first entries
    df_player.drop_duplicates(subset='player_name', keep='first', inplace=True)
        
    return df_player

In [4]:
def get_nba_collegestat(player_url):
    """ Web Scrapping https://www.basketball-reference.com/players and retrive player's
        college statistic
        
        Args:
            player_url (String) : Represent URL that contain player stat
            
        Return:
            html_stat (String) : Represent the stat for all NBA player in HTML
    """
    html_stat = requests.get(player_url)
    status = html_stat.status_code
    
    # If page load correctly
    if status == 200:
        return html_stat.text

In [5]:
def extract_nba_collegestat(html_stat):
    """ Extract college stat from NBA player
    
        Args:
            html_stat (String) : Represent the crawled player stat webpage
            
        Return:
            dict_stat (Dictionary) : Represent the player college stat
    """
    soup = BeautifulSoup(html_stat)
    
    # Clean up hidden scripts
    stat_wcomment = soup.find_all('div', {'id':'all_all_college_stats'})
    
    if not stat_wcomment:
        dict_stat = {}
        return dict_stat
    else:
        str_stat = str(stat_wcomment[0]).replace('<!--','').replace('-->','')
    
        soup = BeautifulSoup(str_stat)
        stat_body = soup.find('tbody')

        # Initalize Dict
        dict_stat = {}

        # Add year
        for year in stat_body.find_all('th'):
            # Get Year and Value
            key = year.get('data-stat')
            value = year.text

            if key not in dict_stat.keys():
                dict_stat[key] = [value]
            else:
                dict_stat[key] = dict_stat[key] + [value]


        # Add Stat Information
        for stat in stat_body.find_all('td'):
            # Get Metric and Value
            key = stat.get('data-stat')
            value = stat.text

            if key not in dict_stat.keys():
                dict_stat[key] = [value]
            else:
                dict_stat[key] = dict_stat[key] + [value]
    
        return dict_stat

In [6]:
def build_nba_college(df_player):
    """ Build the NBA college Data given list of player and URL
    
        Args:
            df_player (DataFrame) : Represent list of Player and corresponding stat URL
            
        Return:
            df_nba_collegestat (DataFrame) : Represent the NBA Player college stat
    """
    df_nba_collegestat = pd.DataFrame()
    
    for idx, row in df_player.iterrows():
        player = row['player_name']
        url = row['url']
        
        player_stat_url = get_nba_collegestat(url)
        dict_player_stat = extract_nba_collegestat(player_stat_url)
        
        df_player_temp = pd.DataFrame(dict_player_stat)
        df_player_temp['player_name'] = player
        
        df_nba_collegestat = df_nba_collegestat.append(df_player_temp)
        
    return df_nba_collegestat

In [7]:
def get_multiple_years(df_college_stats):
    """ Gets dataframe of players who played multiple years

        Args:
            df_college_stats (pd.DataFrame): all college player statistics
        
        Returns:
            df_multiples (pd.GroupBy): all statistics of multiple year players
                                            grouped by player
    """
    x = df_college_stats.duplicated(['player_name'],keep=False)
    y = df_college_stats[x].groupby('player_name',sort=False)
    
    return(y)

In [8]:
def avg_college_stats(df_multiples):
    """ Gets average of college career stats
    
    Args:
        df_multiples (pd.GroupBy): all statistics of multiple year players
                                            grouped by player
    
    Returns:
        df_college_avg (pd.DataFrame): average of each multi-year player's college career
    """
    # just the columns of averagable stats
    mean_column_list = list(df_college_stats.columns)[3:-1]
    
    all_college_stats = []
    
    for player in df_multiples:
        
        name, stats = player[0], player[1]
        
        stats_to_avg = stats.loc[:,mean_column_list]
        
        stats_to_avg.replace('',0.0,inplace=True)
        
        for column in stats_to_avg.columns:
            for idx in stats_to_avg[column].index:  
                stats_to_avg[column][idx] = float(stats_to_avg[column][idx])
        
        avg_stats = round(stats_to_avg.mean(axis=0), 2)

        career_season = stats['season'].iloc[0][:4]+'-'+stats['season'].iloc[-1][:4]
        
        avg_stats['season'] = career_season
        avg_stats['player_name'] = name
        
        all_college_stats.append(avg_stats)
        
    df_college_avg = pd.DataFrame(all_college_stats)
    return(df_college_avg)

#### Scrape List of Player's and their College Stats:

In [9]:
html_player = get_nba_player_html(2021)
df_player = extract_nba_player(html_player)

In [10]:
df_player.head()

Unnamed: 0,player_name,url
0,Precious Achiuwa,https://www.basketball-reference.com/players/a...
1,Jaylen Adams,https://www.basketball-reference.com/players/a...
2,Steven Adams,https://www.basketball-reference.com/players/a...
3,Bam Adebayo,https://www.basketball-reference.com/players/a...
4,LaMarcus Aldridge,https://www.basketball-reference.com/players/a...


In [11]:
df_college_stats = build_nba_college(df_player)

In [12]:
df_college_stats.tail()

Unnamed: 0,season,age,college_id,g,mp,fg,fga,fg3,fg3a,ft,...,pf,pts,fg_pct,fg3_pct,ft_pct,mp_per_g,pts_per_g,trb_per_g,ast_per_g,player_name
1,2014-15,22,UTAH,35,1165,165,324,26,73,153,...,49.0,509,0.509,0.356,0.836,33.3,14.5,4.9,5.1,Delon Wright
0,2006-07,18,GATECH,31,917,177,370,39,93,52,...,,445,0.478,0.419,0.743,29.6,14.4,4.9,2.0,Thaddeus Young
0,2017-18,19,OKLAHOMA,32,1133,261,618,118,328,236,...,57.0,876,0.422,0.36,0.861,35.4,27.4,3.9,8.7,Trae Young
0,2011-12,19,INDIANA,36,1025,200,321,0,0,163,...,97.0,563,0.623,,0.755,28.5,15.6,6.6,1.3,Cody Zeller
1,2012-13,20,INDIANA,36,1062,199,353,0,2,196,...,80.0,594,0.564,0.0,0.757,29.5,16.5,8.0,1.3,Cody Zeller


#### Clean the College Data:

In [13]:
df_multiples = get_multiple_years(df_college_stats)
df_multiples

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000011308B511C0>

In [14]:
multiples = df_college_stats.duplicated(['player_name'],keep=False)

df_multiples = get_multiple_years(df_college_stats)

df_college_avg = avg_college_stats(df_multiples)

df_college = df_college_stats[~multiples].drop(['college_id','age'],axis=1)

df_final_college = df_college.append(df_college_avg)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  stats_to_avg[column][idx] = float(stats_to_avg[column][idx])


In [15]:
df_final_college.head()

Unnamed: 0,season,g,mp,fg,fga,fg3,fg3a,ft,fta,orb,...,pf,pts,fg_pct,fg3_pct,ft_pct,mp_per_g,pts_per_g,trb_per_g,ast_per_g,player_name
0,2019-20,31,943,182,369,13,40,112,187,93,...,73,489,0.493,0.325,0.599,30.4,15.8,10.8,1.0,Precious Achiuwa
0,2012-13,32,749,100,175,0,0,31,70,90,...,52,231,0.571,,0.443,23.4,7.2,6.3,0.6,Steven Adams
0,2016-17,38,1145,170,284,0,0,154,236,118,...,99,494,0.599,,0.653,30.1,13.0,8.0,0.8,Bam Adebayo
0,2016-17,33,1061,179,316,0,7,84,149,100,...,68,442,0.566,0.0,0.564,32.2,13.4,8.4,0.8,Jarrett Allen
0,2017-18,29,438,58,101,2,15,33,64,24,...,66,151,0.574,0.133,0.516,15.1,5.2,2.9,0.4,Kostas Antetokounmpo


### Fantasy Pros

In [16]:
def get_fantasy_stat(year):
    """ Get all NBA Player Stat from https://www.fantasypros.com/nba/stats/overall.php
    
        Args:
            year (int) : Represent year in yyyy which NBA data will be extracted 
            
        Return:
            nba_html (string) : Represent the NBA stat in HTML Representation
    """
    nba_url = f'https://www.fantasypros.com/nba/stats/overall.php?year={year}'
    nba_html = requests.get(nba_url)
    status = nba_html.status_code
    
    # If page load correctly
    if status == 200:
        return nba_html.text

In [17]:
def extract_nba_stat(nba_html):
    """ Extract NBA Stat from the fantasypros website
    
        Args:
            nba_html (string) : Represent the NBA stat in HTML Representation
            
        Return:
            df_nba_stat (DataFrame) : Represent the dataframe contain NBA stat
    
    """
    dict_nba_stat = {'Player' : [], 
                    'PTS' : [],
                    'REB' : [],
                    'AST' : [],
                    'BLK' : [],
                    'STL' : [],
                    'FG%' : [],
                    'FT%' : [],
                    '3PM' : [],
                    'TO' : [],
                    'GP' : [],
                    'MIN' : [],
                    'FTM' : [],
                    '2PM' :[],
                    'A/TO' : [],
                    'PF':[]}
    df_nba_stat = pd.DataFrame(dict_nba_stat)
    soup = BeautifulSoup(nba_html)
    
    # Extract all Information
    for row in soup.find('tbody').find_all('tr'):
        arr = np.array([i.text for i in row.find_all('td')])
        
        df_nba_stat = df_nba_stat.append(pd.DataFrame(arr.reshape(1,-1), columns=list(df_nba_stat)), ignore_index=True)
        
    return df_nba_stat

#### Scrape List of Player's and their Fantasy Points:

In [18]:
nba_html = get_fantasy_stat(2021)
df_nba_stat = extract_nba_stat(nba_html)
df_nba_stat.head()

Unnamed: 0,Player,PTS,REB,AST,BLK,STL,FG%,FT%,3PM,TO,GP,MIN,FTM,2PM,A/TO,PF
0,"DeMar DeRozan (CHI - SF,PF,SG) DTD",1937,372,350,24,61,0.504,0.871,45,166,70,2526,472,665,2.11,161
1,Trae Young (ATL - PG) DTD,1929,261,666,7,71,0.458,0.902,208,276,69,2414,443,431,2.41,114
2,"Jayson Tatum (BOS - SF,PF) DTD",1923,573,304,45,67,0.451,0.86,219,203,71,2569,376,445,1.5,164
3,"Joel Embiid (PHI - PF,C) DTD",1824,692,257,89,71,0.489,0.818,81,183,61,2039,587,497,1.4,165
4,"Nikola Jokic (DEN - PF,C) DTD",1815,935,551,60,97,0.577,0.811,95,257,69,2288,348,591,2.14,178


#### Combine Player's Fantasy Points and their Average College Stats:

In [19]:
points_dict = {}
for idx in df_nba_stat.index:
    name = df_nba_stat.loc[idx,'Player'].split('(')[0].strip()
    points = df_nba_stat.loc[idx,'PTS']
    
    points_dict[name] = points

In [20]:
fantasy_dict = {}
for name in points_dict.keys():
    if name in list(df_final_college['player_name']):
        fantasy_dict[name] = points_dict[name]

In [21]:
missing_pts_list = []
df_dropped = df_final_college.set_index('player_name')

for name in df_dropped.index:
    if name not in list(points_dict.keys()):
        df_dropped = df_dropped.drop(name,axis=0)
        
df_dropped.head()

Unnamed: 0_level_0,season,g,mp,fg,fga,fg3,fg3a,ft,fta,orb,...,tov,pf,pts,fg_pct,fg3_pct,ft_pct,mp_per_g,pts_per_g,trb_per_g,ast_per_g
player_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Precious Achiuwa,2019-20,31,943,182,369,13,40,112,187,93.0,...,87,73,489,0.493,0.325,0.599,30.4,15.8,10.8,1.0
Steven Adams,2012-13,32,749,100,175,0,0,31,70,90.0,...,35,52,231,0.571,,0.443,23.4,7.2,6.3,0.6
Bam Adebayo,2016-17,38,1145,170,284,0,0,154,236,118.0,...,64,99,494,0.599,,0.653,30.1,13.0,8.0,0.8
Jarrett Allen,2016-17,33,1061,179,316,0,7,84,149,100.0,...,84,68,442,0.566,0.0,0.564,32.2,13.4,8.4,0.8
Carmelo Anthony,2002-03,35,1274,277,612,56,166,168,238,,...,77,77,778,0.453,0.337,0.706,36.4,22.2,10.0,2.2


In [22]:
final_stats_list = []
for name in df_dropped.index:
    player_series = df_dropped.loc[name,:]
    player_series['f_PTS'] = fantasy_dict[name]
    final_stats_list.append(player_series)

In [23]:
df_all_stats = pd.DataFrame(final_stats_list)
df_all_stats.replace('','NaN').to_csv('nba_player_stats.csv')

In [24]:
df_all_stats.head()

Unnamed: 0,season,g,mp,fg,fga,fg3,fg3a,ft,fta,orb,...,pf,pts,fg_pct,fg3_pct,ft_pct,mp_per_g,pts_per_g,trb_per_g,ast_per_g,f_PTS
Precious Achiuwa,2019-20,31,943,182,369,13,40,112,187,93.0,...,73,489,0.493,0.325,0.599,30.4,15.8,10.8,1.0,589
Steven Adams,2012-13,32,749,100,175,0,0,31,70,90.0,...,52,231,0.571,,0.443,23.4,7.2,6.3,0.6,511
Bam Adebayo,2016-17,38,1145,170,284,0,0,154,236,118.0,...,99,494,0.599,,0.653,30.1,13.0,8.0,0.8,973
Jarrett Allen,2016-17,33,1061,179,316,0,7,84,149,100.0,...,68,442,0.566,0.0,0.564,32.2,13.4,8.4,0.8,904
Carmelo Anthony,2002-03,35,1274,277,612,56,166,168,238,,...,77,778,0.453,0.337,0.706,36.4,22.2,10.0,2.2,876


## Visualizations
#### Current NBA Players and Their College Stats

<img src="https://lh3.googleusercontent.com/uzzGQLVkKpgItQLyAUFSJSx2E1aGN0u5KEAlM9F044JgsbTUgUI3ZE-xeqXsmgtdNDOJRW16pvC3NBE36Syg3y32Hl-G3Pb9pWADPt2Lz975ovS9sjMfApX8RccfU_k6_lFNRm9lQKY=w2400" width=800px>

The graph above shows a snippet of current NBA and their main 3 stats tracked in college- points, rebounds, and assists (all per game). 

#### Current Season NBA Performance per Draft Class
<img src="https://lh3.googleusercontent.com/KKuTl7txziLdD--A7sgcC2RQVw7Nk5vPhLCTP9sAyL7r6XqUmuO0WCPliWVEi4efcqBGPoumVV764GFpWkBxLQGwM07iYMn30r22D0Dv6nZk7wD2AkH0TfLm6fcAwfUHx1KCk1vtmcc=w2400" width=800px>

The graph above shows how each draft class has been performing as a whole this NBA season. Performance is based on fantasy points on the season and done by averaging the total fantasy points per player in each class.



## Analysis Plan