# Week 2: Data Extraction

### By: Calvin Chen and Matt Hashimoto

Hey everyone, and welcome to Week 2 of the `Balling with Data` project! We're excited to get started with the project, so let's get underway! First, a table of contents about what we'll be covering in this notebook today.

In [2]:
# Standard imports
# If any of these don't work, try doing `pip install _____`, or try looking up the error message.
import numpy as np
import pandas as pd
import json
import time
import os.path
from os import path
import math
import datetime
import unidecode
import requests
from bs4 import BeautifulSoup

# Table of Contents
* [Introduction to web-scraping](#section1)
* [What is `sportsreference`?](#section2)
* [Let's get our data!](#section3)
    * [Potentially Useful Classes](#section3a)
    * [Important Things to Know](#section3b)
    * [Sandbox Area](#section3c)

<a id='section1'></a>
# Introduction to Web-Scraping!

Now that we've discussed the different project objectives and what kind of data we plan on getting,  we can now look into different methods of extracting this data from the internet. There are a couple of different ways we could go about doing this:

1. Web-scraping
2. API endpoint/Package

Between these two methods, the main difference is just how much someone has prepared the data for us beforehand. In many cases with starter data science projects, it'll be possible to find the data you need from differenrt free, online sources/APIs, making it easier for you to get started. However, what may happen on different occassions is that you won't be able to find any reliable database/data source that has all the different components of the data you're looking for. When this happens, you need to be able to find the data yourself. **How would we go about doing that? Let's try webscraping for [Stephen Curry college stats](https://www.sports-reference.com/cbb/players/stephen-curry-1.html).**

In [None]:
steph_url = 'https://www.sports-reference.com/cbb/players/stephen-curry-1.html'
req = requests.get(steph_url) # This will make a request to steph_url for us!

In [None]:
# Now, we sift through the request's content with a html parser.
soup = BeautifulSoup(req.content, 'html.parser')
soup.prettify()

In [None]:
# Now, we can use the .find method for BeautifulSoup objects to find the data we need from Steph Curry's stats.
# We've done the following below for you because we won't be going too in-depth into this for the project, but
# it's nice/important to know how to do.
table = soup.find('table', {'id': 'players_per_game'})
stats = table.findAll('td')
row_stats = [stats[i:i+28] for i in range(0, len(stats), 28)]
last_year = ['Steph Curry'] + [stat.get_text() for stat in row_stats[-2]] # Second-to-last element in row_stats should be the latest yearly averages for the player (right before career stats)
last_year = np.reshape(np.array(last_year), (-1, 29))
last_year


Now, we've gotten Steph Curry's final year stats at Davidson, but what do the different stats mean? Let's find their headers so we put some sense to these numbers.

In [None]:
# Find column headers
cols = table.findAll('th')[1:29] # Column headers
col_headers = ['Name'] + [col.get_text() for col in cols]
col_headers

In [None]:
# Now, let's make a pandas dataframe from this data.
curry = pd.DataFrame(data=last_year, columns=np.array(col_headers))
curry

Congrats! You've successfully scraped together a dataframe for us to analyze about Steph Curry's basketball stats in his final year at Davidson college. Now, we can see that we'd easily be able to apply the same logic above to a variety of different NCAA players, and may still be quite useful when we come across **international players**. Unfortunately, for the scope of this project, we won't get into analyzing international player's stats, but you can imagine it'd be a similar process to how we analyzed Steph Curry above.

Now, let's get into a free sports API that'll abstract all this scraping away for all the different types of websites we might encounter, and allow us to access all the different player data in a friendly format. **Let's get into what `sportsreference` can do for us!**

<a id='section2'></a>
# What is `sportsreference`?

Now that we've seen how web-scraping works fundamentally, let's work with an API that will abstract that all away for us and give us the ability to easily query for different players' stats we're interested in!

**Let's visit the [sportsreference documentation](https://sportsreference.readthedocs.io/en/stable/).**

Read through the documentation and get a handle for how the API is strcutured. Afterwards, we'll get into a couple of different exercises, and then leave the rest for you guys to handle! Feel free to make as many cells as you'd like to help with your development!

Things that might help with running Jupyter Notebooks: Go to Help -> Keyboard Shortcuts (they will help immensely in the long term with saving time!)

Alternatively, check out this [link](https://towardsdatascience.com/jypyter-notebook-shortcuts-bf0101a98330) for other shortcuts!

In [3]:
# Modules from sportsrefernece.ncaab for college basketball
from sportsreference.ncaab.boxscore import Boxscore as NCAAB_Boxscore
from sportsreference.ncaab.conferences import Conferences as NCAAB_Conferences
from sportsreference.ncaab.rankings import Rankings as NCAAB_Rankings
from sportsreference.ncaab.roster import Player as NCAAB_Player
from sportsreference.ncaab.roster import Roster as NCAAB_Roster
from sportsreference.ncaab.schedule import Schedule as NCAAB_Schedule
from sportsreference.ncaab.teams import Teams as NCAAB_Teams

# Modules from sportsrefernece.nba for NBA basketball
from sportsreference.nba.boxscore import Boxscore as NBA_Boxscore
from sportsreference.nba.roster import Player as NBA_Player
from sportsreference.nba.roster import Roster as NBA_Roster
from sportsreference.nba.schedule import Schedule as NBA_Schedule
from sportsreference.nba.teams import Teams as NBA_Teams

If you're unsure of what different attributes an object has, feel free to take a look at its `__dict__` method! This is a great way to remove an abstraction barrier and see what you can really mess with!

**Example:**

In [None]:
curry = NBA_Player('curryst01')

In [None]:
curry.__dict__

**Exercise 1:** Find all the different teams' abbreviations in the NBA in 2011. (Note `NOH` and `NJN` when you query. Look them up in a quick Google search-- do they still exist today?)

In [6]:
abbrev = []
for team in NBA_Teams(2011):
    abbrev.append(team.abbreviation)

In [5]:
abbrev

['DEN',
 'NYK',
 'HOU',
 'PHO',
 'OKC',
 'SAS',
 'GSW',
 'MIA',
 'LAL',
 'MIN',
 'DAL',
 'MEM',
 'IND',
 'UTA',
 'SAC',
 'ORL',
 'TOR',
 'PHI',
 'LAC',
 'CHI',
 'WAS',
 'DET',
 'BOS',
 'POR',
 'CLE',
 'ATL',
 'NOH',
 'NJN',
 'CHA',
 'MIL']

**Exercise 2:** Get all the unique player's names that played for the Golden State Warriors in the past 3 years.

In [22]:
years = [2018, 2019, 2020]
player_names = set()
for year in years:
    players = NBA_Roster('GSW', year).players
    [player_names.add(player.name) for player in players]

In [23]:
player_names

{'Alen Smailagić',
 'Alfonzo McKinnie',
 'Andre Iguodala',
 'Andrew Bogut',
 'Andrew Wiggins',
 'Chris Boucher',
 'Damian Jones',
 'Damion Lee',
 'David West',
 'DeMarcus Cousins',
 'Dragan Bender',
 'Draymond Green',
 'Eric Paschall',
 'JaVale McGee',
 'Jacob Evans',
 'Jeremy Pargo',
 'Jonas Jerebko',
 'Jordan Bell',
 'Jordan Poole',
 'Juan Toscano-Anderson',
 'Kevin Durant',
 'Kevon Looney',
 'Klay Thompson',
 'Ky Bowman',
 'Marcus Derrickson',
 'Marquese Chriss',
 'Mychal Mulder',
 'Nick Young',
 'Omri Casspi',
 'Patrick McCaw',
 'Quinn Cook',
 'Shaun Livingston',
 'Stephen Curry',
 'Zach Norvell',
 'Zaza Pachulia'}

**Exercise 3:** Return all the **unique player objects** that played for Cal Basketball and UCLA from 2014-2015 to 2017-2018.

In [29]:
years = [2015, 2016, 2017, 2018]
players = set()
for year in years:
    [players.add(player) for player in NCAAB_Roster('california', year=year).players]
    [players.add(player) for player in NCAAB_Roster('ucla', year=year).players]

In [33]:
players

{<sportsreference.ncaab.roster.Player at 0x1167d0fd0>,
 <sportsreference.ncaab.roster.Player at 0x1167d1fd0>,
 <sportsreference.ncaab.roster.Player at 0x1167d23c8>,
 <sportsreference.ncaab.roster.Player at 0x1167d8f28>,
 <sportsreference.ncaab.roster.Player at 0x1167e0198>,
 <sportsreference.ncaab.roster.Player at 0x1167e60b8>,
 <sportsreference.ncaab.roster.Player at 0x1167f0080>,
 <sportsreference.ncaab.roster.Player at 0x1167f1f98>,
 <sportsreference.ncaab.roster.Player at 0x1167f5f98>,
 <sportsreference.ncaab.roster.Player at 0x1167fd278>,
 <sportsreference.ncaab.roster.Player at 0x1167fe438>,
 <sportsreference.ncaab.roster.Player at 0x116801f98>,
 <sportsreference.ncaab.roster.Player at 0x116808668>,
 <sportsreference.ncaab.roster.Player at 0x11680d1d0>,
 <sportsreference.ncaab.roster.Player at 0x119fe0fd0>,
 <sportsreference.ncaab.roster.Player at 0x119fe2f98>,
 <sportsreference.ncaab.roster.Player at 0x119fe5dd8>,
 <sportsreference.ncaab.roster.Player at 0x119feefd0>,
 <sportsre

Nice! Now that you've been able to get a feel for how the packages work, let's get into the data problem we're dealing with at hand.

<a id='section3'></a>
# Let's Get Our Data!

Now that you've been able to tinker around with a little bit of the package, try and figure out how you might able to get the data we need for the project! We've provided the following classes below to try and help out what we're trying to find, but tinker around and see what kind of things you come across!

To reiterate our project objective, and in turn, what we need from our data, we want to:

**Predict the 2019-2020 NBA Rookie statlines and compare those to their current statlines, given the past 20 years worth of NBA rookie + NCAA basketball data.**

<a id='section3b'></a>
## Important Things To Know

1. The last digit on the `player_id` tag relates to which number instance they are of that name. For example, `stephen-curry-2` would be the second player with the same name `Stephen Curry`. This can get incredibly annoying when trying to translate player data from the NBA to the NCAA, as there's a lot more players (and more possible name collisions) in the NCAA than in the NBA.

2. Datetime objects are comparable. Let's see how this implicates with what we know above with the new classes. (Hint: Can a player play in NCAA basketball after playing in the NBA?)

In [None]:
year_2009 = datetime.datetime.strptime('2009', '%Y').date()
year_2009

In [None]:
year_2008 = datetime.datetime.strptime('2008', '%Y').date()
year_2008

In [None]:
year_2009 > year_2008

3. It may be easier for you to go from all the different NBA players and trying to find their respective NCAA stats than the other way around (there are less NBA players than NCAA players, so potentially less queries to be made to find all the data.)

4. Take a look at what happens when you try to query into the NCAAB_Player class with an invalid `player_id` and see how you can use this to your advantage!

In [None]:
NCAAB_Player('lebron-james-1') # LeBron never went to college.

5. One of the most key components of this data extraction portion is matching the player correctly (from NBA to NCAA). Since there's no 1-to-1 mapping that exists within the API itself, so it'll be important to think about what verifications we need to make for determining how NBA players can be mapped to the NCAAB.

6. Teams change over time (location or name)! Keep that in mind.

7. This method may take a long time to run (for us, it takes us ~5.5 hrs to fully extract all the data from the past 20 years)! Do start + finish this as early as you can so that you have enough time to let the function run.

8. There will be many different cases where this function won't work, and it's up to you for what you want to do about them (i.e. international players didn't play in the NCAA, players aren't necessarily guarenteed to be the first instance player with their name). Feel free to ask us about what you should do in order to deal with these cases, but we mention this to highlight how what you choose to do here can alter how your project fundamentally behaves later on. This doesn't mean any way is necessarily right (we haven't gone through all the different combinations), but this gives you more free reign to take this project into your own hands and determine **what you want your data to be like, and where to get the data from.**

## Helper Methods

Here's where you'll be extracting all the data you might need for the project. We've also provided some helper functions and how they're used, just to help out with development-- feel free to go in different directions if you'd like/not use these functions if you don't need to!

Feel free to tinker around however you please and ask us any questions you might have about anything-- we're more than happy to help you out!

In [1]:
# This method should hopefully reduce the number of failure cases.
def convert_nba_ncaa_name(name):
    """
    Converts the format of the NBA player_id to the NCAA player_id.
    
    You may want to elaborate on the logic on this function to reduce the number of failure cases later.
    """
    return unidecode.unidecode(name.lower().replace(" ", "-") + "-1")

In [None]:
# Example usage
convert_nba_ncaa_name("Stephen Curry")

In [None]:
def format_df(player_name, df, is_college):
    """
    Formats a dataframe returned by calling either the `NBA_Player` or `NCAAB_Player` methods
    into a dataframe that we want to return later.
    """
    # Easier toggling into different functions for later, instead of having to remember how they work
    is_college_types = {
        True: lambda x: 'NBA_' + x,
        False: lambda x: 'NCAAB_' + x,
    }
    
    # Takes a function and renames the column names using that function
    df.rename(columns=is_college_types[is_college], inplace=True)
    col_names = ['name'] + list(df.columns)
    df['name'] = player_name
    df = df[col_names]
    
    # This is for whether or not to format the dataframe by looking at the last year college stats, or first year 
    # NBA stats. Feel free to tinker with these couple of lines outside this function as well-- they're incredibly 
    # key for this part of the project!
    if is_college == True: 
        return df.iloc[[df.shape[0] - 2]]
    else:
        return df.iloc[[0]]

In [None]:
# Example usage
format_df('Stephen Curry', NBA_Player('curryst01').dataframe, is_college=True)

In [None]:
# Example usage
format_df('Stephen Curry', NCAAB_Player('stephen-curry-1').dataframe, is_college=False)

In [None]:
def convert_year_to_date(year: int):
    """
    Converts a passed in year into a datetime object that can be compared with other datetime objects.
    """
    return datetime.datetime.strptime(str(year), '%Y').date()

In [None]:
# Example usage
convert_year_to_date(2008)

## Sandbox Area

In [34]:
def get_nba_ncaa_10_years(set_players, first_year, one_loop=True):
    """
    Getting the college basketball data for all NBA Players in the past 10 years.
    """
    # Generating columns for combined dataframe
    nba_cols = format_df('Stephen Curry', NBA_Player('curryst01').dataframe, False, lambda x: 'NBA_' + x) 
    ncaa_cols = format_df('Stephen Curry', NCAAB_Player('stephen-curry-1').dataframe, True, lambda x: 'NCAAB_' + x)
    all_cols = nba_cols.merge(ncaa_cols).columns

    combined = pd.DataFrame(columns=all_cols)
    seen = set() # To keep track of seen NBA players
    failed = dict()

    for year in range(first_year, 2020):
        
        sub_year = pd.DataFrame(columns=all_cols)
        
        teams = NBA_Teams(year=year)
        for team in teams:
            
            start = time.time() # For time measuring purposes
            
            players = NBA_Roster(team.abbreviation, year).players
            for player in players:
                if player.name in seen:
                    continue
                seen.add(player.name)
#                 unaccented_name = unidecode.unidecode(player.name) # We use this because maybe some of NBA players played in the NCAA with an accented name
                ncaab_player_id = convert_nba_ncaa_name(player.name) 
                
                try:
                    college_stats = NCAAB_Player(ncaab_player_id)
                except TypeError: # Player doesn't exist
                    print("Couldn't find NCAA player data for", player.name, ". Moving on.")
                    if 'lost' not in failed:
                        failed['lost'] = [player.name]
                    else:
                        failed['lost'].append(player.name)
                    continue
                
#                 if ncaab_player_id in set_players:
                last_college_date = convert_year_to_date(college_stats._most_recent_season[0:4])
                first_nba_date = convert_year_to_date(player._season[0][0:4])
    
                if last_college_date > convert_year_to_date(str(first_year)):
                    
                    # Confirming that the college player we find for the given NBA player has indeed played in college before the NBA
                    # (verifying that they are the same person, as you can't play in the NBA and then play in the NCAA)
#                     last_college_date = datetime.datetime.strptime(college_stats._most_recent_season[0:4], '%Y').date()
#                     first_nba_date = datetime.datetime.strptime(player._season[0][0:4], '%Y').date()
                   
                    if last_college_date < first_nba_date:
#                     if New_NBA_Player('fiewjf')._first_year > fjewifojwf:
                        
                        # Generating properly formatted dataframes for college and NBA stats
                        new_college = format_df(player.name, college_stats.dataframe, True, lambda x: 'NCAAB_' + x)
                        new_nba = format_df(player.name, player.dataframe, False, lambda x: 'NBA_' + x)

                        merged = new_nba.merge(new_college)
                        sub_year = sub_year.append(merged)
#                         combined = combined.append(merged)
                    else:
                        print("NBA Date before college date for", player.name, ". Moving on.")
                        if 'invalid-date' not in failed:
                            failed['invalid-date'] = [player.name]
                        else:
                            failed['invalid-date'].append(player.name)
                        continue
                
                else:
                    print("College id not in set_players for", player.name, ". Moving on.")
                    if 'old-player' not in failed:
                        failed['old-player'] = [player.name]
                    else:
                        failed['old-player'].append(player.name)
                    continue

            print("\n")
            print("Looked at", team.name, "on year", year, ". Moving to the next team.")
            print("\n")

            if one_loop:
                end = time.time()
                print("One iteration for one team and one year would take", end - start, "seconds to run.")
                print("Would take", (end - start) * 600 / 60 / 60, "hours to find all players that played in the NBA in the past 20 years and their respective college stats.")
                return combined.reset_index().drop(columns=['index']), failed
        
        cleaned_year = sub_year.reset_index().drop(columns=['index'])
        cy = cleaned_year.to_csv("{}_Player_Data.csv".format(year))
        
        combined = combined.append(cleaned_year)
    
    return combined.reset_index().drop(columns=['index']), failed

To guide your development, here's a snippet of a dataframe that we'd like to have constructed by the end of this!

In [None]:
pd.read_csv('example_data.csv')

**Running methods and saving CSV here**

In [None]:
data, failed = get_nba_ncaa_10_years()
csv_data = data.to_csv('all_player_data.csv')

**Congrats! You've gotten all your data!** This is definitely not an easy task to do, so congratulate yourself with figuring out how `sportsreference` works and getting all the data we need for the project! Next week, we'll get into analyzing the different features of the data, and doing some [data analysis](https://en.wikipedia.org/wiki/Data_analysis) and [feature engineering](https://en.wikipedia.org/wiki/Feature_engineering) to determine which features will be best to use for our project. Stay tuned for more :D