# Introduction

This tutorial will introduce you to some basic analysis of player ratings from the "NBA 2K" basketball video game series. Some topics we will discuss include how to scrape ratings from various websites, visualize year-to-year rating trends for each player, and predict a player's NBA 2K rating given their statistics from a given NBA season. In data science, it is not always a given that the data you want to analyze is in a perfectly-structured CSV format rather than unstructured information that is difficult to take from a website without tediously copying and pasting or entering the data in by hand. Through data scraping, we can manipulate data into a form that is easier to work with and perform statistical analysis on.

For many casual basketball fans, an NBA player's 2K rating is a fair estimator of a player's overall skill level. Players themselves can be competitive about these ratings, as even a one-point rating difference can have a tremendous impact on the player's reputation around the league. While these ratings are technically on a 40-99 scale, the worst players in the NBA typically don't have a rating lower than 60. An outline of how a player's NBA 2K rating indicates their skill level is demonstrated below:

- 95+: Hall of Famer
- 90-94: Superstar
- 85-89: All-Star
- 80-84: High-Level Starter
- 75-79: High-Level Role Player
- 70-74: Low-Level Role Player
- 65-69: End-of-Bench Player
- 64-: NBA G-League Fodder

## Tutorial Content

In this tutorial, we will show how to do some basic data scraping, visualizing, merging, and modeling. We'll mostly be using the [Pandas](https://pandas.pydata.org/), [Numpy](https://numpy.org/), [Plotly Express](https://plot.ly/python/plotly-express/), and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) libraries.

We'll be using data from a variety of different sources. Ratings from NBA 2K20, the most recent NBA 2K edition, can be found on https://www.2kratings.com. NBA 2K19 ratings are available on https://nba2k19.2kratings.com, and NBA 2K18 ratings are available on its own respective version of that URL. To acquire data from games in years' past, we'll look at the database of ratings dating back to NBA 2K14 from https://hoopshype.com/nba2k. Although there is some inconsistencies in this data given that the 2K ratings websites show ratings from the most recent roster update on that game and Hoopshype ratings are from when the game is initially released, ratings from both websites are used to demonstrate how to scrape data from various sources.

We will cover the following topics in this tutorial:

- [Installing the libraries](#Installing-the-libraries)
- [Scraping this year's 2K ratings website](#Scraping-this-year's-2K-ratings-website)
- [Scraping previous years' 2K ratings websites](#Scraping-previous-years'-2K-ratings-websites)
- [Scraping previous ratings from Hoopshype](#Scraping-previous-ratings-from-Hoopshype)
- [Gathering player ratings for each year](#Gathering-player-ratings-for-each-year)
- [Visualizing rating trends](#Visualizing-rating-trends)
- [Merging Basketball-Reference data](#Merging-Basketball-Reference-data)
- [Building a model to predict 2K ratings](#Building-a-model-to-predict-2K-ratings)
- [Summary and references](#Summary-and-references)

# Installing the libraries

Before getting started, you'll need to install the various libraries that we will use. You can install Requests (which we'll need to scrape HTML from websites), Pandas, BeautifulSoup, Plotly Express, Unidecode (which we'll need to remove accent marks from characters in strings; you'll see why this is relevant later), and Numpy using pip:

```bash
$ pip install --upgrade requests pandas bs4 plotly.express unidecode numpy
```

After you run all the installs, make sure the following commands work for you:

In [18]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import plotly.express as px
import unidecode
import math
import numpy as np
import warnings

# Scraping this year's 2K ratings website

Now that we've installed and loaded the libraries, let's scrape some ratings from NBA 2K20. As you can see from the format of ratings in this URL (https://www.2kratings.com/nba2k20-team/philadelphia-76ers, provided as an example), each player has 3 ratings listed: their overall rating, 3-point rating, and dunk rating. For now, we're only interested in their overall rating so we'll need to be conscious of that in the scraping process.

Because there isn't one universal page with all players and their ratings on this website, we'll need to loop through URLs on a team-by-team basis while also not forgetting about free agents. We'll eventually scrape the ratings for each player all the way back to NBA 2K14, so the easiest way to do this is to store each player's ratings in a dictionary. We'll create a dictionary to store all the players and each player dictionary will have its own dictionary to store their ratings for each year.

The process of scraping through the ratings is pretty dense, so let's go over it one-by-one. After adding the team name into the URL, we can use requests.get() to get the data from that URL. Since the data starts off unstructured, calling BeautifulSoup() will make it easier to get the HTML from that page. At this point, we'd need to go onto the webpage itself to see how player names and ratings are entered on that webpage. An easy way to do this is to click the URL I provided and right click "Inspect". To get the player's name, you could press Control-F and look up a player's name (let's say Joel Embiid) in the HTML elements. By doing this you can see that his name as well as other names of players on the roster can be found under the "td" tag and the "roster-entry" class. Using this same process to find ratings leads you to the "span" tag and "roster rating class; however, there is no way to get only the players' overall ratings. Since finding all ratings with that tag and class will also give you 3-point and dunk ratings, we have to make sure we're getting every third rating.

Once the players and ratings are in a relatively structured form, we can loop through the players and add them and their 2K20 ratings to the dictionary. While doing this, we'll need to keep conscious of a few things. First of all, players who are rookies will have their rookie status in their name (i.e. Matisse Thybulle is a rookie, so his name string from this website is "Matisse Thybulle ROOKIE DRAFTED #20"), so we'll have to get rid of this. Another tedious pattern you'll see a lot more of throughout this tutorial is that some players in different datasets use different spellings of their name. Since we're not working with a unique, unchanging player ID for each player, we'll eventually merge datasets using player names so every player's name must be exactly the same in each dataset we use. As the example below shows, some datasets add the Sr. suffix to Marcus Morris's name and some add the III suffix to James Ennis's name. For the sake of consistency, they will be Marcus Morris and James Ennis in all datasets. 

In [5]:
nba_teams = ["Atlanta Hawks", "Boston Celtics", "Brooklyn Nets", 
             "Charlotte Hornets", "Chicago Bulls", "Cleveland Cavaliers", 
             "Dallas Mavericks", "Denver Nuggets", "Detroit Pistons", 
             "Golden State Warriors", "Houston Rockets", "Indiana Pacers",
             "Los Angeles Clippers", "Los Angeles Lakers", "Memphis Grizzlies",
             "Miami Heat", "Milwaukee Bucks", "Minnesota Timberwolves",
             "New Orleans Pelicans", "New York Knicks", "Oklahoma City Thunder",
             "Orlando Magic", "Philadelphia 76ers", "Phoenix Suns",
             "Portland Trail Blazers", "Sacramento Kings", "San Antonio Spurs",
             "Toronto Raptors", "Utah Jazz", "Washington Wizards"]
nba_ratings_dict = dict()

# Scraping NBA 2k20 Ratings

url = "https://www.2kratings.com/nba2k20-team/"
for team in nba_teams + ["Free Agents"]:
    # Add team name to URL
    team_split = team.split(" ")
    team_url = url
    for name in team_split:
        team_url = team_url + name + "-"
    team_url = team_url[:-1] # remove the last -
    
    # Get HTML from that page
    response = requests.get(team_url)
    root = BeautifulSoup(response.text, "html.parser")
    players = root.find_all("td", {'class': ['roster-entry']})
    ratings = root.find_all("span", {'class': ['roster-rating']})
    
    # Parse through players to add to dictionary
    for i in range(len(players)):
        # For rookies, it will indicate that they are rookies. Let's remove this:
        player = players[i].text.strip().split(" Rookie ")[0]
        # Keep consistency in player names
        if (player == "Marcus Morris Sr"): player = "Marcus Morris"
        elif (player == "James Ennis III"): player = "James Ennis"
        # Overall ratings are listed once every 3 ratings (others are 3PT and Dunk)
        rating = float(ratings[3*i].text.strip()) # Other data types will be floats, so let's keep consistency
        if(player not in nba_ratings_dict):
            nba_ratings_dict[player] = dict()
        nba_ratings_dict[player][float(2020)] = rating

# Scraping previous years' 2K ratings websites

To see an example of how this version of the website works, refer to https://nba2k19.2kratings.com/team/philadelphia-76ers. The main difference between this URL and NBA 2K20's URL is that this one does not include 3-point or dunk ratings, which should make the scraping process easier. Otherwise, this code chunk isn't too different from the previous chunk. A few more names have been added to the consistency check; this is a tedious process that requires no more than looking through the data, seeing how a player's name is correctly spelled, and making necessary changes to keep consistency.

In [7]:
# Scraping NBA 2k19 and 2k18 Ratings is slightly different
years = [19,18]
for year in years:
    url = "https://nba2k" + str(year) + ".2kratings.com/team/"
    for team in nba_teams + ["Free Agents"]:
        team_split = team.split(" ")
        team_url = url
        for name in team_split:
            team_url = team_url + name + "-"
        team_url = team_url[:-1] # remove the last -
        response = requests.get(team_url)
        root = BeautifulSoup(response.text, "html.parser")
        players = root.find_all("td", {'class': ['roster-entry']})
        ratings = root.find_all("span", {'class': ['roster-rating']})
        for i in range(len(players)):
            # For rookies, it will indicate that they are rookies. Let's remove this:
            player = players[i].text.strip().split(" Rookie")[0]
            # Keep consistency in player names and correct misspellings
            if (player == "Marcus Morris Sr"): player = "Marcus Morris"
            elif (player == "James Ennis III"): player = "James Ennis"
            elif (player == "Greivis Varquez"): player = "Greivis Vasquez"
            elif (player == "Jordan Mcrae"): player = "Jordan McRae"
            elif (player == "Wade Baldwin IV"): player = "Wade Baldwin"
            # Ratings do not include 3PT and dunks for past years
            rating = float(ratings[i].text.strip())
            if(player not in nba_ratings_dict):
                nba_ratings_dict[player] = dict()
            nba_ratings_dict[player][float(2000 + year)] = rating

# Scraping previous ratings from Hoopshype

NBA 2K's rating database doesn't include ratings from NBA 2K17 and earlier, so we'll have to turn to HoopsHype's database to fill in the gaps. This database includes ratings dating back to NBA 2K14, so we'll scrape through these to add more players and ratings to our dictionary. For a reference of how this website looks like, take a look at this URL: https://hoopshype.com/nba2k/2016-2017/.

By parsing through HoopsHype's Inspect elements, you can see that player names and ratings are both stored under the "td" tag, with player names under the "name" class and ratings under the "value" class with some data-value corresponding to the player's rating. The data on HoopsHype is much more structured; however, given the differences in how HoopsHype and 2KRatings spell some players' names, we have to fix some of the spellings of several players. Otherwise, the dictionary will treat J.R. Smith as a player with ratings dating back to NBA 2K18 and JR Smith a player with ratings from no later than NBA 2K17, even though J.R. Smith and JR Smith are the same player with names spelled differently by different data sources.

In [8]:
# Scraping from HoopsHype for 2k14-2k17

for year in range(2017, 2013, -1):
    url = "https://hoopshype.com/nba2k/" + str(year - 1) + "-" + str(year) + "/"
    response = requests.get(url)
    root = BeautifulSoup(response.text, "lxml")
    players = root.find_all("td", {'class': ['name'], 'data-value': ['']})
    ratings = root.find_all("td", {'class': ['value'], 
                                   'data-value' : [str(x) for x in list(range(99,0,-1))]})
    for i in range(len(players)):
        player = players[i].text.strip()
        # Keep consistency in player names
        if (player == "AJ Hammons"): player = "A.J. Hammons"
        for cj in ["McCollum", "Miles", "Watson", "Wilcox"]:
            if (player == "CJ " + cj): player = "C.J. " + cj
        if (player == "DJ Augustin"): player = "D.J. Augustin"
        elif (player == "DeAndre Bembry"): player = "DeAndre’ Bembry"
        elif (player == "Dennis Schroeder"): player = "Dennis Schroder"
        elif (player == "D'Angelo Russell"): player = "D’Angelo Russell"
        elif (player == "E'Twaun Moore"): player = "E’Twaun Moore"
        elif (player == "Ishmael Smith"): player = "Ish Smith"
        elif (player == "JJ Redick"): player = "J.J. Redick"
        elif (player == "JR Smith"): player = "J.R. Smith"
        elif (player == "John Lucas"): player = "John Lucas III"
        elif (player == "Johnny O'Bryant"): player = "Johnny O’Bryant III"
        elif (player == "Jose Juan Barea"): player = "J.J. Barea"
        elif (player == "Joseph Young"): player = "Joe Young"
        elif (player == "KJ McDaniels"): player = "K.J. McDaniels"
        elif (player == "Kelly Oubre"): player = "Kelly Oubre Jr."
        elif (player == "Kyle O'Quinn"): player = "Kyle O’Quinn"
        elif (player == "Larry Nance Jr"): player = "Larry Nance Jr."
        elif (player == "Nenê"): player = "Nene" 
        elif (player == "Otto Porter"): player = "Otto Porter Jr." 
        elif (player == "PJ Tucker"): player = "P.J. Tucker"
        elif (player == "Patrick Mills"): player = "Patty Mills" 
        elif (player == "Perry Jones"): player = "Perry Jones III" 
        elif (player == "RJ Hunter"): player = "R.J. Hunter"
        for tj in ["McConnell", "Warren"]:
            if (player == "TJ " + tj): player = "T.J. " + tj
        if (player == "Tim Hardaway Jr"): player = "Tim Hardaway Jr."
        elif (player == "Timothe Luwawu"): player = "Timothe Luwawu-Cabarrot"
        rating = float(ratings[i].text.strip())
        if(player not in nba_ratings_dict):
            nba_ratings_dict[player] = dict()
        nba_ratings_dict[player][float(year)] = rating

To demonstrate how the dictionary works, I've selected a player who has retired prior to 2020 (Kobe Bryant), a player who started his career after 2014 (Ben Simmons), and a player who has been active for all of these seasons (Kevin Durant). Each player has their own dictionary where the year corresponding to a specific NBA 2K game maps to their rating from that game. Note that years are float types right now to make modeling easier later on.

In [9]:
for player in ["Kobe Bryant", "Ben Simmons", "Kevin Durant"]:
    print(player + ":", nba_ratings_dict[player])

Kobe Bryant: {2016.0: 85.0, 2015.0: 89.0, 2014.0: 93.0}
Ben Simmons: {2020.0: 87.0, 2019.0: 88.0, 2018.0: 85.0, 2017.0: 79.0}
Kevin Durant: {2020.0: 96.0, 2019.0: 95.0, 2018.0: 96.0, 2017.0: 93.0, 2016.0: 91.0, 2015.0: 95.0, 2014.0: 95.0}


# Gathering player ratings for each year

Now that we have our dictionary, we can start the process of getting each player's rating at each year an easy task through converting the dictionary into a Pandas data frame. Note that we'll need to set orient = "index" to make sure that player names are the rows and years are the columns. If a player was not in the game for a particular year, their rating will be missing in the dataset as it should be.

In [10]:
nba_ratings_df = pd.DataFrame.from_dict(nba_ratings_dict, orient = "index")
nba_ratings_df.index.name = "Player"
nba_ratings_df.sort_index().head(10)


Unnamed: 0_level_0,2020.0,2019.0,2018.0,2017.0,2016.0,2015.0,2014.0
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
A.J. Hammons,,68.0,68.0,69.0,,,
AJ Price,,,,,,,56.0
Aaron Brooks,,72.0,72.0,73.0,75.0,73.0,76.0
Aaron Gordon,81.0,80.0,81.0,76.0,74.0,74.0,
Aaron Gray,,,,,,67.0,56.0
Aaron Harrison,,65.0,65.0,65.0,66.0,,
Aaron Holiday,73.0,73.0,,,,,
Aaron Jackson,,65.0,65.0,,,,
Abdel Nader,70.0,69.0,67.0,,,,
Admiral Schofield,71.0,,,,,,


By melting the data for each player, we can transform the data to only include three variables: the player, the year, and the rating. This simple manipulation will make it much easier to visualize rating trends as we'll see in the next step.

In [11]:
nba_ratings_melt = nba_ratings_df.reset_index().melt(id_vars='Player')
nba_ratings_melt = nba_ratings_melt.rename(columns={"value": "Rating",
                                                    "variable": "Year"})
nba_ratings_melt.head(10)

Unnamed: 0,Player,Year,Rating
0,John Collins,2020,84.0
1,DeAndre’ Bembry,2020,73.0
2,Trae Young,2020,85.0
3,Kevin Huerter,2020,76.0
4,Vince Carter,2020,73.0
5,Alex Len,2020,77.0
6,De’Andre Hunter,2020,77.0
7,Cam Reddish,2020,75.0
8,Bruno Fernando,2020,71.0
9,Allen Crabbe,2020,73.0


# Visualizing rating trends

Using Plotly Express, we can create a pretty line graph to visualize each player's year-to-year change in NBA 2K ratings. Plotly has several neat features, including the ability for you to scroll over the graph and hover over information at each point. There is also a list of Players for which you can filter out which players you want and do not want to see the trendline for. While the current plot below is very cluttered and it can be hard to infer anything from any player, you can easily see the plot for one player only by double clicking his name. To see the trends of all other players again, you can double click again over the list of players. Single-clicking a player's name will either add or remove his data from the plot.

In [21]:
nba_ratings_melt_sorted = nba_ratings_melt.sort_values(by = ["Player", "Year"])

fig = px.line(nba_ratings_melt_sorted,
              x="Year", y="Rating", 
              color="Player")
fig.show()

To also provide an easier graph to look at, I selected twenty interesting NBA players over the last few years with interesting rating trends for a variety of different reasons. Feel free to play around with these graphs as you wish:

In [22]:
subset_players = nba_ratings_melt.Player.isin(["LeBron James","Giannis Antetokounmpo",
                                               "Stephen Curry","Kevin Durant",
                                               "Anthony Davis","Joel Embiid",
                                               "Dwyane Wade","Dirk Nowitzki",
                                               "Kawhi Leonard","James Harden",
                                               "DeMar DeRozan", "Derrick Rose",
                                               "Carmelo Anthony", "Kobe Bryant",
                                               "Hassan Whiteside", "Tim Duncan",
                                               "Kevin Love", "Kyrie Irving",
                                               "Dwight Howard", "DeMarcus Cousins"])

warnings.simplefilter("ignore")
fig_spec = px.line(nba_ratings_melt_sorted[subset_players],
                   x="Year", y="Rating", color="Player")
fig_spec.show()

# Scraping previous years' 2K ratings websites

For our final step in this tutorial, let's build a basic model to predict a player's 2K rating. Before we do this, however, we need each player's statistics from that season, which we can find on Basketball-Reference.com. The code chunk below will create a dataset with per-game stats from each player in each year. There's a lot more player name manipulation going on here than in previous scraping jobs. Let's go through this step-by step.

By inspecting the Basketball-Reference URL (you can use https://www.basketball-reference.com/leagues/NBA_2019_per_game.html as a reference), it is clear that all the individual rows can be found in the "full_table" class under the "tr" tag, and the data in each individual column can be found under the "td" tag. Some columns are numerical and some are categorical, so we'd add them into the data frame under the appropriate type. After manipulating player names (including using Unidecode to remove accent marks from the names of some players), we must also check to make sure that percentages for players with no shot attempts are coded in as 0's because leaving them blank will have problematic effects once we eventually make our model.

In [14]:
# Adding Basketball-Reference Data

for year in range(2019, 2013, -1):
    url = "https://www.basketball-reference.com/leagues/NBA_" + str(year) + "_per_game.html"
    response = requests.get(url)
    root = BeautifulSoup(response.text, "html.parser")
    # Get all the column names
    columns = root.find_all("tr", limit = 1)[0].text.strip().split("\n")[1:]
    
    # Start building a data frame for this year, we'll eventually append to a master nba_stats data frame
    year_df = pd.DataFrame(columns = columns)
    for row in root.find_all("tr", class_="full_table"):
        data = []
        for col in row.findAll("td"):
            try: data.append(float(col.getText()))
            except: data.append(col.getText())
        player_data = pd.DataFrame(data).T # Need to transpose to add to data
        player_data.columns = year_df.columns
        
        player_name = player_data["Player"].values[0]
        # Get rid of accent marks in names of some players
        player_data["Player"].values[0] = unidecode.unidecode(player_name)
        # Check name consistency. For examples, some players have ' rather than ’ in their names
        split_name = player_name.split("'")
        if(len(split_name) == 2):
            player_data["Player"].values[0] = split_name[0] + "’" + split_name[1]
            
        # Get rid of initials for some players
        initials = ["A.J. Price","D.J. White","J.J. Hickson","O.J. Mayo","P.J. Hairston"]
        if(player_data["Player"].values[0] in initials):
            player_data["Player"].values[0] = player_data["Player"].values[0].replace(".","")
            
        # For other players, do the opposite
        no_initials = ["CJ McCollum", "PJ Dozier"]
        if(player_data["Player"].values[0] in no_initials):
            player_data["Player"].values[0] = player_name[0] + "." + player_name[1] + "." + player_name[2:]
        
        # Some players have a Jr. at the end of their name
        jr = ["Dennis Smith", "Derrick Jones", "Derrick Walton", "Gary Trent",
              "Glen Rice", "Jaren Jackson", "Kelly Oubre", "Larry Nance",
              "Otto Porter", "Tim Hardaway", "Tony Bradley", "Troy Brown",
              "Wayne Selden", "Wendell Carter"]
        if(player_data["Player"].values[0] in jr):
            player_data["Player"].values[0] = player_data["Player"].values[0] + " Jr."
            if(player_name == "Glen Rice"): 
                player_data["Player"].values[0] = player_data["Player"].values[0][:-1]
                
        # Others have a II at the end of their name
        ii = ["Gary Payton", "Larry Drew"]
        if(player_data["Player"].values[0] in ii):
            player_data["Player"].values[0] = player_data["Player"].values[0] + " II"
            
        # Others have a III at the end of their name
        iii = ["Andrew White", "Glenn Robinson", "James Webb", 
               "Johnny O’Bryant", "Marvin Bagley", "Perry Jones"]
        if(player_data["Player"].values[0] in iii):
            player_data["Player"].values[0] = player_data["Player"].values[0] + " III"
            
        # Let's change a few specific names
        if(player_data["Player"].values[0] == "Amar’e Stoudemire"): 
            player_data["Player"].values[0] = "Amare Stoudemire"
        elif(player_data["Player"].values[0] == "Byron Mullens"): 
            player_data["Player"].values[0] = "BJ Mullens"
        elif(player_data["Player"].values[0] == "Chris Johnson"): 
            player_data["Player"].values[0] = "Christapher Johnson"
        elif(player_data["Player"].values[0] == "Hedo Turkoglu"): 
            player_data["Player"].values[0] = "Hidayet Turkoglu"
        elif(player_data["Player"].values[0] == "Jakob Poltl"): 
            player_data["Player"].values[0] = "Jakob Poeltl"
        elif(player_data["Player"].values[0] == "Lonnie Walker"): 
            player_data["Player"].values[0] = "Lonnie Walker IV"
        elif(player_data["Player"].values[0] == "Lou Amundson"): 
            player_data["Player"].values[0] = "Louis Amundson"
        elif(player_data["Player"].values[0] == "Lou Williams" and year < 2018): 
            player_data["Player"].values[0] = "Louis Williams" # also double conditional
        elif(player_data["Player"].values[0] == "Mo Williams"): 
            player_data["Player"].values[0] = "Maurice Williams"
        elif(player_data["Player"].values[0] == "Maurice Harkless" and year < 2018):  
            player_data["Player"].values[0] = "Moe Harkless" # also double conditional
        elif(player_data["Player"].values[0] == "Mo Bamba"): 
            player_data["Player"].values[0] = "Mohamed Bamba"
        elif(player_data["Player"].values[0] == "Nene Hilario"): 
            player_data["Player"].values[0] = "Nene"
        elif(player_data["Player"].values[0] == "Zhou Qi"): 
            player_data["Player"].values[0] = "Qi Zhou"
        elif(player_data["Player"].values[0] == "Taurean Waller-Prince"): 
            player_data["Player"].values[0] = "Taurean Prince"
        elif(player_data["Player"].values[0] == "Walt Lemon"): 
            player_data["Player"].values[0] = "Walter Lemon Jr."
        elif(player_data["Player"].values[0] == "Edy Tavares"): 
            player_data["Player"].values[0] = "Walter Tavares"
            
        # Some values are blank due to divide by 0 error. Let's fix this:
        if(player_data["FG%"].values[0] == ""): player_data["FG%"].values[0] = 0
        if(player_data["3P%"].values[0] == ""): player_data["3P%"].values[0] = 0
        if(player_data["2P%"].values[0] == ""): player_data["2P%"].values[0] = 0
        if(player_data["eFG%"].values[0] == ""): player_data["eFG%"].values[0] = 0
        if(player_data["FT%"].values[0] == ""): player_data["FT%"].values[0] = 0

        # Finally, let's add this data into our data frame
        year_df = year_df.append(player_data, ignore_index = True)
    rows = year_df.shape[0]
    year_df["Year"] = [year] * rows
    if(year == 2019): nba_stats = year_df # Initialize full dataset
    else: nba_stats = nba_stats.append(year_df)
    
nba_stats.head(10)

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
0,Alex Abrines,SG,25,OKC,31,2,19.0,1.8,5.1,0.357,...,0.2,1.4,1.5,0.6,0.5,0.2,0.5,1.7,5.3,2019
1,Quincy Acy,PF,28,PHO,10,0,12.3,0.4,1.8,0.222,...,0.3,2.2,2.5,0.8,0.1,0.4,0.4,2.4,1.7,2019
2,Jaylen Adams,PG,22,ATL,34,1,12.6,1.1,3.2,0.345,...,0.3,1.4,1.8,1.9,0.4,0.1,0.8,1.3,3.2,2019
3,Steven Adams,C,25,OKC,80,80,33.4,6.0,10.1,0.595,...,4.9,4.6,9.5,1.6,1.5,1.0,1.7,2.6,13.9,2019
4,Bam Adebayo,C,21,MIA,82,28,23.3,3.4,5.9,0.576,...,2.0,5.3,7.3,2.2,0.9,0.8,1.5,2.5,8.9,2019
5,Deng Adel,SF,21,CLE,19,3,10.2,0.6,1.9,0.306,...,0.2,0.8,1.0,0.3,0.1,0.2,0.3,0.7,1.7,2019
6,DeVaughn Akoon-Purcell,SG,25,DEN,7,0,3.1,0.4,1.4,0.3,...,0.1,0.4,0.6,0.9,0.3,0.0,0.3,0.6,1.0,2019
7,LaMarcus Aldridge,C,33,SAS,81,81,33.2,8.4,16.3,0.519,...,3.1,6.1,9.2,2.4,0.5,1.3,1.8,2.2,21.3,2019
8,Rawle Alkins,SG,21,CHI,10,1,12.0,1.3,3.9,0.333,...,1.1,1.5,2.6,1.3,0.1,0.0,0.8,0.7,3.7,2019
9,Grayson Allen,SG,23,UTA,38,2,10.9,1.8,4.7,0.376,...,0.1,0.5,0.6,0.7,0.2,0.2,0.9,1.2,5.6,2019


Above, you can see the resulting data frame from scraping through Basketball-Reference. Below, we'll complete a merge with the nba_ratings_melt dataset containing each player's NBA 2K ratings at each year. After the merge each player will have their ratings and their per-game statistics from that season in the data frame.

In [15]:
nba_ratings_and_stats = nba_ratings_melt.merge(nba_stats, 
                                               how = "left", 
                                               on = ["Player", "Year"])
nba_ratings_and_stats.sort_values(by = ["Year", "PTS"], ascending = [True, False]).head(10)

Unnamed: 0,Player,Year,Rating,Pos,Age,Tm,G,GS,MP,FG,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
6159,Kevin Durant,2014,95.0,SF,25,OKC,81,81,38.5,10.5,...,0.873,0.7,6.7,7.4,5.5,1.3,0.7,3.5,2.1,32.0
6612,Carmelo Anthony,2014,93.0,PF,29,NYK,77,77,38.7,9.6,...,0.848,1.9,6.2,8.1,3.1,1.2,0.7,2.6,2.9,27.4
6327,LeBron James,2014,99.0,PF,29,MIA,77,77,37.7,10.0,...,0.75,1.1,5.9,6.9,6.3,1.6,0.3,3.5,1.6,27.1
6200,Kevin Love,2014,88.0,PF,25,MIN,77,77,36.3,8.4,...,0.821,2.9,9.6,12.5,4.4,0.8,0.5,2.5,1.8,26.1
6280,James Harden,2014,88.0,SG,24,HOU,73,73,38.0,7.5,...,0.866,0.8,3.9,4.7,6.1,1.6,0.4,3.6,2.4,25.4
6251,Blake Griffin,2014,86.0,PF,24,LAC,80,80,35.8,9.0,...,0.715,2.4,7.1,9.5,3.9,1.2,0.6,2.8,3.3,24.1
6266,Stephen Curry,2014,88.0,PG,25,GSW,78,78,36.5,8.4,...,0.885,0.6,3.7,4.3,8.5,1.6,0.2,3.8,2.5,24.0
6530,LaMarcus Aldridge,2014,86.0,PF,28,POR,69,69,36.2,9.4,...,0.822,2.4,8.7,11.1,2.6,0.9,1.0,1.8,2.1,23.2
6336,DeMarcus Cousins,2014,82.0,C,23,SAC,71,71,32.4,8.3,...,0.726,3.1,8.6,11.7,2.9,1.5,1.3,3.5,3.8,22.7
6538,DeMar DeRozan,2014,77.0,SG,24,TOR,79,79,38.2,7.6,...,0.824,0.6,3.7,4.3,4.0,1.1,0.4,2.2,2.5,22.7


# Building a model to predict 2K ratings

Now that we have the data in the form we want it, let's make a model using the variables subsetted below to predict NBA 2K rating. Intuitively, the relationship between age and player skill is non-linear, since one should expect a young player to get player and an older player to get worse. For that reason, I've added in an Age$^2$ variable to model 2K ratings as a nonlinear function of age, controlling for the other variables too.

Converting the data to numpy arrays will allow us to perform linear regression. We're only looking to perform a basic model here, so let's do that and extract the coefficients:

In [16]:
X_stats = nba_ratings_and_stats.dropna()[["Age", "G", "GS", "MP", 
                                          "FG", "FGA", "FG%", "3P", "3PA", "3P%",
                                          "eFG%", "FT", "FTA", "FT%", "ORB", "DRB", 
                                          "AST", "STL", "BLK", "TOV", "PF", "PTS"]]
X_stats["Age^2"] = X_stats["Age"] ** 2
y_ratings = nba_ratings_and_stats.loc[X_stats.index, "Rating"].to_numpy(dtype = "float")
X_stats_as_np = X_stats.to_numpy(dtype = "float")

coefficients = np.linalg.solve(X_stats_as_np.T @ X_stats_as_np, 
                               X_stats_as_np.T @ y_ratings)

coefficients_df = pd.DataFrame(coefficients)
coefficients_df.index = X_stats.columns
coefficients_df.columns = ["Coefficient"]
coefficients_df.sort_values(by = "Coefficient", ascending = False)

Unnamed: 0,Coefficient
FG%,7.970027
Age,4.531645
3P%,1.725854
FT%,1.602426
PTS,1.431579
FTA,1.199014
STL,1.077389
3PA,1.035149
BLK,0.907262
DRB,0.761083


As you can see above, field goal percentage is the most correlated variable here with NBA 2K rating. Effective field goal percentage is strongly negatively correlated, though given the presence of other field goal metrics in this model, this is to be expected as there's a good amount of multicollinearity here. The high positive coefficient for age and slightly negative Age$^2$ coefficient implies tha the effect of age does level off at a certain age before progressively worsening. We won't get into significance testing in this tutorial, though that would theoretically be the next step here.

Below, you can see the five best and worst rating predictions from this model. A few observations in the bottom five are particularly notable, as both Michael Carter-Williams and [Hassan Whiteside](https://www.youtube.com/watch?v=XVxyBoYHexQ) had unexpected breakout seasons in their respective seasons and Gordon Hayward had a freak season-ending injury after only playing five minutes in the first game of the season.

In [17]:
coefficients_as_np = coefficients_df.to_numpy()

predictions_data = nba_ratings_and_stats.loc[X_stats.index, ["Player", "Year", "Rating"]]
predictions_data["Predicted Rating"] = X_stats_as_np @ coefficients_as_np
predictions_data["Difference"] = abs(predictions_data["Predicted Rating"] - 
                                     predictions_data["Rating"])
predictions_data.sort_values(by = "Difference")

Unnamed: 0,Player,Year,Rating,Predicted Rating,Difference
2222,T.J. Warren,2018,81.0,81.001324,0.001324
1477,Ian Mahinmi,2019,72.0,72.003129,0.003129
3102,DeAndre Jordan,2017,85.0,84.995534,0.004466
3242,T.J. Warren,2017,76.0,75.993822,0.006178
2262,Kyle Anderson,2018,76.0,76.006237,0.006237
...,...,...,...,...,...
2057,Gordon Hayward,2018,88.0,70.111373,17.888627
6926,Steve Novak,2014,53.0,70.907822,17.907822
6353,Miles Plumlee,2014,56.0,75.829713,19.829713
6466,Michael Carter-Williams,2014,61.0,81.414428,20.414428


# Summary and references

This tutorial highlighted just a few elements of what is possible with scraping, merging, visualizing, and modeling data related to NBA 2K ratings. Much more detail about the libraries and the data used in this tutorial are available from the following links.

1. Requests: https://realpython.com/python-requests/
2. Pandas: https://pandas.pydata.org/
3. BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
4. Plotly Express: https://plot.ly/python/plotly-express/
5. Unidecode: https://pypi.org/project/Unidecode/
6. Numpy: https://numpy.org/
7. Warnings: https://docs.python.org/3/library/warnings.html
8. NBA 2K Ratings: https://www.2kratings.com
9. Basketball-Reference: https://www.basketball-reference.com/