# INFO 2950 Phase 2: Group Project
__Group Members__: Adya Bhargava (ab2446), Akhil Damani (ad674), Madeline Demers (mkd79)


## Research Question
Can we calculate the probability that a player is selected MVP for a season based on their stats for the season? 

## Data Cleaning & Collection
Another file in the repository describes and contains the code for how we collected our data by scraping the NBA site, ESPN, and another site. In order to do this we had to find data that was legal to scrape and unlocked, which proved to be a difficult process. For example, the NBA site lets users access the player data, but not the team data. Furthermore, similar site also proved to be challenging in terms of accessing the data so to get the historical list of MVPs for the past 20 years, we ended up having to scrape a text article in order to get this data. Then, below we show the code and explanation for how we then cleaned this data (after saving them to .csv files). Please refer to the file titled 'data_colllection.ipynb' to see more detail about the in-depth data collection process we did.



In [102]:
import requests
from bs4 import BeautifulSoup

import pandas as pd
import duckdb
import numpy as np
import time

**Player Stat Data Cleaning** 
In the 'data_collection.ipynb' file we were able to collect this data from the NBA stats site and then clean it up so as it met our needs. By reading the .csv file we had a large set of data that was of very good quality, documenting the past 20 seasons statistics for all of the players in the league during the regular season. Below, we dropped some unecessary columns that prove irrelevant to our analysis as well as rename some of the others to be more consistent with the headers across the board. A preview of the data frame is shown below.

In [103]:
player_stats_df = pd.read_csv('player_stats.csv')
player_stats_df = player_stats_df.drop("Unnamed: 0", axis=1)
player_stats_df = player_stats_df.drop("PLAYER_ID", axis=1)
player_stats_df = player_stats_df.drop("TEAM_ID", axis=1)
player_stats_df = player_stats_df.drop("RANK", axis=1)
player_stats_df.rename(columns={'Year': 'SEASON'}, inplace=True)
player_stats_df



Unnamed: 0,PLAYER,TEAM,GP,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,...,OREB,DREB,REB,AST,STL,BLK,TOV,PTS,EFF,SEASON
0,Tracy McGrady,ORL,67,39.9,9.7,23.4,0.417,2.6,7.7,0.339,...,1.4,4.6,6.0,5.5,1.4,0.6,2.7,28.0,23.7,2003-04
1,Peja Stojakovic,SAC,81,40.3,8.2,17.1,0.480,3.0,6.8,0.433,...,1.1,5.1,6.3,2.1,1.3,0.2,1.9,24.2,23.0,2003-04
2,Kevin Garnett,MIN,82,39.4,9.8,19.6,0.499,0.1,0.5,0.256,...,3.0,10.9,13.9,5.0,1.5,2.2,2.6,24.2,33.1,2003-04
3,Kobe Bryant,LAL,65,37.7,7.9,18.1,0.438,1.1,3.3,0.327,...,1.6,3.9,5.5,5.1,1.7,0.4,2.6,24.0,22.7,2003-04
4,Paul Pierce,BOS,80,38.8,7.5,18.7,0.402,1.4,4.8,0.299,...,0.9,5.7,6.5,5.1,1.6,0.7,3.8,23.0,20.5,2003-04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4381,P.J. Tucker,PHI,75,25.6,1.3,3.0,0.427,0.7,1.9,0.393,...,1.3,2.7,3.9,0.8,0.5,0.2,0.6,3.5,6.6,2022-23
4382,Miles McBride,NYK,64,11.9,1.2,3.4,0.358,0.6,2.1,0.299,...,0.2,0.6,0.8,1.1,0.6,0.1,0.4,3.5,3.3,2022-23
4383,Anthony Gill,WAS,59,10.6,1.2,2.2,0.538,0.1,0.5,0.138,...,0.6,1.1,1.7,0.6,0.1,0.2,0.3,3.3,4.3,2022-23
4384,Christian Koloko,TOR,58,13.8,1.2,2.6,0.480,0.0,0.2,0.083,...,1.4,1.5,2.9,0.5,0.4,1.0,0.3,3.1,5.9,2022-23


In order to improve the process of merging, sorting, and analyzing our data we wanted to be sure that the abbreviations presented in the original file did not impede on our ability to merge with other dataframes we have such as the team statistics. So, we added a column mapping the full team name for each abbreviation. This also considered teams that have changed locations/names over the 20 year span in order to be inclusive of all relevant names and abbreviations in the data set. The column was also renamed and moved closer to the one it shares similarities with as for visual ease of use.

In [104]:
team_name_mapping = {
    'ORL': 'Orlando Magic',
    'SAC': 'Sacramento Kings',
    'MIN': 'Minnesota Timberwolves',
    'LAL': 'Los Angeles Lakers',
    'BOS': 'Boston Celtics',
    'NOH': 'New Orleans Hornets',
    'TOR': 'Toronto Raptors',
    'SAS': 'San Antonio Spurs',
    'DAL': 'Dallas Mavericks',
    'MIL': 'Milwaukee Bucks',
    'DEN': 'Denver Nuggets',
    'CLE': 'Cleveland Cavaliers',
    'LAC': 'LA Clippers',
    'NYK': 'New York Knicks',
    'IND': 'Indiana Pacers',
    'POR': 'Portland Trail Blazers',
    'PHX': 'Phoenix Suns',
    'GSW': 'Golden State Warriors',
    'NJN': 'New Jersey Nets',
    'ATL': 'Atlanta Hawks',
    'SEA': 'Seattle SuperSonics',
    'MEM': 'Memphis Grizzlies',
    'DET': 'Detroit Pistons',
    'HOU': 'Houston Rockets',
    'MIA': 'Miami Heat',
    'CHI': 'Chicago Bulls',
    'UTA': 'Utah Jazz',
    'PHI': 'Philadelphia 76ers',
    'WAS': 'Washington Wizards',
    'CHA': 'Charlotte Bobcats',
    'NOK': 'New Orleans/Oklahoma City Hornets',
    'OKC': 'Oklahoma City Thunder',
    'BKN': 'Brooklyn Nets',
    'NOP': 'New Orleans Pelicans'
}

player_stats_df['FULL_NAME'] = player_stats_df['TEAM'].map(team_name_mapping)
full_name_column = player_stats_df.pop("FULL_NAME")
player_stats_df.insert(player_stats_df.columns.get_loc("TEAM") + 1, "FULL_NAME", full_name_column)

player_stats_df


Unnamed: 0,PLAYER,TEAM,FULL_NAME,GP,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,...,OREB,DREB,REB,AST,STL,BLK,TOV,PTS,EFF,SEASON
0,Tracy McGrady,ORL,Orlando Magic,67,39.9,9.7,23.4,0.417,2.6,7.7,...,1.4,4.6,6.0,5.5,1.4,0.6,2.7,28.0,23.7,2003-04
1,Peja Stojakovic,SAC,Sacramento Kings,81,40.3,8.2,17.1,0.480,3.0,6.8,...,1.1,5.1,6.3,2.1,1.3,0.2,1.9,24.2,23.0,2003-04
2,Kevin Garnett,MIN,Minnesota Timberwolves,82,39.4,9.8,19.6,0.499,0.1,0.5,...,3.0,10.9,13.9,5.0,1.5,2.2,2.6,24.2,33.1,2003-04
3,Kobe Bryant,LAL,Los Angeles Lakers,65,37.7,7.9,18.1,0.438,1.1,3.3,...,1.6,3.9,5.5,5.1,1.7,0.4,2.6,24.0,22.7,2003-04
4,Paul Pierce,BOS,Boston Celtics,80,38.8,7.5,18.7,0.402,1.4,4.8,...,0.9,5.7,6.5,5.1,1.6,0.7,3.8,23.0,20.5,2003-04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4381,P.J. Tucker,PHI,Philadelphia 76ers,75,25.6,1.3,3.0,0.427,0.7,1.9,...,1.3,2.7,3.9,0.8,0.5,0.2,0.6,3.5,6.6,2022-23
4382,Miles McBride,NYK,New York Knicks,64,11.9,1.2,3.4,0.358,0.6,2.1,...,0.2,0.6,0.8,1.1,0.6,0.1,0.4,3.5,3.3,2022-23
4383,Anthony Gill,WAS,Washington Wizards,59,10.6,1.2,2.2,0.538,0.1,0.5,...,0.6,1.1,1.7,0.6,0.1,0.2,0.3,3.3,4.3,2022-23
4384,Christian Koloko,TOR,Toronto Raptors,58,13.8,1.2,2.6,0.480,0.0,0.2,...,1.4,1.5,2.9,0.5,0.4,1.0,0.3,3.1,5.9,2022-23


The "SEASON" column was moved to provide an easier visual glance at the data upon looking at the large data frame in a form such as the one shown below. 

In [105]:
season_column = player_stats_df.pop("SEASON")
player_stats_df.insert(player_stats_df.columns.get_loc("FULL_NAME") + 1, "SEASON", season_column)
player_stats_df


Unnamed: 0,PLAYER,TEAM,FULL_NAME,SEASON,GP,MIN,FGM,FGA,FG_PCT,FG3M,...,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PTS,EFF
0,Tracy McGrady,ORL,Orlando Magic,2003-04,67,39.9,9.7,23.4,0.417,2.6,...,0.796,1.4,4.6,6.0,5.5,1.4,0.6,2.7,28.0,23.7
1,Peja Stojakovic,SAC,Sacramento Kings,2003-04,81,40.3,8.2,17.1,0.480,3.0,...,0.927,1.1,5.1,6.3,2.1,1.3,0.2,1.9,24.2,23.0
2,Kevin Garnett,MIN,Minnesota Timberwolves,2003-04,82,39.4,9.8,19.6,0.499,0.1,...,0.791,3.0,10.9,13.9,5.0,1.5,2.2,2.6,24.2,33.1
3,Kobe Bryant,LAL,Los Angeles Lakers,2003-04,65,37.7,7.9,18.1,0.438,1.1,...,0.852,1.6,3.9,5.5,5.1,1.7,0.4,2.6,24.0,22.7
4,Paul Pierce,BOS,Boston Celtics,2003-04,80,38.8,7.5,18.7,0.402,1.4,...,0.819,0.9,5.7,6.5,5.1,1.6,0.7,3.8,23.0,20.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4381,P.J. Tucker,PHI,Philadelphia 76ers,2022-23,75,25.6,1.3,3.0,0.427,0.7,...,0.826,1.3,2.7,3.9,0.8,0.5,0.2,0.6,3.5,6.6
4382,Miles McBride,NYK,New York Knicks,2022-23,64,11.9,1.2,3.4,0.358,0.6,...,0.667,0.2,0.6,0.8,1.1,0.6,0.1,0.4,3.5,3.3
4383,Anthony Gill,WAS,Washington Wizards,2022-23,59,10.6,1.2,2.2,0.538,0.1,...,0.731,0.6,1.1,1.7,0.6,0.1,0.2,0.3,3.3,4.3
4384,Christian Koloko,TOR,Toronto Raptors,2022-23,58,13.8,1.2,2.6,0.480,0.0,...,0.627,1.4,1.5,2.9,0.5,0.4,1.0,0.3,3.1,5.9


**Team Stats Data Cleaning**
Once this data was collected and converted to a .csv file, we were able to rename columns to match the desired headers and ensure there were no string oddities, such as the '*' that had indicated a playoff team in the original site. We removed this so that it is consistent with the other data frame and then it will be easy to merge and compare the data properly.

In [116]:

team_stats_df = pd.read_csv('team_data.csv')
team_stats_df['Team'] = team_stats_df["Unnamed: 0"].str.replace('*', '')
team_stats_df = team_stats_df.drop("Unnamed: 0", axis=1)

team_stats_df

#Clean this data

Unnamed: 0,w,l,w/l %,avg pts,opp avg pts,srs,Team
0,49,33,0.598,95.4,90.1,4.42,New Jersey Nets
1,48,34,0.585,96.8,94.5,1.76,Philadelphia 76ers
2,44,38,0.537,92.7,93.1,-0.75,Boston Celtics
3,42,40,0.512,98.5,98.4,-0.39,Orlando Magic
4,37,45,0.451,91.5,92.5,-1.47,Washington Wizards
5,37,45,0.451,95.9,97.2,-1.61,New York Knicks
6,25,57,0.305,85.6,90.6,-5.13,Miami Heat
7,50,32,0.61,91.4,87.7,2.97,Detroit Pistons
8,48,34,0.585,96.8,93.3,2.79,Indiana Pacers
9,47,35,0.573,93.9,91.8,1.52,New Orleans Hornets


**Cleaning Data about Historical MVPs**
This data is the list of MVPs for the seasons that we collected data on. This data was merged with the player_stats_df in order to match the player to their team to provide a better reference point. Columns were renamed as appropriate to develop a consistent header text across the data frames.

In [107]:

mvp_df = pd.read_csv('mvp_historical.csv')

mvp_team_df = duckdb.sql("""
SELECT mvp_df.Year, mvp_df.MVP_Name, player_stats_df.TEAM, player_stats_df.FULL_NAME,
FROM mvp_df
LEFT JOIN player_stats_df ON mvp_df.Year = player_stats_df.SEASON AND mvp_df.MVP_Name = player_stats_df.PLAYER
ORDER BY mvp_df.Year
""").df()

mvp_team_df.rename(columns={'Year': 'SEASON'}, inplace=True)
mvp_team_df.rename(columns={'MVP_Name': 'PLAYER'}, inplace=True)

mvp_team_df


Unnamed: 0,SEASON,PLAYER,TEAM,FULL_NAME
0,2003-04,Kevin Garnett,MIN,Minnesota Timberwolves
1,2004-05,Steve Nash,PHX,Phoenix Suns
2,2005-06,Steve Nash,PHX,Phoenix Suns
3,2006-07,Dirk Nowitzki,DAL,Dallas Mavericks
4,2007-08,Kobe Bryant,LAL,Los Angeles Lakers
5,2008-09,LeBron James,CLE,Cleveland Cavaliers
6,2009-10,LeBron James,CLE,Cleveland Cavaliers
7,2010-11,Derrick Rose,CHI,Chicago Bulls
8,2011-12,LeBron James,MIA,Miami Heat
9,2012-13,LeBron James,MIA,Miami Heat


**Merging Data About Players Rank Within Their Team**
TO DO: Once we have team data alongside player data, we can then merge these dfs and attempt to rank the players within their teams.
Describe process here


In [108]:

# #  Merge dataframes on 'SEASON' and 'FULL_NAME' columns


## Data Description

- __What are the observations (rows) and the attributes (columns)?__
- __Why was this dataset created?__
- __Who funded the creation of the dataset?__
- __What processes might have influenced what data was observed and recorded and what was not?__
- __What preprocessing was done, and how did the data come to be in the form that you are using?__
- __If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?__
- __Where can your raw source data be found, if applicable? Provide a link to the raw data (hosted on Github, in a Cornell Google Drive or Cornell Box).__


## Data Limitations

Fill in here!

## Exploratory Data Analysis

TO DO 

In [109]:
#Data Analysis

## Questions for Reviewers
1. Is our research question in depth enough for the scope of the project assignment? If not, how might you suggest we modify it to better reflect the project requirements?
2. TO DO
3. TO DO
