<h1>NBA 2K20 Database</h1>
<em>Aaron Wollman, Kelsey Richardson Blackwell, Will Huang</em>
<hr>

This project is to create a production database that contains both real-life and game data for players in NBA2K20. 

In this notebook, the extract, transform, and load process will take place for two CSV files as their data is placed into a database.

## Prerequisites

Before running this notebook, make sure to run the Prerequisites section in the <a href="README.md" target="_blank">README</a> for this project. 

Following those instructions will create a config.<span></span>py file and will create the production database used in this notebook. 

## Setup

In order for the code in this notebook to run, the dependencies in the next cell are required.

<em>Note that a config.py file is <b>required</b> for the next cell to run. 
    Follow the directions in Prerequisites section to create this file.</em>

In [1]:
import pandas as pd
# TODO Other dependencies
# from config import username, password

In [2]:
csv_files = {
    "nba2k" : "data/nba2k20.csv",
    "player_stats" : "data/players_stats.csv"
}

In [3]:
import re 

## Extract

After the dependencies are setup, now the code will import the data to be worked on. Pandas will be used to import this data into DataFrames which will be cleaned up in the next section. Both files that will be imported are CSV files, which makes this step fairly easy.

### NBA 2K20 Statistics

This data contains player statistics from the videogame NBA 2K20. This videogame only contains data from the 2019 - 2020 NBA season.

In [4]:
nba2k=pd.read_csv(csv_files['nba2k'])
nba2k.head()

Unnamed: 0,full_name,rating,jersey,team,position,b_day,height,weight,salary,country,draft_year,draft_round,draft_peak,college
0,LeBron James,97,#23,Los Angeles Lakers,F,12/30/84,6-9 / 2.06,250 lbs. / 113.4 kg.,$37436858,USA,2003,1,1,
1,Kawhi Leonard,97,#2,Los Angeles Clippers,F,06/29/91,6-7 / 2.01,225 lbs. / 102.1 kg.,$32742000,USA,2011,1,15,San Diego State
2,Giannis Antetokounmpo,96,#34,Milwaukee Bucks,F-G,12/06/94,6-11 / 2.11,242 lbs. / 109.8 kg.,$25842697,Greece,2013,1,15,
3,Kevin Durant,96,#7,Brooklyn Nets,F,09/29/88,6-10 / 2.08,230 lbs. / 104.3 kg.,$37199000,USA,2007,1,2,Texas
4,James Harden,96,#13,Houston Rockets,G,08/26/89,6-5 / 1.96,220 lbs. / 99.8 kg.,$38199000,USA,2009,1,3,Arizona State


### NBA Player Statistics

This data contains real-life player statistics for many seasons and leagues. We read and convert the data into a dataframe. 

In [5]:
NBA_player_stats = pd.read_csv(csv_files['player_stats'])
NBA_player_stats.head()

Unnamed: 0,League,Season,Stage,Player,Team,GP,MIN,FGM,FGA,3PM,...,PTS,birth_year,birth_month,birth_date,height,height_cm,weight,weight_kg,nationality,high_school
0,NBA,2009 - 2010,Regular_Season,Kevin Durant,OKC,82,3239.3,794,1668,128,...,2472,1988.0,Sep,"Sep 29, 1988",06-Sep,206.0,240.0,109.0,United States,Montrose Christian School
1,NBA,2009 - 2010,Regular_Season,LeBron James,CLE,76,2965.6,768,1528,129,...,2258,1984.0,Dec,"Dec 30, 1984",06-Aug,203.0,250.0,113.0,United States,St. Vincent St. Mary High School
2,NBA,2009 - 2010,Regular_Season,Dwyane Wade,MIA,77,2792.4,719,1511,73,...,2045,1982.0,Jan,"Jan 17, 1982",06-Apr,193.0,220.0,100.0,United States,Harold L. Richards High School
3,NBA,2009 - 2010,Regular_Season,Dirk Nowitzki,DAL,81,3038.8,720,1496,51,...,2027,1978.0,Jun,"Jun 19, 1978",7-0,213.0,245.0,111.0,Germany,
4,NBA,2009 - 2010,Regular_Season,Kobe Bryant,LAL,73,2835.4,716,1569,99,...,1970,1978.0,Aug,"Aug 23, 1978",06-Jun,198.0,212.0,96.0,United States,Lower Merion High School


## Transform

Now that the data has been loaded, it now needs to be cleaned up before it is loaded up into the database.

This will be done by doing some tranforms on individual datasets first. Afterward, both datasets will be merged into one dataset. By merging into one dataset, it will be easier to reorganize the data into seperate tables to be placed into the database.

In order for the merge to work correctly, the names to match properly. Any punctuation and spaces will be removed and the names will be capitalized to remove any variables that can affect the merge.  The following function will do so for us:

In [6]:
def format_names(dataframe, name_column):
    names = dataframe[name_column]
    names = [re.sub('[^A-Za-z0-9]+', '', name).upper() for name in names]
    return names

### NBA 2K20 Statistics

For the NBA 2K20 Statistics, it needs to be...

In [7]:
nba2k.rename(columns = {'full_name':'Player'}, inplace=True)

In [8]:
nba2k['jersey']=nba2k['jersey'].apply(lambda x:x.split('#')[-1])
nba2k['height']=nba2k['height'].apply(lambda x:x.split('/')[0])
nba2k['salary']=nba2k['salary'].apply(lambda x:x.replace('$',''))
nba2k["merge_name"] = format_names(nba2k, "Player")

In [9]:
nba2k.head()

Unnamed: 0,Player,rating,jersey,team,position,b_day,height,weight,salary,country,draft_year,draft_round,draft_peak,college,merge_name
0,LeBron James,97,23,Los Angeles Lakers,F,12/30/84,6-9,250 lbs. / 113.4 kg.,37436858,USA,2003,1,1,,LEBRONJAMES
1,Kawhi Leonard,97,2,Los Angeles Clippers,F,06/29/91,6-7,225 lbs. / 102.1 kg.,32742000,USA,2011,1,15,San Diego State,KAWHILEONARD
2,Giannis Antetokounmpo,96,34,Milwaukee Bucks,F-G,12/06/94,6-11,242 lbs. / 109.8 kg.,25842697,Greece,2013,1,15,,GIANNISANTETOKOUNMPO
3,Kevin Durant,96,7,Brooklyn Nets,F,09/29/88,6-10,230 lbs. / 104.3 kg.,37199000,USA,2007,1,2,Texas,KEVINDURANT
4,James Harden,96,13,Houston Rockets,G,08/26/89,6-5,220 lbs. / 99.8 kg.,38199000,USA,2009,1,3,Arizona State,JAMESHARDEN


### NBA Player Statistics

Before we merge the NBA Player Statistics with the NBA 2K20 data, we need to do a little cleaning.

All players except those from the NBA league during the 2019-2020 Season are dropped from the table. Columns that are already in the NBA 2K20 table are also dropped. 

Finally, to match the database, the height is converted from centimeters to feet.

In [10]:
# Drop all other leagues besides NBA and all years except 2019-2020
NBA = NBA_player_stats["League"] == "NBA"
Season = NBA_player_stats["Season"] == "2019 - 2020"
NBA_players = NBA_player_stats[NBA & Season]

In [11]:
# Drop unnecessary columns
NBA_players_clean = NBA_players.drop(columns=["birth_year", "birth_month", "birth_date", "height", "weight_kg"])

In [12]:
# Convert height from cm to inches
NBA_players_clean["height_ft"] = NBA_players_clean["height_cm"] / 30.48
height_NBA_players = NBA_players_clean.drop(columns=["height_cm"])

In [13]:
# Further Cleaning
final_NBA_players = height_NBA_players.rename(columns = {"weight": "weight_lbs"}, inplace = False)
final_NBA_players["merge_name"] = format_names(final_NBA_players, "Player")
final_NBA_players.head()

Unnamed: 0,League,Season,Stage,Player,Team,GP,MIN,FGM,FGA,3PM,...,REB,AST,STL,BLK,PTS,weight_lbs,nationality,high_school,height_ft,merge_name
36950,NBA,2019 - 2020,Regular_Season,James Harden,HOU,68,2482.6,672,1514,299,...,446,512,125,60,2335,220.0,United States,Artesia High School,6.430446,JAMESHARDEN
36951,NBA,2019 - 2020,Regular_Season,Damian Lillard,POR,66,2473.7,624,1349,270,...,284,530,70,22,1978,195.0,United States,Oakland High School,6.266404,DAMIANLILLARD
36952,NBA,2019 - 2020,Regular_Season,Devin Booker,PHX,70,2511.8,627,1283,141,...,297,456,49,18,1863,206.0,United States,Moss Point High School,6.496063,DEVINBOOKER
36953,NBA,2019 - 2020,Regular_Season,Giannis Antetokounmpo,MIL,63,1916.9,685,1238,89,...,856,354,61,66,1857,242.0,Greece / Nigeria,,6.922572,GIANNISANTETOKOUNMPO
36954,NBA,2019 - 2020,Regular_Season,Trae Young,ATL,60,2120.1,546,1249,205,...,255,560,65,8,1778,180.0,United States,Norman High School,6.167979,TRAEYOUNG


### Merge & Reorganize Statistics

Now that the datasets are cleaned up, the tables need to b

In [14]:
nba_combined_df = nba2k.merge(final_NBA_players, on="merge_name")
nba_combined_df.head()

Unnamed: 0,Player_x,rating,jersey,team,position,b_day,height,weight,salary,country,...,DRB,REB,AST,STL,BLK,PTS,weight_lbs,nationality,high_school,height_ft
0,LeBron James,97,23,Los Angeles Lakers,F,12/30/84,6-9,250 lbs. / 113.4 kg.,37436858,USA,...,459,524,684,78,36,1698,250.0,United States,St. Vincent St. Mary High School,6.660105
1,Kawhi Leonard,97,2,Los Angeles Clippers,F,06/29/91,6-7,225 lbs. / 102.1 kg.,32742000,USA,...,348,402,280,103,33,1543,230.0,United States,Martin Luther King High School,6.594488
2,Giannis Antetokounmpo,96,34,Milwaukee Bucks,F-G,12/06/94,6-11,242 lbs. / 109.8 kg.,25842697,Greece,...,716,856,354,61,66,1857,242.0,Greece / Nigeria,,6.922572
3,James Harden,96,13,Houston Rockets,G,08/26/89,6-5,220 lbs. / 99.8 kg.,38199000,USA,...,376,446,512,125,60,2335,220.0,United States,Artesia High School,6.430446
4,Anthony Davis,94,3,Los Angeles Lakers,F-C,03/11/93,6-10,222 lbs. / 100.7 kg.,27093019,USA,...,435,578,200,91,143,1618,253.0,United States,Perspectives Charter Academy,6.824147


## Load

Finally, the data can be loaded into the production database for any clients to potentially use. The production database is an SQL relational database with the following tables:
<ul>
    <li><em>Table</em> - Description</li>
</ul>
The database is structured in this way because...

### Players Table

In [None]:
# TODO: Load up data into the production database.

### Teams Table

In [None]:
# TODO: Load up data into the production database.

### Team_Players Table

In [None]:
# TODO: Load up data into the production database.

### Statistics Table

In [None]:
# TODO: Load up data into the production database.

## Production

To test to make sure that this ETL project works correctly, run database/queries.sql.  The queries in this file will verify that the data was cleaned up correctly such that merges between tables work.