# NBA Data: Data Wrangling

## About [Dataset](https://www.kaggle.com/datasets/nathanlauga/nba-games?select=games.csv) (copied from kaggle)

### Context
This dataset was collected to work on NBA games data. I used the [nba stats website](https://stats.nba.com/) to create this dataset.

You can find more details about data collection in my GitHub repo here : [nba predictor repo](https://github.com/Nathanlauga/nba-predictor).

If you want more informations about this api endpoint feel free to go on the nba_api GitHub repo that documentate each endpoint : [link here](https://github.com/swar/nba_api/blob/master/docs/table_of_contents.md)

### Content
You can find 5 datasets :

* **games.csv :** all games from 2004 season to last update with the date, teams and some details like number of points, etc.
* **games_details.csv :** details of games dataset, all statistics of players for a given game
* **players.csv :** players details (name)
* **ranking.csv :** ranking of NBA given a day (split into west and east on CONFERENCE column
* **teams.csv :** all teams of NBA

# Imports

In [28]:
import numpy as np
import pandas as pd
#from library.sb_utils import save_file
# ! pip install library --user

# Load Data

In [29]:
games = pd.read_csv('data/games.csv')
games_details = pd.read_csv('data/games_details.csv')
players = pd.read_csv('data/players.csv')
ranking = pd.read_csv('data/ranking.csv')
teams = pd.read_csv('data/teams.csv')



# Data Cleaning
There are 5 datasets so 5 subsections

## Games

In [30]:
games.head()

Unnamed: 0,GAME_DATE_EST,GAME_ID,GAME_STATUS_TEXT,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,TEAM_ID_home,PTS_home,FG_PCT_home,FT_PCT_home,...,AST_home,REB_home,TEAM_ID_away,PTS_away,FG_PCT_away,FT_PCT_away,FG3_PCT_away,AST_away,REB_away,HOME_TEAM_WINS
0,2022-03-12,22101005,Final,1610612748,1610612750,2021,1610612748,104.0,0.398,0.76,...,23.0,53.0,1610612750,113.0,0.422,0.875,0.357,21.0,46.0,0
1,2022-03-12,22101006,Final,1610612741,1610612739,2021,1610612741,101.0,0.443,0.933,...,20.0,46.0,1610612739,91.0,0.419,0.824,0.208,19.0,40.0,1
2,2022-03-12,22101007,Final,1610612759,1610612754,2021,1610612759,108.0,0.412,0.813,...,28.0,52.0,1610612754,119.0,0.489,1.0,0.389,23.0,47.0,0
3,2022-03-12,22101008,Final,1610612744,1610612749,2021,1610612744,122.0,0.484,0.933,...,33.0,55.0,1610612749,109.0,0.413,0.696,0.386,27.0,39.0,1
4,2022-03-12,22101009,Final,1610612743,1610612761,2021,1610612743,115.0,0.551,0.75,...,32.0,39.0,1610612761,127.0,0.471,0.76,0.387,28.0,50.0,0


I prefer my columns formatted to be lower case with underscores *like_this*

In [31]:
games.columns = games.columns.str.replace(" ", "_").str.lower()
games

Unnamed: 0,game_date_est,game_id,game_status_text,home_team_id,visitor_team_id,season,team_id_home,pts_home,fg_pct_home,ft_pct_home,...,ast_home,reb_home,team_id_away,pts_away,fg_pct_away,ft_pct_away,fg3_pct_away,ast_away,reb_away,home_team_wins
0,2022-03-12,22101005,Final,1610612748,1610612750,2021,1610612748,104.0,0.398,0.760,...,23.0,53.0,1610612750,113.0,0.422,0.875,0.357,21.0,46.0,0
1,2022-03-12,22101006,Final,1610612741,1610612739,2021,1610612741,101.0,0.443,0.933,...,20.0,46.0,1610612739,91.0,0.419,0.824,0.208,19.0,40.0,1
2,2022-03-12,22101007,Final,1610612759,1610612754,2021,1610612759,108.0,0.412,0.813,...,28.0,52.0,1610612754,119.0,0.489,1.000,0.389,23.0,47.0,0
3,2022-03-12,22101008,Final,1610612744,1610612749,2021,1610612744,122.0,0.484,0.933,...,33.0,55.0,1610612749,109.0,0.413,0.696,0.386,27.0,39.0,1
4,2022-03-12,22101009,Final,1610612743,1610612761,2021,1610612743,115.0,0.551,0.750,...,32.0,39.0,1610612761,127.0,0.471,0.760,0.387,28.0,50.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25791,2014-10-06,11400007,Final,1610612737,1610612740,2014,1610612737,93.0,0.419,0.821,...,24.0,50.0,1610612740,87.0,0.366,0.643,0.375,17.0,43.0,1
25792,2014-10-06,11400004,Final,1610612741,1610612764,2014,1610612741,81.0,0.338,0.719,...,18.0,40.0,1610612764,85.0,0.411,0.636,0.267,17.0,47.0,0
25793,2014-10-06,11400005,Final,1610612747,1610612743,2014,1610612747,98.0,0.448,0.682,...,29.0,45.0,1610612743,95.0,0.387,0.659,0.500,19.0,43.0,1
25794,2014-10-05,11400002,Final,1610612761,1610612758,2014,1610612761,99.0,0.440,0.771,...,21.0,30.0,1610612758,94.0,0.469,0.725,0.385,18.0,45.0,1


In [32]:
games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25796 entries, 0 to 25795
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   game_date_est     25796 non-null  object 
 1   game_id           25796 non-null  int64  
 2   game_status_text  25796 non-null  object 
 3   home_team_id      25796 non-null  int64  
 4   visitor_team_id   25796 non-null  int64  
 5   season            25796 non-null  int64  
 6   team_id_home      25796 non-null  int64  
 7   pts_home          25697 non-null  float64
 8   fg_pct_home       25697 non-null  float64
 9   ft_pct_home       25697 non-null  float64
 10  fg3_pct_home      25697 non-null  float64
 11  ast_home          25697 non-null  float64
 12  reb_home          25697 non-null  float64
 13  team_id_away      25796 non-null  int64  
 14  pts_away          25697 non-null  float64
 15  fg_pct_away       25697 non-null  float64
 16  ft_pct_away       25697 non-null  float6

In [33]:
games = games.assign(game_date_est = lambda x: pd.to_datetime(x.game_date_est), 
                     game_id = lambda x: x.game_id.astype(str),
                     home_team_id = lambda x: x.home_team_id.astype(str),
                     visitor_team_id = lambda x: x.visitor_team_id.astype(str),
                     team_id_home = lambda x: x.team_id_home.astype(str),
                     team_id_away = lambda x: x.team_id_away.astype(str))

games

Unnamed: 0,game_date_est,game_id,game_status_text,home_team_id,visitor_team_id,season,team_id_home,pts_home,fg_pct_home,ft_pct_home,...,ast_home,reb_home,team_id_away,pts_away,fg_pct_away,ft_pct_away,fg3_pct_away,ast_away,reb_away,home_team_wins
0,2022-03-12,22101005,Final,1610612748,1610612750,2021,1610612748,104.0,0.398,0.760,...,23.0,53.0,1610612750,113.0,0.422,0.875,0.357,21.0,46.0,0
1,2022-03-12,22101006,Final,1610612741,1610612739,2021,1610612741,101.0,0.443,0.933,...,20.0,46.0,1610612739,91.0,0.419,0.824,0.208,19.0,40.0,1
2,2022-03-12,22101007,Final,1610612759,1610612754,2021,1610612759,108.0,0.412,0.813,...,28.0,52.0,1610612754,119.0,0.489,1.000,0.389,23.0,47.0,0
3,2022-03-12,22101008,Final,1610612744,1610612749,2021,1610612744,122.0,0.484,0.933,...,33.0,55.0,1610612749,109.0,0.413,0.696,0.386,27.0,39.0,1
4,2022-03-12,22101009,Final,1610612743,1610612761,2021,1610612743,115.0,0.551,0.750,...,32.0,39.0,1610612761,127.0,0.471,0.760,0.387,28.0,50.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25791,2014-10-06,11400007,Final,1610612737,1610612740,2014,1610612737,93.0,0.419,0.821,...,24.0,50.0,1610612740,87.0,0.366,0.643,0.375,17.0,43.0,1
25792,2014-10-06,11400004,Final,1610612741,1610612764,2014,1610612741,81.0,0.338,0.719,...,18.0,40.0,1610612764,85.0,0.411,0.636,0.267,17.0,47.0,0
25793,2014-10-06,11400005,Final,1610612747,1610612743,2014,1610612747,98.0,0.448,0.682,...,29.0,45.0,1610612743,95.0,0.387,0.659,0.500,19.0,43.0,1
25794,2014-10-05,11400002,Final,1610612761,1610612758,2014,1610612761,99.0,0.440,0.771,...,21.0,30.0,1610612758,94.0,0.469,0.725,0.385,18.0,45.0,1


In [34]:
games.describe()

Unnamed: 0,season,pts_home,fg_pct_home,ft_pct_home,fg3_pct_home,ast_home,reb_home,pts_away,fg_pct_away,ft_pct_away,fg3_pct_away,ast_away,reb_away,home_team_wins
count,25796.0,25697.0,25697.0,25697.0,25697.0,25697.0,25697.0,25697.0,25697.0,25697.0,25697.0,25697.0,25697.0,25796.0
mean,2011.798341,103.106044,0.460313,0.759705,0.355896,22.736779,43.345799,100.29412,0.449265,0.758082,0.349413,21.403899,42.085146,0.587494
std,5.397985,13.174726,0.056629,0.100692,0.11194,5.177566,6.621832,13.343016,0.055528,0.103418,0.110194,5.140897,6.526951,0.492295
min,2003.0,36.0,0.25,0.143,0.0,6.0,15.0,33.0,0.244,0.143,0.0,4.0,19.0,0.0
25%,2007.0,94.0,0.421,0.696,0.286,19.0,39.0,91.0,0.412,0.692,0.278,18.0,38.0,0.0
50%,2012.0,103.0,0.459,0.765,0.355,23.0,43.0,100.0,0.448,0.765,0.35,21.0,42.0,1.0
75%,2016.0,112.0,0.5,0.829,0.429,26.0,48.0,109.0,0.487,0.833,0.42,25.0,46.0,1.0
max,2021.0,168.0,0.684,1.0,1.0,50.0,72.0,168.0,0.687,1.0,1.0,46.0,81.0,1.0


In [35]:
missing = pd.concat([games.isnull().sum(), 100 * games.isnull().mean()], axis=1)
missing.columns=["count", "percent"]
missing.sort_values(by="count", ascending = False)

Unnamed: 0,count,percent
fg3_pct_home,99,0.38378
reb_away,99,0.38378
ast_away,99,0.38378
fg3_pct_away,99,0.38378
ft_pct_away,99,0.38378
fg_pct_away,99,0.38378
pts_home,99,0.38378
fg_pct_home,99,0.38378
ft_pct_home,99,0.38378
pts_away,99,0.38378


In [36]:
games[games.fg3_pct_home.isnull()]

Unnamed: 0,game_date_est,game_id,game_status_text,home_team_id,visitor_team_id,season,team_id_home,pts_home,fg_pct_home,ft_pct_home,...,ast_home,reb_home,team_id_away,pts_away,fg_pct_away,ft_pct_away,fg3_pct_away,ast_away,reb_away,home_team_wins
18320,2003-10-24,10300116,Final,1610612753,1610612762,2003,1610612753,,,,...,,,1610612762,,,,,,,0
18321,2003-10-24,10300108,Final,1610612737,1610612764,2003,1610612737,,,,...,,,1610612764,,,,,,,0
18322,2003-10-24,10300109,Final,1610612738,1610612751,2003,1610612738,,,,...,,,1610612751,,,,,,,0
18323,2003-10-24,10300113,Final,1610612759,1610612745,2003,1610612759,,,,...,,,1610612745,,,,,,,0
18324,2003-10-24,10300112,Final,1610612749,1610612765,2003,1610612749,,,,...,,,1610612765,,,,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18414,2003-10-09,10300019,Final,1610612743,1610612756,2003,1610612743,,,,...,,,1610612756,,,,,,,0
18415,2003-10-09,10300022,Final,1610612757,1610612758,2003,1610612757,,,,...,,,1610612758,,,,,,,0
18416,2003-10-08,10300013,Final,1610612759,1610612763,2003,1610612759,,,,...,,,1610612763,,,,,,,0
18423,2003-10-08,10300015,Final,1610612747,1610612744,2003,1610612747,,,,...,,,1610612744,,,,,,,0


Games from 2003-10-07 to 2003-10-24 are missing. Not a big deal. I could filter those out or even filter out that entire season. There's plenty of data otherwise.

## Game Details

In [37]:
games_details.columns = games_details.columns.str.replace(" ", "_").str.lower()
games_details.head()

Unnamed: 0,game_id,team_id,team_abbreviation,team_city,player_id,player_name,nickname,start_position,comment,min,...,oreb,dreb,reb,ast,stl,blk,to,pf,pts,plus_minus
0,22101005,1610612750,MIN,Minnesota,1630162,Anthony Edwards,Anthony,F,,36:22,...,0.0,8.0,8.0,5.0,3.0,1.0,1.0,1.0,15.0,5.0
1,22101005,1610612750,MIN,Minnesota,1630183,Jaden McDaniels,Jaden,F,,23:54,...,2.0,4.0,6.0,0.0,0.0,2.0,2.0,6.0,14.0,10.0
2,22101005,1610612750,MIN,Minnesota,1626157,Karl-Anthony Towns,Karl-Anthony,C,,25:17,...,1.0,9.0,10.0,0.0,0.0,0.0,3.0,4.0,15.0,14.0
3,22101005,1610612750,MIN,Minnesota,1627736,Malik Beasley,Malik,G,,30:52,...,0.0,3.0,3.0,1.0,1.0,0.0,1.0,4.0,12.0,20.0
4,22101005,1610612750,MIN,Minnesota,1626156,D'Angelo Russell,D'Angelo,G,,33:46,...,0.0,6.0,6.0,9.0,1.0,0.0,5.0,0.0,14.0,17.0


In [38]:
games_details.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 645953 entries, 0 to 645952
Data columns (total 29 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   game_id            645953 non-null  int64  
 1   team_id            645953 non-null  int64  
 2   team_abbreviation  645953 non-null  object 
 3   team_city          645953 non-null  object 
 4   player_id          645953 non-null  int64  
 5   player_name        645953 non-null  object 
 6   nickname           30362 non-null   object 
 7   start_position     247215 non-null  object 
 8   comment            105602 non-null  object 
 9   min                540350 non-null  object 
 10  fgm                540350 non-null  float64
 11  fga                540350 non-null  float64
 12  fg_pct             540350 non-null  float64
 13  fg3m               540350 non-null  float64
 14  fg3a               540350 non-null  float64
 15  fg3_pct            540350 non-null  float64
 16  ft

In [39]:
games_details = games_details.assign(game_id = lambda x: x.game_id.astype(str), 
                                     team_id = lambda x: x.team_id.astype(str),
                                     player_id = lambda x: x.player_id.astype(str),
                                     min_min = lambda x: pd.to_numeric(x['min'].str.split(':').str[0], errors='coerce').fillna(0), 
                                     min_sec = lambda x: pd.to_numeric(x['min'].str.split(':').str[1], errors='coerce').fillna(0),
                                     min = lambda x: x.min_min + (x.min_sec/60)).\
drop(['min_min', 'min_sec'], axis=1)

games_details

Unnamed: 0,game_id,team_id,team_abbreviation,team_city,player_id,player_name,nickname,start_position,comment,min,...,oreb,dreb,reb,ast,stl,blk,to,pf,pts,plus_minus
0,22101005,1610612750,MIN,Minnesota,1630162,Anthony Edwards,Anthony,F,,36.366667,...,0.0,8.0,8.0,5.0,3.0,1.0,1.0,1.0,15.0,5.0
1,22101005,1610612750,MIN,Minnesota,1630183,Jaden McDaniels,Jaden,F,,23.900000,...,2.0,4.0,6.0,0.0,0.0,2.0,2.0,6.0,14.0,10.0
2,22101005,1610612750,MIN,Minnesota,1626157,Karl-Anthony Towns,Karl-Anthony,C,,25.283333,...,1.0,9.0,10.0,0.0,0.0,0.0,3.0,4.0,15.0,14.0
3,22101005,1610612750,MIN,Minnesota,1627736,Malik Beasley,Malik,G,,30.866667,...,0.0,3.0,3.0,1.0,1.0,0.0,1.0,4.0,12.0,20.0
4,22101005,1610612750,MIN,Minnesota,1626156,D'Angelo Russell,D'Angelo,G,,33.766667,...,0.0,6.0,6.0,9.0,1.0,0.0,5.0,0.0,14.0,17.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
645948,11200005,1610612743,DEN,Denver,202706,Jordan Hamilton,,,,19.000000,...,0.0,2.0,2.0,0.0,2.0,0.0,1.0,3.0,17.0,
645949,11200005,1610612743,DEN,Denver,202702,Kenneth Faried,,,,23.000000,...,1.0,0.0,1.0,1.0,1.0,0.0,3.0,3.0,18.0,
645950,11200005,1610612743,DEN,Denver,201585,Kosta Koufos,,,,15.000000,...,3.0,5.0,8.0,0.0,1.0,0.0,0.0,3.0,6.0,
645951,11200005,1610612743,DEN,Denver,202389,Timofey Mozgov,,,,19.000000,...,1.0,2.0,3.0,1.0,0.0,0.0,4.0,2.0,2.0,


Quick explanation for cleaning the **min** (minutes) column. It was set up in clock format (mm:ss) and therefore a str type instead of a numeric type. I extracted the minutes and seconds in two separate columns using string split on the ":" separater. The minutes, I left as is. The seconds I divided by 60. Then I added them back together. So something like **10:45 == 10.75**. Now it can be properly used as a numeric column.

## Players

In [40]:
players.columns = players.columns.str.replace(" ", "_").str.lower()
players.head()

Unnamed: 0,player_name,team_id,player_id,season
0,Royce O'Neale,1610612762,1626220,2019
1,Bojan Bogdanovic,1610612762,202711,2019
2,Rudy Gobert,1610612762,203497,2019
3,Donovan Mitchell,1610612762,1628378,2019
4,Mike Conley,1610612762,201144,2019


In [41]:
players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7228 entries, 0 to 7227
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   player_name  7228 non-null   object
 1   team_id      7228 non-null   int64 
 2   player_id    7228 non-null   int64 
 3   season       7228 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 226.0+ KB


In [42]:
players = players.assign(team_id = lambda x: x.team_id.astype(str), 
                         player_id = lambda x: x.player_id.astype(str))

players

Unnamed: 0,player_name,team_id,player_id,season
0,Royce O'Neale,1610612762,1626220,2019
1,Bojan Bogdanovic,1610612762,202711,2019
2,Rudy Gobert,1610612762,203497,2019
3,Donovan Mitchell,1610612762,1628378,2019
4,Mike Conley,1610612762,201144,2019
...,...,...,...,...
7223,Lanny Smith,1610612758,201831,2009
7224,Warren Carter,1610612752,201999,2009
7225,Bennet Davis,1610612751,201834,2009
7226,Brian Hamilton,1610612751,201646,2009


## Ranking

In [43]:
ranking.columns = ranking.columns.str.replace(" ", "_").str.lower()
ranking.head()

Unnamed: 0,team_id,league_id,season_id,standingsdate,conference,team,g,w,l,w_pct,home_record,road_record,returntoplay
0,1610612756,0,22021,2022-03-12,West,Phoenix,67,53,14,0.791,28-8,25-6,
1,1610612744,0,22021,2022-03-12,West,Golden State,68,46,22,0.676,28-7,18-15,
2,1610612763,0,22021,2022-03-12,West,Memphis,68,46,22,0.676,24-10,22-12,
3,1610612762,0,22021,2022-03-12,West,Utah,67,42,25,0.627,24-10,18-15,
4,1610612742,0,22021,2022-03-12,West,Dallas,67,41,26,0.612,23-12,18-14,


In [44]:
ranking.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 201792 entries, 0 to 201791
Data columns (total 13 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   team_id        201792 non-null  int64  
 1   league_id      201792 non-null  int64  
 2   season_id      201792 non-null  int64  
 3   standingsdate  201792 non-null  object 
 4   conference     201792 non-null  object 
 5   team           201792 non-null  object 
 6   g              201792 non-null  int64  
 7   w              201792 non-null  int64  
 8   l              201792 non-null  int64  
 9   w_pct          201792 non-null  float64
 10  home_record    201792 non-null  object 
 11  road_record    201792 non-null  object 
 12  returntoplay   3990 non-null    float64
dtypes: float64(2), int64(6), object(5)
memory usage: 20.0+ MB


In [45]:
ranking = ranking.assign(team_id = lambda x: x.team_id.astype(str), 
                         league_id = lambda x: x.league_id.astype(str), 
                         standingsdate = lambda x: pd.to_datetime(x.standingsdate))

ranking

Unnamed: 0,team_id,league_id,season_id,standingsdate,conference,team,g,w,l,w_pct,home_record,road_record,returntoplay
0,1610612756,0,22021,2022-03-12,West,Phoenix,67,53,14,0.791,28-8,25-6,
1,1610612744,0,22021,2022-03-12,West,Golden State,68,46,22,0.676,28-7,18-15,
2,1610612763,0,22021,2022-03-12,West,Memphis,68,46,22,0.676,24-10,22-12,
3,1610612762,0,22021,2022-03-12,West,Utah,67,42,25,0.627,24-10,18-15,
4,1610612742,0,22021,2022-03-12,West,Dallas,67,41,26,0.612,23-12,18-14,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
201787,1610612765,0,22013,2014-09-01,East,Detroit,82,29,53,0.354,17-24,12-29,
201788,1610612738,0,22013,2014-09-01,East,Boston,82,25,57,0.305,16-25,9-32,
201789,1610612753,0,22013,2014-09-01,East,Orlando,82,23,59,0.280,19-22,4-37,
201790,1610612755,0,22013,2014-09-01,East,Philadelphia,82,19,63,0.232,10-31,9-32,


In [46]:
# I'm not sure why season_id is formatted like **22021** (year 2021?) while some are **12014** (2014?). 
# I'm not sure why some start with 1 and others with 2.
# I'll go back and add additional lines of code to clean this if necessary
ranking.season_id.value_counts()

22010    12480
22019    12240
22018    10470
22017    10380
22004    10260
22009    10230
22007    10230
22012    10200
22015    10200
22006    10200
22013    10200
22016    10200
22014    10170
22003    10150
22005    10140
22008    10140
22020     8573
22011     8550
22021     4350
12019      900
12006      780
12009      780
12012      750
12015      750
12016      720
12007      720
12013      720
12014      720
12008      690
12010      690
12003      667
12005      660
12004      630
12018      540
12017      510
12021      480
12020      336
12011      270
22002      116
Name: season_id, dtype: int64

## Teams

I doubt this dataset wil be any useful for what I'm trying to do but I'll clean it anyway

In [47]:
teams.columns = teams.columns.str.replace(" ", "_").str.lower()
teams.head()

Unnamed: 0,league_id,team_id,min_year,max_year,abbreviation,nickname,yearfounded,city,arena,arenacapacity,owner,generalmanager,headcoach,dleagueaffiliation
0,0,1610612737,1949,2019,ATL,Hawks,1949,Atlanta,State Farm Arena,18729.0,Tony Ressler,Travis Schlenk,Lloyd Pierce,Erie Bayhawks
1,0,1610612738,1946,2019,BOS,Celtics,1946,Boston,TD Garden,18624.0,Wyc Grousbeck,Danny Ainge,Brad Stevens,Maine Red Claws
2,0,1610612740,2002,2019,NOP,Pelicans,2002,New Orleans,Smoothie King Center,,Tom Benson,Trajan Langdon,Alvin Gentry,No Affiliate
3,0,1610612741,1966,2019,CHI,Bulls,1966,Chicago,United Center,21711.0,Jerry Reinsdorf,Gar Forman,Jim Boylen,Windy City Bulls
4,0,1610612742,1980,2019,DAL,Mavericks,1980,Dallas,American Airlines Center,19200.0,Mark Cuban,Donnie Nelson,Rick Carlisle,Texas Legends


In [48]:
teams.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   league_id           30 non-null     int64  
 1   team_id             30 non-null     int64  
 2   min_year            30 non-null     int64  
 3   max_year            30 non-null     int64  
 4   abbreviation        30 non-null     object 
 5   nickname            30 non-null     object 
 6   yearfounded         30 non-null     int64  
 7   city                30 non-null     object 
 8   arena               30 non-null     object 
 9   arenacapacity       26 non-null     float64
 10  owner               30 non-null     object 
 11  generalmanager      30 non-null     object 
 12  headcoach           30 non-null     object 
 13  dleagueaffiliation  30 non-null     object 
dtypes: float64(1), int64(5), object(8)
memory usage: 3.4+ KB


In [49]:
teams = teams.assign(league_id = lambda x: x.league_id.astype(str),  
                     team_id = lambda x: x.team_id.astype(str))

teams

Unnamed: 0,league_id,team_id,min_year,max_year,abbreviation,nickname,yearfounded,city,arena,arenacapacity,owner,generalmanager,headcoach,dleagueaffiliation
0,0,1610612737,1949,2019,ATL,Hawks,1949,Atlanta,State Farm Arena,18729.0,Tony Ressler,Travis Schlenk,Lloyd Pierce,Erie Bayhawks
1,0,1610612738,1946,2019,BOS,Celtics,1946,Boston,TD Garden,18624.0,Wyc Grousbeck,Danny Ainge,Brad Stevens,Maine Red Claws
2,0,1610612740,2002,2019,NOP,Pelicans,2002,New Orleans,Smoothie King Center,,Tom Benson,Trajan Langdon,Alvin Gentry,No Affiliate
3,0,1610612741,1966,2019,CHI,Bulls,1966,Chicago,United Center,21711.0,Jerry Reinsdorf,Gar Forman,Jim Boylen,Windy City Bulls
4,0,1610612742,1980,2019,DAL,Mavericks,1980,Dallas,American Airlines Center,19200.0,Mark Cuban,Donnie Nelson,Rick Carlisle,Texas Legends
5,0,1610612743,1976,2019,DEN,Nuggets,1976,Denver,Pepsi Center,19099.0,Stan Kroenke,Tim Connelly,Michael Malone,No Affiliate
6,0,1610612745,1967,2019,HOU,Rockets,1967,Houston,Toyota Center,18104.0,Tilman Fertitta,Daryl Morey,Mike D'Antoni,Rio Grande Valley Vipers
7,0,1610612746,1970,2019,LAC,Clippers,1970,Los Angeles,Staples Center,19060.0,Steve Ballmer,Michael Winger,Doc Rivers,Agua Caliente Clippers of Ontario
8,0,1610612747,1948,2019,LAL,Lakers,1948,Los Angeles,Staples Center,19060.0,Jerry Buss Family Trust,Rob Pelinka,Frank Vogel,South Bay Lakers
9,0,1610612748,1988,2019,MIA,Heat,1988,Miami,AmericanAirlines Arena,19600.0,Micky Arison,Pat Riley,Erik Spoelstra,Sioux Falls Skyforce


# Save Cleaned Data

In [50]:
#save_file(games, games_clean.csv, "/data")
#save_file(games_details, games_details_clean.csv, "/data")
#save_file(players, players_clean.csv, "/data")
#save_file(ranking, ranking_clean.csv, "/data")
#save_file(teams, teams_clean.csv, "/data")

games.to_csv("data_clean/games_clean.csv")
games_details.to_csv("data_clean/games_details_clean.csv")
players.to_csv("data_clean/players_clean.csv")
ranking.to_csv("data_clean/ranking_clean.csv")
teams.to_csv("data_clean/teams_clean.csv")
