## Imports

In [14]:
import pandas as pd
import numpy as np

# Data Loading

In [15]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


awards = pd.read_csv('../../datasets/awards_players.csv')
player_teams = pd.read_csv("../../datasets/players_teams.csv")
players = pd.read_csv("../../datasets/players.csv")

---

### Players_Teams

We will start by loading the data from the `Players_Teams.csv` file. This file contains the information about the players and the teams they played for.

In [16]:
player_teams.head()

Unnamed: 0,playerID,year,stint,tmID,lgID,GP,GS,minutes,points,oRebounds,dRebounds,rebounds,assists,steals,blocks,turnovers,PF,fgAttempted,fgMade,ftAttempted,ftMade,threeAttempted,threeMade,dq,PostGP,PostGS,PostMinutes,PostPoints,PostoRebounds,PostdRebounds,PostRebounds,PostAssists,PostSteals,PostBlocks,PostTurnovers,PostPF,PostfgAttempted,PostfgMade,PostftAttempted,PostftMade,PostthreeAttempted,PostthreeMade,PostDQ
0,abrossv01w,2,0,MIN,WNBA,26,23,846,343,43,131,174,53,42,9,85,70,293,114,132,96,76,19,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,abrossv01w,3,0,MIN,WNBA,27,27,805,314,45,101,146,60,42,10,92,73,316,119,116,56,60,20,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,abrossv01w,4,0,MIN,WNBA,30,25,792,318,44,97,141,82,44,11,90,79,285,112,98,69,82,25,0,3,3,69,23,1,4,5,4,4,1,8,8,22,6,8,8,7,3,0
3,abrossv01w,5,0,MIN,WNBA,22,11,462,146,17,57,74,45,30,2,43,42,139,49,46,28,53,20,0,2,2,67,20,3,6,9,3,1,2,3,7,23,8,4,2,8,2,0
4,abrossv01w,6,0,MIN,WNBA,31,31,777,304,29,78,107,60,48,6,80,86,276,109,73,53,82,33,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


This file contains the following columns:

- "playerID" - The player's unique identifier, also present in the 'players.csv' file.
- "year" - The year in which the player played.
- "stint" - If the player played for more than one team in the same year, this column indicates the order in which the player played for the teams, starting from 1.
- "tmID" - The team's unique identifier, also present in the 'teams.csv' file.
- "lgID - The league's unique identifier, also present in the 'leagues.csv' file. It's always 'WNBA', since the dataset only contains WNBA data, and it's not useful for our analysis.
- "GP" - Games played.
- "GS" - Games started.
- "minutes" - Minutes played.
- "points" - Points scored.
- "oRebounds" - Offensive rebounds.
- "dRebounds" - Defensive rebounds.
- "rebounds" - Total rebounds.
- "assists" - Assists.
- "steals" - Steals.
- "blocks" - Blocks.
- "turnovers" - Turnovers.
- "PF" - Personal fouls.
- "fgAttempted" - Field goals attempted.
- "fgMade" - Field goals made.
- "ftAttempted" - Free throws attempted.
- "ftMade" - Free throws made.
- "threeAttempted" - Three-point field goals attempted.
- "threeMade" - Three-point field goals made.
- "dq" - Disqualifications.
- "PostGP" - Postseason games played.
- "PostGS" - Postseason games started.
- "PostMinutes" - Postseason minutes played.
- "PostPoints" - Postseason points scored.
- "PostoRebounds" - Postseason offensive rebounds.
- "PostdRebounds" - Postseason defensive rebounds.
- "PostRebounds" - Postseason total rebounds.
- "PostAssists" - Postseason assists.
- "PostSteals" - Postseason steals.
- "PostBlocks" - Postseason blocks.
- "PostTurnovers" - Postseason turnovers.
- "PostPF" - Postseason personal fouls.
- "PostfgAttempted" - Postseason field goals attempted.
- "PostfgMade" - Postseason field goals made.
- "PostftAttempted" - Postseason free throws attempted.
- "PostftMade" - Postseason free throws made.
- "PostthreeAttempted" - Postseason three-point field goals attempted.
- "PostthreeMade" - Postseason three-point field goals made.
- "PostDQ" - Postseason disqualifications.


In [17]:
player_teams.describe()

Unnamed: 0,year,stint,GP,GS,minutes,points,oRebounds,dRebounds,rebounds,assists,steals,blocks,turnovers,PF,fgAttempted,fgMade,ftAttempted,ftMade,threeAttempted,threeMade,dq,PostGP,PostGS,PostMinutes,PostPoints,PostoRebounds,PostdRebounds,PostRebounds,PostAssists,PostSteals,PostBlocks,PostTurnovers,PostPF,PostfgAttempted,PostfgMade,PostftAttempted,PostftMade,PostthreeAttempted,PostthreeMade,PostDQ
count,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0,1876.0
mean,5.326226,0.113539,24.320896,12.438166,501.26919,176.261727,24.38806,54.334755,78.722814,39.031983,19.600746,9.065032,36.480277,48.596482,152.122068,64.071962,48.376866,36.358742,34.659382,11.759062,0.415778,1.828358,0.990405,39.950959,14.140725,1.90032,4.413646,6.313966,3.126333,1.420043,0.759062,2.623134,3.735075,12.282516,5.149254,3.672708,2.822495,2.924307,1.019723,0.026652
std,2.905475,0.422574,10.460614,13.641697,359.566117,161.983839,23.325974,48.347088,69.210226,40.147037,17.542694,13.497853,27.956998,34.158825,132.153836,58.914688,48.238245,37.86826,46.189357,17.023107,0.888352,2.659597,2.215079,71.565062,29.55186,4.154121,9.121491,12.881782,7.081885,2.992881,2.280011,5.037807,6.697874,24.313379,10.726421,8.463917,6.72317,7.751034,2.992637,0.170751
min,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.0,0.0,17.0,0.0,165.0,41.0,6.0,15.0,21.0,8.0,5.0,1.0,11.0,18.0,42.0,15.0,10.0,7.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,5.0,0.0,29.0,5.0,459.0,129.0,18.0,42.0,63.0,26.0,16.0,4.0,32.0,47.0,118.0,47.0,33.0,23.5,12.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,8.0,0.0,32.0,29.0,826.25,284.0,36.0,83.0,119.0,57.25,30.0,11.0,57.0,74.0,243.0,104.0,73.0,55.0,54.0,18.0,0.25,3.0,0.0,54.25,15.25,2.0,5.0,7.0,3.0,1.0,0.0,3.0,5.0,15.0,6.0,4.0,2.0,1.0,0.0,0.0
max,10.0,3.0,34.0,34.0,1234.0,860.0,162.0,276.0,363.0,236.0,99.0,113.0,126.0,143.0,660.0,298.0,275.0,246.0,305.0,121.0,7.0,11.0,11.0,412.0,245.0,33.0,80.0,104.0,71.0,33.0,31.0,34.0,43.0,188.0,82.0,68.0,62.0,85.0,32.0,2.0


Taking a look at the dataset, all the columns seem to be useful for our analysis, except for the `lgID` column, which is always 'WNBA'.

---

### PLAYERS

Next, we will load the data from the `players.csv` file. This file contains the information about the players.

In [18]:
players.head()

Unnamed: 0,bioID,pos,firstseason,lastseason,height,weight,college,collegeOther,birthDate,deathDate
0,abrahta01w,C,0,0,74.0,190,George Washington,,1975-09-27,0000-00-00
1,abrossv01w,F,0,0,74.0,169,Connecticut,,1980-07-09,0000-00-00
2,adairje01w,C,0,0,76.0,197,George Washington,,1986-12-19,0000-00-00
3,adamsda01w,F-C,0,0,73.0,239,Texas A&M,Jefferson College (JC),1989-02-19,0000-00-00
4,adamsjo01w,C,0,0,75.0,180,New Mexico,,1981-05-24,0000-00-00


This file contains the following columns:

- "bioID" - The player's unique identifier. 
- "pos" - The player's position.
- "firstseason" - If it's the player's first season, this columns will be '1', otherwise '0'.
- "lastseason" - If it's the player's last season, this columns will be '1', otherwise '0'.
- "height" - The player's height.
- "weight" - The player's weight.
- "college" - The player's college.
- "collegeOther" - If the player played for more than one college, this column contains the other colleges, otherwise it's empty.
- "birthDate" - The player's birth date.
- "deathDate" - The player's death date. 

In [19]:
players.describe()

Unnamed: 0,firstseason,lastseason,height,weight
count,893.0,893.0,893.0,893.0
mean,0.0,0.0,65.50056,145.415454
std,0.0,0.0,20.940425,61.275703
min,0.0,0.0,0.0,0.0
25%,0.0,0.0,68.0,140.0
50%,0.0,0.0,72.0,162.0
75%,0.0,0.0,75.0,180.0
max,0.0,0.0,80.0,254.0


After analyzing the dataset, we can already understand that there are some columns that are not useful for our analysis, such as `firstseason`, `lastseason`, `birthDate`, and `deathDate`. 

- `firstseason` and `lastseason` are not useful because their values are always '0'. 
- `DeathDate` is not useful because no player in the dataset has a death date.
- `BirthDate` is not useful in this format, but we can extract the player's age from it, which might be useful for our analysis. However, there are players with missing birth dates, so we will have to deal with that.

---

## AWARDS

Next, we will analyze the `awards_players.csv` file. This file contains the information about the awards won by the players.

In [21]:
awards.head()

Unnamed: 0,playerID,award,year,lgID
0,thompti01w,All-Star Game Most Valuable Player,1,WNBA
1,leslili01w,All-Star Game Most Valuable Player,2,WNBA
2,leslili01w,All-Star Game Most Valuable Player,3,WNBA
3,teaslni01w,All-Star Game Most Valuable Player,4,WNBA
4,swoopsh01w,All-Star Game Most Valuable Player,6,WNBA


This file contains the following columns:

- "playerID" - The player's unique identifier. (However, this column also has coaches' unique identifiers)
- "award" - The award won by the player.
- "year" - The year in which the award was won.
- "lgID" - The league's unique identifier. It's always 'WNBA', since the dataset only contains WNBA data, and it's not useful for our analysis.

In [None]:
awards.describe()