### IMPORTS

In [3]:
import pandas as pd
import numpy as np

## DATA LOADING

In [4]:
# Set pandas options to display all columns
pd.set_option('display.max_columns', None)
# Set pandas options to display all rows
pd.set_option('display.max_rows', None)

teams_post = pd.read_csv("../../datasets/teams_post.csv")
teams = pd.read_csv("../../datasets/teams.csv")
series = pd.read_csv("../../datasets/series_post.csv")

---

# TEAMS

We will start by taking a look at the "teams" dataset.

In [5]:
teams.head()

Unnamed: 0,year,lgID,tmID,franchID,confID,divID,rank,playoff,seeded,firstRound,semis,finals,name,o_fgm,o_fga,o_ftm,o_fta,o_3pm,o_3pa,o_oreb,o_dreb,o_reb,o_asts,o_pf,o_stl,o_to,o_blk,o_pts,d_fgm,d_fga,d_ftm,d_fta,d_3pm,d_3pa,d_oreb,d_dreb,d_reb,d_asts,d_pf,d_stl,d_to,d_blk,d_pts,tmORB,tmDRB,tmTRB,opptmORB,opptmDRB,opptmTRB,won,lost,GP,homeW,homeL,awayW,awayL,confW,confL,min,attend,arena
0,9,WNBA,ATL,ATL,EA,,7,N,0,,,,Atlanta Dream,895,2258,542,725,202,598,340,737,1077,492,796,285,593,142,2534,1014,2254,679,918,172,502,401,864,1265,684,726,310,561,134,2879,0,0,0,0,0,0,4,30,34,1,16,3,14,2,18,6825,141379,Philips Arena
1,10,WNBA,ATL,ATL,EA,,2,Y,0,L,,,Atlanta Dream,1089,2428,569,755,114,374,404,855,1259,547,741,329,590,121,2861,996,2363,624,807,181,530,353,821,1174,615,700,347,601,133,2797,0,0,0,0,0,0,18,16,34,12,5,6,11,10,12,6950,120737,Philips Arena
2,1,WNBA,CHA,CHA,EA,,8,N,0,,,,Charlotte Sting,812,1903,431,577,131,386,305,630,935,551,713,222,496,90,2186,879,1930,533,716,138,423,326,664,990,596,596,259,426,123,2429,0,0,0,0,0,0,8,24,32,5,11,3,13,5,16,6475,90963,Charlotte Coliseum
3,2,WNBA,CHA,CHA,EA,,4,Y,0,W,W,L,Charlotte Sting,746,1780,410,528,153,428,309,639,948,467,605,217,474,114,2055,732,1846,431,562,114,369,344,567,911,443,579,257,447,124,2009,0,0,0,0,0,0,18,14,32,11,5,7,9,15,6,6500,105525,Charlotte Coliseum
4,3,WNBA,CHA,CHA,EA,,2,Y,0,L,,,Charlotte Sting,770,1790,490,663,211,527,302,653,955,496,647,241,408,105,2241,778,1807,444,598,133,372,295,620,915,489,600,208,424,103,2133,0,0,0,0,0,0,18,14,32,11,5,7,9,12,9,6450,106670,Charlotte Coliseum



The features available in the `teams.csv` dataset are:

- `year` - The year the data was collected (important as the data for a certain year will be used to predict the next year's results);
- `lgID` - The league the team is part of (doesn't impact the model as all teams are from the same league);
- `tmID` - The team ID (important to identify the team and can be used to merge with other datasets);
- `franchID` - The franchise ID (basically the same as the team ID);
- `confID` - The conference ID (considering that four teams from each conference make the playoffs, this feature can be important);
- `divID` - The division ID (this feature will be removed as all rows are NaN);
- `rank` - The team's rank in the division;
- `playoff` - Whether the team made the playoffs or not;
- `seeded` - Whether the team was seeded or not;
- `firstRound`, `semis`, `final` - Whether the team made it to the first round, semis or finals of the playoffs;
- `name` - The team's name;
- `Team Stats` (will be important as the stats directly impact the team performance thus the playoff qualification):
    - `o_fgm` - Opponent field goals made;
    - `o_fga` - Opponent field goals attempted;
    - `o_ftm` - Opponent free throws made;
    - `o_fta` - Opponent free throws attempted;
    - `o_3pm` - Opponent three-pointers made;
    - `o_3pa` - Opponent three-pointers attempted;
    - `o_oreb` - Opponent offensive rebounds;
    - `o_dreb` - Opponent defensive rebounds;
    - `o_reb` - Opponent total rebounds;
    - `o_asts` - Opponent assists;
    - `o_pf` - Opponent personal fouls;
    - `o_stl` - Opponent steals;
    - `o_to` - Opponent turnovers;
    - `o_blk` - Opponent blocks;
    - `o_pts` - Opponent points;
    - `d_fgm` - Defensive field goals made;
    - `d_fga` - Defensive field goals attempted;
    - `d_ftm` - Defensive free throws made;
    - `d_fta` - Defensive free throws attempted;
    - `d_3pm` - Defensive three-pointers made;
    - `d_3pa` - Defensive three-pointers attempted;
    - `d_oreb` - Defensive offensive rebounds;
    - `d_dreb` - Defensive defensive rebounds;
    - `d_reb` - Defensive total rebounds;
    - `d_asts` - Defensive assists;
    - `d_pf` - Defensive personal fouls;
    - `d_stl` - Defensive steals;
    - `d_to` - Defensive turnovers;
    - `d_blk` - Defensive blocks;
    - `d_pts` - Defensive points;
    - `tmORB` - Team offensive rebounds (maintains value 0 for all rows, will be removed);
    - `tmDRB` - Team defensive rebounds (maintains value 0 for all rows, will be removed);
    - `tmTRB` - Team total rebounds (maintains value 0 for all rows, will be removed);
    - `opptmORB` - Opponent team offensive rebounds (maintains value 0 for all rows, will be removed);
    - `opptmDRB` - Opponent team defensive rebounds (maintains value 0 for all rows, will be removed);
    - `opptmTRB` - Opponent team total rebounds (maintains value 0 for all rows, will be removed);
    - `won` - Number of games won;
    - `lost` - Number of games lost;
    - `GP` - Games played;
    - `homeW` - Home wins;
    - `homeL` - Home losses;
    - `awayW` - Away wins;
    - `awayL` - Away losses;
    - `confW` - Conference wins;
    - `confL` - Conference losses;
    - `min` - Minutes played;
- `attend` - Attendance (probably not relevant for the model);
- `arena` - Arena name (categorical feature that will probably not be impactful for the model);

In [6]:
teams.describe()

Unnamed: 0,year,divID,rank,seeded,o_fgm,o_fga,o_ftm,o_fta,o_3pm,o_3pa,o_oreb,o_dreb,o_reb,o_asts,o_pf,o_stl,o_to,o_blk,o_pts,d_fgm,d_fga,d_ftm,d_fta,d_3pm,d_3pa,d_oreb,d_dreb,d_reb,d_asts,d_pf,d_stl,d_to,d_blk,d_pts,tmORB,tmDRB,tmTRB,opptmORB,opptmDRB,opptmTRB,won,lost,GP,homeW,homeL,awayW,awayL,confW,confL,min,attend
count,142.0,0.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0
mean,5.302817,,4.084507,0.0,860.387324,2039.683099,488.338028,651.366197,157.161972,463.014085,330.5,730.929577,1061.429577,520.830986,653.929577,263.112676,510.450704,122.077465,2366.274648,860.380282,2039.676056,488.338028,651.366197,157.161972,463.014085,330.514085,730.922535,1061.43662,520.830986,653.915493,263.112676,510.450704,122.070423,2366.260563,0.0,0.0,0.0,0.0,0.0,0.0,16.661972,16.661972,33.323944,10.169014,6.492958,6.492958,10.169014,10.56338,10.56338,6735.683099,141050.253521
std,2.917274,,2.095226,0.0,86.998969,176.879707,70.749372,86.035246,43.73658,116.166119,41.191432,83.378114,105.393245,54.625738,60.978039,34.91288,50.873585,26.136324,243.15486,82.547277,183.678935,70.896377,90.051245,30.482093,82.147246,33.743659,78.264153,100.099983,50.782106,51.319814,26.644521,54.038019,20.658537,234.615384,0.0,0.0,0.0,0.0,0.0,0.0,4.999131,4.999131,0.949425,2.994017,2.967308,2.702104,2.731409,3.485461,3.485461,197.851093,34714.358519
min,1.0,,1.0,0.0,647.0,1740.0,333.0,469.0,62.0,205.0,242.0,537.0,793.0,390.0,467.0,187.0,408.0,63.0,1822.0,664.0,1676.0,325.0,444.0,92.0,315.0,267.0,567.0,873.0,388.0,538.0,197.0,390.0,71.0,1788.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,4.0,32.0,1.0,0.0,1.0,3.0,2.0,2.0,6400.0,57635.0
25%,3.0,,2.0,0.0,794.5,1908.5,435.25,582.75,128.25,389.0,301.25,653.25,969.25,478.5,617.0,241.25,470.5,101.25,2185.25,800.75,1889.0,442.25,591.25,135.0,401.75,307.25,676.25,983.0,491.25,616.0,244.25,470.25,109.0,2196.75,0.0,0.0,0.0,0.0,0.0,0.0,13.0,14.0,32.0,8.0,4.25,5.0,9.0,8.0,8.0,6500.0,120897.5
50%,5.0,,4.0,0.0,864.0,2025.0,483.5,650.0,157.0,459.0,333.5,724.0,1069.5,520.0,648.0,261.5,508.0,121.0,2340.0,856.5,2016.0,480.0,643.5,153.5,456.0,327.0,720.0,1048.0,516.0,649.0,262.0,503.0,123.0,2339.5,0.0,0.0,0.0,0.0,0.0,0.0,17.0,16.0,34.0,11.0,6.0,6.0,10.0,11.0,10.0,6825.0,135895.5
75%,8.0,,6.0,0.0,915.0,2177.5,539.0,716.5,180.75,528.0,356.75,788.0,1142.0,556.75,696.5,284.0,540.75,138.0,2531.5,909.0,2184.0,530.5,708.5,173.75,517.0,350.0,786.0,1117.25,545.75,686.5,277.0,545.5,136.75,2522.75,0.0,0.0,0.0,0.0,0.0,0.0,20.0,20.0,34.0,12.0,8.0,8.0,12.0,13.0,13.0,6875.0,150441.5
max,10.0,,8.0,0.0,1128.0,2485.0,668.0,882.0,283.0,802.0,452.0,931.0,1311.0,683.0,796.0,373.0,637.0,216.0,3156.0,1094.0,2582.0,694.0,932.0,290.0,762.0,445.0,945.0,1390.0,684.0,778.0,347.0,649.0,206.0,3031.0,0.0,0.0,0.0,0.0,0.0,0.0,28.0,30.0,34.0,16.0,16.0,13.0,16.0,19.0,19.0,7025.0,259237.0


Taking a look at the columns information, we can see that there are some columns that will not be useful for the model: 

- 'lgID' -> the league is the same for all teams;
- 'franchID' -> the franchise ID is the same as the team ID;
- 'divID' -> all rows are NaN;
- 'name', 'arena' -> these are categorical features that will not be used to predict the playoffs, since the tmID is enough to identify the team;
- 'firstRound', 'semis', 'finals' -> these columns are the result of the playoffs and will not be used to predict the playoffs;
- 'seeded', 'tmORB', 'tmDRB', 'tmTRB', 'opptmORB', 'opptmDRB', 'opptmTRB' -> these columns are not useful for the model, since their values are always 0.

--- 

## TEAMS_POST

We will now take a look at the `teams_post.csv` dataset.

In [7]:
teams_post.head()

Unnamed: 0,year,tmID,lgID,W,L
0,1,HOU,WNBA,6,0
1,1,ORL,WNBA,1,2
2,1,CLE,WNBA,3,3
3,1,WAS,WNBA,0,2
4,1,NYL,WNBA,4,3


The `teams_post.csv` is a simple dataset that contains the playoff results for each team in the dataset.

The features available in the `teams_post.csv` dataset are:

- `year` - The year the data was collected, which is important as the data for a certain year will be used to predict the next year's results;
- `tmID` - The team ID, used to identify the team;
- `lgID` - The League ID, which is the same for all teams;
- `W` - Number of wins in the playoffs;
- `L` - Number of losses in the playoffs;

In [8]:
teams_post.describe()

Unnamed: 0,year,W,L
count,80.0,80.0,80.0
mean,5.5,2.35,2.35
std,2.890403,2.228129,0.843441
min,1.0,0.0,0.0
25%,3.0,1.0,2.0
50%,5.5,1.5,2.0
75%,8.0,3.25,3.0
max,10.0,7.0,5.0


In this dataset, `LgID` is the only one useful for the model, since it is the same for all teams.


---

## Series_Post

Finally, we will study the `series_post.csv` dataset.

In [9]:
series.head()

Unnamed: 0,year,round,series,tmIDWinner,lgIDWinner,tmIDLoser,lgIDLoser,W,L
0,1,FR,A,CLE,WNBA,ORL,WNBA,2,1
1,1,FR,B,NYL,WNBA,WAS,WNBA,2,0
2,1,FR,C,LAS,WNBA,PHO,WNBA,2,0
3,1,FR,D,HOU,WNBA,SAC,WNBA,2,0
4,1,CF,E,HOU,WNBA,LAS,WNBA,2,0


The available features in the `series_post.csv` dataset are:

- `year` - The year the data was collected, which is important as the data for a certain year will be used to predict the next year's results;
- `round` - The round of the playoffs. It can have the following values: 
  - `FR` - First Round;
  - `CF` - Semifinals;
  - `F` - Finals.
- `series` - The series number. It goes from A to G in each year, identifying the series and the game itself;
- `tmIDWinner` - The team ID of the winner;
- `lgIDWinner` - The league ID of the winner;
- `tmIDLoser` - The team ID of the loser;
- `lgIDLoser` - The league ID of the loser;
- `W` - Number of wins of the winner;
- `L` - Number of losses of the winner;

The `series_post.csv` dataset will be useful to create the target variable for the model, which will be the playoff qualification. With this dataset, we can create a column that can give us some insights about the team's performance in the playoffs.

In [11]:
series.describe()

Unnamed: 0,year,W,L
count,70.0,70.0,70.0
mean,5.5,2.071429,0.614286
std,2.89302,0.259399,0.572127
min,1.0,2.0,0.0
25%,3.0,2.0,0.0
50%,5.5,2.0,1.0
75%,8.0,2.0,1.0
max,10.0,3.0,2.0


There are no columns with NaN values in this dataset, which is good for the model.
However, there are some useless columns that will be removed:

- `lgIDWinner`, `lgIDLoser` -> the league ID is the same for all teams;
- `series` -> this column is not useful for the model, since it is just an identifier for the series and repeats itself every year. Also, We already have the `round` column to identify the round of the playoffs;

Depending on the model, the `round` column can also be removed, since we can focus on the number of wins and losses to predict the playoff qualification. 