# AC Project - Basketball Playoffs Qualification

This notebook aims to analyse the data related to basketball tournaments, with data from the past seasons.

## 1.Data Collection and Understanding

Here are the types of data present in the datasets:

* awards_players (96 objects) - each record describes awards and prizes received by players across 10 seasons
* coaches (163 objects) - each record describes all coaches who've managed the teams during the time period
* players (894 objects) - each record contains details of all players
* players_teams (1877 objects) - each record describes the performance of each player for each team they played
* series_post (71 objects) - each record describes the series' results
* teams (143 objects) - each record describes the performance of the teams for each season
* teams_post (81 objects) - each record describes the results of each team at the post-season

The goal of this project is to predict which teams will qualify for the playoffs in the next season, using data from players, teams, coaches, games and several other metrics.

**Required Libraries**

In [30]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV
import bz2,pickle
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import export_graphviz
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.neighbors import KNeighborsClassifier

**Read the Datasets**

In [31]:
df=pd.read_csv('awards_players.csv',header= 0)
df

Unnamed: 0,playerID,award,year,lgID
0,thompti01w,All-Star Game Most Valuable Player,1,WNBA
1,leslili01w,All-Star Game Most Valuable Player,2,WNBA
2,leslili01w,All-Star Game Most Valuable Player,3,WNBA
3,teaslni01w,All-Star Game Most Valuable Player,4,WNBA
4,swoopsh01w,All-Star Game Most Valuable Player,6,WNBA
...,...,...,...,...
90,boltoru01w,WNBA All Decade Team Honorable Mention,7,WNBA
91,holdsch01w,WNBA All Decade Team Honorable Mention,7,WNBA
92,penicti01w,WNBA All Decade Team Honorable Mention,7,WNBA
93,tauradi01w,WNBA All Decade Team Honorable Mention,7,WNBA


In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95 entries, 0 to 94
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   playerID  95 non-null     object
 1   award     95 non-null     object
 2   year      95 non-null     int64 
 3   lgID      95 non-null     object
dtypes: int64(1), object(3)
memory usage: 3.1+ KB


In [16]:
df1=pd.read_csv('coaches.csv',header= 0)
df1

Unnamed: 0,coachID,year,tmID,lgID,stint,won,lost,post_wins,post_losses
0,adamsmi01w,5,WAS,WNBA,0,17,17,1,2
1,adubari99w,1,NYL,WNBA,0,20,12,4,3
2,adubari99w,2,NYL,WNBA,0,21,11,3,3
3,adubari99w,3,NYL,WNBA,0,18,14,4,4
4,adubari99w,4,NYL,WNBA,0,16,18,0,0
...,...,...,...,...,...,...,...,...,...
157,wintebr01w,6,IND,WNBA,0,21,13,2,2
158,wintebr01w,7,IND,WNBA,0,21,13,0,2
159,wintebr01w,8,IND,WNBA,0,21,13,3,3
160,zierddo99w,8,MIN,WNBA,0,10,24,0,0


In [23]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162 entries, 0 to 161
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   coachID      162 non-null    object
 1   year         162 non-null    int64 
 2   tmID         162 non-null    object
 3   lgID         162 non-null    object
 4   stint        162 non-null    int64 
 5   won          162 non-null    int64 
 6   lost         162 non-null    int64 
 7   post_wins    162 non-null    int64 
 8   post_losses  162 non-null    int64 
dtypes: int64(6), object(3)
memory usage: 11.5+ KB


In [17]:
df2=pd.read_csv('players.csv',header= 0)
df2

Unnamed: 0,bioID,pos,firstseason,lastseason,height,weight,college,collegeOther,birthDate,deathDate
0,abrahta01w,C,0,0,74.0,190,George Washington,,1975-09-27,0000-00-00
1,abrossv01w,F,0,0,74.0,169,Connecticut,,1980-07-09,0000-00-00
2,adairje01w,C,0,0,76.0,197,George Washington,,1986-12-19,0000-00-00
3,adamsda01w,F-C,0,0,73.0,239,Texas A&M,Jefferson College (JC),1989-02-19,0000-00-00
4,adamsjo01w,C,0,0,75.0,180,New Mexico,,1981-05-24,0000-00-00
...,...,...,...,...,...,...,...,...,...,...
888,zellosh01w,G,0,0,70.0,155,Pittsburgh,,1986-08-28,0000-00-00
889,zhengha01w,C,0,0,80.0,254,,,1967-03-07,0000-00-00
890,zierddo99w,,0,0,0.0,0,,,0000-00-00,0000-00-00
891,zirkozu01w,G,0,0,69.0,145,,,1980-06-06,0000-00-00


In [24]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 893 entries, 0 to 892
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   bioID         893 non-null    object 
 1   pos           815 non-null    object 
 2   firstseason   893 non-null    int64  
 3   lastseason    893 non-null    int64  
 4   height        893 non-null    float64
 5   weight        893 non-null    int64  
 6   college       726 non-null    object 
 7   collegeOther  11 non-null     object 
 8   birthDate     893 non-null    object 
 9   deathDate     893 non-null    object 
dtypes: float64(1), int64(3), object(6)
memory usage: 69.9+ KB


In [18]:
df3=pd.read_csv('players_teams.csv',header= 0)
df3

Unnamed: 0,playerID,year,stint,tmID,lgID,GP,GS,minutes,points,oRebounds,...,PostBlocks,PostTurnovers,PostPF,PostfgAttempted,PostfgMade,PostftAttempted,PostftMade,PostthreeAttempted,PostthreeMade,PostDQ
0,abrossv01w,2,0,MIN,WNBA,26,23,846,343,43,...,0,0,0,0,0,0,0,0,0,0
1,abrossv01w,3,0,MIN,WNBA,27,27,805,314,45,...,0,0,0,0,0,0,0,0,0,0
2,abrossv01w,4,0,MIN,WNBA,30,25,792,318,44,...,1,8,8,22,6,8,8,7,3,0
3,abrossv01w,5,0,MIN,WNBA,22,11,462,146,17,...,2,3,7,23,8,4,2,8,2,0
4,abrossv01w,6,0,MIN,WNBA,31,31,777,304,29,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1871,zakalok01w,3,2,PHO,WNBA,5,0,37,6,0,...,0,0,0,0,0,0,0,0,0,0
1872,zarafr01w,6,0,SEA,WNBA,34,4,413,90,11,...,0,5,0,6,4,2,2,1,1,0
1873,zellosh01w,10,0,DET,WNBA,34,4,802,406,25,...,3,7,15,68,24,27,23,17,7,0
1874,zirkozu01w,4,0,WAS,WNBA,6,0,30,11,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1876 entries, 0 to 1875
Data columns (total 43 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   playerID            1876 non-null   object
 1   year                1876 non-null   int64 
 2   stint               1876 non-null   int64 
 3   tmID                1876 non-null   object
 4   lgID                1876 non-null   object
 5   GP                  1876 non-null   int64 
 6   GS                  1876 non-null   int64 
 7   minutes             1876 non-null   int64 
 8   points              1876 non-null   int64 
 9   oRebounds           1876 non-null   int64 
 10  dRebounds           1876 non-null   int64 
 11  rebounds            1876 non-null   int64 
 12  assists             1876 non-null   int64 
 13  steals              1876 non-null   int64 
 14  blocks              1876 non-null   int64 
 15  turnovers           1876 non-null   int64 
 16  PF                  1876

In [19]:
df4=pd.read_csv('series_post.csv',header= 0)
df4

Unnamed: 0,year,round,series,tmIDWinner,lgIDWinner,tmIDLoser,lgIDLoser,W,L
0,1,FR,A,CLE,WNBA,ORL,WNBA,2,1
1,1,FR,B,NYL,WNBA,WAS,WNBA,2,0
2,1,FR,C,LAS,WNBA,PHO,WNBA,2,0
3,1,FR,D,HOU,WNBA,SAC,WNBA,2,0
4,1,CF,E,HOU,WNBA,LAS,WNBA,2,0
...,...,...,...,...,...,...,...,...,...
65,10,FR,C,IND,WNBA,WAS,WNBA,2,0
66,10,FR,D,DET,WNBA,ATL,WNBA,2,0
67,10,CF,E,PHO,WNBA,LAS,WNBA,2,1
68,10,CF,F,IND,WNBA,DET,WNBA,2,1


In [26]:
df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   year        70 non-null     int64 
 1   round       70 non-null     object
 2   series      70 non-null     object
 3   tmIDWinner  70 non-null     object
 4   lgIDWinner  70 non-null     object
 5   tmIDLoser   70 non-null     object
 6   lgIDLoser   70 non-null     object
 7   W           70 non-null     int64 
 8   L           70 non-null     int64 
dtypes: int64(3), object(6)
memory usage: 5.0+ KB


In [20]:
df5=pd.read_csv('teams.csv',header= 0)
df5

Unnamed: 0,year,lgID,tmID,franchID,confID,divID,rank,playoff,seeded,firstRound,...,GP,homeW,homeL,awayW,awayL,confW,confL,min,attend,arena
0,9,WNBA,ATL,ATL,EA,,7,N,0,,...,34,1,16,3,14,2,18,6825,141379,Philips Arena
1,10,WNBA,ATL,ATL,EA,,2,Y,0,L,...,34,12,5,6,11,10,12,6950,120737,Philips Arena
2,1,WNBA,CHA,CHA,EA,,8,N,0,,...,32,5,11,3,13,5,16,6475,90963,Charlotte Coliseum
3,2,WNBA,CHA,CHA,EA,,4,Y,0,W,...,32,11,5,7,9,15,6,6500,105525,Charlotte Coliseum
4,3,WNBA,CHA,CHA,EA,,2,Y,0,L,...,32,11,5,7,9,12,9,6450,106670,Charlotte Coliseum
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
137,6,WNBA,WAS,WAS,EA,,5,N,0,,...,34,10,7,6,11,9,11,6900,171501,Verizon Center
138,7,WNBA,WAS,WAS,EA,,4,Y,0,L,...,34,13,4,5,12,12,8,6850,133255,Verizon Center
139,8,WNBA,WAS,WAS,EA,,5,N,0,,...,34,8,9,8,9,8,12,6900,133255,Verizon Center
140,9,WNBA,WAS,WAS,EA,,6,N,0,,...,34,6,11,4,13,6,14,6825,154637,Verizon Center


In [27]:
df5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142 entries, 0 to 141
Data columns (total 61 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   year        142 non-null    int64  
 1   lgID        142 non-null    object 
 2   tmID        142 non-null    object 
 3   franchID    142 non-null    object 
 4   confID      142 non-null    object 
 5   divID       0 non-null      float64
 6   rank        142 non-null    int64  
 7   playoff     142 non-null    object 
 8   seeded      142 non-null    int64  
 9   firstRound  80 non-null     object 
 10  semis       38 non-null     object 
 11  finals      20 non-null     object 
 12  name        142 non-null    object 
 13  o_fgm       142 non-null    int64  
 14  o_fga       142 non-null    int64  
 15  o_ftm       142 non-null    int64  
 16  o_fta       142 non-null    int64  
 17  o_3pm       142 non-null    int64  
 18  o_3pa       142 non-null    int64  
 19  o_oreb      142 non-null    i

In [21]:
df6=pd.read_csv('teams_post.csv',header= 0)
df6

Unnamed: 0,year,tmID,lgID,W,L
0,1,HOU,WNBA,6,0
1,1,ORL,WNBA,1,2
2,1,CLE,WNBA,3,3
3,1,WAS,WNBA,0,2
4,1,NYL,WNBA,4,3
...,...,...,...,...,...
75,10,SAS,WNBA,1,2
76,10,PHO,WNBA,7,4
77,10,SEA,WNBA,1,2
78,10,LAS,WNBA,3,3


In [28]:
df6.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80 entries, 0 to 79
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   year    80 non-null     int64 
 1   tmID    80 non-null     object
 2   lgID    80 non-null     object
 3   W       80 non-null     int64 
 4   L       80 non-null     int64 
dtypes: int64(3), object(2)
memory usage: 3.2+ KB


## 2. Data Analysis

**Separate Dataset into Train and Test**

Experimental pipeline using only teams.csv (df5)