# Introduction
The NBA-games dataset is a really great dataset with lots of information about players, teams and results that can be put together in order to create a predictive model for future games. The main focus of the notebook is to create a dataset for predictive modelling and at a second step to trim it down to few features in order to enhance interpretability. Let's get started! 

We have 5 datasets at our disposal:

- Games -> Information about each game and the stats of the teams playing
- Games details -> More detailed information about the individual player stats
- Players -> Information about the name of the player and his team
- Ranking -> Information about the standings of each team on individual days throughout the season
- Teams --> Information about the team including ownership, arena, when it was established etc.


# Import Libraries and Data

In [34]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd


# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# to restrict the float value to 3 decimal places
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [35]:
games = pd.read_csv('games.csv')
games_details = pd.read_csv('games_details.csv')
players = pd.read_csv('players.csv')
ranking = pd.read_csv("ranking.csv")
teams = pd.read_csv("teams.csv")

  games_details = pd.read_csv('games_details.csv')


# **Data Check**

#### **POINTS: 15**
- View the first five entries of each dataframe
- Determine the number of entries in each dataframe
- Check the data types for each dataframe
- Check for missing values
- Create a statistical summary of the numerical data



**Games:**

In [36]:
games.head()

Unnamed: 0,GAME_DATE_EST,GAME_ID,GAME_STATUS_TEXT,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,TEAM_ID_home,PTS_home,FG_PCT_home,FT_PCT_home,...,AST_home,REB_home,TEAM_ID_away,PTS_away,FG_PCT_away,FT_PCT_away,FG3_PCT_away,AST_away,REB_away,HOME_TEAM_WINS
0,2022-12-22,22200477,Final,1610612740,1610612759,2022,1610612740,126.0,0.484,0.926,...,25.0,46.0,1610612759,117.0,0.478,0.815,0.321,23.0,44.0,1
1,2022-12-22,22200478,Final,1610612762,1610612764,2022,1610612762,120.0,0.488,0.952,...,16.0,40.0,1610612764,112.0,0.561,0.765,0.333,20.0,37.0,1
2,2022-12-21,22200466,Final,1610612739,1610612749,2022,1610612739,114.0,0.482,0.786,...,22.0,37.0,1610612749,106.0,0.47,0.682,0.433,20.0,46.0,1
3,2022-12-21,22200467,Final,1610612755,1610612765,2022,1610612755,113.0,0.441,0.909,...,27.0,49.0,1610612765,93.0,0.392,0.735,0.261,15.0,46.0,1
4,2022-12-21,22200468,Final,1610612737,1610612741,2022,1610612737,108.0,0.429,1.0,...,22.0,47.0,1610612741,110.0,0.5,0.773,0.292,20.0,47.0,0


**Games Details:**

In [37]:
games_details.head()

Unnamed: 0,GAME_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_CITY,PLAYER_ID,PLAYER_NAME,NICKNAME,START_POSITION,COMMENT,MIN,...,OREB,DREB,REB,AST,STL,BLK,TO,PF,PTS,PLUS_MINUS
0,22200477,1610612759,SAS,San Antonio,1629641,Romeo Langford,Romeo,F,,18:06,...,1.0,1.0,2.0,0.0,1.0,0.0,2.0,5.0,2.0,-2.0
1,22200477,1610612759,SAS,San Antonio,1631110,Jeremy Sochan,Jeremy,F,,31:01,...,6.0,3.0,9.0,6.0,1.0,0.0,2.0,1.0,23.0,-14.0
2,22200477,1610612759,SAS,San Antonio,1627751,Jakob Poeltl,Jakob,C,,21:42,...,1.0,3.0,4.0,1.0,1.0,0.0,2.0,4.0,13.0,-4.0
3,22200477,1610612759,SAS,San Antonio,1630170,Devin Vassell,Devin,G,,30:20,...,0.0,9.0,9.0,5.0,3.0,0.0,2.0,1.0,10.0,-18.0
4,22200477,1610612759,SAS,San Antonio,1630200,Tre Jones,Tre,G,,27:44,...,0.0,2.0,2.0,3.0,0.0,0.0,2.0,2.0,19.0,0.0


**Players:**

**Ranking:**

**Teams:**

In [38]:
teams.columns

Index(['LEAGUE_ID', 'TEAM_ID', 'MIN_YEAR', 'MAX_YEAR', 'ABBREVIATION',
       'NICKNAME', 'YEARFOUNDED', 'CITY', 'ARENA', 'ARENACAPACITY', 'OWNER',
       'GENERALMANAGER', 'HEADCOACH', 'DLEAGUEAFFILIATION'],
      dtype='object')

In [39]:
teams.shape

(30, 14)


# Identifying And Treating Erroneous Data

We won't need to handle erroneous data as the NBA and their analysts have thoroughly cleaned and verified the values.

# Imputing Missing Data

In contrast to erroneous data, the NBA has deliberately chosen not to impute values for missing data. The NBA has opted not to impute missing values due to several reasons:

- Nature of Missing Data: Sometimes, missing data is random or unavoidable, and attempting to impute it could lead to assumptions that may not hold true in reality. The NBA chooses not to impute if they believe the missing data doesn't significantly impact the overall analysis or conclusions.
- Data Integrity: Imputing missing values potentially introduces inaccuracies or bias into the dataset. The NBA might prioritize data integrity to ensure that any analysis or decisions made are based on the most accurate information available.
- Statistical Impact: Imputation methods can skew statistical analysis or models. The NBA prefers transparency in reporting statistics and does not want to manipulate or fill in missing data that could alter the representation of actual trends or patterns.
- Ethical and Policy Reasons: There are ethical considerations or policy guidelines within the NBA that discourage the manipulation or estimation of data to maintain transparency and trust among stakeholders, fans, and analysts.

#### **POINTS: 25**
We will either impute the missing values or remove entries containing missing values to enhance the accuracy of our prediction models. Describe the rationale behind your choice to impute or remove these values. If imputation is chosen, explain the method employed for filling in the missing data.



# Merge games and games_details dataframes into a single dataframe

Run this cell only after you have handled the missing values in the games and games_details dataframes

In [40]:
data = pd.merge(games, games_details, on='GAME_ID', how='inner')
data.head()

Unnamed: 0,GAME_DATE_EST,GAME_ID,GAME_STATUS_TEXT,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,TEAM_ID_home,PTS_home,FG_PCT_home,FT_PCT_home,...,OREB,DREB,REB,AST,STL,BLK,TO,PF,PTS,PLUS_MINUS
0,2022-12-22,22200477,Final,1610612740,1610612759,2022,1610612740,126.0,0.484,0.926,...,1.0,1.0,2.0,0.0,1.0,0.0,2.0,5.0,2.0,-2.0
1,2022-12-22,22200477,Final,1610612740,1610612759,2022,1610612740,126.0,0.484,0.926,...,6.0,3.0,9.0,6.0,1.0,0.0,2.0,1.0,23.0,-14.0
2,2022-12-22,22200477,Final,1610612740,1610612759,2022,1610612740,126.0,0.484,0.926,...,1.0,3.0,4.0,1.0,1.0,0.0,2.0,4.0,13.0,-4.0
3,2022-12-22,22200477,Final,1610612740,1610612759,2022,1610612740,126.0,0.484,0.926,...,0.0,9.0,9.0,5.0,3.0,0.0,2.0,1.0,10.0,-18.0
4,2022-12-22,22200477,Final,1610612740,1610612759,2022,1610612740,126.0,0.484,0.926,...,0.0,2.0,2.0,3.0,0.0,0.0,2.0,2.0,19.0,0.0


# Exploratory Data Analysis
Exploratory Data Analysis (EDA) serves several crucial purposes in data analysis:

- EDA helps in understanding the structure, distribution, and nature of the dataset. It involves summarizing the main characteristics of the data and revealing underlying patterns or trends.
- EDA techniques uncover patterns, trends, correlations, and outliers within the data. This exploration helps identify potential relationships between variables and anomalies that might require further investigation.
- By exploring relationships between variables, EDA assists in selecting relevant features for analysis or modeling. It helps in understanding which variables might be more influential or predictive.
- EDA helps in generating and refining hypotheses about the data, guiding subsequent statistical tests or modeling. It allows analysts to formulate informed questions about the dataset for further investigation.
- EDA aids in visually and qualitatively presenting insights to stakeholders. Visualizations and summary statistics from EDA can effectively communicate complex findings in an accessible manner.

#### **POINTS: 40**
Your exploratory data analysis (EDA) will involve several key steps:

- Questioning the Dataframe: Begin by querying your dataset, posing specific questions, and seeking answers within the data.

- Visual Representation: Utilize a range of graphs and visualizations to explore various aspects of your data:
  
  - Univariate Analysis: Visualize distributions and counts for individual fields using histograms, boxplots, and countplots to understand their characteristics and variations.
  
  - Multivariate Analysis: Examine relationships and correlations between multiple fields through countplots, lmplots (for regression-based insights), and heatmaps to uncover connections and dependencies within the dataset.

By combining these approaches, you can gain a deeper understanding of your data, revealing patterns, outliers, relationships, and key insights that pave the way for informed decision-making and further analysis.

Your analysis needs to pose and answer a minimum of **eight questions** that aid in comprehending the factors influencing a team's victory in a game. Additionally, create **eight graphs** that visually represent and elucidate the data and include a brief caption explaining what the graph shows.


# Predictive Classification Model

#### **POINTS: 20**
Use the Pycaret library to generate a binary classification model to predict the outcome of games. 

[Pycaret Tutorials](https://colab.research.google.com/github/pycaret/pycaret/blob/master/tutorials/Tutorial%20-%20Binary%20Classification.ipynb)

Explain which metric one should use to evaluate the predition model (accuracy, precision, recall or F1-score).