# Version 2 EDA

## TODO - Questions

1. **Data Cleaning and Preprocessing**: Handle missing values, check for data inconsistencies, and prepare your data for further analysis.
2. **Descriptive Statistics**: Calculate measures of central tendency (mean, median, mode) and dispersion (range, variance, standard deviation) for player performance metrics like points, assists, rebounds, etc.
3. **Probability Distributions**: Analyze the distribution of player performance metrics across the NBA. Are they normally distributed or do they follow some other distribution? Are there noticeable skews or outliers?
4. **Correlation and Covariance Analysis**: Determine how different variables in your dataset are related. For example, is there a correlation between the number of assists a player has and the number of points they score?
5. **Hypothesis Testing - Home vs Away**: Perform a statistical test to determine if there's a significant difference in teams' performance when playing home versus away games.
6. **Inferential Statistics for Player Performance**: Use hypothesis testing to make inferences about player performance. For example, are there statistically significant differences in player performance between different positions?
7. **Compare Player Per Game Stats vs Boxscore Stats**: Investigate how often players perform at, above, or below their average stats. This could help identify players that are particularly consistent or inconsistent in their performances.
8. **Compare Player Boxscores vs Game Outcome**: Investigate which player stats have the most impact on the game outcome.
9. **Regression Analysis**: Predict an outcome based on one or more predictors. For instance, you could use linear regression to predict the number of points scored by a player based on their other stats (like assists, rebounds, etc.) or use logistic regression to predict the outcome of a game (win/lose) based on team stats.
10. **Clustering of Players**: Implement a clustering analysis to group players based on their performance metrics. This can reveal interesting patterns and groupings in your data.
11. **Bayesian Statistics**: Update the probability of a hypothesis as more evidence or information becomes available. For example, you can use a Bayesian approach to update the probability of a team winning a game based on new player stats.
12. **Time Series Analysis**: If your data is sequential over time, you could consider running time series analysis, such as identifying trends or seasonality in player performance or game outcomes.
13. **Multivariate Analysis**: Look at the relationship between more than two variables. For instance, how do player performance, team performance, and game location interact to affect game outcomes?
14. **Automate EDA process**: With your EDA process in place, consider how you can automate as much of it as possible. This will make it easier to incorporate new data as you continue your analysis.


## Imports and Global Settings

In [1]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
from sqlalchemy import create_engine

sys.path.append('../')
from passkeys import RDS_ENDPOINT, RDS_PASSWORD

# Pandas Settings
pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_rows',1000)
pd.options.display.max_info_columns = 1000
pd.options.display.precision = 5

# Graphing Settings
sns.set_theme()

## Data Inbound

### Database Connection

In [2]:
username = 'postgres'
password = RDS_PASSWORD
endpoint = RDS_ENDPOINT
database = 'nba_betting'
port = '5432'

# Create the connection string
connection_string = f'postgresql+psycopg2://{username}:{password}@{endpoint}:{port}/{database}'

### Database Queries

In [3]:
start_date = "2010-09-01" # September 1st, 2010 

# Game Results
nbastats_game_results_query = f"SELECT * FROM ibd_nba_stats_game_results WHERE game_date >= '{start_date}';"

# Betting Data
covers_query = f"SELECT * FROM covers WHERE date >= '{start_date}';"

# NBA Stats Player Boxscores
nbastats_player_boxscores_traditional_query = f"SELECT * FROM ibd_nba_stats_boxscores_traditional WHERE game_date >= '{start_date}';"
nbastats_player_boxscores_adv_traditional_query = f"SELECT * FROM ibd_nba_stats_boxscores_adv_traditional WHERE game_date >= '{start_date}';"
nbastats_player_boxscores_adv_advanced_query = f"SELECT * FROM ibd_nba_stats_boxscores_adv_advanced WHERE game_date >= '{start_date}';"

# NBA Stats Player Stats
nbastats_player_general_traditional_query = f"SELECT * FROM ibd_nba_stats_player_general_traditional WHERE to_date >= '{start_date}';"
nbastats_player_general_advanced_query = f"SELECT * FROM ibd_nba_stats_player_general_advanced WHERE to_date >= '{start_date}';"

# FiveThirtyEight Player Data
fivethirtyeight_player_query = f"SELECT * FROM ibd_fivethirtyeight_player WHERE to_date >= '{start_date}';"

### Base Tables

In [None]:
with create_engine(connection_string).connect() as connection:
    # Game Results
    nbastats_game_results = pd.read_sql_query(nbastats_game_results_query, connection, parse_dates=['game_date'])

    # Betting Data
    covers_betting_data = pd.read_sql_query(covers_query, connection, parse_dates=['date'])

    # NBA Stats Player Boxscores
    nbastats_player_boxscores_traditional = pd.read_sql_query(nbastats_player_boxscores_traditional_query,
                                                           connection, parse_dates=['game_date'])
    nbastats_player_boxscores_adv_traditional = pd.read_sql_query(nbastats_player_boxscores_adv_traditional_query,
                                                               connection, parse_dates=['game_date'])
    nbastats_player_boxscores_adv_advanced = pd.read_sql_query(nbastats_player_boxscores_adv_advanced_query,
                                                            connection, parse_dates=['game_date'])

    # NBA Stats Player Stats
    nbastats_player_general_traditional = pd.read_sql_query(nbastats_player_general_traditional_query,
                                                            connection, parse_dates=['to_date'])
    nbastats_player_general_advanced = pd.read_sql_query(nbastats_player_general_advanced_query,
                                                            connection, parse_dates=['to_date'])

    # FiveThirtyEight Player Data
    fivethirtyeight_player = pd.read_sql_query(fivethirtyeight_player_query,
                                            connection, parse_dates=['to_date'])

### Working Tables - **Restart From Here**

In [None]:
# Game Results
games_df = nbastats_game_results.copy()

# Betting Data
betting_df = covers_betting_data.copy()

# NBA Stats Player Boxscores
trad_box_df = nbastats_player_boxscores_traditional.copy()
tradadv_box_df = nbastats_player_boxscores_adv_traditional.copy()
adv_box_df = nbastats_player_boxscores_adv_advanced.copy()

# NBA Stats Player Stats
stats_trad_df = nbastats_player_general_traditional.copy()
stats_adv_df = nbastats_player_general_advanced.copy()

# FiveThirtyEight Player Data
five38_df = fivethirtyeight_player.copy()

In [None]:
def print_table_info(df):
    # Print the head(10)
    print("Head (first 10 rows):")
    print(df.head(10))

    # Print the info()
    print("\nInfo:")
    print(df.info(verbose=True, show_counts=True))

In [None]:
print_table_info(games_df)