# **Exploratory data analysis**

EDA represents various fixture-level college basketball (NCAA Basketball / which is added to the database (mydatabase.db)) statistics from the past 2 years. The objective is to build a model that is able to predict the winners for the upcoming fixtures.
The Raw data contains > FixtureKey,Team,X2PM,X2PA,X3PM,X3PA,FTM,FTA,ORB,DRB,AST,STL,BLK,TOV,PF 

The EDA itself is an exploratory section in the project, hence the simplistic regression model is adopted in the process. Not much effort is taken into evaluate the model or calibrate the model as the focus of the assessment is something else. The winner prediction model here extremely confident in its predictions, this can be due to overfitting or certain features heavily influencing the model's decision boundary. One of the main reason is the base score calcutaion itself is based on the variables that have low confidence. 

## Data preparation
Please refer to section 2.a, titled 'Data preprocessing,' in the README.MD for the code related to transferring data into the database

## Data Preprocessing
Please refer to section 2.b in the README.MD for the code where extensive data cleaning is performed using SQL queries instead of Python. In Python, we solely rely on SQL queries to retrieve the already cleaned data.

### Libraries/Main Query 
Performing extensive data cleaning within SQL queries focuses on data preprocessing. Specifically, it addresses the task of making the initial calculations to obtain the necessary variables to cacluate the base score of each team within the fixture.


In [1]:
#Libraries
import sqlite3
import pandas as pd
import os
import numpy as np
import matplotlib
import datetime
import calendar

In [2]:
con = sqlite3.connect('mydatabase.db')  #conneting to database 
df_box_score_query = """
WITH TeamAvTeamB AS (
    SELECT
        bs.FixtureKey,
        SUBSTR(bs.FixtureKey, 1, LENGTH(bs.FixtureKey) - 12) AS TeamAvTeamB
    FROM
        box_scores AS bs
)

-- Query 1 as a CTE
, Query1 AS (
    SELECT DISTINCT
        TRIM(CASE
                WHEN bs.Team = 1 THEN SUBSTR(TeamAvTeamB, 1, INSTR(TeamAvTeamB, 'v') - 1)
                WHEN bs.Team = 2 THEN SUBSTR(TeamAvTeamB, INSTR(TeamAvTeamB, 'v') + 1)
                ELSE NULL
            END) AS TeamName,  
        AVG(bs.ORB) AS ORBAvg,
        AVG(bs.DRB) AS DRBAvg,
        AVG(bs.STL) AS STLAvg,
        AVG(bs.BLK) AS BLKAvg,
        AVG(bs.PF) AS PFAvg
        
    FROM
        TeamAvTeamB AS ta
    JOIN box_scores AS bs ON ta.FixtureKey = bs.FixtureKey
    GROUP BY
        TeamName
)

-- Query 2 as a CTE
, Query2 AS (
    SELECT
        bs.FixtureKey,
        TRIM(CASE
                WHEN bs.Team = 1 THEN SUBSTR(TeamAvTeamB, 1, INSTR(TeamAvTeamB, 'v') - 1)
                WHEN bs.Team = 2 THEN SUBSTR(TeamAvTeamB, INSTR(TeamAvTeamB, 'v') + 1)
                ELSE NULL
            END) AS TeamName,
        TRIM(CASE
                WHEN bs.Team = 2 THEN SUBSTR(TeamAvTeamB, 1, INSTR(TeamAvTeamB, 'v') - 1)
                WHEN bs.Team = 1 THEN SUBSTR(TeamAvTeamB, INSTR(TeamAvTeamB, 'v') + 1)
                ELSE NULL
            END) AS Oppnent,
            CASE
              WHEN bs.Team = 1 THEN 'Yes'
              WHEN bs.TeaM = 2 THEN 'No'
              ELSE NULL
              END AS HomeTeamAdv,
        bs.Team, bs.X2PM, bs.X2PA, bs.X3PM, bs.X3PA, bs.FTM, bs.FTA, bs.ORB, bs.DRB, bs.AST, bs.STL, bs.BLK, bs.TOV,
        bs.PF,
        ROUND((CAST(bs.X2PM AS REAL) + CAST(bs.X3PM AS REAL)) / (CAST(bs.X2PA AS REAL) + CAST(bs.X3PA AS REAL))*100,2) AS 'FG%',
        ROUND((CAST(bs.X2PM AS REAL) / CAST(bs.X3PA AS REAL))*100,2)  AS '3P%',
        ROUND((CAST(bs.FTM AS REAL) / CAST(bs.FTA AS REAL))*100,2)  AS 'FT%',
        ROUND((CAST(bs.AST AS REAL) / CAST(bs.TOV AS REAL))*100,2)  AS 'ASTtoTOV%'
    FROM
        TeamAvTeamB AS ta
    JOIN box_scores AS bs ON ta.FixtureKey = bs.FixtureKey
)

-- CTE3  Query joining Query1 and Query2 using TeamName
, FinalCTE AS (SELECT DISTINCT
    Query2.FixtureKey, Query2.TeamName,Query2.Oppnent, Query2.HomeTeamAdv, Query2.Team,Query2.X2PM, Query2.X2PA, Query2.X3PM, Query2.X3PA,Query2.FTM, Query2.FTA,
    Query2.ORB,Query2.DRB,Query2.AST,Query2.STL,Query2.BLK,Query2.TOV, Query2.PF,Query2.'FG%', Query2.'3P%',Query2.'FT%',Query2.'ASTtoTOV%',
    ROUND(Query1.ORBAvg, 2) AS ORBAvg,
    ROUND(Query3.ORBAvg, 2 ) AS OppORBAvg,
    ROUND(Query1.DRBAvg, 2) AS DRBAvg,
    ROUND(Query3.DRBAvg, 2) AS OppDRBAvg,
    ROUND(Query1.STLAvg, 2) AS STLAvg,
    ROUND(Query1.BLKAvg, 2) AS BLKAvg,
    ROUND(Query1.PFAvg, 2) AS PFAvg,
    ROUND(Query3.PFAvg, 2) AS OppPFAvg
    
FROM
    Query1
JOIN Query2 ON Query1.TeamName = Query2.TeamName
JOIN Query1 AS Query3 ON Query2.Oppnent = Query3.TeamName

)

--Joining Final CTE from BOX score table with fixture_information

 SELECT FinalCTE.FixtureKey, FinalCTE.TeamName ,FinalCTE.Oppnent, FinalCTE.'FG%',FinalCTE.'3P%',
 FinalCTE.'FT%', FinalCTE.'ASTtoTOV%',
 (FinalCTE.ORBAvg-FinalCTE.OppORBAvg) AS ORD,
 (FinalCTE.DRBAvg-FinalCTE.OppDRBAvg) AS DRD,
 FinalCTE.STLAvg ,FinalCTE.BLKAvg,
 (FinalCTE.PFAvg-FinalCTE.OppPFAvg) AS DiffPFAvg, 
 FinalCTE.HomeTeamAdv,
 fixture_information.TipOff AS TipOff,             
 fixture_information.GameType AS GameType, 
 fixture_information.IsNeutralSite AS IsNeutralSite,
 fixture_information.Attendance AS Attendance,
 fixture_information.Season AS Season,
 fixture_information.Team1Conference AS Team1Conference,
 fixture_information.Team2Conference AS Team2Conference
 FROM
     FinalCTE
 JOIN fixture_information ON FinalCTE.FixtureKey = fixture_information.FixtureKey"""
query = "SELECT * FROM box_scores"
df_box_scores = pd.read_sql_query(df_box_score_query, con)
df = pd.read_sql_query(query,con)
df_box_scores

Unnamed: 0,FixtureKey,TeamName,Oppnent,FG%,3P%,FT%,ASTtoTOV%,ORD,DRD,STLAvg,BLKAvg,DiffPFAvg,HomeTeamAdv,TipOff,GameType,IsNeutralSite,Attendance,Season,Team1Conference,Team2Conference
0,LIPSCO v A PEAY 14-Jan-2023,A PEAY,LIPSCO,40.32,85.71,93.75,108.33,2.38,-5.55,6.11,2.60,4.13,No,17:00,RegularSeason,0,0,2023,A-Sun,A-Sun
1,LIPSCO v A PEAY 14-Jan-2023,LIPSCO,A PEAY,50.98,72.73,82.76,130.77,-2.38,5.55,4.82,2.47,-4.13,Yes,17:00,RegularSeason,0,0,2023,A-Sun,A-Sun
2,QUENNC v A PEAY 29-Dec-2022,A PEAY,QUENNC,45.16,111.11,86.67,180.00,-0.01,-2.71,6.11,2.60,0.14,No,18:00,RegularSeason,0,552,2023,A-Sun,A-Sun
3,QUENNC v A PEAY 29-Dec-2022,QUENNC,A PEAY,49.02,94.74,66.67,107.69,0.01,2.71,5.65,2.45,-0.14,Yes,18:00,RegularSeason,0,552,2023,A-Sun,A-Sun
4,FGCU v A PEAY 24-Feb-2023,A PEAY,FGCU,45.16,60.61,63.64,190.00,0.64,-3.74,6.11,2.60,2.14,No,19:00,RegularSeason,0,2291,2023,A-Sun,A-Sun
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13317,YSU v SIUE 20-Nov-2021,YSU,SIUE,47.92,105.88,65.22,50.00,-0.73,-0.51,6.68,2.53,-1.29,Yes,14:00,RegularSeason,0,1309,2022,Horizon,OVC
13318,YSU v OAK 09-Feb-2022,OAK,YSU,41.82,78.26,80.00,92.86,-1.02,-1.43,8.54,2.44,-3.23,No,19:00,RegularSeason,0,1580,2022,Horizon,Horizon
13319,YSU v OAK 09-Feb-2022,YSU,OAK,44.12,63.64,81.82,222.22,1.02,1.43,6.68,2.53,3.23,Yes,19:00,RegularSeason,0,1580,2022,Horizon,Horizon
13320,DET v YSU 12-Jan-2023,DET,YSU,47.76,113.04,75.00,240.00,0.50,-1.10,6.03,1.69,-1.68,Yes,19:00,RegularSeason,0,1333,2023,Horizon,Horizon


In [4]:
df_box_scores_2 = pd.DataFrame()
df_box_scores_2['TeamName'] = df['FixtureKey']
df_box_scores_2['Opponent'] = df['FixtureKey']

condlist = [df['Team'] == 1, df['Team'] == 2]
choicelist = [df_box_scores_2['TeamName'].apply(lambda x:x.split(" v ")[0]),df_box_scores_2['TeamName'].apply(lambda x:x.split(" v ")[1])]
df_box_scores_2['TeamName'] = np.select(condlist,choicelist)
df_box_scores_2['TeamName'] = df_box_scores_2['TeamName'].apply(lambda x:x.replace(x[-11:],"") if len(x) > 11
                                                        else x)

condlist1 = [df['Team'] == 2, df['Team'] == 1]
choicelist1 = [df_box_scores_2['Opponent'].apply(lambda x:x.split(" v ")[0]),df_box_scores_2['Opponent'].apply(lambda x:x.split(" v ")[1])]
df_box_scores_2['Opponent'] = np.select(condlist1,choicelist1)
df_box_scores_2['Opponent'] = df_box_scores_2['Opponent'].apply(lambda x:x.replace(x[-11:],"") if len(x) > 11
                                                        else x)

'''for i in np.arange(2,len(df.columns),1):
    df_box_scores_2[df.columns[i]] = df[df.columns[i]].tolist()'''

def figl(dataframe,dataframe2):
    dataframe2['FG%'] = round((dataframe['X2PM'] + dataframe['X3PM'])/(dataframe['X2PA'] + dataframe['X3PA'])*100,2)
    return dataframe2

def thpts(dataframe,dataframe2):
    dataframe2['3P%'] = round((dataframe['X3PM']/dataframe['X3PA'])*100,2)
    return dataframe2

def ftpts(dataframe,dataframe2):
    dataframe2['FT%'] = round((dataframe['FTM']/dataframe['FTA'])*100,2)
    return dataframe2

def ord(dataframe,dataframe2):
    dataframe2['ORD'] = round((dataframe['ORB'].groupby(dataframe2['TeamName']).transform('mean')) - (dataframe['ORB'].groupby(dataframe2['Opponent']).transform('mean')),2)
    return dataframe2 

def drd(dataframe,dataframe2):
    dataframe2['DRD'] = round((dataframe['DRB'].groupby(dataframe2['TeamName']).transform('mean')) - (dataframe['DRB'].groupby(dataframe2['Opponent']).transform('mean')),2)
    return dataframe2 

def stl(dataframe,dataframe2):
    dataframe2['STLAvg'] = round(dataframe['STL'].groupby(dataframe2['TeamName']).transform('mean'),2)
    return dataframe2

def blk(dataframe,dataframe2):
    dataframe2['BLKAvg'] = round(dataframe['BLK'].groupby(dataframe2['TeamName']).transform('mean'),2)
    return dataframe2

def foul(dataframe,dataframe2):
    dataframe2['PF%'] = round((dataframe['PF'].groupby(dataframe2['TeamName']).transform('mean')) - (dataframe['PF'].groupby(dataframe2['Opponent']).transform('mean')),2)
    return dataframe2

def findDay(date):
    born = datetime.datetime.strptime(date, '%d-%b-%Y').weekday()
    return (calendar.day_name[born])

In [5]:
df_box_scores_2 = figl(df,df_box_scores_2)
df_box_scores_2 = thpts(df,df_box_scores_2)
df_box_scores_2 = ftpts(df,df_box_scores_2)
df_box_scores_2 = ord(df, df_box_scores_2)
df_box_scores_2 = drd(df, df_box_scores_2)
df_box_scores_2 = stl(df, df_box_scores_2)
df_box_scores_2 = blk(df, df_box_scores_2)
df_box_scores_2 = foul(df, df_box_scores_2)
df_box_scores_2

Unnamed: 0,TeamName,Opponent,FG%,3P%,FT%,ORD,DRD,STLAvg,BLKAvg,PF%
0,A PEAY,LIPSCO,40.32,33.33,93.75,0.12,-3.79,5.52,2.52,1.85
1,LIPSCO,A PEAY,50.98,45.45,82.76,-1.06,4.21,4.26,3.21,-4.01
2,A PEAY,QUENNC,45.16,44.44,86.67,2.71,-3.08,5.52,2.52,-1.42
3,QUENNC,A PEAY,49.02,36.84,66.67,0.85,-0.38,5.62,2.00,0.97
4,A PEAY,FGCU,45.16,24.24,63.64,1.33,-3.95,5.52,2.52,1.91
...,...,...,...,...,...,...,...,...,...,...
13317,YSU,SIUE,47.92,29.41,65.22,0.32,-0.39,6.61,2.78,-2.44
13318,OAK,YSU,41.82,21.74,80.00,0.23,-0.18,7.35,1.83,-2.21
13319,YSU,OAK,44.12,27.27,81.82,-1.73,-1.58,6.61,2.78,1.38
13320,DET,YSU,47.76,26.09,75.00,2.15,1.01,6.12,2.00,-1.26


In [3]:
con.close()  #close the connnection

Alternate work

In [4]:
import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df_box_score = pd.read_sql_query(df_box_score_query, con)
df_box_score.info()

In [None]:
%matplotlib inline

### Data Preprocessing (dplyr)
Further the table is cleaned using dplyr. Task of splitting the columns in the data obtained from the initial data preparation is done here, which are initially in the format TeamA_v_TeamB Date' into separate team names. This separation allows for more efficient and structured analysis of each team's aggregate statistics

In [None]:
class BoxScoreCalculator:
    def __init__(self, df: pd.DataFrame):
        self.df = df

    def extract_date_and_day(self):
        self.df['Date'] = self.df['FixtureKey'].str.split(" ").apply(lambda x: x[-1])
        self.df['Day'] = pd.to_datetime(self.df['Date'], format='%d-%b-%Y').dt.day_name()
        self.df = self.df.drop('Date', axis='columns')

    def manipulate_hometeam_adv(self):
        conditions = [
            (self.df['HomeTeamAdv'] == "Yes") & (self.df['IsNeutralSite'] == 0),
            (self.df['HomeTeamAdv'] == "No") & (self.df['IsNeutralSite'] == 1),
            (self.df['HomeTeamAdv'] == "Yes") & (self.df['IsNeutralSite'] == 1)
        ]
        choices = ['Yes', 'No', 'No']
        self.df['HomeTeamAdv'] = pd.np.select(conditions, choices, default='No')
        self.df = self.df.drop('IsNeutralSite', axis=1)
        self.df['HomeTeamAdv'] = self.df['HomeTeamAdv'].apply(lambda x: 0.05 if x == "Yes" else 0)

    def calculate_base_score(self):
        self.df['Base_score'] = (self.df['FG%']*0.3 + self.df['3P%']*0.2 + self.df['FT%']*0.1 + 
                                 self.df['ASTtoTOV%']*0.2 + self.df['ORD']*0.05 + self.df['DRD']*0.05 + 
                                 self.df['STLAvg']*0.03 + self.df['BLKAvg']*0.02 + self.df['DiffPFAvg']*0.03 + 
                                 self.df['HomeTeamAdv'])

    def adjust_for_game_type(self):
        game_type_conditions = [
            self.df["GameType"] == "RegularSeason",
            self.df["GameType"] == "ConferenceChampionship",
            self.df["GameType"] == "NIT"
        ]
        game_type_choices = [
            self.df['Base_score']*1,
            self.df['Base_score']*1.1,
            self.df['Base_score']*1.2
        ]
        self.df['Base_score'] = pd.np.select(game_type_conditions, game_type_choices, default=self.df['Base_score']*1)

    def adjust_for_tipoff_time(self):
        self.df['Time_multiplier'] = self.df['TipOff'].str.split(":").apply(lambda x: x[0]).astype(int)
        time_conditions = [
            (self.df["Time_multiplier"] >=6) & (self.df["Time_multiplier"] < 12),
            (self.df["Time_multiplier"] >=12) & (self.df["Time_multiplier"] < 17),
            (self.df["Time_multiplier"] >=17) & (self.df["Time_multiplier"] < 21)
        ]
        time_choices = [
            self.df['Base_score']*0.98,
            self.df['Base_score']*1,
            self.df['Base_score']*1.02
        ]
        self.df['Base_score'] = pd.np.select(time_conditions, time_choices, default=self.df['Base_score']*1.01)

    def adjust_for_day(self):
        day_conditions = [
            self.df["Day"] == day for day in ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
        ]
        day_choices = [
            self.df['Base_score']*1,
            self.df['Base_score']*1,
            self.df['Base_score']*1,
            self.df['Base_score']*1,
            self.df['Base_score']*1.01,
            self.df['Base_score']*1.02,
            self.df['Base_score']*1.01
        ]
        self.df['Base_score'] = pd.np.select(day_conditions, day_choices, default=self.df['Base_score']*1)

    def adjust_for_attendance(self):
        max_attendance = self.df['Attendance'].max()
        self.df['Base_score'] = self.df.apply(lambda row: row['Base_score'] if row['Attendance'] == 0 else
                                    row['Base_score'] + 0.05 * (row['Attendance'] / max_attendance), axis=1)

    def get_score(self):
        self.extract_date_and_day()
        self.manipulate_hometeam_adv()
        self.calculate_base_score()
        self.adjust_for_game_type()
        self.adjust_for_tipoff_time()
        self.adjust_for_day()
        self.adjust_for_attendance()
        return self.df[['FixtureKey', 'TeamName', 'Oppnent', 'Base_score']]

In [21]:
def fiter_box_scoreFun(df: pd.DataFrame) -> pd.DataFrame:
    # Extracting Date and Day
    df['Date'] = df['FixtureKey'].str.split(" ").apply(lambda x: x[-1])
    df['Day'] = pd.to_datetime(df['Date'], format='%d-%b-%Y').dt.day_name()
    df = df.drop('Date', axis='columns')
    
    # HomeTeamAdv Manipulation
    conditions = [
        (df['HomeTeamAdv'] == "Yes") & (df['IsNeutralSite'] == 0),
        (df['HomeTeamAdv'] == "No") & (df['IsNeutralSite'] == 1),
        (df['HomeTeamAdv'] == "Yes") & (df['IsNeutralSite'] == 1),
    ]
    choices = ['Yes', 'No', 'No']
    df['HomeTeamAdv'] = np.select(conditions, choices, default='No')
    df = df.drop('IsNeutralSite', axis=1)
    df['HomeTeamAdv'] = df['HomeTeamAdv'].apply(lambda x: 0.05 if x == "Yes" else 0)
    
    # Base_score Calculation
    df['Base_score'] = (df['FG%']*0.3 + df['3P%']*0.2 + df['FT%']*0.1 + df['ASTtoTOV%']*0.2 + df['ORD']*0.05 + 
                        df['DRD']*0.05 + df['STLAvg']*0.03 + df['BLKAvg']*0.02 + df['DiffPFAvg']*0.03 + 
                        df['HomeTeamAdv'])
    
    # GameType factor
    game_type_conditions = [
        df["GameType"] == "RegularSeason",
        df["GameType"] == "ConferenceChampionship",
        df["GameType"] == "NIT"
    ]
    game_type_choices = [
        df['Base_score']*1,
        df['Base_score']*1.1,
        df['Base_score']*1.2
    ]
    df['Base_score'] = np.select(game_type_conditions, game_type_choices, default=df['Base_score']*1)
    
    # Time_multiplier and its factor
    df['Time_multiplier'] = df['TipOff'].str.split(":").apply(lambda x: x[0]).astype(int)
    time_conditions = [
        (df["Time_multiplier"] >=6) & (df["Time_multiplier"] < 12),
        (df["Time_multiplier"] >=12) & (df["Time_multiplier"] < 17),
        (df["Time_multiplier"] >=17) & (df["Time_multiplier"] < 21)
    ]
    time_choices = [
        df['Base_score']*0.98,
        df['Base_score']*1,
        df['Base_score']*1.02
    ]
    df['Base_score'] = np.select(time_conditions, time_choices, default=df['Base_score']*1.01)
    
    # Day factor
    day_conditions = [
        df["Day"] == day for day in ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
    ]
    day_choices = [
        df['Base_score']*1,
        df['Base_score']*1,
        df['Base_score']*1,
        df['Base_score']*1,
        df['Base_score']*1.01,
        df['Base_score']*1.02,
        df['Base_score']*1.01
    ]
    df['Base_score'] = np.select(day_conditions, day_choices, default=df['Base_score']*1)
    
    # Attendance factor
    max_attendance = df['Attendance'].max()
    df['Base_score'] = df.apply(lambda row: row['Base_score'] if row['Attendance'] == 0 else 
                                row['Base_score'] + 0.05 * (row['Attendance'] / max_attendance), axis=1)
    
    return df[['FixtureKey', 'TeamName', 'Oppnent', 'Base_score']]



In [22]:
warnings.filterwarnings('ignore')
dfBoxScore = fiter_box_scoreFun(df_box_score)

In [23]:
warnings.filterwarnings('ignore')
dfBoxScoresHome = fiter_box_scoreFun(df_box_score[df_box_score['HomeTeamAdv']=="Yes"])
dfBoxScoresHome

Unnamed: 0,FixtureKey,TeamName,Oppnent,Base_score
1,LIPSCO v A PEAY 14-Jan-2023,LIPSCO,A PEAY,67.156363
3,QUENNC v A PEAY 29-Dec-2022,QUENNC,A PEAY,63.504884
5,FGCU v A PEAY 24-Feb-2023,FGCU,A PEAY,72.461076
6,A PEAY v NO FLA 18-Feb-2023,A PEAY,NO FLA,83.014326
8,A PEAY v JVILLE 16-Feb-2023,A PEAY,JVILLE,39.893686
...,...,...,...,...
13312,GRNBAY v YSU 19-Jan-2023,GRNBAY,YSU,54.798674
13315,YSU v NIAGRA 21-Nov-2021,YSU,NIAGRA,43.554265
13317,YSU v SIUE 20-Nov-2021,YSU,SIUE,53.320729
13319,YSU v OAK 09-Feb-2022,YSU,OAK,80.693746


In [24]:
warnings.filterwarnings('ignore')
dfBoxScoresAway = fiter_box_scoreFun(df_box_score[df_box_score['HomeTeamAdv']=="No"])
dfBoxScoresAway

Unnamed: 0,FixtureKey,TeamName,Oppnent,Base_score
0,LIPSCO v A PEAY 14-Jan-2023,A PEAY,LIPSCO,62.923080
2,QUENNC v A PEAY 29-Dec-2022,A PEAY,QUENNC,82.151708
4,FGCU v A PEAY 24-Feb-2023,A PEAY,FGCU,72.299541
7,A PEAY v NO FLA 18-Feb-2023,NO FLA,A PEAY,43.276452
9,A PEAY v JVILLE 16-Feb-2023,JVILLE,A PEAY,58.834066
...,...,...,...,...
13313,GRNBAY v YSU 19-Jan-2023,YSU,GRNBAY,81.901196
13314,YSU v NIAGRA 21-Nov-2021,NIAGRA,YSU,49.272885
13316,YSU v SIUE 20-Nov-2021,SIUE,YSU,53.366527
13318,YSU v OAK 09-Feb-2022,OAK,YSU,55.953850


In [25]:
dfBoxScoresHome = fiter_box_scoreFun(df_box_score[df_box_score['HomeTeamAdv']=="Yes"])
dfBoxScoresAway = fiter_box_scoreFun(df_box_score[df_box_score['HomeTeamAdv']=="No"])
dfBoxScoresHome.rename(columns={'TeamName':"Home", 'Oppnent':'Away', 'Base_score':'Home_score'}, inplace=True)
dfBoxScoresAway.rename(columns={'TeamName':"Away", 'Oppnent':'Home', 'Base_score':'Away_score'}, inplace=True)
Final_score = pd.merge(dfBoxScoresHome, dfBoxScoresAway, how='inner', on=['FixtureKey','Home','Away' ])
Final_score[['Home_score', 'Away_score']] = Final_score[['Home_score', 'Away_score']].round(2)
Final_score

Unnamed: 0,FixtureKey,Home,Away,Home_score,Away_score
0,LIPSCO v A PEAY 14-Jan-2023,LIPSCO,A PEAY,67.16,62.92
1,QUENNC v A PEAY 29-Dec-2022,QUENNC,A PEAY,63.50,82.15
2,FGCU v A PEAY 24-Feb-2023,FGCU,A PEAY,72.46,72.30
3,A PEAY v NO FLA 18-Feb-2023,A PEAY,NO FLA,83.01,43.28
4,A PEAY v JVILLE 16-Feb-2023,A PEAY,JVILLE,39.89,58.83
...,...,...,...,...,...
6656,GRNBAY v YSU 19-Jan-2023,GRNBAY,YSU,54.80,81.90
6657,YSU v NIAGRA 21-Nov-2021,YSU,NIAGRA,43.55,49.27
6658,YSU v SIUE 20-Nov-2021,YSU,SIUE,53.32,53.37
6659,YSU v OAK 09-Feb-2022,YSU,OAK,80.69,55.95


In [26]:
dfboxscoresMean = pd.concat([dfBoxScoresHome.drop(['Away', 'FixtureKey'], axis=1).rename(columns={'Home':'Team', 'Home_score':'Score'}).round(2),dfBoxScoresAway.drop(['Home', 'FixtureKey'], axis=1).rename(columns={'Away':'Team', 'Away_score':'Score'}).round(2)]).groupby('Team').mean()
dfboxscoresMedian = pd.concat([dfBoxScoresHome.drop(['Away', 'FixtureKey'], axis=1).rename(columns={'Home':'Team', 'Home_score':'Score'}).round(2),dfBoxScoresAway.drop(['Home', 'FixtureKey'], axis=1).rename(columns={'Away':'Team', 'Away_score':'Score'}).round(2)]).groupby('Team').median()
dfboxscoresMean

Unnamed: 0_level_0,Score
Team,Unnamed: 1_level_1
A PEAY,59.014000
ABILCH,69.003750
AIRFOR,58.170278
AKRON,57.664000
AL A&M,60.488889
...,...
WVU,63.592941
WYO,60.217368
XAVIER,76.215789
YALE,67.514706



####Acronyms</br>
X2PM = 2 pointer made, X2PA=2 pointer attemped, X3PM = 3 pointer made, X3PA = 2 pointer attempted,FTM = free throws made,
FTA = free throws attempted, ORB = offensive rebounds, DRB = deffensive rebounds, AST = assist, STL= steal, BLK = block, TOV = turnovers, PF = personal fouls
FG% = ratio of successful field goals (2PM + 3PM) to total field goal attempts (2PA + 3PA), </br>
3P% = ratio of successful three-pointers (3PM) to total three-point attempts (3PA),</br>
FT% = ratio of successful free throws (FTM) to total free throw attempts (FTA),</br>
ASTtoTOV%  = This ratio measures a team's ball-handling efficiency. It's the ratio of assists (AST) to turnovers (TOV). Assist-to-Turnover Ratio (AST/TOV),</br>
ORD = difference between the average offensive rebounds (ORB) a team secures and the average offensive rebounds their opponents secure,</br>
DRD =  the difference between the average defensive rebounds (DRB) a team secures and the average defensive rebounds their opponents secure,</br>
STLAvg = SteaL average of a team , BLKAvg = Block average of a team,</br>
DiffPFAvg = the average difference in the number of fouls committed (PF) between a team and its opponents,</br>
HomeTeamAdv = Home Court Advantage
