# DA BALLERS - Predicting The Results of NFL Games

#### Andree Makahinda (UNI: abm2203)
#### Harshal Popat (UNI: hsp2120)
#### Adhya Rajesh (UNI: ar4279)
#### Varun Jadcherla (UNI: vrj2110)
#### Eric Loconto (UNI: el3152)

## 1. Introduction to sportsipy

#### Installing open source modules (please activate any modules that are needed to be installed)

In [None]:
!pip install sportsipy
# !pip install matplotlib
# !pip install pandas
# !pip install numpy
# !pip install datetime
# !pip install seaborn
# !pip install geojsonio --upgrade
# !pip install folium --upgrade
# !pip install ipython
# !pip install branca
# !pip install scipy
# !pip install --user decorator==4.3.0
# !pip install networkx
# !pip install scikit-learn
# !pip install xgboost
# !pip install tensorflow
# !pip install yellowbrick

# !pip install nltk
# import nltk
# nltk.download('all-corpora')

# !pip install requests
# !pip install tweepy
# !pip install gensim==3.8.3

#### Importing some of the necessary packages

In [1]:
# sportsipy is an open source library to extract the game statistics
from sportsipy.nfl.boxscore import Boxscores as game_info
from sportsipy.nfl.boxscore import Boxscore as game_stats_info

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime           
%matplotlib inline

In [3]:
# To hide all the warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Increase resolutions of all the graphs in the notebook
# plt.rcParams['figure.dpi'] = 80

#### Game information and stats methods from sportsipy
1. game_info(week_number, year).games['string'] takes argument as string as 'week_number-year' format and returns a json object (having all games in that week of that year)
2. We then pull out game_id from the json object
3. game_stats_info() takes game_id (here game_str) as an argument and returns the game information

In [15]:
game_str = game_info(5,2022).games['5-2022'][15]['boxscore']
one_game_stats = game_stats_info(game_str)
one_game_stats.dataframe

Unnamed: 0,attendance,away_first_downs,away_fourth_down_attempts,away_fourth_down_conversions,away_fumbles,away_fumbles_lost,away_interceptions,away_net_pass_yards,away_pass_attempts,away_pass_completions,...,roof,stadium,surface,time,vegas_line,weather,winner,winning_abbr,winning_name,won_toss
202210100kan,,18,2,1,1,0,0,223,30,19,...,Outdoors,GEHA Field at Arrowhead Stadium,Grass,8:15pm,Kansas City Chiefs -7.0,,Home,KAN,Kansas City Chiefs,Chiefs (deferred)


## 2. Data Gathering and Exploration

### 2.1. Extracting the schedule of past 11 years (2010 - 2021)

Building a function using the open source sportsipy package that loops through each week and each game inside each week

In [11]:
def schedule(year):
    weeks_list = list(range(1,19))
    schedule_df = pd.DataFrame()
    for w in range(len(weeks_list)):
        date = '{}-{}'.format(weeks_list[w], year)
        w_scores = game_info(weeks_list[w],year)
        w_games_df = pd.DataFrame()
        for g in range(len(w_scores.games[date])):
            game_df = pd.DataFrame(w_scores.games[date][g], index = [0])[['away_name',
                                                                          'away_abbr',
                                                                          'home_name',
                                                                          'home_abbr',
                                                                          'winning_name',
                                                                          'winning_abbr']]
            game_df['week'] = weeks_list[w]
            w_games_df = pd.concat([w_games_df,game_df])
        schedule_df = pd.concat([schedule_df, w_games_df]).reset_index().drop(columns = 'index') 
    return schedule_df

In [2]:
schedule(2022)

NameError: name 'schedule' is not defined

1. Importing the schedule from the year 2010 to 2020 and exporting it to a csv file (so that we don't need to extract the data from the open source, hence increasing the speed)
2. Changing names of the teams that have changed their name in the last 10 years to maintain homogeneity 
3. Concating the schedule of current year (2021) to the schedule from 2010 to 2020

In [13]:
# Setting up the current year variable and passing the number of weeks to a list
current_y = 2022
weeks_list = list(range(1,19))

In [None]:
full_schedule = pd.DataFrame()
for n in range(2019, 2022):
    schedule_ = schedule(n)
    schedule_['year'] = n
    full_schedule = pd.concat([full_schedule, schedule_])
full_schedule.reset_index(drop = True, inplace = True)

In [10]:
full_schedule = full_schedule.replace('Oakland Raiders', 'Las Vegas Raiders')
full_schedule = full_schedule.replace('San Diego Chargers', 'Los Angeles Chargers')
full_schedule = full_schedule.replace('St. Louis Rams', 'Los Angeles Rams')
full_schedule = full_schedule.replace('Washington Football Team', 'Washington Commanders')

In [8]:
full_schedule = pd.read_csv('full_schedule3.csv')
full_schedule = full_schedule.loc[:, ~full_schedule.columns.str.contains('^Unnamed')]

In [14]:
full_schedule_1 = schedule(current_y)
full_schedule_1['year'] = current_y
full_schedule = pd.concat([full_schedule, full_schedule_1])
full_schedule.reset_index(drop = True, inplace = True)

In [9]:
full_schedule = full_schedule.loc[full_schedule['year']!=2022]

In [None]:
full_schedule

### 2.2. Setting up the current week
1. Current week (current_w) is the most important argument (parameter) in the functions
2. We should make sure that the current_w really matches the real current week of the NFL season while also considering whether or not the library has updated its database
3. We have run our code and built our model in the week 14 of 2021 NFL season (Dec 12, 09.00 AM)
4. In other words, all the visualizations, analysis, model accuracy, model selection, presentation, and report all have been produced from running the code on that time 
5. Our model is an adapting model, meaning that it will adapt and take into accounts the new available data (the new games that just been played during the current week)
6. Please note that the results may be changed and different from our submission (presentation and report) if the code is being run again after our submission time since the current week and data may be updated by the library developer
7. The results may also be changed even when the current week is still the same if there are new game data that are being updated in the database

In [None]:
# Using full_schedule method to determine the correct current week as an input
def determine_current_week(full_schedule, current_y):
    current_w = int()
    completed_games_df = full_schedule[(full_schedule['winning_name'].notna()) &
                                       (full_schedule['year'] == current_y)]
    
    if len(completed_games_df) == 0:
        current_w = 1
    else:
        latest_completed_w = completed_games_df['week'].values.max()
        scheduled_games_df = full_schedule[(full_schedule['winning_name'].isnull()) &
                                           (full_schedule['year'] == current_y) &
                                           (full_schedule['week'] >= latest_completed_w)]
        earliest_scheduled_w = scheduled_games_df['week'].values.min()
        current_w = earliest_scheduled_w
    
    print('The current week is: ', current_w)
    
    return current_w

In [None]:
# Setting up the current week
current_w = determine_current_week(full_schedule, current_y)
print('The current week value being assigned: ', current_w)

### 2.3. Full schedule dataframe exploration

In [None]:
full_schedule.head()

In [None]:
full_schedule.tail()

In [None]:
full_schedule.info()

In [None]:
full_schedule_null = full_schedule[full_schedule['winning_name'].isnull()]
len(full_schedule_null)

1. We can see that there are two columns that contain null values: winning_name and winning_abbr
2. There are 87 null values
3. The null values consists of NaN and None
4. NaN = The result of the game is a draw -> to be handled (excluded) later
5. None = The match (or game) has not completed yet, only scheduled

In [None]:
# We can find which team recorded the most wins from 2010 up until now
def plot_team_rank_based_on_winning_count(full_schedule):
    winner_group = full_schedule.groupby('winning_name')
    from matplotlib import cm
    plt.rcParams["figure.dpi"] = 80
    colors = cm.inferno_r(np.linspace(.2, .8, 32))
    winner_group.size().sort_values().plot(kind='barh', figsize=(6,8), xlabel='Team Name',
        title='Team Ranking Based on Total Number of Games Won 2010-2021', color=colors)

plot_team_rank_based_on_winning_count(full_schedule)

1. Team with the most wins is New England Patriots, followed by Green Bay Packers and Pittsburgh Steelers
2. On the other hand, team with the least wins is Jacksonville Jaguars, followed by Cleveland Browns and New York Jets
3. We can create the same chart but only for a specific year (cross-section analysis)
4. For example, in 2020, the team with the most wins is Kansas City Chiefs, followed by Buffalo Bills and Green Bay Packers

In [None]:
# Find the team with the most wins in a given year
def plot_team_rank_for_specific_year(full_schedule, year):
    full_schedule_filtered = full_schedule[full_schedule['year'] == year]
    winner_group = full_schedule_filtered.groupby('winning_name')
    from matplotlib import cm
    plt.rcParams["figure.dpi"] = 80
    colors = cm.inferno_r(np.linspace(.2, .8, 32))
    winner_group.size().sort_values().plot(kind='barh', figsize=(6,8), x='Games Won', xlabel='Team Name',
        title=f'Team Ranking Based on Total Number of Games Won in {year}', color=colors)

plot_team_rank_for_specific_year(full_schedule, 2020)

In [None]:
# We can also find out how each team performs each year (time-series)
def plot_team_performance(full_schedule, team_names):
    team_performance_df = full_schedule.groupby(['winning_name','year']).size().unstack().T
    team_performance_df = team_performance_df.fillna(0)
    team_performance_df.columns.name = 'Team'
    plt.rcParams["figure.dpi"] = 80
    team_performance_df.loc[:,team_names].plot(figsize=(10,5), grid=False, xlabel='Year',
        ylabel='Games Won', xticks = team_performance_df.index, colormap='coolwarm',
        lw=4, title='Number of Games Won by Each Team Annually')

team_names = ['New England Patriots']
plot_team_performance(full_schedule, team_names)

1. As seen above, even though New England Patriots recorded the most wins in the last 10 years, its performance was not that good last year
2. Using the same function, we can compare the trend for several teams in one graph like the one below

In [None]:
team_names = ['Arizona Cardinals','Detroit Lions','Miami Dolphins','New York Jets','Atlanta Falcons']
plot_team_performance(full_schedule, team_names)

1. Another time-series analysis we can do with the full schedule dataframe is comparing the number of wins between away and home team
2. From the graph below, we can see that home team scored more wins from 2010 to 2019
3. In 2020, however, away team managed to record more wins
4. The season is still on-going for the 2021 NFL but away team is currently leading the metric

In [None]:
# We can check how many times a home team wins vs an away team wins each year
def plot_home_vs_away(full_schedule):
    home_team_wins_df = full_schedule[full_schedule['winning_name']==full_schedule['home_name']]
    home_team_wins_df = pd.DataFrame(home_team_wins_df.groupby('year').size())
    away_team_wins_df = full_schedule[full_schedule['winning_name']==full_schedule['away_name']]
    away_team_wins_df = pd.DataFrame(away_team_wins_df.groupby('year').size())
    combined_df = pd.merge(home_team_wins_df, away_team_wins_df, left_index=True, right_index=True)
    combined_df = combined_df.rename(columns = {'0_x':'Home','0_y':'Away'})
    plt.rcParams["figure.dpi"] = 80
    combined_df.plot(kind ='bar',figsize=(8,4), grid=False, xlabel='Year', ylabel='Games Won',
                    title='Total Number of Games Won by Home Team vs. Away Team',
                    colormap='coolwarm')

plot_home_vs_away(full_schedule)

### 2.4. Extracting the game data (stats) of past 11 years (2010 - 2021)

#### Column manipulation functions

Creating the supporting functions that will help the main functions for ease of processing

In [16]:
# 'column_name_manipulation' function changes the feature names (it removes home/away from the start of the feature name)
def column_name_manipulation(obj):
    if type(obj) is not list:
        columns =  list(obj.columns)
    else:
        columns =  obj
    new_columns_list = []

    for column_name in columns:
        if 'away' in column_name:
            column_name = column_name.split('_')
            column_name.remove('away')
            column_name = column_name = '_'.join(column_name)
            new_columns_list.append(column_name)
            
        elif 'home' in column_name:
            column_name = column_name.split('_')
            column_name.remove('home')
            column_name = column_name = '_'.join(column_name)
            new_columns_list.append(column_name)
            
        else:
            new_columns_list.append(column_name)
    if type(obj) is not list:
        obj.columns = new_columns_list
    else:       
        obj = new_columns_list
        
    return obj

In [17]:
# 'column_name_manipulation' function changes the feature names (it adds home/away from the start of the feature name)
def column_name_manipulation_reverse(obj, type_ = 'home'):
    if type(obj) is not list:
        columns =  list(obj.columns)
    else:
        columns =  obj
        
    new_columns_list = []

    for column_name in columns:
        column_name = type_ + '_' + column_name 
        new_columns_list.append(column_name)
            
    if type(obj) is not list:
        obj.columns = new_columns_list
    else:       
        obj = new_columns_list
        
    return obj

1. Dropping the redundant columns
2. Listing the required columns

In [18]:
columns_required = ['first_downs', 'fourth_down_attempts', 'fourth_down_conversions', 
                    'fumbles', 'fumbles_lost', 'interceptions','net_pass_yards',
                    'pass_attempts','pass_completions','pass_touchdowns','pass_yards',
                    'penalties','points','rush_attempts','rush_touchdowns','rush_yards',
                    'third_down_attempts','third_down_conversions','time_of_possession',
                    'times_sacked','total_yards','turnovers','yards_from_penalties','yards_lost_from_sacks']

#### Function to extract each game info

1. 'g_data' function takes 2 arguments as inputs, schedule and game info
2. It returns 2 dataframes, home_df and away_df
3. home_df and away_df contain statistics of home team and away team for that particular game
4. It also adds 2 more columns to the dataframes, game_won and game_lost, which are binary. If score of away team is greater than home team, that means away team won and hence away_df would have game_won flagged as 1 and game_lost flagged as 0. Same for home team.
5. The prefixes home/away are removed from each of the dataframe features
6. We do (5) to build past records of each team for each game and use it for modelling and analyses
7. The function also transforms the value of time_of_possession from a time duration format (%M:%S) to integer duration (expressed in seconds)

In [19]:
def g_data(g_df,one_game_stats):

    columns_required_home = column_name_manipulation_reverse(columns_required, 'home')
    columns_required_away = column_name_manipulation_reverse(columns_required, 'away')
    
    try:
        a_team_df = column_name_manipulation(g_df[['away_name', 'away_abbr', 'away_score']]).rename(columns = {
            'name' : 'team_name', 'abbr': 'team_abbr'})
        h_team_df = column_name_manipulation(g_df[['home_name','home_abbr', 'home_score']]).rename(columns = {
            'name' : 'team_name', 'abbr': 'team_abbr'})

        try:
            if g_df.loc[0,'away_score'] != g_df.loc[0,'home_score']:
                a_team_df['game_won'] = int(g_df.loc[0,'away_score'] > g_df.loc[0,'home_score'])
                a_team_df['game_lost'] = 1- a_team_df['game_won']
                h_team_df['game_won'] = int(g_df.loc[0,'away_score'] < g_df.loc[0,'home_score'])
                h_team_df['game_lost'] = 1- h_team_df['game_won']

            else:
                a_team_df['game_won'] = a_team_df['game_lost'] = h_team_df['game_won'] = h_team_df['game_lost'] = 0

        except TypeError:
                a_team_df['game_won'] = a_team_df['game_lost'] = h_team_df['game_won'] = h_team_df['game_lost'] = np.nan
                
        a_stats_df = one_game_stats.dataframe[columns_required_away].reset_index().drop(columns ='index')
        a_stats_df = column_name_manipulation(a_stats_df)
        
        h_stats_df = one_game_stats.dataframe[columns_required_home].reset_index().drop(columns = 'index')
        h_stats_df = column_name_manipulation(h_stats_df)

        a_team_df = pd.merge(a_team_df, a_stats_df,left_index = True, right_index = True)
        h_team_df = pd.merge(h_team_df, h_stats_df,left_index = True, right_index = True)
        
        try:
            time_a_team = datetime.datetime.strptime(a_team_df['time_of_possession'][0],'%M:%S')
            time_h_team = datetime.datetime.strptime(h_team_df['time_of_possession'][0],'%M:%S')
            a_team_df['time_of_possession'] = int(time_a_team.minute* 60) + int(time_a_team.second)
            h_team_df['time_of_possession'] = int(time_h_team.minute* 60) + int(time_h_team.second)
        except TypeError:
            a_team_df['time_of_possession'] = np.nan
            h_team_df['time_of_possession'] = np.nan
            
    except TypeError:
        a_team_df = pd.DataFrame()
        h_team_df = pd.DataFrame()
    return a_team_df, h_team_df

In [28]:
week_scores = game_info(2,2020)
game_str = week_scores.games['2-2020'][0]['boxscore']
one_game_stats = game_stats_info(game_str)
g_df = pd.DataFrame(week_scores.games['2-2020'][0], index = [0])
g_data(g_df,one_game_stats)[0]

NameError: name 'column_name_manipulation_reverse' is not defined

#### Apply g_data function in for loop to gather all the game statistics given number of weeks and year number

1. g_data_till_week function takes 2 arguments : weeks_list and year (weeks_list is the list of all the weeks for which we want data) 
2. It returns team statistics for each game in that particular week for that particular year
3. In our case, we take data for all the weeks in a season 

In [34]:
def g_data_till_week(weeks_list, year):
    weeksgames_df = pd.DataFrame()
    for w in range(len(weeks_list)):
        date = '{}-{}'.format(weeks_list[w], year)
        w_scores = game_info(weeks_list[w],year)
        w_games_df = pd.DataFrame()
        for g in range(len(w_scores.games[date])):
            game_string = w_scores.games[date][g]['boxscore']
            one_game_stats = game_stats_info(game_string)
            g_df = pd.DataFrame(w_scores.games[date][g], index = [0])
            a_team_df, h_team_df = g_data(g_df,one_game_stats)
            a_team_df['week'] = h_team_df['week'] = weeks_list[w]
            w_games_df = pd.concat([w_games_df,a_team_df])
            w_games_df = pd.concat([w_games_df,h_team_df])
            print(weeks_list[w])
        weeksgames_df = pd.concat([weeksgames_df,w_games_df])
        
    return weeksgames_df

In [106]:
one_game_stats = game_stats_info(game_info(weeks_list[0],2022).games['{}-{}'.format(weeks_list[0], 2022)][15]['boxscore'])

ParserError: Document is empty

In [20]:
def g_data_till_week(weeks_list, year):
    weeksgames_df = pd.DataFrame()
    for w in range(len(weeks_list)):
        date = '{}-{}'.format(weeks_list[w], year)
        w_scores = game_info(weeks_list[w],year)
        w_games_df = pd.DataFrame()
        for g in range(len(w_scores.games[date])):
            game_string = w_scores.games[date][g]['boxscore']
            try:
                one_game_stats = game_stats_info(game_string)
                g_df = pd.DataFrame(w_scores.games[date][g], index = [0])
                a_team_df, h_team_df = g_data(g_df,one_game_stats)
                a_team_df['week'] = h_team_df['week'] = weeks_list[w]
                w_games_df = pd.concat([w_games_df,a_team_df])
                w_games_df = pd.concat([w_games_df,h_team_df])
            except:
                one_game_stats = None
        weeksgames_df = pd.concat([weeksgames_df,w_games_df])
       
    return weeksgames_df







In [107]:
one_game_stats

Boxscore for Tampa Bay Buccaneers at Dallas Cowboys (Sunday Sep 11, 2022)

In [33]:
aggregate_games_df.tail(15)

Unnamed: 0,away_name,away_abbr,home_name,home_abbr,week,year,win_perc_dif,first_downs_dif,fumbles_dif,interceptions_dif,...,rush_yards_dif,time_of_possession_dif,times_sacked_dif,total_yards_dif,turnovers_dif,yards_from_penalties_dif,yards_lost_from_sacks_dif,fourth_down_perc_dif,third_down_perc_dif,result
2863,New York Jets,nyj,Cleveland Browns,cle,2,2022,-0.929846,0.796308,0.872615,0.974769,...,-126.097538,-339.140615,1.779692,20.731692,1.880462,7.877692,3.034,0.234853,-0.270844,
2864,Washington Commanders,was,Detroit Lions,det,2,2022,0.930957,2.964,0.036,0.916,...,-85.432,290.952,0.028,6.472,1.812,-2.408,-1.832,-0.900943,0.056544,
2865,Indianapolis Colts,clt,Jacksonville Jaguars,jax,2,2022,0.0,8.419692,3.595077,-0.033846,...,52.845692,737.512462,-0.026769,125.731538,0.843846,-1.632769,-2.802615,0.011784,0.141466,
2866,Tampa Bay Buccaneers,tam,New Orleans Saints,nor,2,2022,0.024,0.536,-0.92,0.888,...,-0.492,352.94,-1.872,-24.788,-0.02,-66.3,-16.652,0.00728,0.056441,
2867,Carolina Panthers,car,New York Giants,nyg,2,2022,-0.897231,-3.462154,2.675385,0.010923,...,-164.468769,-382.739385,-0.831692,-119.539846,-0.898769,47.161846,-3.411385,-0.898191,0.147588,
2868,New England Patriots,nwe,Pittsburgh Steelers,pit,2,2022,-0.886789,3.759538,0.876308,0.916769,...,5.586154,62.041846,0.864615,6.021231,2.706769,-39.667385,16.350769,0.024041,0.16167,
2869,Miami Dolphins,mia,Baltimore Ravens,rav,2,2022,-0.017692,4.023231,0.976769,-0.923385,...,-3.909385,249.172154,0.802308,22.299385,-0.874615,-7.771077,23.260154,0.885714,0.044254,
2870,Atlanta Falcons,atl,Los Angeles Rams,ram,2,2022,-0.024,5.972,1.864,-2.696,...,132.0,261.756,-6.26,148.908,-0.864,22.78,-44.076,-0.606042,-0.07145,
2871,Seattle Seahawks,sea,San Francisco 49ers,sfo,2,2022,0.88,1.428,0.868,-0.936,...,-91.432,-418.508,0.092,-75.664,-0.968,-21.488,9.364,-0.023529,0.062631,
2872,Cincinnati Bengals,cin,Dallas Cowboys,dal,2,2022,-0.02,17.672,1.756,2.732,...,52.36,889.056,2.792,164.224,3.608,-44.24,13.62,-0.054399,0.268485,


In [22]:
current_week=1

In [83]:
weeks = list(range(1,current_week + 1))
date_string = str(weeks[0]) + '-' + str(2022)
week_scores = game_info(weeks[0],2022)
week_games_df = pd.DataFrame()
weeks_games_df = pd.DataFrame()
game_str = week_scores.games[date_string][0]['boxscore']
game_stats = game_stats_info(game_str)
game_df = pd.DataFrame(week_scores.games[date_string][0], index = [0])
away_team_df, home_team_df = g_data(game_df,game_stats)
away_team_df['week'] = weeks[0]
home_team_df['week'] = weeks[0]
week_games_df = pd.concat([week_games_df,away_team_df])
week_games_df = pd.concat([week_games_df,home_team_df])
weeks_games_df = pd.concat([weeks_games_df,week_games_df])

In [39]:
g_data_till_week(weeks_list, current_y)

Unnamed: 0,team_name,team_abbr,score,game_won,game_lost,first_downs,fourth_down_attempts,fourth_down_conversions,fumbles,fumbles_lost,...,rush_yards,third_down_attempts,third_down_conversions,time_of_possession,times_sacked,total_yards,turnovers,yards_from_penalties,yards_lost_from_sacks,week
0,Buffalo Bills,buf,31,1,0,23,0,0,2,2,...,121,10,9,1874,2,413,4,35,5,1
0,Los Angeles Rams,ram,10,0,1,19,3,2,1,0,...,52,13,6,1726,7,243,3,30,49,1
0,New Orleans Saints,nor,27,1,0,18,0,0,1,1,...,151,13,4,1576,4,385,1,99,35,1
0,Atlanta Falcons,atl,26,0,1,26,0,0,3,2,...,201,13,5,2024,0,416,2,55,0,1
0,Cleveland Browns,cle,26,1,0,23,2,1,1,0,...,217,18,8,2306,1,355,0,71,9,1
0,Carolina Panthers,car,24,0,1,15,0,0,5,0,...,54,11,4,1294,4,261,1,96,28,1
0,San Francisco 49ers,sfo,10,0,1,17,2,0,2,1,...,176,17,8,2008,2,331,2,99,9,1
0,Chicago Bears,chi,19,1,0,15,0,0,0,0,...,99,14,5,1592,2,204,1,24,16,1
0,Pittsburgh Steelers,pit,23,1,0,13,0,0,1,0,...,75,15,4,1577,1,267,0,59,2,1
0,Cincinnati Bengals,cin,20,0,1,32,3,1,2,1,...,133,16,8,2623,7,432,5,27,39,1


1. Importing the team statistics for each game from the year 2010 to 2020 using the g_data function and exporting it to a csv file (so that we don't need to extract the data from the open source, hence increasing the speed)
2. Changing names of the teams that have changed their name in the last 10 years to maintain homogeneity 
3. Concating the team statistics for each game of the current year (2021) to the similar team stats from 2010 to 2020

In [65]:
current_y = 2022

In [None]:
full_game_data = pd.DataFrame()
for n in range(2019, 2022):
    game_data_ = g_data_till_week(list(range(1, 18)), n)
    game_data_['year'] = n
    full_game_data = pd.concat([full_game_data, game_data_])
full_game_data.reset_index(drop = True, inplace = True)

In [19]:
full_game_data = full_game_data.replace('Oakland Raiders', 'Las Vegas Raiders')
full_game_data = full_game_data.replace('San Diego Chargers', 'Los Angeles Chargers')
full_game_data = full_game_data.replace('St. Louis Rams', 'Los Angeles Rams')
full_game_data = full_game_data.replace('Washington Football Team', 'Washington Commanders')

In [21]:
full_game_data = pd.read_csv('full_game_data2.csv')
full_game_data = full_game_data.loc[:, ~full_game_data.columns.str.contains('^Unnamed')]

In [49]:
full_game_data_1 = pd.read_csv('2022_wk1_data.csv')
full_game_data_1['year'] = current_y
full_game_data = pd.concat([full_game_data, full_game_data_1])
full_game_data.reset_index(drop = True, inplace = True)

In [22]:
full_game_data_1 = g_data_till_week(weeks_list, current_y)
full_game_data_1['year'] = current_y
full_game_data = pd.concat([full_game_data, full_game_data_1])
full_game_data.reset_index(drop = True, inplace = True)

In [23]:
full_game_data = full_game_data.loc[:, ~full_game_data.columns.str.contains('^Unnamed')]

In [18]:
full_game_data.to_csv('full_game_data_for_spreads_wk2.csv')

In [75]:
full_game_data.tail(15)

Unnamed: 0,team_name,team_abbr,score,game_won,game_lost,first_downs,fourth_down_attempts,fourth_down_conversions,fumbles,fumbles_lost,interceptions,net_pass_yards,pass_attempts,pass_completions,pass_touchdowns,pass_yards,penalties,points,rush_attempts,rush_touchdowns,rush_yards,third_down_attempts,third_down_conversions,time_of_possession,times_sacked,total_yards,turnovers,yards_from_penalties,yards_lost_from_sacks,week,year
6491,Miami Dolphins,mia,33.0,1.0,0.0,23.0,1.0,0.0,0.0,0.0,0.0,103.0,22.0,15.0,1.0,109.0,5.0,33.0,43.0,1.0,195.0,15.0,7.0,2016.0,1.0,298.0,0.0,33.0,6.0,18,2021
6492,Chicago Bears,chi,17.0,0.0,1.0,24.0,6.0,1.0,0.0,0.0,2.0,266.0,48.0,33.0,1.0,325.0,2.0,17.0,25.0,0.0,90.0,15.0,5.0,2207.0,7.0,356.0,2.0,10.0,59.0,18,2021
6493,Minnesota Vikings,min,31.0,1.0,0.0,11.0,0.0,0.0,1.0,0.0,0.0,227.0,22.0,14.0,3.0,250.0,4.0,31.0,22.0,0.0,104.0,13.0,7.0,1393.0,3.0,331.0,0.0,47.0,23.0,18,2021
6494,Washington Football Team,was,22.0,1.0,0.0,16.0,1.0,1.0,2.0,0.0,0.0,99.0,18.0,9.0,0.0,120.0,3.0,22.0,37.0,1.0,226.0,13.0,3.0,1937.0,3.0,325.0,0.0,29.0,21.0,18,2021
6495,New York Giants,nyg,7.0,0.0,1.0,10.0,4.0,2.0,1.0,1.0,2.0,83.0,31.0,15.0,1.0,103.0,3.0,7.0,25.0,0.0,94.0,17.0,6.0,1663.0,3.0,177.0,3.0,21.0,20.0,18,2021
6496,Pittsburgh Steelers,pit,16.0,1.0,0.0,19.0,1.0,1.0,0.0,0.0,1.0,235.0,44.0,30.0,1.0,244.0,6.0,16.0,30.0,0.0,79.0,17.0,6.0,2208.0,1.0,314.0,1.0,43.0,9.0,18,2021
6497,Baltimore Ravens,rav,13.0,0.0,1.0,20.0,2.0,1.0,2.0,1.0,2.0,132.0,32.0,16.0,0.0,141.0,1.0,13.0,36.0,1.0,249.0,14.0,3.0,1876.0,3.0,381.0,3.0,5.0,9.0,18,2021
6498,Carolina Panthers,car,17.0,0.0,1.0,18.0,6.0,2.0,1.0,1.0,1.0,207.0,43.0,29.0,2.0,219.0,1.0,17.0,26.0,0.0,110.0,14.0,4.0,2105.0,2.0,317.0,2.0,10.0,12.0,18,2021
6499,Tampa Bay Buccaneers,tam,41.0,1.0,0.0,21.0,1.0,1.0,0.0,0.0,0.0,324.0,39.0,29.0,3.0,326.0,2.0,41.0,20.0,2.0,85.0,11.0,4.0,1495.0,1.0,409.0,0.0,10.0,2.0,18,2021
6500,Seattle Seahawks,sea,38.0,1.0,0.0,19.0,0.0,0.0,1.0,1.0,1.0,229.0,26.0,15.0,3.0,238.0,4.0,38.0,30.0,2.0,202.0,12.0,8.0,1451.0,1.0,431.0,2.0,30.0,9.0,18,2021


### 2.5. Full game data (stats) dataframe exploration

In [None]:
full_game_data.head()

In [None]:
full_game_data.tail()

In [None]:
full_game_data.info()

1. We can see that the team stats data from sportsipy NFL package are complete (no null/missing values)
2. There are also no duplicate values (as indicated below) since each row is attributed to each team's performance for a specific match
2. This dataframe is not the final dataframe that will be used in the machine learning application
3. However, we can analyze and extract some insights from this dataframe before creating the final dataframe

In [None]:
full_game_data.duplicated().unique()

In [None]:
# We first can do a single variable (univariat) analysis by plotting its histogram
# Since the game is played with each team trying to score to win, we'll use this variable
def create_histogram_for_one_variable(full_game_data, variable_name):
    full_game_data[variable_name].hist(bins=15, figsize=(5,4), grid=False, color='midnightblue')
    plt.xlabel(variable_name)
    plt.ylabel('Frequency')
    plt.title('Histogram of ' + variable_name)

variable_name = 'score'
create_histogram_for_one_variable(full_game_data, variable_name)

1. We can see that it's not that balanced and there are some extremes to the right side
3. From the histogram, we can conclude that the teams usually score between 15-30
2. The data also shows that there are less and less observations when a team scores more than 40 in a match

In [None]:
# We can further stack the histogram by the year
def create_stacked_histogram_for_one_variable(full_game_data, variable_name):
    from matplotlib import cm
    colors = cm.inferno_r(np.linspace(.3, .7, 12))
    full_game_data.pivot(columns='year')['score'].plot(kind = 'hist', bins = 15,
                                                       figsize=(5,4), grid=False,
                                                       stacked=True, color=colors)
    plt.xlabel(variable_name)
    plt.ylabel('Frequency')
    plt.title('Histogram of ' + variable_name + ' Grouped by Year')

variable_name = 'score'
create_stacked_histogram_for_one_variable(full_game_data, variable_name)

1. From the graph above, we see that there is no significant difference between each year in each bar
2. The frequency tends to be divided fairly every season
3. It may be easier to see the cumulative histogram to see that the height of each bar is fairly divided

In [None]:
# We can plot the cumulative histogram of the above graph
def create_stacked_histogram_for_one_variable_cumulative(full_game_data, variable_name):
    from matplotlib import cm
    colors = cm.inferno_r(np.linspace(.3, .7, 12))
    full_game_data.pivot(columns='year')['score'].plot(kind = 'hist', bins = 15,
                                                       figsize=(5,5), grid=False,
                                                       stacked=True, color=colors,
                                                       cumulative=True)
    plt.xlabel(variable_name)
    plt.ylabel('Frequency')
    plt.title('Cumulative Histogram of ' + variable_name + ' Grouped by Year')

variable_name = 'score'
create_stacked_histogram_for_one_variable_cumulative(full_game_data, variable_name)

In [None]:
# We are interested in the relationship between the score and all other team stats
def plot_scatter_for_score_vs_other_stats(full_game_data, no_of_cols):
    team_stats = full_game_data.drop(columns = ['team_name', 'team_abbr', 'game_won',
                                            'game_lost', 'week', 'year'])
    no_of_rows = (len(team_stats.columns)//no_of_cols)+1
    fig = plt.figure(figsize=(20,30))
    for i, col in enumerate(team_stats.iloc[:,1:].columns):
        ax = fig.add_subplot(no_of_rows,no_of_cols, i+1)
        ax.scatter(team_stats[col], team_stats['score'], c='dimgrey')
        ax.set_ylabel('score')
        ax.set_xlabel(col)
        ax.set_title('{} vs. {}'.format(col, 'score'), color='firebrick')
    fig.tight_layout()  
    plt.show()

plot_scatter_for_score_vs_other_stats(full_game_data, 4)

1. From the scatter plots above, we see that there are some stats that are positively correlated with score (first_downs, net_pass_yards, pass_touchdowns, pass_yards, points, rush_attempts, rush_touchdowns, rush_yards, time_of_possession, and total_yards)
2. However, there are also stats that are negatively correlated with score (fourth_down_attempts, fumbles_lost, interceptions, times_scaked, turnovers, and yards_lost_from_sacks)
3. There are also stats that seem to be inconclusive based on these charts, in other words, the dots are too scattered (fourth_down_conversions, fumbles, pass_attempts, pass_completions, penalties, third_down_attempts, third_down_conversions, and yards_from_penalties)
4. We can then further investigate the relationship between other variables

In [None]:
# Plotting scatter plots between several variables in a list
def plot_scatter_by_choosing_variables_no1(full_game_data, list_of_var):
    import seaborn as sns
    sns.set_style("whitegrid", {'axes.grid' : False})
    sns.pairplot(full_game_data[list_of_var])

list_of_var = ['first_downs', 'rush_attempts', 'rush_yards', 'total_yards']
plot_scatter_by_choosing_variables_no1(full_game_data, list_of_var)

1. We suspect that variables that are positively correlated to the score will also have the same behavior between each other
2. The plot above shows that there are indeed a positive correlation between variables that contribute to scoring higher
3. It signals multicolinearity and we will confirm again and handle it later in our final dataframe
4. For the graph below, we plot the same thing, but now each dot represents the season of NFL

In [None]:
# Plotting scatter plots between several variables in a list and differentiate each point based on the season (year)
def plot_scatter_by_choosing_variables_no2(full_game_data, list_of_var):
    full_game_data['season'] = full_game_data['year'].apply(lambda x: str(x))
    list_of_var.append('season')
    import seaborn as sns
    sns.set_style("whitegrid", {'axes.grid' : False})
    sns.pairplot(full_game_data[list_of_var], hue='season', palette='icefire', corner=True)

list_of_var = ['first_downs', 'rush_attempts', 'rush_yards', 'total_yards']
plot_scatter_by_choosing_variables_no2(full_game_data, list_of_var)

In [None]:
# Plotting scatter plots between several variables in a list and differentiate each point based on the team name
def plot_scatter_by_choosing_variables_no3(full_game_data, list_of_var):
    list_of_var.append('team_name')
    import seaborn as sns
    sns.set_style("whitegrid", {'axes.grid' : False})
    sns.pairplot(full_game_data[list_of_var], hue='team_name', palette='coolwarm', corner=True)

list_of_var = ['interceptions', 'times_sacked', 'turnovers', 'yards_lost_from_sacks']
plot_scatter_by_choosing_variables_no3(full_game_data, list_of_var)

1. For the variables that are negatively correlated with the score, we can also see that some of them have linear relationship between each other based on the graphs above (turnovers and interceptions, yards_lost_from_sacks and times_sacked)
2. It further confirms that multicolinearity exists
3. For these above graphs, each point represents each team

In [None]:
# Plotting histograms of a variable grouped by the winning and losing team
def create_feature_histogram_winning_vs_losing_team(full_game_data, variable_name):
    full_game_data.groupby('game_won')[variable_name].hist(bins=10, figsize=(5,3),
                                                           histtype='stepfilled',
                                                           alpha=0.4, grid=False)
    plt.xlabel(variable_name)
    plt.ylabel('Frequency')
    plt.title('Histogram of ' + variable_name + ' Grouped by Winning and Losing Team')
    plt.legend(['0.0: Losing Team','1.0: Winning Team'])

variable_name = 'first_downs'
create_feature_histogram_winning_vs_losing_team(full_game_data, variable_name)

In [None]:
variable_name = 'rush_attempts'
create_feature_histogram_winning_vs_losing_team(full_game_data, variable_name)

In [None]:
variable_name = 'rush_touchdowns'
create_feature_histogram_winning_vs_losing_team(full_game_data, variable_name)

1. The three histograms comparison above confirms that for a variable that is correlated positively to the score, the winning team histogram will be more on the right side
2. On the other hand, for the negative correlated variables (with score), as shown below, the losing team histogram tends to be more on the right side

In [None]:
variable_name = 'turnovers'
create_feature_histogram_winning_vs_losing_team(full_game_data, variable_name)

In [None]:
variable_name = 'times_sacked'
create_feature_histogram_winning_vs_losing_team(full_game_data, variable_name)

1. Based on some of the findings above and information that we have about the NFL team statistics, we believe that there are two types of team statistics: the good stats (stats that will increase the chance of winning) and the bad ones (stats that don't help the team)
2. Intuitively, we think that the teams that won the game should have better good stats compared to the teams that lost the game
3. On the contrary, the winning team would have worse bad stats compared to the losing team
4. In other words, if we take the stats difference (average of all macthes) between the winning team and the losing team, the good stats difference should have positive values (first graph below), and the bad stats difference should be negative (second graph below)
5. Later on, after performing the machine learning model, we can compare this hypothesis to the decomposed features' importance graph

#### Hypothesized good features (stats):
1. first_downs
2. fourth_down_conversions
3. time_of_possession
4. third_down_attempts
5. net_pass_yards
6. pass_attempts
7. pass_completions
8. pass_touchdowns
9. pass_yards
10. points
11. rush_attempts
12. rush_touchdowns
13. rush_yards
14. third_down_conversions
15. total_yards

#### Hypothesized bad features (stats):
1. fourth_down_attempts
2. fumbles
3. fumbles_lost
4. interceptions
5. penalties
6. times_sacked
7. turnovers
8. yards_from_penalties
9. yards_lost_from_sacks

In [None]:
# Creating the good features plot between winning team vs. losing team: expected to be positive,
# and the bad features plot: expected to be negative
def plot_good_and_bad_features_comparison(full_game_data):
    winning_df = full_game_data[full_game_data["game_won"] == 1].drop(
        columns = ['team_name', 'team_abbr', 'score', 'game_won', 'game_lost',
        'week', 'year'])
    losing_df = full_game_data[full_game_data["game_lost"] == 1].drop(
        columns = ['team_name', 'team_abbr', 'score', 'game_won', 'game_lost',
        'week', 'year'])

    winning_good_df = winning_df[['first_downs','fourth_down_conversions','time_of_possession',
                         'third_down_attempts','net_pass_yards','pass_attempts',
                         'pass_completions','pass_touchdowns','pass_yards','points',
                         'rush_attempts','rush_touchdowns','rush_yards',
                         'third_down_conversions','total_yards']]
    winning_bad_df = winning_df[['fourth_down_attempts','fumbles','fumbles_lost','interceptions',
                          'penalties','times_sacked','turnovers','yards_from_penalties',
                          'yards_lost_from_sacks']]
    losing_good_df = losing_df[['first_downs','fourth_down_conversions','time_of_possession',
                         'third_down_attempts','net_pass_yards','pass_attempts',
                         'pass_completions','pass_touchdowns','pass_yards','points',
                         'rush_attempts','rush_touchdowns','rush_yards',
                         'third_down_conversions','total_yards']]
    losing_bad_df = losing_df[['fourth_down_attempts','fumbles','fumbles_lost','interceptions',
                          'penalties','times_sacked','turnovers','yards_from_penalties',
                          'yards_lost_from_sacks']]

    good_features = winning_good_df.mean() - losing_good_df.mean()
    bad_features = winning_bad_df.mean() - losing_bad_df.mean()
    
    plt.rcParams["figure.dpi"] = 80
    
    good_features.plot(kind='barh', figsize = (10, 6), color='midnightblue')
    plt.ylabel('Features')
    plt.legend(['Average Good Features of Winning Team - Average Good Features of Losing Team'], loc='best')
    plt.title('Good Features Comparison')
    plt.show()
    
    bad_features.plot(kind='barh', figsize = (10, 4), color='firebrick')
    plt.ylabel('Features')
    plt.legend(['Average Bad Features of Winning Team - Average Bad Features of Losing Team'], loc='best')
    plt.title('Bad Features Comparison')
    plt.show()

plot_good_and_bad_features_comparison(full_game_data)

### 2.6. More on visualization: folium map

1. We are also interested in creating folium map to summarize each team's performance (team stats)
2. To complete this, we create two maps: map that can show the summary from 2010 until the current week of current season and map that can show the summary for a specific year

In [None]:
# Creating a folium map summarizing team stats from 2010 to 2021 (up until current week)
def create_team_stats_map(full_game_data):
    import folium
    import geojsonio
    import json
    import branca
    from IPython.display import IFrame
            
    team_loc = pd.read_csv('team_information.csv')
    team_stats = full_game_data.drop(columns=['team_abbr', 'game_won', 'game_lost', 'week',
                                            'year']).groupby('team_name').mean().reset_index()
    team_won_and_lost = full_game_data[['team_name', 'game_won', 'game_lost'
                                       ]].groupby('team_name').sum().reset_index()
    temp_df = pd.merge(team_loc, team_won_and_lost, left_on='team_name',
                       right_on='team_name')
    df = pd.merge(temp_df, team_stats, left_on='team_name',
                  right_on='team_name').rename(columns = {'game_won': 'total_game_won',
                                                 'game_lost': 'total_game_lost',
                                                 'score': 'average_score',
                                                 'first_downs': 'average_first_downs',
                                                 'fourth_down_attempts': 'average_fourth_down_attempts',
                                                 'fourth_down_conversions': 'average_fourth_down_conversions',
                                                 'fumbles': 'average_fumbles',
                                                 'fumbles_lost': 'average_fumbles_lost',
                                                 'interceptions': 'average_interceptions',
                                                 'net_pass_yards': 'average_net_pass_yards',
                                                 'pass_attempts': 'average_pass_attempts',
                                                 'pass_completions': 'average_pass_completions',
                                                 'pass_touchdowns': 'average_pass_touchdowns',
                                                 'pass_yards': 'average_pass_yards',
                                                 'penalties': 'average_penalties',
                                                 'points': 'average_points',
                                                 'rush_attempts': 'average_rush_attempts',
                                                 'rush_touchdowns': 'average_rush_touchdowns',
                                                 'rush_yards': 'average_rush_yards',
                                                 'third_down_attempts': 'average_third_down_attempts',
                                                 'third_down_conversions': 'average_third_down_conversions',
                                                 'time_of_possession': 'average_time_of_possession',
                                                 'times_sacked': 'average_times_sacked',
                                                 'total_yards': 'average_total_yards',
                                                 'turnovers': 'average_turnovers',
                                                 'yards_from_penalties': 'average_yards_from_penalties',
                                                 'yards_lost_from_sacks': 'average_yards_lost_from_sacks'})    
    
    usa_center = (37.0902, -95.7129)
    team_map = folium.Map(location=usa_center, zoom_start=4)
    
    for i in df.index:
        popup_dict = dict()
        for var in df.iloc[:,10:].columns:
            popup_dict[var] = f'{df.iloc[:,10:].loc[i, var]:.2f}'
            popup_df = pd.DataFrame(list(popup_dict.items()), columns=[['Attributes', 'Values']])
            popup_html = popup_df.to_html(index=False)
            iframe = branca.element.IFrame(html=popup_html, width=400, height=300)
            popup = folium.Popup(iframe, max_width=2650)
        
        icon = folium.features.CustomIcon(df.loc[i,'team_logo_wikipedia'], icon_size=(36, 36))
        folium.Marker(location=[df.loc[i,'lat'], df.loc[i,'long']],
                      popup=popup,
                      tooltip=df.loc[i, 'team_name'],
                      icon=icon).add_to(team_map)
    
    title = 'Summary of Team Stats 2010-2021'
    title_html = '''
             <h3 align="center" style="font-size:20px"><b>{title}</b></h3>
             '''.format(title=title)
    team_map.get_root().html.add_child(folium.Element(title_html))
    
    return team_map

create_team_stats_map(full_game_data)

1. With the map, we can just click the team logo to get the summarized stats
2. Below is the function to create the same map but only the stats of a given year will be shown

In [None]:
# Creating a folium map summarizing team stats in a given (specific) year
def create_yearly_team_stats_map(full_game_data, year):
    import folium
    import geojsonio
    import json
    import branca
    from IPython.display import IFrame
    
    full_game_data_year = full_game_data[full_game_data['year'] == year]
            
    team_loc = pd.read_csv('team_information.csv')
    team_stats = full_game_data_year.drop(columns=['team_abbr', 'game_won', 'game_lost', 'week',
                                            'year']).groupby('team_name').mean().reset_index()
    team_won_and_lost = full_game_data_year[['team_name', 'game_won', 'game_lost'
                                       ]].groupby('team_name').sum().reset_index()
    temp_df = pd.merge(team_loc, team_won_and_lost, left_on='team_name',
                       right_on='team_name')
    df = pd.merge(temp_df, team_stats, left_on='team_name',
                  right_on='team_name').rename(columns = {'game_won': 'total_game_won',
                                                 'game_lost': 'total_game_lost',
                                                 'score': 'average_score',
                                                 'first_downs': 'average_first_downs',
                                                 'fourth_down_attempts': 'average_fourth_down_attempts',
                                                 'fourth_down_conversions': 'average_fourth_down_conversions',
                                                 'fumbles': 'average_fumbles',
                                                 'fumbles_lost': 'average_fumbles_lost',
                                                 'interceptions': 'average_interceptions',
                                                 'net_pass_yards': 'average_net_pass_yards',
                                                 'pass_attempts': 'average_pass_attempts',
                                                 'pass_completions': 'average_pass_completions',
                                                 'pass_touchdowns': 'average_pass_touchdowns',
                                                 'pass_yards': 'average_pass_yards',
                                                 'penalties': 'average_penalties',
                                                 'points': 'average_points',
                                                 'rush_attempts': 'average_rush_attempts',
                                                 'rush_touchdowns': 'average_rush_touchdowns',
                                                 'rush_yards': 'average_rush_yards',
                                                 'third_down_attempts': 'average_third_down_attempts',
                                                 'third_down_conversions': 'average_third_down_conversions',
                                                 'time_of_possession': 'average_time_of_possession',
                                                 'times_sacked': 'average_times_sacked',
                                                 'total_yards': 'average_total_yards',
                                                 'turnovers': 'average_turnovers',
                                                 'yards_from_penalties': 'average_yards_from_penalties',
                                                 'yards_lost_from_sacks': 'average_yards_lost_from_sacks'})    
    
    usa_center = (37.0902, -95.7129)
    team_map = folium.Map(location=usa_center, zoom_start=4, tiles='cartodbpositron')
    
    for i in df.index:
        popup_dict = dict()
        for var in df.iloc[:,10:].columns:
            popup_dict[var] = f'{df.iloc[:,10:].loc[i, var]:.2f}'
            popup_df = pd.DataFrame(list(popup_dict.items()), columns=[['Attributes', 'Values']])
            popup_html = popup_df.to_html(index=False)
            iframe = branca.element.IFrame(html=popup_html, width=400, height=300)
            popup = folium.Popup(iframe, max_width=2650)
        
        icon = folium.features.CustomIcon(df.loc[i,'team_logo_wikipedia'], icon_size=(36, 36))
        folium.Marker(location=[df.loc[i,'lat'], df.loc[i,'long']],
                      popup=popup,
                      tooltip=df.loc[i, 'team_name'],
                      icon=icon).add_to(team_map)
    
    title = 'Summary of Team Stats in ' + str(year)
    title_html = '''
             <h3 align="center" style="font-size:20px"><b>{title}</b></h3>
             '''.format(title=title)
    team_map.get_root().html.add_child(folium.Element(title_html))
    
    return team_map

create_yearly_team_stats_map(full_game_data, 2020)

1. Having the team stats performace from 2010-2021 and plot them on a folium map is a great thing
2. With the map and stats, we can directly compare the teams to identify how strong a team is compared to the others (relative team strength based on the team stats)
3. However, one might be curious about which teams have the same overall strength based on the team stats and how 'close' each team is with the other teams based on their summarized stats
4. This is where we can apply network analysis

### 2.7. More on visualization: network

1. Using the overall stats performance of each team like what we have used in the folium map, we will calculate the 'distance' between each team
2. Distance, in this case, is calculated by using the euclidian method
3. Distance represents how close or how far each team is, based on the overall team stats performance
4. In other words, two teams that are close in distance will have similar team stats performance from 2010-2021
5. On the other hand, two teams that are far from each other will have opposite team stats performance
6. However, using distance can be a bit confusing since the 'smaller' the distance actually means the 'greater' the similarity between two teams
7. Therefore, for easier interpretation, we'll create similarity_distance which equals to '1 divided by the distance'
8. Hence, the smaller the distance -> the greater the similarity_distance, and the greater the similarity (in terms of performance stats) between the teams

In [None]:
def create_euclidean_distance_df(full_game_data):
    team_stats = full_game_data.drop(columns=['game_won', 'game_lost', 'week',
        'year']).groupby(['team_name', 'team_abbr']).mean().reset_index()
    team_won_and_lost = full_game_data[['team_name', 'team_abbr', 'game_won', 'game_lost'
        ]].groupby(['team_name', 'team_abbr']).sum().reset_index()

    df = pd.merge(team_won_and_lost, team_stats, left_on=['team_name','team_abbr'], right_on=['team_name',
        'team_abbr']).rename(columns = {'game_won': 'total_game_won',
                                        'game_lost': 'total_game_lost',
                                        'score': 'average_score',
                                        'first_downs': 'average_first_downs',
                                        'fourth_down_attempts': 'average_fourth_down_attempts',
                                        'fourth_down_conversions': 'average_fourth_down_conversions',
                                        'fumbles': 'average_fumbles',
                                        'fumbles_lost': 'average_fumbles_lost',
                                        'interceptions': 'average_interceptions',
                                        'net_pass_yards': 'average_net_pass_yards',
                                        'pass_attempts': 'average_pass_attempts',
                                        'pass_completions': 'average_pass_completions',
                                        'pass_touchdowns': 'average_pass_touchdowns',
                                        'pass_yards': 'average_pass_yards',
                                        'penalties': 'average_penalties',
                                        'points': 'average_points',
                                        'rush_attempts': 'average_rush_attempts',
                                        'rush_touchdowns': 'average_rush_touchdowns',
                                        'rush_yards': 'average_rush_yards',
                                        'third_down_attempts': 'average_third_down_attempts',
                                        'third_down_conversions': 'average_third_down_conversions',
                                        'time_of_possession': 'average_time_of_possession',
                                        'times_sacked': 'average_times_sacked',
                                        'total_yards': 'average_total_yards',
                                        'turnovers': 'average_turnovers',
                                        'yards_from_penalties': 'average_yards_from_penalties',
                                        'yards_lost_from_sacks': 'average_yards_lost_from_sacks'})
    
    from scipy.spatial.distance import pdist
    import itertools

    distance_df = pd.DataFrame(itertools.combinations(df['team_abbr'].values, 2), columns=['team_1','team_2'])
    distance_df['distance'] = pdist(df.iloc[:,2:].values, 'euclid')
    distance_df['similarity_distance'] = 1 / distance_df['distance']
    
    return distance_df

In [None]:
distance_df = create_euclidean_distance_df(full_game_data)

In [None]:
distance_df

1. The distance_df has 496 rows because there are 32 teams and a distance represents a pair of 2 teams' distance (32 choose 2 combinations)
2. The descriptive stats of distance is shown below

In [None]:
distance_df.describe()

1. Before drawing the network, we want to make sure that the network won't be too cluttered
2. Therefore, we will have a threshold to remove the edges (pair of two teams) where the similarity_distance is below this threshold number (less significant)
3. If we don't remove edges with large distance, every node will have 31 edges and it will be cluttered
4. We also want to identify the relative closeness of distance by giving thicker edges to smaller distance between two nodes (scaling the edge thickness to indicate the closeness in distance: thicker means closer)
5. We also want to identify the most connected nodes by adjusting the size of nodes (scale the size of nodes based on its degree to indicate which teams have the greatest number of closeness in distance with other teams)
6. We'll use the 'Fruchterman-Reingold' layout algorithm which will set the positions of each node by minimizing the distance between the nodes with high similarity_distance

In [None]:
# Define the network by previously removing edges that are below a threshold
def create_network_and_remove_edges_below_mean(distance_df):
    
    distance_df = distance_df[['team_1', 'team_2', 'similarity_distance']]
    
    # Set the threshold to be equal to the mean
    threshold_distance = distance_df.describe().loc['mean', 'similarity_distance']
    
    import networkx as nx
    # Create a graph from edge list
    G_team = nx.from_pandas_edgelist(distance_df, 'team_1', 'team_2', edge_attr=['similarity_distance'])

    # List to store edges to remove
    edges_removed_list = []

    # Loop through edges in G and find distance which are below the threshold
    for team_1, team_2 in G_team.edges():
        edge_distance = G_team[team_1][team_2]['similarity_distance']
        if edge_distance < threshold_distance:
            edges_removed_list.append((team_1, team_2))

    # Remove edges contained in the remove list
    G_team.remove_edges_from(edges_removed_list)
    
    print(str(len(edges_removed_list)) + " edges have been removed")
    
    return G_team

In [None]:
# Define the G_team and get information on how many edges have been removed
G_team = create_network_and_remove_edges_below_mean(distance_df)

In [None]:
# Draw the network with Fruchterman Reingold layout
def draw_network_of_nfl_team(G_team):
    import networkx as nx
    
    # Set node size
    def assign_node_size(value, scaling_factor=2):
        return value**2 * scaling_factor
    
    node_size = []
    for key, value in dict(G_team.degree).items():
        node_size.append(assign_node_size(value))
    
    # Set edge thickness
    def assign_edge_thickness(value, scaling_factor=10):
        return (value*100)**2 / scaling_factor

    edge_width = []
    for key, value in nx.get_edge_attributes(G_team, 'similarity_distance').items():
        edge_width.append(assign_edge_thickness(value))
        
    import seaborn as sns
    sns.set(rc={'figure.figsize': (9, 9)})
    font_dict = {'fontsize': 18}
    
    nx.draw(G_team, pos=nx.fruchterman_reingold_layout(G_team),
            with_labels=True, node_size=node_size,
            node_color="#e1575c", edge_color='#363847',
            width=edge_width)

    plt.title("NFL Team Network Based on Team Stats Euclidian Distance", fontdict=font_dict)
    plt.show()

draw_network_of_nfl_team(G_team)

1. Based on the network above, we first can easily identify the teams with bigger size of node, which means they are the most connected teams
2. These teams, for instance, are 'clt' (Indianapolis Colts), 'cin' (Cincinnati Bengals), and 'min' (Minnesota Vikings)
3. Nodes that are positioned closer to each other also show thicker edges, meaning that we manage to signify the closeness of these nodes (teams)
4. For example, 'crd' (Arizona Cardinals) is closer to 'ram' (Los Angeles Rams), rather than to 'mia' (Miami Dolphins), and it is shown by the position and also the thickness level of their edges

### Team abbreviation
1. crd: Arizona Cardinals
2. atl: Atlanta Falcons
3. rav: Baltimore Ravens
4. buf: Buffalo Bills
5. car: Carolina Panthers
6. chi: Chicago Bears
7. cin: Cincinnati Bengals
8. cle: Cleveland Browns
9. dal: Dallas Cowboys
10. den: Denver Broncos
11. det: Detroit Lions
12. gnb: Green Bay Packers
13. htx: Houston Texans
14. clt: Indianapolis Colts
15. jax: Jacksonville Jaguars
16. kan: Kansas City Chiefs
17. rai: Las Vegas Raiders
18. sdg: Los Angeles Chargers
19. ram: Los Angeles Rams
20. mia: Miami Dolphins
21. min: Minnesota Vikings
22. nwe: New England Patriots
23. nor: New Orleans Saints
24. nyg: New York Giants
25. nyj: New York Jets
26. phi: Philadelphia Eagles
27. pit: Pittsburgh Steelers
28. sfo: San Francisco 49ers
29. sea: Seattle Seahawks
30. tam: Tampa Bay Buccaneers
31.	oti: Tennessee Titans
32. was: Washington Football Team

#### Drawing minimum spanning tree (MST)
1. Since the layout already positions the teams based on their closeness, we don't need the edges to represent closeness as well
2. We'll now use the MST to reduce the edges down to those necessary to connect all the teams
3. By doing this, we will have a better clustered network
4. We don't need the size of the node to be scaled right now since it will be meaningless in MST

In [None]:
# Draw minimum spanning tree (MST) network
def draw_minimum_spanning_tree_of_nfl_team(G_team):
    import networkx as nx
    G_team_mst = nx.minimum_spanning_tree(G_team)
    
    import seaborn as sns
    sns.set(rc={'figure.figsize': (9, 9)})
    font_dict = {'fontsize': 18}
        
    # Draw minimum spanning tree, but we have to set node size and width to constant
    nx.draw(G_team_mst, with_labels=True,
            pos=nx.fruchterman_reingold_layout(G_team_mst),
            node_size=500, node_color="#e1575c",
            edge_color='#363847', width = 1.2)

    plt.title("NFL Team Network Based on Team Stats Euclidian Distance - Minimum Spanning Tree",
              fontdict=font_dict)
    plt.show()

draw_minimum_spanning_tree_of_nfl_team(G_team)

In [None]:
full_game_data.head(100)

1. With the MST network above, we can better see how the teams are clustered
2. We can identify the centroids of the clusters: 'buf' (Buffalo Bills), 'crd' (Arizona Cardinals), and 'atl' (Atlanta Falcons)
3. There is also 'clt' (Indianapolis Colts) that connects two clusters

## 3. Data Finalization

#### Creating Dataframe to build and train the model

1. For schedule of each game from (2011-2021), we aggregate the statistics of each team playing the game and determine the impact of features on the result of the game
2. Statistics are calculated using the weighted average of statistics of games played by a team in that season and the previous season. For a given season, statistics up to the game played are taken into account.
3. For each game, we take statistics from past 17 weeks of team's performance
4. For example, for game in week 7 of a particular season, we use the statistics from the first 6 weeks of that season and last 11 weeks from the previous season
5. Weighted average is assigned to account for team changes in each season, momentum in each season, hence giving more weightage to performance of the current season than the last season
6. The differential statistics between each team are then calculated
7. We add a 'nan' value for the result if the game hasn't been played yet

'aggregate_weekly_data' function takes 5 arguments as inputs:
1. schedule dataframe -> full_schedule dataframe
2. weeksgames_df (game statistics of given week of a given year) -> full_game_data dataframe
3. current week
4. current year
5. weeks list
6. m (weightage given to statistics of past season)

In [24]:
def aggregate_weekly_data(schedule_df, weeksgames_df, current_w, current_y, weeks_list, m):
    aggregate_games_df = pd.DataFrame()
    for n_year in range(2011, current_y + 1):
        weeksgames_df_1 = weeksgames_df[weeksgames_df.year == n_year]
        weeksgames_df_0 = weeksgames_df[weeksgames_df.year == n_year - 1]
        if n_year != current_y:
            schedule_df_1 = schedule_df[schedule_df.year == n_year]
            schedule_df_0 = schedule_df[schedule_df.year == n_year - 1]
        else:
            schedule_df_1 = schedule_df[schedule_df.year == n_year]
            schedule_df_1 = schedule_df_1[schedule_df_1.week <= current_w]
            schedule_df_0 = schedule_df[schedule_df.year == n_year - 1]
            
        for w in range(0, len(weeks_list)):
            games_df = schedule_df_1[schedule_df_1.week == weeks_list[w]]
            games_df = games_df.drop('year', axis = 1)

            if w == 0:
                aggregate_weekly_df = weeksgames_df_0[weeksgames_df_0.week >= weeks_list[w]].drop(columns = [
                    'score','week','game_won', 'game_lost']).groupby(by=["team_name", "team_abbr"]).mean().reset_index()      
                win_loss_df = weeksgames_df_0[weeksgames_df_0.week >= weeks_list[w]][["team_name",
                    "team_abbr",'game_won', 'game_lost']].groupby(by=["team_name", "team_abbr"]).sum().reset_index()
                win_loss_df['win_perc'] = win_loss_df['game_won'] / (win_loss_df['game_won'] + win_loss_df['game_lost'])
                win_loss_df = win_loss_df.drop(columns = ['game_won', 'game_lost'])
                try:
                    aggregate_weekly_df['fourth_down_perc'] = aggregate_weekly_df[
                        'fourth_down_conversions'] / agg_weekly_df['fourth_down_attempts']
                except:
                    aggregate_weekly_df['fourth_down_perc'] = 0
                aggregate_weekly_df['fourth_down_perc'] = aggregate_weekly_df['fourth_down_perc'].fillna(0)
                
                try:
                    aggregate_weekly_df['third_down_perc'] = aggregate_weekly_df[
                        'third_down_conversions'] / aggregate_weekly_df['third_down_attempts']
                except:
                    aggregate_weekly_df['third_down_perc'] = 0
                aggregate_weekly_df['third_down_perc'] = aggregate_weekly_df['third_down_perc'].fillna(0)
                
                aggregate_weekly_df = aggregate_weekly_df.drop(columns = ['fourth_down_attempts',
                                                                          'fourth_down_conversions',
                                                                          'third_down_attempts',
                                                                          'third_down_conversions'])
                
            else:
                aggregate_weekly_df_1 = weeksgames_df_1[weeksgames_df_1.week < weeks_list[w]].drop(columns = [
                    'score','week','game_won', 'game_lost']).groupby(by=["team_name", "team_abbr"]).mean().reset_index()       
                win_loss_df_1 = weeksgames_df_1[weeksgames_df_1.week < weeks_list[w]][["team_name",
                    "team_abbr",'game_won', 'game_lost']].groupby(by=["team_name", "team_abbr"]).sum().reset_index()
                win_loss_df_1['win_perc'] = win_loss_df_1['game_won'] / (win_loss_df_1['game_won'] + win_loss_df_1[
                    'game_lost'])
                win_loss_df_1 = win_loss_df_1.drop(columns = ['game_won', 'game_lost'])
                
                try:
                    aggregate_weekly_df_1['fourth_down_perc'] = aggregate_weekly_df_1[
                        'fourth_down_conversions'] / aggregate_weekly_df_1['fourth_down_attempts']
                except:
                    aggregate_weekly_df_1['fourth_down_perc'] = 0
                aggregate_weekly_df_1['fourth_down_perc'] = aggregate_weekly_df_1['fourth_down_perc'].fillna(0)
                
                try:
                    aggregate_weekly_df_1['third_down_perc'] = aggregate_weekly_df_1[
                        'third_down_conversions'] / aggregate_weekly_df_1['third_down_attempts']
                except:
                    aggregate_weekly_df_1['third_down_perc'] = 0
                aggregate_weekly_df_1['third_down_perc'] = aggregate_weekly_df_1['third_down_perc'].fillna(0)
                
                aggregate_weekly_df_1 = aggregate_weekly_df_1.drop(columns = ['fourth_down_attempts',
                                                                              'fourth_down_conversions',
                                                                              'third_down_attempts',
                                                                              'third_down_conversions'])
                
                aggregate_weekly_df_0 = weeksgames_df_0[weeksgames_df_0.week >= weeks_list[w]].drop(columns = [
                    'score','week','game_won', 'game_lost']).groupby(by=["team_name", "team_abbr"]).mean().reset_index()      
                win_loss_df_0 = weeksgames_df_0[weeksgames_df_0.week >= weeks_list[w]][["team_name",
                    "team_abbr",'game_won', 'game_lost']].groupby(by=["team_name", "team_abbr"]).sum().reset_index()
                win_loss_df_0['win_perc'] = win_loss_df_0['game_won'] / (win_loss_df_0['game_won'] + win_loss_df_0[
                    'game_lost'])
                win_loss_df_0 = win_loss_df_0.drop(columns = ['game_won', 'game_lost'])
                
                try:
                    aggregate_weekly_df_0['fourth_down_perc'] = aggregate_weekly_df_0[
                        'fourth_down_conversions'] / aggregate_weekly_df_0['fourth_down_attempts']
                except:
                    aggregate_weekly_df_0['fourth_down_perc'] = 0
                aggregate_weekly_df_0['fourth_down_perc'] = aggregate_weekly_df_0['fourth_down_perc'].fillna(0)

                try:
                    aggregate_weekly_df_0['third_down_perc'] = aggregate_weekly_df_0[
                        'third_down_conversions'] / aggregate_weekly_df_0['third_down_attempts']
                except:
                    aggregate_weekly_df_0['third_down_perc'] = 0
                aggregate_weekly_df_0['third_down_perc'] = aggregate_weekly_df_0['third_down_perc'].fillna(0)
                
                aggregate_weekly_df_0 = aggregate_weekly_df_0.drop(columns = ['fourth_down_attempts',
                                                                              'fourth_down_conversions',
                                                                              'third_down_attempts',
                                                                              'third_down_conversions'])
                
                name_abb_df = aggregate_weekly_df_1[['team_name', 'team_abbr']]
                aggregate_weekly_df = aggregate_weekly_df_1.select_dtypes(exclude=['object',
                    'datetime']) * (1-m) + aggregate_weekly_df_0.select_dtypes(exclude=['object', 'datetime']) * m
                win_loss_df = win_loss_df_1.select_dtypes(exclude=['object',
                    'datetime']) * (1-m) + win_loss_df_0.select_dtypes(exclude=['object', 'datetime']) * m
                
                aggregate_weekly_df = pd.concat([name_abb_df, aggregate_weekly_df], axis=1)
                win_loss_df = pd.concat([name_abb_df, win_loss_df], axis=1)
            
            aggregate_weekly_df = aggregate_weekly_df.drop('year', axis=1)
            aggregate_weekly_df = pd.merge(win_loss_df,aggregate_weekly_df,left_on = ['team_name',
                'team_abbr'], right_on = ['team_name', 'team_abbr'])
            away_df = pd.merge(games_df,aggregate_weekly_df,how = 'inner', left_on = ['away_name',
                'away_abbr'], right_on = ['team_name', 'team_abbr']).drop(columns = ['team_name', 'team_abbr'])
            
            list_column_change = list(away_df.columns[7:])
            list_new_away = column_name_manipulation_reverse(list_column_change, 'away')
            list_new_home = column_name_manipulation_reverse(list_column_change, 'home')

            away_df.columns = list(away_df.columns)[:7] + list_new_away
            
            home_df = pd.merge(games_df,aggregate_weekly_df,how = 'inner', left_on = ['home_name',
                'home_abbr'], right_on = ['team_name', 'team_abbr']).drop(columns = ['team_name', 'team_abbr'])
            home_df.columns = list(home_df.columns)[:7] + list_new_home
            
            aggregate_weekly_df = pd.merge(away_df,home_df,left_on = ['away_name', 'away_abbr', 'home_name',
                'home_abbr', 'winning_name', 'winning_abbr', 'week'], right_on = ['away_name', 'away_abbr',
                'home_name', 'home_abbr', 'winning_name', 'winning_abbr', 'week'])
            
            for n in range(len(list_column_change)):
                column_new = list_column_change[n] + '_' + 'dif'
                aggregate_weekly_df[column_new] = aggregate_weekly_df[list_new_away[n]] - aggregate_weekly_df[
                    list_new_home[n]]
                aggregate_weekly_df[column_new] = aggregate_weekly_df[column_new].fillna(0)
            
            aggregate_weekly_df = aggregate_weekly_df.drop(columns = list_new_away + list_new_home + [
                'fumbles_lost_dif'])
            
            if (aggregate_weekly_df['winning_name'].isnull().values.any() and n_year == current_y and weeks_list[w
                ] == current_w):
                conditions = [aggregate_weekly_df['winning_name'] == aggregate_weekly_df['away_name'],
                              aggregate_weekly_df['winning_name'] == aggregate_weekly_df['home_name']]
                choices = [1,0]
                aggregate_weekly_df['result'] = np.select(conditions,choices,default=np.nan)
                    
            elif aggregate_weekly_df['winning_name'].isnull().values.any():
                aggregate_weekly_df = aggregate_weekly_df.dropna()
                aggregate_weekly_df['result'] = (aggregate_weekly_df['winning_name'] == aggregate_weekly_df['away_name'])
                aggregate_weekly_df['result'] = aggregate_weekly_df['result'].astype('int')
                
            else:
                aggregate_weekly_df['result'] = (aggregate_weekly_df['winning_name'] == aggregate_weekly_df['away_name'])
                aggregate_weekly_df['result'] = aggregate_weekly_df['result'].astype('int')
            
            aggregate_weekly_df = aggregate_weekly_df.drop(columns = ['winning_name', 'winning_abbr'])
            aggregate_weekly_df['year'] = [n_year for n in range(len(aggregate_weekly_df))]
            
            new_columns_arrangement = list(aggregate_weekly_df.columns)[:5] + [
                list(aggregate_weekly_df.columns)[-1]] + list(aggregate_weekly_df.columns)[5:-1]
            aggregate_weekly_df = aggregate_weekly_df.reindex(columns=new_columns_arrangement)
            aggregate_weekly_df['year'] = aggregate_weekly_df.year.astype('int64')
            aggregate_games_df = pd.concat([aggregate_games_df, aggregate_weekly_df], axis = 0)
            aggregate_games_df = aggregate_games_df.reset_index().drop(columns = 'index')
            
    return aggregate_games_df

#### Transformation and feature engineering in the aggregate_weekly_data function
1. With this function, we create new columns and in the process of that, we also drop the unnecessary columns
2. We create 'win_perc' column from 'game_won' and 'game_lost' columns
3. We create 'fourth_down_perc' column from 'fourth_down_attempts' and 'fourth_down_conversions' columns
4. Similarly, we create 'third_down_perc' column from 'third_down_attempts' and 'third_down_conversions' columns
5. We create the target variable called 'result' column, which is coded to 1 if away team wins and coded to 0 if home team wins

#### Calling the aggregate_weekly_data function
1. We rename the full_schedule df to schedule_df
2. We rename the full_game_data df to weeksgames_df
3. We'll have a quick look on the data information and descriptive stats

In [25]:
schedule_df = full_schedule
weeksgames_df = full_game_data
current_w=6
current_y=2022
weeks_list = list(range(1,19))

In [None]:
schedule_df

In [26]:
# Calling the aggregate games function
# Value of 'm' is taken to be '0.1' as from analyses (trial and error), we determined that it gives us highest accuracy
aggregate_games_df = aggregate_weekly_data(schedule_df, weeksgames_df, current_w, current_y, weeks_list, 0.1)

In [23]:
pd.set_option("display.max_rows", None, "display.max_columns", None)

In [None]:
aggregate_games_df= aggregate_games_df.drop_duplicates(drop = True)

In [None]:
aggregate_games_df_test = aggregate_games_df.loc[aggregate_games_df.week != 18]
aggregate_games_df_test = aggregate_games_df_test.loc[aggregate_games_df_test.year == 2021]
len(aggregate_games_df_test)

In [98]:
aggregate_games_df= aggregate_games_df.loc[aggregate_games_df.week != 17]

In [26]:
aggregate_games_df.tail(60)

Unnamed: 0,away_name,away_abbr,home_name,home_abbr,week,year,win_perc_dif,first_downs_dif,fumbles_dif,interceptions_dif,net_pass_yards_dif,pass_attempts_dif,pass_completions_dif,pass_touchdowns_dif,pass_yards_dif,penalties_dif,points_dif,rush_attempts_dif,rush_touchdowns_dif,rush_yards_dif,time_of_possession_dif,times_sacked_dif,total_yards_dif,turnovers_dif,yards_from_penalties_dif,yards_lost_from_sacks_dif,fourth_down_perc_dif,third_down_perc_dif,result
2864,Cleveland Browns,cle,Carolina Panthers,car,1,2022,0.142857,0.964286,0.071429,-0.464286,3.857143,-4.892857,-1.892857,0.285714,4.107143,0.142857,2.25,1.571429,0.285714,38.678571,-5.571429,0.071429,42.535714,-0.464286,8.607143,0.25,0.0,0.028819,1.0
2865,San Francisco 49ers,sfo,Chicago Bears,chi,1,2022,0.222222,2.259259,-0.111111,-0.333333,67.296296,0.111111,1.592593,0.666667,53.185185,-0.37037,7.37037,1.074074,0.407407,2.185185,67.185185,-1.555556,69.481481,-0.185185,7.888889,-14.111111,0.0,0.052333,0.0
2866,Pittsburgh Steelers,pit,Cincinnati Bengals,cin,1,2022,-0.032593,-0.259259,0.074074,-0.333333,-28.148148,6.814815,2.851852,-0.777778,-36.740741,2.222222,-6.333333,-1.62963,-0.333333,-10.555556,-59.62963,-0.888889,-38.703704,-0.148148,12.555556,-8.592593,0.0,0.00318,1.0
2867,Philadelphia Eagles,phi,Detroit Lions,det,1,2022,0.38,1.271164,0.322751,-0.279101,-6.589947,-6.043651,-5.335979,0.030423,-14.65873,0.128307,8.501323,6.825397,0.722222,45.066138,15.720899,-0.511905,38.47619,-0.440476,-0.199735,-8.068783,0.0,0.12132,1.0
2868,New England Patriots,nwe,Miami Dolphins,mia,1,2022,0.142857,2.357143,-0.535714,-0.107143,8.321429,-5.357143,-2.321429,0.178571,6.5,-0.785714,7.857143,3.928571,0.642857,36.25,47.892857,-0.607143,44.428571,-0.25,1.285714,-1.821429,0.0,0.031912,0.0
2869,Baltimore Ravens,rav,New York Jets,nyj,1,2022,0.333333,4.851852,0.37037,-0.407407,12.888889,-1.333333,0.777778,0.0,8.148148,-0.518519,5.407407,9.037037,0.37037,56.777778,335.111111,0.407407,69.666667,-0.407407,-2.259259,-4.740741,0.0,-0.027672,1.0
2870,Jacksonville Jaguars,jax,Washington Commanders,was,1,2022,-0.222222,-2.62963,-0.259259,0.074074,-6.851852,2.888889,-0.185185,-0.592593,-11.148148,1.296296,-5.0,-5.148148,0.111111,-15.185185,-216.111111,-0.555556,-22.037037,0.259259,6.0,-4.296296,0.0,-0.046457,0.0
2871,Kansas City Chiefs,kan,Arizona Cardinals,crd,1,2022,-0.035714,3.142857,-0.392857,0.178571,31.964286,6.928571,2.785714,0.535714,20.964286,0.035714,0.071429,-4.714286,-0.571429,-8.678571,-37.357143,-0.607143,23.285714,0.75,-2.928571,-11.0,0.0,0.07296,1.0
2872,Green Bay Packers,gnb,Minnesota Vikings,min,1,2022,0.268519,1.457672,-0.187831,0.095238,-6.57672,-1.939153,-1.186508,0.141534,-5.321429,-2.492063,-0.292328,-1.0,0.15873,-5.343915,148.90873,0.334656,-11.920635,0.082011,-21.724868,1.255291,0.0,0.058314,0.0
2873,New York Giants,nyg,Tennessee Titans,oti,1,2022,-0.455026,-3.420635,0.084656,0.145503,-6.981481,2.939153,-0.087302,-0.359788,-10.633598,-0.694444,-8.801587,-7.67328,-0.837302,-39.609788,-274.0,-0.563492,-46.59127,0.12963,-15.417989,-3.652116,0.0,-0.056502,1.0


In [None]:
aggregate_games_df_null = aggregate_games_df[aggregate_games_df['result'].isnull()]
len(aggregate_games_df_null)

In [None]:
# Getting the descriptive stats for all feature columns
aggregate_games_df.iloc[:,6:-1].describe()

1. We can see that all the features of the final dataframe doesn't have any null/missing values
2. It means, the function works correctly to aggregate the values of the features (dependent variables)
3. However, the column result, which is the target variable, has 13 NaN values (13 rows)
4. These rows are actually rows of current week games that have not been played yet so there is no result for these games
5. We will need these rows with NaN result as our prediction dataset
6. From here, the final dataframe is good to go to machine learning application

### PRedict wk 1 2022

In [None]:
aggregate_games_df= pd.read_csv('agg_df_working_on_9_9_2022.csv')

In [None]:
aggregate_games_df = aggregate_games_df.loc[aggregate_games_df.week != 18]

In [None]:
aggregate_games_df = aggregate_games_df.loc[aggregate_games_df.year >= 2020]

In [None]:
aggregate_games_df= aggregate_games_df[aggregate_games_df.filter(regex='^(?!Unnamed)').columns]

In [None]:
aggregate_games_df.tail(50)

In [23]:
model_finding_df = aggregate_games_df[aggregate_games_df.result.notna()]


#model_finding_df = model_finding_df.drop(columns = ['away_name', 'away_abbr', 'home_name', 'home_abbr', 'week','year'])

prediction_df=aggregate_games_df[aggregate_games_df.result.isnull()]

In [None]:
model_finding_df

### Predict wk 2 2022

In [41]:
list1 = y_pred_prob.tolist()
list1 = [round(list1[x], 3) for x in range(len(list1))]
list1 = [float(list1[x]) for x in range(len(list1))]

In [37]:
model_finding_df = aggregate_games_df[aggregate_games_df.result.notna()]


#model_finding_df = model_finding_df.drop(columns = ['away_name', 'away_abbr', 'home_name', 'home_abbr', 'week','year'])

prediction_df=aggregate_games_df[aggregate_games_df.result.isnull()]

In [38]:
# STEP 1
X = model_finding_df.iloc[:, 6:-1].values
y = model_finding_df.iloc[:, -1].values

# STEP 2
from sklearn.model_selection import train_test_split
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# STEP 3
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_stdz = sc.fit_transform(X_train_raw)
X_test_stdz = sc.transform(X_test_raw)

from sklearn.decomposition import PCA
dimension = 7 
pca = PCA(n_components = dimension)
X_train = pca.fit_transform(X_train_stdz)
X_test = pca.transform(X_test_stdz)

# STEP 4
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

LogisticRegression(random_state=0)

In [39]:
# Define and transform the dataset
X_pred_raw = prediction_df.iloc[:, 6:-1].values
X_pred_stdz = sc.transform(X_pred_raw)
X_pred = pca.transform(X_pred_stdz)

In [40]:
# Predict the results (winning probability)
y_pred_prob = classifier.predict_proba(X_pred)

y_pred_prob = y_pred_prob[:,1]

In [43]:
def display_prediction_for_current_week_games2(y_pred_prob, prediction_df):
    for t in range(len(y_pred_prob)):
        win_prob = round(y_pred_prob[t], 3)
        away_team = prediction_df.reset_index().drop(columns = 'index').loc[t,'away_name']
        home_team = prediction_df.reset_index().drop(columns = 'index').loc[t,'home_name']
        print('The {} are predicted to be {} against the spread versus the {}.'.format(away_team, win_prob, home_team))

In [44]:
display_prediction_for_current_week_games2(list2, prediction_df)

The Washington Commanders are predicted to be -0.5 against the spread versus the Chicago Bears.
The San Francisco 49ers are predicted to be -0.5 against the spread versus the Atlanta Falcons.
The New England Patriots are predicted to be 4.5 against the spread versus the Cleveland Browns.
The Jacksonville Jaguars are predicted to be -1 against the spread versus the Indianapolis Colts.
The New York Jets are predicted to be 6 against the spread versus the Green Bay Packers.
The Minnesota Vikings are predicted to be 2.5 against the spread versus the Miami Dolphins.
The Cincinnati Bengals are predicted to be -2.5 against the spread versus the New Orleans Saints.
The Baltimore Ravens are predicted to be -1.5 against the spread versus the New York Giants.
The Tampa Bay Buccaneers are predicted to be -3 against the spread versus the Pittsburgh Steelers.
The Carolina Panthers are predicted to be 6.5 against the spread versus the Los Angeles Rams.
The Arizona Cardinals are predicted to be 1.5 ag

In [38]:
display_prediction_for_current_week_games2(list2, prediction_df)

The Washington Commanders are predicted to be 1 against the spread versus the Chicago Bears.
The San Francisco 49ers are predicted to be 1.5 against the spread versus the Atlanta Falcons.
The New England Patriots are predicted to be 5.5 against the spread versus the Cleveland Browns.
The Jacksonville Jaguars are predicted to be -2.5 against the spread versus the Indianapolis Colts.
The New York Jets are predicted to be 6 against the spread versus the Green Bay Packers.
The Minnesota Vikings are predicted to be 1.5 against the spread versus the Miami Dolphins.
The Cincinnati Bengals are predicted to be -2.5 against the spread versus the New Orleans Saints.
The Baltimore Ravens are predicted to be -2.5 against the spread versus the New York Giants.
The Tampa Bay Buccaneers are predicted to be -3 against the spread versus the Pittsburgh Steelers.
The Carolina Panthers are predicted to be 6.5 against the spread versus the Los Angeles Rams.
The Arizona Cardinals are predicted to be -0.5 aga

In [83]:
display_prediction_for_current_week_games2(list2, prediction_df)

The Indianapolis Colts are predicted to be 4.5 against the spread versus the Denver Broncos.
The New York Giants are predicted to be 4.5 against the spread versus the Green Bay Packers.
The Pittsburgh Steelers are predicted to be 8 against the spread versus the Buffalo Bills.
The Los Angeles Chargers are predicted to be 3 against the spread versus the Cleveland Browns.
The Houston Texans are predicted to be 6 against the spread versus the Jacksonville Jaguars.
The Chicago Bears are predicted to be 7 against the spread versus the Minnesota Vikings.
The Seattle Seahawks are predicted to be -2.5 against the spread versus the New Orleans Saints.
The Detroit Lions are predicted to be -2.5 against the spread versus the New England Patriots.
The Miami Dolphins are predicted to be -3.5 against the spread versus the New York Jets.
The Atlanta Falcons are predicted to be 6 against the spread versus the Tampa Bay Buccaneers.
The Tennessee Titans are predicted to be 3 against the spread versus the

In [44]:
display_prediction_for_current_week_games2(list2, prediction_df)

The Indianapolis Colts are predicted to be 4.5 against the spread versus the Denver Broncos.
The New York Giants are predicted to be 4 against the spread versus the Green Bay Packers.
The Pittsburgh Steelers are predicted to be 7 against the spread versus the Buffalo Bills.
The Los Angeles Chargers are predicted to be 3 against the spread versus the Cleveland Browns.
The Houston Texans are predicted to be 6 against the spread versus the Jacksonville Jaguars.
The Chicago Bears are predicted to be 7.5 against the spread versus the Minnesota Vikings.
The Seattle Seahawks are predicted to be -3 against the spread versus the New Orleans Saints.
The Detroit Lions are predicted to be -2.5 against the spread versus the New England Patriots.
The Miami Dolphins are predicted to be -3.5 against the spread versus the New York Jets.
The Atlanta Falcons are predicted to be 4.5 against the spread versus the Tampa Bay Buccaneers.
The Tennessee Titans are predicted to be 1.5 against the spread versus t

In [94]:
display_prediction_for_current_week_games2(list2, prediction_df)

The Miami Dolphins are predicted to be -3 against the spread versus the Cincinnati Bengals.
The Minnesota Vikings are predicted to be -3.5 against the spread versus the New Orleans Saints.
The Cleveland Browns are predicted to be -5.5 against the spread versus the Atlanta Falcons.
The Tennessee Titans are predicted to be -0.5 against the spread versus the Indianapolis Colts.
The Washington Commanders are predicted to be 3.5 against the spread versus the Dallas Cowboys.
The Seattle Seahawks are predicted to be 6 against the spread versus the Detroit Lions.
The Los Angeles Chargers are predicted to be -3.5 against the spread versus the Houston Texans.
The Chicago Bears are predicted to be 4 against the spread versus the New York Giants.
The Jacksonville Jaguars are predicted to be 3.5 against the spread versus the Philadelphia Eagles.
The New York Jets are predicted to be 3.5 against the spread versus the Pittsburgh Steelers.
The Buffalo Bills are predicted to be -1.5 against the spread 

In [51]:
display_prediction_for_current_week_games2(list2, prediction_df)

The Miami Dolphins are predicted to be -2.5 against the spread versus the Cincinnati Bengals.
The Minnesota Vikings are predicted to be -3 against the spread versus the New Orleans Saints.
The Cleveland Browns are predicted to be -3 against the spread versus the Atlanta Falcons.
The Tennessee Titans are predicted to be 2.5 against the spread versus the Indianapolis Colts.
The Washington Commanders are predicted to be -1 against the spread versus the Dallas Cowboys.
The Seattle Seahawks are predicted to be 6 against the spread versus the Detroit Lions.
The Los Angeles Chargers are predicted to be -3 against the spread versus the Houston Texans.
The Chicago Bears are predicted to be 7 against the spread versus the New York Giants.
The Jacksonville Jaguars are predicted to be 3.5 against the spread versus the Philadelphia Eagles.
The New York Jets are predicted to be 3.5 against the spread versus the Pittsburgh Steelers.
The Buffalo Bills are predicted to be -2.5 against the spread versus

In [34]:
display_prediction_for_current_week_games2(list2, prediction_df)

The Miami Dolphins are predicted to be -2.5 against the spread versus the Cincinnati Bengals.
The Minnesota Vikings are predicted to be -3 against the spread versus the New Orleans Saints.
The Cleveland Browns are predicted to be -3 against the spread versus the Atlanta Falcons.
The Tennessee Titans are predicted to be 2.5 against the spread versus the Indianapolis Colts.
The Washington Commanders are predicted to be -1 against the spread versus the Dallas Cowboys.
The Seattle Seahawks are predicted to be 7.5 against the spread versus the Detroit Lions.
The Los Angeles Chargers are predicted to be -3 against the spread versus the Houston Texans.
The Chicago Bears are predicted to be 6.5 against the spread versus the New York Giants.
The Jacksonville Jaguars are predicted to be 3.5 against the spread versus the Philadelphia Eagles.
The New York Jets are predicted to be 3 against the spread versus the Pittsburgh Steelers.
The Buffalo Bills are predicted to be 2 against the spread versus 

In [44]:
display_prediction_for_current_week_games2(list2, prediction_df)

The Pittsburgh Steelers are predicted to be 6 against the spread versus the Cleveland Browns.
The New Orleans Saints are predicted to be 3 against the spread versus the Carolina Panthers.
The Houston Texans are predicted to be 2.5 against the spread versus the Chicago Bears.
The Kansas City Chiefs are predicted to be -12.5 against the spread versus the Indianapolis Colts.
The Buffalo Bills are predicted to be 3 against the spread versus the Miami Dolphins.
The Detroit Lions are predicted to be 1.5 against the spread versus the Minnesota Vikings.
The Baltimore Ravens are predicted to be -3 against the spread versus the New England Patriots.
The Cincinnati Bengals are predicted to be 4 against the spread versus the New York Jets.
The Las Vegas Raiders are predicted to be 3 against the spread versus the Tennessee Titans.
The Philadelphia Eagles are predicted to be 2 against the spread versus the Washington Commanders.
The Jacksonville Jaguars are predicted to be 7 against the spread versu

In [106]:
display_prediction_for_current_week_games2(list2, prediction_df)

The Pittsburgh Steelers are predicted to be 7 against the spread versus the Cleveland Browns.
The New Orleans Saints are predicted to be -1 against the spread versus the Carolina Panthers.
The Houston Texans are predicted to be 3 against the spread versus the Chicago Bears.
The Kansas City Chiefs are predicted to be -7.5 against the spread versus the Indianapolis Colts.
The Buffalo Bills are predicted to be 5 against the spread versus the Miami Dolphins.
The Detroit Lions are predicted to be 6.5 against the spread versus the Minnesota Vikings.
The Baltimore Ravens are predicted to be -3 against the spread versus the New England Patriots.
The Cincinnati Bengals are predicted to be 4 against the spread versus the New York Jets.
The Las Vegas Raiders are predicted to be 7 against the spread versus the Tennessee Titans.
The Philadelphia Eagles are predicted to be -0.5 against the spread versus the Washington Commanders.
The Jacksonville Jaguars are predicted to be 6.5 against the spread ve

In [97]:
display_prediction_for_current_week_games2(list2, prediction_df)

The Pittsburgh Steelers are predicted to be 7 against the spread versus the Cleveland Browns.
The New Orleans Saints are predicted to be 1 against the spread versus the Carolina Panthers.
The Houston Texans are predicted to be 3 against the spread versus the Chicago Bears.
The Kansas City Chiefs are predicted to be -9.5 against the spread versus the Indianapolis Colts.
The Buffalo Bills are predicted to be 5 against the spread versus the Miami Dolphins.
The Detroit Lions are predicted to be 6 against the spread versus the Minnesota Vikings.
The Baltimore Ravens are predicted to be -3 against the spread versus the New England Patriots.
The Cincinnati Bengals are predicted to be 4 against the spread versus the New York Jets.
The Las Vegas Raiders are predicted to be 7 against the spread versus the Tennessee Titans.
The Philadelphia Eagles are predicted to be -2.5 against the spread versus the Washington Commanders.
The Jacksonville Jaguars are predicted to be 7 against the spread versus 

In [89]:
display_prediction_for_current_week_games2(list2, prediction_df)

The Pittsburgh Steelers are predicted to be 7 against the spread versus the Cleveland Browns.
The New Orleans Saints are predicted to be -0.5 against the spread versus the Carolina Panthers.
The Houston Texans are predicted to be 2.5 against the spread versus the Chicago Bears.
The Kansas City Chiefs are predicted to be -8 against the spread versus the Indianapolis Colts.
The Buffalo Bills are predicted to be 5.5 against the spread versus the Miami Dolphins.
The Detroit Lions are predicted to be 5 against the spread versus the Minnesota Vikings.
The Baltimore Ravens are predicted to be -3 against the spread versus the New England Patriots.
The Cincinnati Bengals are predicted to be 3.5 against the spread versus the New York Jets.
The Las Vegas Raiders are predicted to be 6 against the spread versus the Tennessee Titans.
The Philadelphia Eagles are predicted to be -2.5 against the spread versus the Washington Commanders.
The Jacksonville Jaguars are predicted to be 7 against the spread 

In [30]:
display_prediction_for_current_week_games2(list2, prediction_df)

The Los Angeles Chargers are predicted to be 6 against the spread versus the Kansas City Chiefs.
The New York Jets are predicted to be 10.5 against the spread versus the Cleveland Browns.
The Washington Commanders are predicted to be -0.5 against the spread versus the Detroit Lions.
The Indianapolis Colts are predicted to be -2.5 against the spread versus the Jacksonville Jaguars.
The Tampa Bay Buccaneers are predicted to be -2 against the spread versus the New Orleans Saints.
The Carolina Panthers are predicted to be 8.5 against the spread versus the New York Giants.
The New England Patriots are predicted to be 7 against the spread versus the Pittsburgh Steelers.
The Miami Dolphins are predicted to be 1.5 against the spread versus the Baltimore Ravens.
The Atlanta Falcons are predicted to be -6 against the spread versus the Los Angeles Rams.
The Seattle Seahawks are predicted to be -2.5 against the spread versus the San Francisco 49ers.
The Cincinnati Bengals are predicted to be -2 ag

### WITH ELO

In [31]:
def get_elo():
    elo_df = pd.read_csv('nfl_elo3.csv')
    elo_df = elo_df.drop(columns = ['neutral' ,'playoff', 'elo_prob1', 'elo_prob2', 'elo1_post', 'elo2_post',
           'qbelo1_pre', 'qbelo2_pre', 'qb1', 'qb2', 'qb1_adj', 'qb2_adj', 'qbelo_prob1', 'qbelo_prob2',
           'qb1_game_value', 'qb2_game_value', 'qb1_value_post', 'qb2_value_post',
           'qbelo1_post', 'qbelo2_post', 'score1', 'score2'])
    elo_df.date = pd.to_datetime(elo_df.date)
    elo_df = elo_df[elo_df.date >= '09-8-2011']
    
    elo_df['team1'] = elo_df['team1'].replace(['KC', 'JAX', 'CAR', 'BAL', 'BUF', 'MIN', 'DET', 'ATL', 'NE', 'WSH',
           'CIN', 'NO', 'SF', 'LAR', 'NYG', 'DEN', 'CLE', 'IND', 'TEN', 'NYJ',
           'TB', 'MIA', 'PIT', 'PHI', 'GB', 'CHI', 'DAL', 'ARI', 'LAC', 'HOU',
           'SEA', 'OAK'],
            ['kan','jax','car', 'rav', 'buf', 'min', 'det', 'atl', 'nwe', 'was', 
            'cin', 'nor', 'sfo', 'ram', 'nyg', 'den', 'cle', 'clt', 'oti', 'nyj', 
             'tam','mia', 'pit', 'phi', 'gnb', 'chi', 'dal', 'crd', 'sdg', 'htx', 'sea', 'rai' ])
    elo_df['team2'] = elo_df['team2'].replace(['KC', 'JAX', 'CAR', 'BAL', 'BUF', 'MIN', 'DET', 'ATL', 'NE', 'WSH',
           'CIN', 'NO', 'SF', 'LAR', 'NYG', 'DEN', 'CLE', 'IND', 'TEN', 'NYJ',
           'TB', 'MIA', 'PIT', 'PHI', 'GB', 'CHI', 'DAL', 'ARI', 'LAC', 'HOU',
           'SEA', 'OAK'],
            ['kan','jax','car', 'rav', 'buf', 'min', 'det', 'atl', 'nwe', 'was', 
            'cin', 'nor', 'sfo', 'ram', 'nyg', 'den', 'cle', 'clt', 'oti', 'nyj', 
             'tam','mia', 'pit', 'phi', 'gnb', 'chi', 'dal', 'crd', 'sdg', 'htx', 'sea', 'rai' ])
    return elo_df

In [32]:
elo_df = get_elo()

In [33]:
def merge_rankings(agg_games_df,elo_df):
    agg_games_df = pd.merge(agg_games_df, elo_df, how = 'inner', left_on = ['home_abbr', 'away_abbr','year'], right_on = ['team1', 'team2','season']).drop(columns = ['date','team1', 'team2','season'])
    agg_games_df['elo_dif'] = agg_games_df['elo2_pre'] - agg_games_df['elo1_pre']
    agg_games_df['qb_dif'] = agg_games_df['qb2_value_pre'] - agg_games_df['qb1_value_pre']
    agg_games_df = agg_games_df.drop(columns = ['elo1_pre', 'elo2_pre', 'qb1_value_pre', 'qb2_value_pre','quality','importance','total_rating'])
    return agg_games_df

In [34]:
aggregate_games_df = merge_rankings(aggregate_games_df, elo_df)

In [34]:
aggregate_games_df.tail(10)

Unnamed: 0,away_name,away_abbr,home_name,home_abbr,week,year,win_perc_dif,first_downs_dif,fumbles_dif,interceptions_dif,net_pass_yards_dif,pass_attempts_dif,pass_completions_dif,pass_touchdowns_dif,pass_yards_dif,penalties_dif,points_dif,rush_attempts_dif,rush_touchdowns_dif,rush_yards_dif,time_of_possession_dif,times_sacked_dif,total_yards_dif,turnovers_dif,yards_from_penalties_dif,yards_lost_from_sacks_dif,fourth_down_perc_dif,third_down_perc_dif,result,elo_dif,qb_dif
2717,Seattle Seahawks,sea,New Orleans Saints,nor,5,2022,0.203947,1.835526,-1.146053,-0.460526,-8.902632,-2.639474,1.993421,0.182895,-14.614474,-0.518421,4.043421,-0.138158,-0.209211,3.419737,-42.539474,-1.485526,-5.482895,-1.156579,-3.540789,-5.711842,-0.027536,0.188617,,-19.426865,11.351064
2718,Detroit Lions,det,New England Patriots,nwe,5,2022,-0.057353,3.142895,-0.915789,-0.420263,50.279474,8.1,3.368421,1.729474,43.917895,0.792368,13.45,-1.118158,0.363947,29.192632,-103.034737,-0.823684,79.472105,-1.108158,6.729737,-6.361579,0.292112,-0.024712,,-115.942036,13.73746
2719,Miami Dolphins,mia,New York Jets,nyj,5,2022,0.258947,-0.466579,-1.054211,-0.240789,13.552632,-12.020789,-3.692632,0.452895,12.27,-0.666053,4.966316,-0.959737,0.195263,-17.165263,-20.398947,-0.722632,-3.612632,-1.138947,-27.914474,-1.282632,0.181429,0.046148,,222.091855,79.409017
2720,Atlanta Falcons,atl,Tampa Bay Buccaneers,tam,5,2022,-0.026316,1.372368,0.55,0.706579,-72.143421,-13.777632,-11.788158,-0.848684,-70.469737,-1.831579,3.456579,10.731579,1.088158,90.038158,50.925,0.115789,17.894737,0.518421,-21.163158,1.673684,-0.012061,0.022782,,-237.958405,-87.478466
2721,Tennessee Titans,oti,Washington Commanders,was,5,2022,0.268158,-3.490263,0.191842,-0.444737,-40.154737,-15.412368,-8.851842,-0.650263,-54.761053,1.542895,1.234211,2.429211,0.521316,1.171579,-228.932105,-2.335263,-38.983158,-0.237632,1.608947,-14.606316,0.329174,-0.051791,,75.460251,-10.407719
2722,San Francisco 49ers,sfo,Carolina Panthers,car,5,2022,0.112895,4.202105,-0.154211,-0.125789,13.472105,-2.612368,-0.599737,-0.238421,6.624474,0.724474,-2.714211,11.206842,0.501316,51.073158,448.625263,-0.455,64.545263,0.096842,-6.730526,-6.847632,0.000641,0.097766,,180.22736,45.79022
2723,Philadelphia Eagles,phi,Arizona Cardinals,crd,5,2022,0.45,1.95,-0.765,0.0,23.18,-12.92,-8.215,-0.275,18.655,-1.065,6.445,10.73,1.4,53.34,3.92,0.38,76.52,0.015,-21.01,-4.525,0.013763,0.112547,,39.901575,0.725603
2724,Dallas Cowboys,dal,Los Angeles Rams,ram,5,2022,0.075,-1.796053,-0.559211,-1.317105,-20.335526,-0.851316,-4.921053,-0.305263,-30.635526,2.438158,-1.956579,4.447368,-0.444737,29.621053,-0.697368,-1.360526,9.285526,-1.634211,11.268421,-10.3,0.064985,-0.161206,,-40.551845,-118.120859
2725,Cincinnati Bengals,cin,Baltimore Ravens,rav,5,2022,0.005263,1.860526,0.476316,-0.021053,39.463158,8.689474,5.467105,-0.606579,55.293421,0.893421,-5.668421,2.257895,-0.428947,-50.839474,255.778947,1.757895,-11.376316,0.203947,10.178947,15.830263,-0.060516,0.060524,,15.886173,-52.835507
2726,Las Vegas Raiders,rai,Kansas City Chiefs,kan,5,2022,-0.477632,-2.298421,-0.232632,0.459474,-24.537368,0.989211,-0.752632,-1.173947,-16.810789,0.946579,-7.842632,-3.303947,-0.455789,-4.793684,-56.373158,1.135,-29.331053,0.217895,-0.985789,7.726579,0.274615,-0.118096,,-204.663389,-137.895492


In [35]:
df1 = aggregate_games_df.pop('result')

In [36]:
aggregate_games_df['result']=df1

## 4. Model Selection

### 4.1. Checking the balance of the data and existence of multicolinearity

#### Preparing the data before applying the machine learning (ML) models
1. Before going to the steps of building model, we have to make sure our data is ready
2. We split the data into two datasets: data for the model (that later will be split again into test and train dataset), and data to be predicted
3. We'll plot the histogram of each dependent variables to see the overall distribution of each feature
4. We'll check the balance of the target variable
5. We'll confirm again the existence of multicolinearity

In [None]:
pd.set_option("display.max_rows", None, "display.max_columns", None)
#model_finding_df = aggregate_games_df
aggregate_games_df['result'] = np.where(aggregate_games_df.year ==2021, np.nan, aggregate_games_df['result'])

In [None]:
model_finding_df.tail(500)

In [57]:
pd.set_option("display.max_rows", None, "display.max_columns", None)

In [None]:
# We first take the dataframe where the results are not null (NaN): model_finding_df
# In other words, we only use the dataframe with completed games
model_finding_df = model_finding_df[model_finding_df.result.notna()]


# Then we select the dataset to be predicted by the best model (games that have not been played): prediction_df
prediction_df_all = aggregate_games_df
prediction_df_all = prediction_df_all[prediction_df_all.year ==2021]
prediction_df_all = prediction_df_all.loc[prediction_df_all.week != 18]
prediction_df_all['result']=np.nan



In [None]:
model_finding_df.tail(100)

In [None]:
# Creating histogram of the features to visualize the distribution of the data
def create_histograms(model_finding_df, no_of_cols):
    features_df = model_finding_df.iloc[:,6:-1]
    no_of_rows = (len(features_df.columns)//no_of_cols)+1
    fig = plt.figure(figsize=(20,25))
    for i, col in enumerate(features_df.columns):
        ax = fig.add_subplot(no_of_rows,no_of_cols, i+1)
        features_df[col].hist(bins=50, ax=ax, facecolor='midnightblue', grid=False)
        ax.set_title('Distribution of '+col, color='firebrick')
        ax.set_ylabel('Counts')
    fig.tight_layout()  
    plt.show()

create_histograms(model_finding_df, 4)

1. We can see that almost all the data are looking like normal distribution
2. There are some extreme values, but since these numbers are game stats (facts from every game), we believe that we should include all the data to build the models
3. We'll now check the balance of the data because we are building a classification model

In [None]:
# Checking the balance of the target variable
def create_result_frequency(model_finding_df):
    import seaborn as sns
    fig,axes=plt.subplots(1 , 2, figsize=(12,6), dpi=80)
    model_finding_df['result'].value_counts().plot.pie(explode=[0,0.1], autopct='%1.1f%%', ax=axes[0], colormap='Paired')
    axes[0].set_title('Result Frequency (in Percentage)')
    axes[0].set_ylabel('')
    sns.countplot('result', data=model_finding_df, ax=axes[1], palette=['#A6CEE3', '#B15928'])
    axes[1].set_title('Result Frequency')
    axes[1].set_ylabel('Counts')
    axes[1].set_xlabel('Target')
    axes[1].set_xticklabels(['0.0: Home Team Wins','1.0: Away Team Wins'])
    plt.show()

create_result_frequency(model_finding_df)

1. The ratio of the target variable (result) is 55.6 : 44.4
2. We think this is a balanced dataset
2. Since it's balanced, we can proceed to check the multicolinearity

In [None]:
# Checking the correlation matrix
def create_correlation_matrix(model_finding_df):
    features_df = model_finding_df.iloc[:, 6:-1]
    import seaborn as sns
    fig, ax = plt.subplots(figsize=(10, 8), dpi=80)  
    corr_mat = features_df.corr()
    sns.heatmap(corr_mat[(corr_mat >= 0.5) | (corr_mat <= -0.5)], 
            cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
            annot=True, annot_kws={"size": 8}, square=True)
    ax.set_title('Features Matrix Correlation', fontsize=15)

create_correlation_matrix(model_finding_df)

1. We can see that there are some features that are highly correlated with each other
2. We will perform PCA to reduce the dimensions (data reductions) to handle this

### 4.2. Building the ML models

#### Steps to build the classification model
1. Separate the data into independent variables (X) and dependent variables (y). Here, 'result' is the dependent variable and rest are independent variables.
2. Splitting the data into train set and test set. Model will learn from the train set and its performance and effectiveness will be tested on the test set.
3. Using StandardScaler to standardise the values corresponding to each independent variable and bring them all in a particular range. We'll perform PCA inside this step too.
4. Importing the model and training the model on the test set
5. Testing the model built on the test set
6. We will be using cross validation to test the ability of our machine learning model to predict new data. It can also help to flag problems like overfitting or selection bias and gives insights on how the model will generalize to an independent dataset.
7. Printing the report (overall effectiveness) including recall, precision, etc of the model

In [None]:
model_finding_df

In [61]:
# STEP 1
# Separating the model_finding_df to X and y

X = model_finding_df.iloc[:, 6:-1].values 
y = model_finding_df.iloc[:, -1].values

In [62]:
# STEP 2 
# Splitting the data into train and test

from sklearn.model_selection import train_test_split
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [63]:
# STEP 3
# Standardizing the data and performing PCA

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_stdz = sc.fit_transform(X_train_raw)
X_test_stdz = sc.transform(X_test_raw)

print(X_train_stdz.shape)
print(X_test_stdz.shape)

(2113, 22)
(705, 22)


In [64]:
# Performing PCA

from sklearn.decomposition import PCA
dimension = 7 
pca = PCA(n_components = dimension)
X_train = pca.fit_transform(X_train_stdz)
X_test = pca.transform(X_test_stdz)
print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))
print('Total variance explained: {}'.format(pca.explained_variance_ratio_.sum()))

Explained variation per principal component: [0.29858815 0.18880579 0.08959149 0.08300362 0.06843132 0.04508895
 0.04264432]
Total variance explained: 0.8161536364562751


In [None]:
# Checking the eigenvalues

eigenvalues = pca.explained_variance_
eigenvalues

1. The number of components that we have chosen is 7
2. The reason is that the eigenvalues for 7 components are all > 0.9 (which is close to 1.0)
3. Also, we want to make sure that the total variance explained is at least 80%

In [None]:
# Defining a function to highlight large correlation in the factor loadings table
def highlight_background(val):
    threshold = 0.35
    color = ''
    if (val > threshold) or (val < -1*threshold):
        color = 'wheat'
    return 'background-color: %s' % color

def highlight_font(val):
    threshold = 0.35
    color = ''
    if (val > threshold) or (val < -1*threshold):
        if val > 0:
            color = 'royalblue'
        else:
            color = 'firebrick'
    return 'color: %s' % color

In [None]:
# Using the functions above, we can check which variables explain each component
# Usually, correlation larger than 0.3 (absolute value) is large enough, but we use 0.35 as our threshold
# Column = PCA, Row = Original features

components_df = pd.DataFrame(pca.components_, index=[
    'PCA%i' % i for i in range(dimension)])
components_df = components_df.T.set_index(model_finding_df.iloc[:, 6:-1].columns)
components_style = components_df.style.applymap(highlight_background).applymap(highlight_font)
components_style

1. Component 1 (PCA0): total_yards (+)
2. Component 2 (PCA1): pass_attempts (-), rush_attempts (+), rush_yards (+)
3. Component 3 (PCA2): penalties (-), yards_from_penalties (-)
4. Component 4 (PCA3): fumbles (-), interceptions (-), turnovers (-)
5. Component 5 (PCA4): times_sacked (+), yards_lost_from_sacks (+)
6. Component 6 (PCA5): fourth_down_perc (+)
7. Component 7 (PCA6): rush_touchdowns (+), time_of_possession (-)

In [None]:
# Creating the dataframe for the PCA of training dataset to check the matrix correlation again

pca_X_train_df = pd.DataFrame(data = X_train, columns=[
    'PCA%i' % i for i in range(dimension)])

In [None]:
# Creating the correlation matrix for the principal components
def create_pca_correlation_matrix(pca_X_train_df):
    import seaborn as sns
    fig, ax = plt.subplots(figsize=(10, 8), dpi=80)  
    corr_mat = pca_X_train_df.corr()
    sns.heatmap(corr_mat, mask=np.zeros_like(corr_mat, dtype=np.bool),
                cmap='viridis', square=True, ax=ax)
    ax.set_title('Components Matrix Correlation', fontsize=15)

create_pca_correlation_matrix(pca_X_train_df)

1. We can see that we have handled the multicollinearity
2. We can continue with applying several ML models to the datasets (steps 4 to 7 for each  model)

#### Now we are using several machine learning algorithms such as Decision Tree, Logistic Regression , Random Forest Classifier, etc

Steps 1 to 3 (preprocessing steps) as seen above remain same for all the models and we will be building all the models on that dataset

#### Model 1 : Decision Tree

In [None]:
# STEP 4
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

In [None]:
# STEP 5 & STEP 6
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import classification_report as report
y_pred = classifier.predict(X_test)
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

In [None]:
# STEP 7
print(report(y_test, y_pred))

#### Model 2 : Logistic Regression

In [65]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

LogisticRegression(random_state=0)

In [66]:
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import classification_report as report
y_pred = classifier.predict(X_test)
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 60.29 %
Standard Deviation: 2.77 %


In [None]:
print(report(y_test, y_pred))

#### Model 3 : Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import classification_report as report
y_pred = classifier.predict(X_test)
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

In [None]:
print(report(y_test, y_pred))

#### Model 4 : Kernel SVM (Support Vector Machine)

In [None]:
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import classification_report as report
y_pred = classifier.predict(X_test)
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

In [None]:
print(report(y_test, y_pred))

#### Model 5 : Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import classification_report as report
y_pred = classifier.predict(X_test)
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

In [None]:
print(report(y_test, y_pred))

#### Model 6 : KNN (K Nearest Neighbours)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import classification_report as report
y_pred = classifier.predict(X_test)
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

In [None]:
print(report(y_test, y_pred))

#### Model 7 : XGBoost

In [None]:
from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(X_train, y_train)

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import classification_report as report
y_pred = classifier.predict(X_test)
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

In [None]:
print(report(y_test, y_pred))

#### Model 8 : Artifical Neural Networks

In [None]:
import tensorflow as tf
ann = tf.keras.models.Sequential()
ann.add(tf.keras.layers.Dense(units=7, activation='relu'))
for n in range(2):
    ann.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))
ann.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
ann.fit(X_train, y_train, batch_size = 32, epochs = 300)

In [None]:
y_pred = ann.predict(X_test)
y_pred = (y_pred > 0.5)

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

In [None]:
print(report(y_test, y_pred))

### 4.3. Final model selection

#### We choose Logistic Regression Classifier because it has high accuracy and comparable to our models with high accuracy
1. Based on the results above, we can see that Logistic Regression, Kernel SVM, Naive Bayes models and Neural Network have recorded the highest accuracy
2. These four models are also comparable to each other based on the other metrics (precision, recall, etc)
3. We choose Logistic Regression as our model for further analysis as Logistic Regression model gives us accuracy comparable/higher to others. It is also faster to run, efficient to implement compared to other models.
4. With the selected model, we will now take a look at several things: feature importance, confusion matrix, and ROC-AUC
5. Before that, we will run and define the classifier again to make sure we are using the right model

In [None]:
model_finding_df.iloc[:, 6:-1]

In [67]:
# STEP 1
X = model_finding_df.iloc[:, 6:-1].values
y = model_finding_df.iloc[:, -1].values

# STEP 2
from sklearn.model_selection import train_test_split
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# STEP 3
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_stdz = sc.fit_transform(X_train_raw)
X_test_stdz = sc.transform(X_test_raw)

from sklearn.decomposition import PCA
dimension = 7 
pca = PCA(n_components = dimension)
X_train = pca.fit_transform(X_train_stdz)
X_test = pca.transform(X_test_stdz)

# STEP 4
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

LogisticRegression(random_state=0)

In [None]:
def show_feature_importances(X_test, y_test):
    from yellowbrick.model_selection import feature_importances
    X_test_df = pd.DataFrame(X_test, columns = ['PCA%i' % i for i in range(dimension)])
    y_test_df = pd.DataFrame(y_test, columns = [list(model_finding_df.columns)[-1]])
    classifier = LogisticRegression(C = 0.01, penalty = 'l2', solver = 'liblinear', random_state = 0)
    feature_importances(classifier, X_test_df, y_test_df)
    plt.rcParams["figure.dpi"] = 80
    plt.show()

show_feature_importances(X_test, y_test)

#### Feature importance after PCA
1. As we can see above, PCA0, PCA1, and PCA3 are the top 3 in the feature importance graph
2. If we take a look again at the factor loadings, we know that PCA0 is highly correlated with total_yards (+)
3. PCA1 is highly correlated with pass_attempts (-), rush_attempts (+), and rush_yards (+)
4. And last but not least, PCA3 is highly explained by fumbles (-), interceptions (-), and turnovers (-)
5. The threshold number for absolute correlation between original varibales and the PCA components is 0.35

In [None]:
y_test_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_test_pred)
print('Confusion Matrix:')
print(cm)
print('\nAccuracy: ' + f'{100 * accuracy_score(y_test, y_test_pred):.2f}%')

In [None]:
def show_confusion_matrix(classifier, X_test, y_test):
    from sklearn.metrics import plot_confusion_matrix
    matrix = plot_confusion_matrix(classifier, X_test, y_test,
                                 cmap=plt.cm.Blues,
                                 normalize=None)
    plt.grid(visible=False)
    plt.title('Confusion Matrix')
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.show(matrix)
    plt.rcParams["figure.dpi"] = 80
    plt.show()

show_confusion_matrix(classifier, X_test, y_test)

#### Interpretation of the confusion matrix
1. True negative (tn): model predicts away team lost (0) and away team lost = 295
2. False positive (fp): model predicts away team won (1) but away team actually lost = 92
3. False negative (fn): model predicts away team lost but away team actually won = 171
4. True positive (tp): model predicts away team won and away team won = 128

In [None]:
def show_confusion_matrix_analysis(classifier, X_test, y_test):
    y_test_pred = classifier.predict(X_test)
    from sklearn.metrics import confusion_matrix
    cm = confusion_matrix(y_test, y_test_pred)
    tn = cm[0][0]
    fp = cm[0][1]
    fn = cm[1][0]
    tp = cm[1][1]
    precision = tp/(tp+fp)
    recall = tp/(tp+fn)
    fpr = fp/(fp+tn)
    f_score = 2*precision*recall/(precision+recall)
    accuracy = (tp+tn)/(tp+tn+fp+fn)
    
    print("Precision:\t\t\t%1.2f"%(100*precision) + "% identified as away teams' victory are away teams' victory")
    print("Recall/TPR:\t\t\t%1.2f"%(100*recall) + "% of away teams' victory are identified")
    print("False Positive Rate:\t\t%1.2f"%(100*fpr) + "% of away team's defeat identified as away team's victory")
    print("f-score:\t\t\t%1.2f"%(100*f_score) + "% tradeoff between precision and recall")
    print("Accuracy:\t\t\t%1.2f"%(100*accuracy) + "% how well the model has classified")

show_confusion_matrix_analysis(classifier, X_test, y_test)

In [None]:
def show_roc_curve(classifier, X_test, y_test):
    import sklearn.metrics as metrics
    y_test_pred_prob = classifier.predict_proba(X_test)
    y_test_pred_prob = y_test_pred_prob[:,1]
    fpr, tpr, threshold = metrics.roc_curve(y_test, y_test_pred_prob)
    roc_auc = metrics.auc(fpr, tpr)

    import matplotlib.pyplot as plt
    plt.title('Receiver Operating Characteristic')
    plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.rcParams["figure.dpi"] = 80
    plt.show()

show_roc_curve(classifier, X_test, y_test)

1. The area under curve (AUC) is 0.68
2. The AUC is a measure of how stable the model is
3. If the AUC is higher, then the better the model is at predicting 0 classes as 0 and 1 classes as 1
4. We consider the 0.68 value is good enough for this use-case

#### Feature importance with original full features
1. We have obtained the feature importance from the model after PCA
2. However, to be able to compare it the previous hypothesis of good features and bad features, we need the feature importance with original full features instead
3. To do that, we run the best model (Logistic Regression) on the dataset without performing PCA so the dimension is not reduced
4. When building this particular model, we add 'F' or 'f' to the variable or model name to indicate 'full features', and so that it won't change the value of pre-defined variables that have been used before

In [None]:
# STEP 1
XF = model_finding_df.iloc[:, 6:-1].values
yf = model_finding_df.iloc[:, -1].values

# STEP 2
from sklearn.model_selection import train_test_split
XF_train_raw, XF_test_raw, yf_train, yf_test = train_test_split(XF, yf, test_size = 0.25, random_state = 0)

# STEP 3
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
XF_train = sc.fit_transform(XF_train_raw)
XF_test = sc.transform(XF_test_raw)

# STEP 4
from sklearn.linear_model import LogisticRegression
classifier_f = LogisticRegression(random_state = 0)
classifier_f.fit(XF_train, yf_train)

#### GridSearchCV method is used for hyperparameter tuning so that the best parameters are selected for the Logistic Regression

In [None]:
from sklearn.model_selection import GridSearchCV
solvers = ['newton-cg', 'lbfgs', 'liblinear']
penalty = ['l2']
c_values = [100, 10, 1.0, 0.1, 0.01]
# define grid search
parameters = dict(solver=solvers,penalty=penalty,C=c_values)
# parameters = {'var_smoothing': np.logspace(0,-9, num=100)}
grid_search = GridSearchCV(estimator = classifier_f,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search.fit(XF_train, yf_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
print("Best Parameters:", best_parameters)

In [None]:
def show_feature_importances_full_features(XF_test, yf_test):
    from yellowbrick.model_selection import feature_importances
    XF_test_df = pd.DataFrame(XF_test, columns = list(model_finding_df.columns)[6:-1])
    yf_test_df = pd.DataFrame(yf_test, columns = [list(model_finding_df.columns)[-1]])
    classifier_f = LogisticRegression(C = 0.01, penalty = 'l2', solver = 'liblinear', random_state = 0)
    feature_importances(classifier_f, XF_test_df, yf_test_df)
    plt.rcParams["figure.dpi"] = 80
    plt.show()

show_feature_importances_full_features(XF_test, yf_test)

1. If we compare with our previous hypothesis, there are five features where feature importance shows different results
2. Feature importance shows that penalties has positive importance while we hypothesized that this feature is a bad feature instead
3. Feature importance shows that total_yards, net_pass_yards, pass_yards, and rush_attempts has negative importance while we hypothesized that they are good features instead
4. Based on these results, we believe that we still can accept our hypothesis
5. We can say that our intuition about good attributes (features) and bad attributes is in line with the feature importance results, even though there are some exceptions

## 5. NFL Games Prediction

### 5.1. Predicting probabilities for the current week games that have not been played yet

1. Using the selected classifier, we can now predict the outcome of forthcoming games in the current week that have not been played
2. We first define a function to display the prediction results
3. After that, we define the dataset that will be used from the prediction dataframe we have created before
4. The dataset is then standardized and transformed according to the PCA results
5. Finally, we can compute the winning probability of the away team and display the resultsprediction_df.iloc[:, 6:-1].values

In [None]:
prediction_df.iloc[:, 6:-1]

In [None]:
# Function to display the prediction results
def display_prediction_for_current_week_games(y_pred_prob, prediction_df):
    for t in range(len(y_pred_prob)):
        win_prob = round(y_pred_prob[t], 3)
        away_team = prediction_df.reset_index().drop(columns = 'index').loc[t,'away_name']
        home_team = prediction_df.reset_index().drop(columns = 'index').loc[t,'home_name']
        print('The {} have a probability of {} of defeating the {}.'.format(away_team, win_prob, home_team))

In [68]:
# Define and transform the dataset
X_pred_raw = prediction_df.iloc[:, 6:-1].values
X_pred_stdz = sc.transform(X_pred_raw)
X_pred = pca.transform(X_pred_stdz)

In [69]:
# Predict the results (winning probability)
y_pred_prob = classifier.predict_proba(X_pred)
y_pred_prob = y_pred_prob[:,1]

In [None]:
y_pred_prob

In [None]:
y_pred_prob

In [None]:
display_prediction_for_current_week_games2(list2, prediction_df)

In [None]:
list1=[]
list1.append(y_pred_prob)

In [None]:
list1.append(y_pred_prob)

In [None]:
prediction_df_all=prediction_df_all.reset_index()

In [None]:
prediction_df_all=prediction_df_all.drop(columns= ['index'])

### RUNNING MODEL FOR SPREAD(converting from probability to spread)(do your own math for this eventually)

In [None]:
def display_prediction_for_current_week_games2(y_pred_prob, prediction_df_all):
    for t in range(len(y_pred_prob)):
        win_prob = round(y_pred_prob[t], 3)
        away_team = prediction_df_all.reset_index().drop(columns = 'index').loc[t,'away_name']
        home_team = prediction_df_all.reset_index().drop(columns = 'index').loc[t,'home_name']
        print('The {} are predicted to be {} against the spread versus the {}.'.format(away_team, win_prob, home_team))

In [None]:
# Define and transform the dataset
X_pred_raw = prediction_df_all.iloc[:, 6:-1].values
X_pred_stdz = sc.transform(X_pred_raw)
X_pred = pca.transform(X_pred_stdz)

In [None]:
# Predict the results (winning probability)
y_pred_prob = classifier.predict_proba(X_pred)
y_pred_prob = y_pred_prob[:,1]

In [113]:
model_finding_df.to_csv('model_finding_df_most_recent_full.csv')

### WEEK 1 GO BACK AND FIX THE MODEL TRAIN SET AND RE RUN TO SEE IF DIFFERENCE

In [None]:
aggregate_games_df = pd.read_csv('agg_df_2021_final_minus_wk18_and_2022.csv')

In [None]:
model_finding_df = aggregate_games_df[aggregate_games_df.result.notna()]


#model_finding_df = model_finding_df.drop(columns = ['away_name', 'away_abbr', 'home_name', 'home_abbr', 'week','year'])

prediction_df=aggregate_games_df[aggregate_games_df.result.isnull()]

In [None]:
aggregate_games_df = aggregate_games_df[aggregate_games_df.filter(regex='^(?!Unnamed)').columns]

In [None]:
aggregate_games_df = aggregate_games_df[aggregate_games_df.year ==2021]

In [None]:
wk2

In [None]:
list1[15]

In [92]:
aggregate_games_df = pd.read_csv('agg_df_most_recent_full.csv')

### Run for 2021 and get probabilities one by one for each week and combine into 1 list list1

In [198]:
list1 = wk1+list1[0] + list1[1]+list1[2]+list1[3]+list1[4]+list1[5]+list1[6]+wk10+list1[8]+list1[9]+list1[10]+list1[11]+list1[12]+list1[13]+list1[14]+list1[15]

ValueError: Unable to coerce to Series, length must be 29: given 15

In [197]:
wk10 =(list1[7])

In [156]:
list1=list1 + wk2

In [149]:
del wk10[-8]

In [181]:
wk10

[0.31,
 0.536,
 0.452,
 0.345,
 0.796,
 0.473,
 0.646,
 0.503,
 0.682,
 0.542,
 0.748,
 0.531,
 0.469]

In [199]:
list1

[[0.957,
  0.772,
  0.884,
  0.869,
  0.808,
  0.715,
  0.856,
  0.266,
  0.487,
  0.373,
  0.188,
  0.352,
  0.134,
  0.738,
  0.691],
 [0.494,
  0.329,
  0.777,
  0.818,
  0.626,
  0.5,
  0.629,
  0.521,
  0.543,
  0.145,
  0.239,
  0.303,
  0.725,
  0.229,
  0.469],
 [0.49,
  0.243,
  0.787,
  0.393,
  0.554,
  0.354,
  0.245,
  0.85,
  0.715,
  0.702,
  0.193,
  0.399,
  0.252,
  0.762,
  0.405],
 [0.797,
  0.397,
  0.246,
  0.554,
  0.649,
  0.729,
  0.335,
  0.9,
  0.274,
  0.225,
  0.364,
  0.312,
  0.212,
  0.671,
  0.323],
 [0.797,
  0.493,
  0.582,
  0.764,
  0.544,
  0.742,
  0.721,
  0.614,
  0.691,
  0.513,
  0.821,
  0.554,
  0.712],
 [0.475,
  0.681,
  0.318,
  0.708,
  0.56,
  0.341,
  0.369,
  0.226,
  0.159,
  0.169,
  0.468,
  0.74],
 [0.331,
  0.454,
  0.259,
  0.711,
  0.352,
  0.587,
  0.561,
  0.765,
  0.81,
  0.467,
  0.671,
  0.6,
  0.641,
  0.262],
 [0.31,
  0.536,
  0.452,
  0.345,
  0.796,
  0.473,
  0.385,
  0.646,
  0.503,
  0.682,
  0.542,
  0.748,
  0.53

In [180]:
wk1


[0.288,
 0.316,
 0.424,
 0.359,
 0.637,
 0.422,
 0.527,
 0.469,
 0.401,
 0.39,
 0.601,
 0.643,
 0.555,
 0.408,
 0.568]

In [196]:
list1=[]
for i in range(2,len(weeks_list)):

    list1.append(run_for_each_individual_wk(i,2021))


In [None]:
wk1=run_for_each_individual_wk(1,2021)
wk2= run_for_each_individual_wk(2,2021)

In [None]:
for i in range(len(list1)):

    print(len(list1[i]))


In [None]:
run_for_each_individual_wk(0,2021)

In [24]:
wk1=run_for_each_individual_wk(1,2021)

In [25]:
wk2= run_for_each_individual_wk(2,2021)

In [None]:
run_for_each_individual_wk(17,2021)

In [None]:
wk3 = run_for_each_individual_wk(3,2021)

In [None]:
list(range(1,19))

In [None]:
run_for_each_individual_wk(10,2021)

In [None]:
full_game_data.head(10
                   )

In [94]:
full_game_data=full_game_data[full_game_data.year>=2016]


In [141]:
run_for_each_individual_wk(2,2021)

[0.997,
 0.0,
 0.034,
 0.075,
 0.0,
 0.999,
 1.0,
 1.0,
 0.001,
 1.0,
 0.0,
 0.0,
 0.003,
 0.986,
 0.878,
 1.0]

In [145]:
full_game_data=pd.read_csv('full_game_data2.csv')

In [None]:
run_for_each_individual_wk(17,2021)

In [195]:
def run_for_each_individual_wk(current_w,current_y):
    schedule_df = full_schedule
    weeksgames_df = full_game_data
    weeks_list = list(range(1,19))

    aggregate_games_df = aggregate_weekly_data(schedule_df, weeksgames_df, current_w, current_y, weeks_list, 0.1)
    aggregate_games_df = aggregate_games_df[aggregate_games_df.week !=18]
    aggregate_games_df = aggregate_games_df[aggregate_games_df.year >=current_y-1]
    aggregate_games_df = aggregate_games_df[aggregate_games_df.filter(regex='^(?!Unnamed)').columns]
    aggregate_games_df['result'] = np.where((aggregate_games_df.year ==current_y)& (aggregate_games_df.week ==current_w), np.nan, aggregate_games_df['result'])
    model_finding_df = aggregate_games_df[aggregate_games_df.result.notna()]
    prediction_df=aggregate_games_df[aggregate_games_df.result.isnull()]
    
    X = model_finding_df.iloc[:, 6:-1].values
    y = model_finding_df.iloc[:, -1].values

    # STEP 2
    from sklearn.model_selection import train_test_split
    X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

    # STEP 3
    from sklearn.preprocessing import StandardScaler
    sc = StandardScaler()
    X_train_stdz = sc.fit_transform(X_train_raw)
    X_test_stdz = sc.transform(X_test_raw)

    from sklearn.decomposition import PCA
    dimension = 7 
    pca = PCA(n_components = dimension)
    X_train = pca.fit_transform(X_train_stdz)
    X_test = pca.transform(X_test_stdz)

    # STEP 4
    from sklearn.linear_model import LogisticRegression
    classifier = LogisticRegression(random_state = 0)
    classifier.fit(X_train, y_train)
    
    X_pred_raw = prediction_df.iloc[:, 6:-1].values
    X_pred_stdz = sc.transform(X_pred_raw)
    X_pred = pca.transform(X_pred_stdz)
    
    y_pred_prob = classifier.predict_proba(X_pred)
    y_pred_prob = y_pred_prob[:,1]
    list1 = y_pred_prob.tolist()
    list1 = [round(list1[x], 3) for x in range(len(list1))]

    #model_finding_df = model_finding_df.drop(columns = ['away_name', 'away_abbr', 'home_name', 'home_abbr', 'week','year'])

    return list1

In [None]:
model_finding_df = model_finding_df[model_finding_df.result.notna()]


# Then we select the dataset to be predicted by the best model (games that have not been played): prediction_df
prediction_df_all = aggregate_games_df
prediction_df_all = prediction_df_all[prediction_df_all.year ==2021]
prediction_df_all = prediction_df_all.loc[prediction_df_all.week != 18]
prediction_df_all['result']=np.nan


In [None]:
pd.set_option("display.max_rows", None, "display.max_columns", None)
#model_finding_df = aggregate_games_df
aggregate_games_df['result'] = np.where((aggregate_games_df.year ==2021)& (aggregate_games_df.week ==3), np.nan, aggregate_games_df['result'])

In [None]:
def display_prediction_for_current_week_games2(y_pred_prob, prediction_df_wk101):
    for t in range(len(y_pred_prob)):
        win_prob = round(y_pred_prob[t], 3)
        away_team = prediction_df_wk101.reset_index().drop(columns = 'index').loc[t,'away_name']
        home_team = prediction_df_wk101.reset_index().drop(columns = 'index').loc[t,'home_name']
        print('The {} are predicted to be {} against the spread versus the {}.'.format(away_team, win_prob, home_team))

In [None]:
# Define and transform the dataset
X_pred_raw = prediction_df_wk101.iloc[:, 6:-1].values
X_pred_stdz = sc.transform(X_pred_raw)
X_pred = pca.transform(X_pred_stdz)

In [None]:
# Predict the results (winning probability)
y_pred_prob = classifier.predict_proba(X_pred)
y_pred_prob = y_pred_prob[:,1]

In [None]:
y_pred_prob

In [None]:
list1 = y_pred_prob.tolist()
list1 = [round(list1[x], 3) for x in range(len(list1))]
list1 = [float(list1[x]) for x in range(len(list1))]

In [None]:
import math

In [None]:
del list1

In [None]:
list1 = [round(list1[x], 3) for x in range(len(list1))]

In [None]:
list1 = [float(list1[x]) for x in range(len(list1))]

In [None]:
list1 = [list1[x]**1.0 for x in range(len(list1))]

In [None]:
list1

In [None]:
list1[1]

In [None]:
if list1[195] >= .406 and list1[195]< .455:
    print('yes')

In [None]:
x = np.arange(start=1, stop=17, step=.5)

In [None]:
y = np.array([51.3,52.5,53.5,54.5,59.4,64.3,65.8,67.3,68.1,69.0,70.7,72.4,75.2,78.1,79.1,80.2,80.7,81.1,83.6,86.0,87.1,88.2,88.5,88.7,89.3,90.0,92.4,94.9,95.6,96.3,98.1,99.8])

In [None]:
y

In [None]:
x

In [None]:
fig, ax = plt.subplots()

plt.plot(x, y)
plt.show()

In [154]:
len(list1)

239

In [None]:
aggregate_games_df.head(10)

In [None]:
if (list1[8])  >= .406 and (list1[195])< .455:
    print('yes')

In [None]:
for i in range(len(list1)):
#     if list1[i] == .500:
#         list5.append('no')
    if list1[i] < .500:
        if (list1[i]) >= .406 and (list1[i])< .455:
            list5.append('yes')
        
#     if list1[i] > .500:
#         list5.append('maybe')

In [42]:
list2=[]
for i in range(len(list1)):
    old_len = len(list2)
    if list1[i] == .500:
        list2.append(0)
    if list1[i] < .500:
        if list1[i] >= 0.0 and list1[i]< .0020:
            list2.append(+17)
        elif list1[i] >= .002 and list1[i]< .019:
            list2.append(+16.5)
        elif list1[i] >= .019 and list1[i]< .037:
            list2.append(+16)
        elif list1[i] >= .037 and list1[i]< .044:
            list2.append(+15.5)
        elif list1[i] >= .044 and list1[i]< .051:
            list2.append(+15)
        elif list1[i] >= .051 and list1[i]< .076:
            list2.append(+14.5)
        elif list1[i] >= .076 and list1[i]< .10:
            list2.append(+14)
        elif list1[i] >= .10 and list1[i]< .107:
            list2.append(+13.5)
        elif list1[i] >= .107 and list1[i]< .113:
            list2.append(+13)
        elif list1[i] >= .113 and list1[i] < .116:
            list2.append(+12.5)
        elif list1[i] >= .116 and list1[i]< .118:
            list2.append(+12)
        elif list1[i] >= .118 and list1[i]< .129:
            list2.append(+11.5)
        elif list1[i] >= .129 and list1[i]< .14:
            list2.append(+11)
        elif list1[i] >= .14 and list1[i]< .164:
            list2.append(+10.5)
        elif list1[i] >= .164 and list1[i] < .189:
            list2.append(+10)
        elif list1[i] >= .189 and list1[i]< .193:
            list2.append(+9.5)
        elif list1[i] >= .193 and list1[i]< .198:
            list2.append(+9)
        elif list1[i] >= .198 and list1[i]< .209:
            list2.append(+8.5)
        elif list1[i] >= .209 and list1[i]< .219:
            list2.append(+8)
        elif list1[i] >= .219 and list1[i]< .248:
            list2.append(+7.5)
        elif list1[i] >= .248 and list1[i]< .277:
            list2.append(+7)
        elif list1[i] >= .277 and list1[i]< .294:
            list2.append(+6.5)
        elif list1[i] >= .294 and list1[i]< .311:
            list2.append(+6)
        elif list1[i] >= .311 and list1[i]< .319:
            list2.append(+5.5)
        elif list1[i] >= .319 and list1[i] < .327:
            list2.append(+5)
        elif list1[i] >= .327 and list1[i]< .342:
            list2.append(+4.5)
        elif list1[i] >= .342 and list1[i]< .357:
            list2.append(+4)
        elif list1[i] >= .357 and list1[i]< .406:
            list2.append(+3.5)
        elif list1[i] >= .406 and list1[i]< .455:
            list2.append(+3)
        elif list1[i] >= .455 and list1[i]< .465:
            list2.append(+2.5)
        elif list1[i] >= .465 and list1[i]< .475:
            list2.append(+2)
        elif list1[i] >= .475 and list1[i]< .488:
            list2.append(+1.5)
        elif list1[i] >= .488 and list1[i]< .500:
            list2.append(+1)
    
    elif list1[i] > .500:
        #
        if list1[i] >= .501 and list1[i]< .513:
            list2.append(-.5)
        if list1[i] >= .513 and list1[i]< .525:
            list2.append(-1)
        elif list1[i] >= .525 and list1[i]< .535:
            list2.append(-1.5)
        elif list1[i] >= .535 and list1[i]< .545:
            list2.append(-2)
        elif list1[i] >= .545 and list1[i]< .594:
            list2.append(-2.5)
        elif list1[i] >= .594 and list1[i]< .643:
            list2.append(-3)
        elif list1[i] >= .643 and list1[i]< .658:
            list2.append(-3.5)
        elif list1[i] >= .658 and list1[i]< .673:
            list2.append(-4)
        elif list1[i] >= .673 and list1[i]< .681:
            list2.append(-4.5)
        elif list1[i] >= .681 and list1[i]< .690:
            list2.append(-5)
        elif list1[i] >= .690 and list1[i]< .707:
            list2.append(-5.5)
        elif list1[i] >= .707 and list1[i]< .724:
            list2.append(-6)
        elif list1[i] >= .724 and list1[i]< .752:
            list2.append(-6.5)
        elif list1[i] >= .752 and list1[i]< .781:
            list2.append(-7)
        elif list1[i] >= .781 and list1[i]< .791:
            list2.append(-7.5)
        elif list1[i] >= .791 and list1[i]< .802:
            list2.append(-8)
        elif list1[i] >= .802 and list1[i]< .807:
            list2.append(-8.5)
        elif list1[i] >= .807 and list1[i] < .811:
            list2.append(-9)
        elif list1[i] >= .811 and list1[i]< .836:
            list2.append(-9.5)
        elif list1[i] >= .836 and list1[i]< .860:
            list2.append(-10)
        elif list1[i] >= .860 and list1[i]< .871:
            list2.append(-10.5)
        elif list1[i] >= .871 and list1[i]< .882:
            list2.append(-11)
        elif list1[i] >= .882 and list1[i]< .885:
            list2.append(-11.5)
        elif list1[i] >= .885 and list1[i]< .887:
            list2.append(-12)
        elif list1[i] >= .887 and list1[i]< .893:
            list2.append(-12.5)
        elif list1[i] >= .893 and list1[i] < .900:
            list2.append(-13)
        elif list1[i] >= .900 and list1[i]< .924:
            list2.append(-13.5)
        elif list1[i] >= .924 and list1[i]< .949:
            list2.append(-14)
        elif list1[i] >= .949 and list1[i]< .956:
            list2.append(-14.5)
        elif list1[i] >= .956 and list1[i]< .963:
            list2.append(-15)
        elif list1[i] >= .963 and list1[i]< .981:
            list2.append(-15.5)
        elif list1[i] >= .981 and list1[i]< .998:
            list2.append(-16)
        elif list1[i] >= .998:
            list2.append(-17)
            
    if len(list2) == old_len:
        list2.append('OH NAH J')
    
            
list2        

[-0.5, -0.5, 4.5, -1, 6, 2.5, -2.5, -1.5, -3, 6.5, 1.5, 3.5, 4.5, 5.5]

In [110]:
aggregate_weekly_games

NameError: name 'aggregate_weekly_games' is not defined

In [None]:
ohnah = np.where(np.array(list2)=='OH NAH J')
list1= np.array(list1)
list1[ohnah]

In [31]:
spreads_df_covers = pd.read_csv('who_covers_2021.csv') 

In [None]:
index = list2.index('OH NAH J')
index

In [None]:
list2[195]

In [None]:
len(prediction_df_all)

In [None]:
prediction_df_all = prediction_df_all.loc[prediction_df_all.week != 18]

In [None]:
spreads_df_covers = spreads_df_covers[spreads_df_covers.filter(regex='^(?!Unnamed)').columns]

In [None]:
spreads_df_covers

In [157]:
len(list1)

254

### plug predicted values into the 2021 dataframe

In [152]:
spreads_df_covers2['pred_probs']=list1

ValueError: Length of values (239) does not match length of index (255)

In [130]:
spreads_df_covers2['pred_spreads']=list2

In [33]:
spreads_df_covers=spreads_df_covers.drop(columns= ['win_perc_dif','first_downs_dif','interceptions_dif','net_pass_yards_dif', 'pass_attempts_dif', 'pass_completions_dif','pass_touchdowns_dif','pass_yards_dif','penalties_dif', 'points_dif','rush_attempts_dif','rush_touchdowns_dif','rush_yards_dif', 
                                 'time_of_possession_dif','times_sacked_dif','total_yards_dif','turnovers_dif', 'yards_from_penalties_dif','fourth_down_perc_dif','third_down_perc_dif'])

In [34]:
mlb_team_abbrev = {'crd': 'Arizona Cardinals',
'atl': 'Atlanta Falcons',
'rav': 'Baltimore Ravens',
'buf': 'Buffalo Bills',
'car': 'Carolina Panthers',
'chi': 'Chicago Bears',
'cin': 'Cincinnati Bengals',
'cle': 'Cleveland Browns',
'dal': 'Dallas Cowboys',
'den': 'Denver Broncos',
'det': 'Detroit Lions',
'gnb': 'Green Bay Packers',
'htx': 'Houston Texans',
'clt': 'Indianapolis Colts',
'jax': 'Jacksonville Jaguars',
'kan': 'Kansas City Chiefs',
'rai': 'Las Vegas Raiders',
'sdg': 'Los Angeles Chargers',
'ram': 'Los Angeles Rams',
'mia': 'Miami Dolphins',
'min':'Minnesota Vikings',
'nwe': 'New England Patriots',
'nor': 'New Orleans Saints',
'nyg': 'New York Giants',
'nyj': 'New York Jets',
'phi': 'Philadelphia Eagles',
'pit': 'Pittsburgh Steelers',
'sfo': 'San Francisco 49ers',
'sea': 'Seattle Seahawks',
'tam': 'Tampa Bay Buccaneers',
'oti': 'Tennessee Titans',
'was': 'Washington Football Team'}
#weeksgames_df2['Team full name']= weeksgames_df2['team_abbr'].map(mlb_team_abbrev).fillna(weeksgames_df2['team_abbr'])

In [None]:
res

In [35]:
test_keys=list(mlb_team_abbrev.values())

In [36]:
test_values=list(mlb_team_abbrev.keys())

In [37]:
res = {test_keys[i]: test_values[i] for i in range(len(test_keys))}

In [38]:
spreads_df_covers['favorite abbr']= spreads_df_covers['favorite'].map(res).fillna(spreads_df_covers['favorite'])

In [39]:
spreads_df_covers2 = spreads_df_covers.applymap(lambda x: x.strip() if isinstance(x, str) else x)

In [40]:
spreads_df_covers2['favorite abbr']= spreads_df_covers2['favorite'].map(res).fillna(spreads_df_covers2['favorite'])

In [60]:
lst=[]
for i in range(len(spreads_df_covers)):
    lst.append(spreads_df_covers['favorite'][i].replace(' ',''))

AttributeError: 'float' object has no attribute 'replace'

In [None]:
for i in range(len(spreads_df_covers)):
    if spreads_df_covers['pred_spreads'][i]>0:
        if spreads_df_covers['away_abbr'][i]== spreads_df_covers['favorite abbr'][i] and 

### Preprocessing spread and results

In [42]:
lst=[]
for i in range(len(spreads_df_covers2)):
    if spreads_df_covers2['spread'][i]< 0:
        lst.append(spreads_df_covers2['spread'][i])
    elif spreads_df_covers2['spread'][i]> 0:
        lst.append(spreads_df_covers2['spread'][i]*-1)
    else:
        lst.append(spreads_df_covers2['spread'][i])
spreads_df_covers2['spread']=lst

In [43]:
lst=[]
for i in range(len(spreads_df_covers2)):
    
    if spreads_df_covers2['away_abbr'][i]== spreads_df_covers2['favorite abbr'][i]:
            lst.append(spreads_df_covers2['spread'][i])

    elif spreads_df_covers2['away_abbr'][i]!= spreads_df_covers2['favorite abbr'][i]:
        lst.append(spreads_df_covers2['spread'][i]*-1)
            
    else:
        lst.append(spreads_df_covers2['spread'])
spreads_df_covers2['away_spread']=lst

In [44]:
spreads_df_covers2['away_spread']=lst

In [45]:
spreads_df_covers2.loc[spreads_df_covers2['favorite abbr'].isna(),'favorite abbr'].values[0]==np.nan

False

In [131]:
lst2=[]
for i in range(len(spreads_df_covers2)):
    if abs(spreads_df_covers2['away_spread'][i]-spreads_df_covers2['pred_spreads'][i])>=3:
        if spreads_df_covers2['pred_spreads'][i]>0:
            if spreads_df_covers2['away_spread'][i]-spreads_df_covers2['pred_spreads'][i]>0:
                if spreads_df_covers2['away_abbr'][i]== spreads_df_covers2['covers_name_list'][i]:
                    if spreads_df_covers2['winner'][i]== spreads_df_covers2['away_abbr'][i]:
                        lst2.append(2)
                    else:
                        lst2.append(1)
                elif spreads_df_covers2['home_abbr'][i]== spreads_df_covers2['covers_name_list'][i]:
                    lst2.append(-1)
                else:
                    lst2.append('error')

            elif spreads_df_covers2['away_spread'][i]-spreads_df_covers2['pred_spreads'][i]<0:
                if spreads_df_covers2['home_abbr'][i]== spreads_df_covers2['covers_name_list'][i]:
                    if spreads_df_covers2['winner'][i]== spreads_df_covers2['home_abbr'][i]:
                        lst2.append(2)
                    else:
                        lst2.append(1)
                elif spreads_df_covers2['away_abbr'][i]== spreads_df_covers2['covers_name_list'][i]:
                    lst2.append(-1)
                else:
                    lst2.append('error')
            else:
                lst2.append(0)
                
        elif spreads_df_covers2['pred_spreads'][i]<0:
            if spreads_df_covers2['away_spread'][i]-spreads_df_covers2['pred_spreads'][i]>0:
                if spreads_df_covers2['away_abbr'][i]== spreads_df_covers2['covers_name_list'][i]:
                    if spreads_df_covers2['winner'][i]== spreads_df_covers2['away_abbr'][i]:
                        lst2.append(2)
                    else:
                        lst2.append(1)
                elif spreads_df_covers2['home_abbr'][i]== spreads_df_covers2['covers_name_list'][i]:
                    lst2.append(-1)
                else:
                    lst2.append('error')


            elif spreads_df_covers2['away_spread'][i]-spreads_df_covers2['pred_spreads'][i]<0:
                if spreads_df_covers2['home_abbr'][i]== spreads_df_covers2['covers_name_list'][i]:
                    if spreads_df_covers2['winner'][i]== spreads_df_covers2['home_abbr'][i]:
                        lst2.append(2)
                    else:
                        lst2.append(1)
                elif spreads_df_covers2['home_abbr'][i]== spreads_df_covers2['covers_name_list'][i]:
                    lst2.append(-1)
                else:
                    lst2.append('error')
            else:
                lst2.append(0)
        else:
            lst2.append(np.nan)
                
    else:
        lst2.append(np.nan)
         



In [47]:
covers_name_list=[]
covers_diff=[]

for i in range(len(spreads_df_covers2)):
    if (spreads_df_covers2['favorite'][i]== spreads_df_covers2['winner full'][i]) and (abs(spreads_df_covers2['Actual Score Differential'][i]) >= abs(spreads_df_covers2['spread'][i])):
        covers_name_list.append(spreads_df_covers2['winner'][i])
        covers_diff.append(spreads_df_covers2['Actual Score Differential'][i]-spreads_df_covers2['new spread'][i])

    elif (spreads_df_covers2['favorite'][i]!= spreads_df_covers2['winner full'][i]):
        covers_name_list.append(spreads_df_covers2['winner'][i])
        covers_diff.append(spreads_df_covers2['Actual Score Differential'][i]-spreads_df_covers2['new spread'][i])
    else:
        covers_name_list.append(spreads_df_covers2['loser'][i])
        covers_diff.append(spreads_df_covers2['new spread'][i]-spreads_df_covers2['Actual Score Differential'][i])
spreads_df_covers2['covers_name_list']  = covers_name_list
spreads_df_covers2['covers_diff']  = covers_diff

In [48]:
spreads_df_covers2['away_spread']=lst

In [87]:
len(lst2)

254

In [132]:
spreads_df_covers2['pred_results']=lst2

In [133]:
spreads_df_covers2.groupby(by=["pred_results"]).count()

Unnamed: 0_level_0,Unnamed: 0,away_name,away_abbr,home_name,home_abbr,week,year,fumbles_dif,yards_lost_from_sacks_dif,result,spread,Overs,Actual Total,Actual Score Differential,winner,team_abbr,new spread,vegas_odds,Team full name,favorite,winner full,loser,home or away,covers_name_list,covers_diff,favorite abbr,away_spread,pred_probs,pred_spreads
pred_results,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1
-1,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73
1,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28
2,70,70,70,70,70,70,70,70,70,70,70,70,70,70,70,70,70,70,70,69,70,70,70,70,70,69,70,70,70
error,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10


In [71]:
spreads_df_covers2['pred_probs'].mean()

0.506486274509804

In [134]:
spreads_df_covers2

Unnamed: 0.1,Unnamed: 0,away_name,away_abbr,home_name,home_abbr,week,year,fumbles_dif,yards_lost_from_sacks_dif,result,spread,Overs,Actual Total,Actual Score Differential,winner,team_abbr,new spread,vegas_odds,Team full name,favorite,winner full,loser,home or away,covers_name_list,covers_diff,favorite abbr,away_spread,pred_probs,pred_spreads,pred_results
0,0,Dallas Cowboys,dal,Tampa Bay Buccaneers,tam,1,2021,0.4375,12.5,60,-8.5,1,60,2,tam,dal,8.5,Tampa Bay Buccaneers -8.5,Dallas Cowboys,Tampa Bay Buccaneers,Tampa Bay Buccaneers,dal,0,dal,6.5,tam,8.5,0.486,1.5,1
1,1,Philadelphia Eagles,phi,Atlanta Falcons,atl,1,2021,0.9375,9.0,38,-3.5,0,38,-26,phi,phi,3.5,Atlanta Falcons -3.5,Philadelphia Eagles,Atlanta Falcons,Philadelphia Eagles,atl,0,phi,-29.5,atl,3.5,0.486,1.5,
2,2,Pittsburgh Steelers,pit,Buffalo Bills,buf,1,2021,0.25,-2.5,39,-6.5,0,39,-7,pit,pit,6.5,Buffalo Bills -6.5,Pittsburgh Steelers,Buffalo Bills,Pittsburgh Steelers,buf,0,pit,-13.5,buf,6.5,0.486,1.5,2
3,3,New York Jets,nyj,Carolina Panthers,car,1,2021,-0.3125,4.875,33,-4.0,0,33,5,car,nyj,4.0,Carolina Panthers -4.0,New York Jets,Carolina Panthers,Carolina Panthers,nyj,0,car,1.0,car,4.0,0.486,1.5,
4,4,Minnesota Vikings,min,Cincinnati Bengals,cin,1,2021,0.125,-5.5625,51,-3.0,1,51,3,cin,min,-3.0,Minnesota Vikings -3.0,Minnesota Vikings,Minnesota Vikings,Cincinnati Bengals,min,0,cin,6.0,min,-3.0,0.486,1.5,2
5,5,Seattle Seahawks,sea,Indianapolis Colts,clt,1,2021,0.375,10.6875,44,-3.0,0,44,-12,sea,sea,-3.0,Seattle Seahawks -3.0,Seattle Seahawks,Seattle Seahawks,Seattle Seahawks,clt,0,sea,-9.0,sea,-3.0,0.486,1.5,-1
6,6,San Francisco 49ers,sfo,Detroit Lions,det,1,2021,0.4375,-0.375,74,-9.0,1,74,-8,sfo,sfo,-9.0,San Francisco 49ers -9.0,San Francisco 49ers,San Francisco 49ers,San Francisco 49ers,det,0,det,-1.0,sfo,-9.0,0.486,1.5,1
7,7,Jacksonville Jaguars,jax,Houston Texans,htx,1,2021,-0.3125,-3.0625,58,-3.0,1,58,16,htx,jax,-3.0,Jacksonville Jaguars -3.0,Jacksonville Jaguars,Jacksonville Jaguars,Houston Texans,jax,0,htx,19.0,jax,-3.0,0.486,1.5,2
8,8,Arizona Cardinals,crd,Tennessee Titans,oti,1,2021,0.4375,0.8125,51,-3.0,0,51,-25,crd,crd,3.0,Tennessee Titans -3.0,Arizona Cardinals,Tennessee Titans,Arizona Cardinals,oti,0,crd,-28.0,oti,3.0,0.486,1.5,
9,9,Los Angeles Chargers,sdg,Washington Commanders,was,1,2021,-0.5,-7.0,36,-1.5,0,36,-4,sdg,sdg,1.5,Washington Football Team -1.5,Los Angeles Chargers,Washington Football Team,Los Angeles Chargers,was,0,sdg,-5.5,was,1.5,0.486,1.5,


In [59]:
spreads_df_covers2.to_csv('spreads_df_covers2.csv')

In [None]:
display_prediction_for_current_week_games2(y_pred_prob, prediction_df_wk101)

In [None]:
display_prediction_for_current_week_games2(list2, prediction_df_wk101)

### 5.2. Simple text mining (sentiment analysis) vs. machine learning model
1. We know that NFL is the number 1 sport in the US
2. Fans, sports analysts, or media in general will report a news, share an analysis, or just give a support about the NFL teams
3. One medium where they'll do it is Twitter
4. We will gather the recent tweets from Twitter using each team's name as the search term
5. After that we will perform sentiment analysis on each team's search results
6. For each pair of two teams that will play each other in the current week, we'll compute the net sentiment score (net_sentiment_dif), which is the difference of the net sentiment of the away team minus that of the home team
7. If the net sentiment is greater than 0, then we'll just simply conclude that the away team is more favoured, and vice versa
8. We can then compare it with the probability model results since we can know which team is more favoured to win the game based on the winning probability of the away team

In [None]:
# Function to do the positive vs. negative sentiment analysis
def do_pos_neg_sentiment_analysis(text_list, debug=False):
    import nltk
    def get_pos_neg_words():
        
        def get_words(url):
            import requests
            words = requests.get(url).content.decode('latin-1')
            word_list = words.split('\n')
            index = 0
            while index < len(word_list):
                word = word_list[index]
                if ';' in word or not word:
                    word_list.pop(index)
                else:
                    index+=1
            return word_list

        p_url = 'http://ptrckprry.com/course/ssd/data/positive-words.txt'
        n_url = 'http://ptrckprry.com/course/ssd/data/negative-words.txt'
        positive_words = get_words(p_url)
        negative_words = get_words(n_url)
        return positive_words, negative_words
    
    positive_words, negative_words = get_pos_neg_words()
    from nltk import word_tokenize
    sentiment_results = list()
    for text in text_list:
        cpos = cneg = lpos = lneg = 0
        for word in word_tokenize(text[1]):
            if word in positive_words:
                if debug:
                    print("Positive", word)
                cpos+=1
            if word in negative_words:
                if debug:
                    print("Negative", word)
                cneg+=1
        sentiment_results.append((text[0], cpos/len(word_tokenize(text[1])),
                                  cneg/len(word_tokenize(text[1]))))
    return sentiment_results

In [None]:
# Function to get the recent tweets that will be used as an input to the sentiment analysis function
def get_recent_tweets(search_term):
    
    # Change the twitter API input
    consumer_key = 'nfTyujbaEl1fDF22xVlMYz6ga'
    consumer_secret = 'NinKRYzhB5LQevBg7a5H38SZIoDdDFHsvqgQWFdBi5Ir2Bz6qP'
    access_token = '1456748024533856263-3TENqcDwMAamVNd4wEKpa8jxNgEsEf'
    access_token_secret = 'u8F5lNFuPfqQ2gZtS81mLSLkoAQikBeFUA7JNy4Gfqvg9'
    
    # Change this to your computer directory
    file_location = 'C:/Data/03_MSBA/05_Courses_Term/01_Fall_2021/IEOR_E_4523_DA/Group_Project/Notebook_Files/Tweets/'
    
    import tweepy
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth)

    tweet_results = api.search_tweets(q=search_term,
                                      lang='en',
                                      result_type='recent',
                                      count=1000)

    for i in range(len(tweet_results)):
        filename = search_term+'.'+str(len(tweet_results)-i)
        with open(file_location+filename, 'w', encoding='utf-8') as f:
            f.write(tweet_results[i]._json['text']+'\n')

In [None]:
def compare_probability_model_and_sentiment(prediction_df, y_pred_prob):
    prediction_df = prediction_df.reset_index().drop(columns='index')

    probability_final_list = list()
    
    for i in prediction_df.index:
        if y_pred_prob[i] > 0.5:
            probability_winner = prediction_df['away_name'][i]
        elif y_pred_prob[i] < 0.5:
            probability_winner = prediction_df['home_name'][i]
        else:
            probability_winner = 'Draw'

        probability_final_list.append([prediction_df['away_name'][i],
                                   prediction_df['home_name'][i],
                                   f'{y_pred_prob[i]:.2f}',
                                   probability_winner])
    
    sentiment_final_list = list()

    for i in prediction_df.index:
        get_recent_tweets(prediction_df['away_name'][i])
        get_recent_tweets(prediction_df['home_name'][i])
    
        # Change this to your computer directory where the tweets data are stored
        tweets_root = 'C:/Data/03_MSBA/05_Courses_Term/01_Fall_2021/IEOR_E_4523_DA/Group_Project/Notebook_Files/Tweets'
    
        import nltk
        from nltk.corpus import PlaintextCorpusReader
        
        away_files = prediction_df['away_name'][i]+'.*'
        home_files = prediction_df['home_name'][i]+'.*'
        away_data = PlaintextCorpusReader(tweets_root, away_files)
        home_data = PlaintextCorpusReader(tweets_root, home_files)
    
        sentiment_list = do_pos_neg_sentiment_analysis([[prediction_df['away_name'][i],
                                                         away_data.raw()],
                                                        [prediction_df['home_name'][i],
                                                         home_data.raw()]])
    
        away_net = (sentiment_list[0][1] - sentiment_list[0][2])
        home_net = (sentiment_list[1][1] - sentiment_list[1][2])
        net_sentiment_dif = 100 * (away_net - home_net)
    
        if net_sentiment_dif > 0:
            sentiment_winner = prediction_df['away_name'][i]
        elif net_sentiment_dif < 0:
            sentiment_winner = prediction_df['home_name'][i]
        else:
            sentiment_winner = 'Draw'
    
        sentiment_final_list.append([sentiment_list[0][0],
                                     sentiment_list[1][0],
                                     f'{net_sentiment_dif:.2f}%',
                                     sentiment_winner])
    
    probability_df = pd.DataFrame(probability_final_list, columns = ['away_name',
                                                                     'home_name',
                                                                     'away_win_prob',
                                                                     'favourable_team_prob'])
    
    sentiment_df = pd.DataFrame(sentiment_final_list, columns = ['away_name',
                                                                 'home_name',
                                                                 'away_minus_home_net_sentiment',
                                                                 'favourable_team_sentiment'])
    
    prob_model_vs_sentiment_df = pd.merge(probability_df, sentiment_df,
                                          left_on=['away_name', 'home_name'],
                                          right_on=['away_name', 'home_name'])
   
    return prob_model_vs_sentiment_df

In [None]:
# Calling the main function to create the comparison dataframe
prob_model_vs_sentiment_df = compare_probability_model_and_sentiment(prediction_df, y_pred_prob)

1. The favourable_team_prob is equal to away_name if the away_win_prob is greater than 0.5
2. The favourable_team_prob is equal to home_name if the away_win_prob is smaller than 0.5
3. The favourable_team_sentiment is equal to away_name if the away_minus_home_net_sentiment is greater than 0%
4. The favourable_team_sentiment is equal to home_name if the away_minus_home_net_sentiment is smaller than 0%

In [None]:
prob_model_vs_sentiment_df

#### In this analysis and comparison, we have some assumptions
1. We assume the recent tweets will reflect the upcoming game that a particular team will play, meaning that the tweets leading up to the match will be about the match: who will win and who will lose (while in reality, the tweets are not always about the upcoming game)
2. We assume that greater net sentiment means that there are more favourable tweets on that team to win the game against the other team, even though both teams have recorded positive net sentiment

In [None]:
prob_model_vs_sentiment_df[prob_model_vs_sentiment_df['favourable_team_prob']
                           == prob_model_vs_sentiment_df['favourable_team_sentiment']]

1. We can filter the dataframe to select all the rows where both the favourable teams (one is based on the model and the other one is based on the sentiment analysis) are equal
2. It is also important to note that the tweets data are always being updated since new tweets appear from time to time, and thus the sentiment analysis results may change accordingly