<a href="https://colab.research.google.com/github/Nelsontorresjr330/CS5265-Repo/blob/main/Main_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 1 Assignment


## Proposed Project - NFL Data to predict future games

## Background 
### I am a huge sports fan (NFL and Soccer mostly) & luckily for me there is a ton of readily available data online for all fans to use. The dataset I chose uses Football data to start but I could see myself dabbling in other sports later on.
### There are tons of available scholarly articles & professionals who dedicate their lives to determining game results. Its ultimately impossible to perfectly predict every game though as there are essentially infinite variables to consider but I believe there is some fun/opportunity in trying to get as accurate as possible. One example of someone using data to try and predict results is https://www.activestate.com/blog/how-to-predict-nfl-winners-with-python/ . I will be referencing this one throughout the semester most likely as its broken down clearly and easy to digest.


## Project Description
### The final goal for my project is to reach around 60% accuracy on game predictions, the specific dataset I plan on using is provided by FiveThirtyEight (https://data.fivethirtyeight.com/). Their nfl-elo dataset hosts 17380 rows and 33 columns with date, boolean, integer, string, and float data types. To fit the assignment description a bit better I can widdle down the number of columns since I dont think they will all be necessary.
### My plan in a nutshell is essentially to weigh the columns initally in terms of how I believe they contribute to the final score, then let the modeling algorithm grow from their and discover the weights on it's own. 
### To me, the most important stats are: 
- playoff (BOOL : Whether or not the game was a playoff game)
- elo_pre scores (FLOAT : Overall team rating going into the games)
- elo_prob scores (FLOAT: A teams chance of winning based solely on their elo)
- qbelo_pre scores (FLOAT : Overall QB rating going into the games)
- qbelo_prob scores (FLOAT : A teams chance of winning based solely on their QB Elo)
- qbelo_post scores (FLOAT : A QB rating after the given games) 
- quality (INT : A game's quality score based on the teams' pregame elo ratings, scaled from 0-100)
- importance (INT : Rating of games importance based on how the result would affect the models foraecasted playoff odds, scaled from 0-100)
- total_rating (INT : The average of quality and importance)

### For a description of all the columns -> https://github.com/fivethirtyeight/data/tree/master/nfl-elo

## Performance Metric 
### The dataset provides data as of February 2023 or the end of the most recent NFL season. To determine how well my model works, I will compare the model's predicted score to the true end result and the closer the model is to the actual score, the better the model. Ideally, my model will be able to predict around 60% of the games correctly.
### Initially, the goal is just to get a win or loss model working, then compare the predicted results to the actual ones and once satisfied with those, I will move on to predicting scores outright.
# Total Accuracy Percentage = Σ(games_predicted_correctly) / Σ(all_games_predicted) 


# Week 3 Assignment

## EDA



### Questions


*   Which teams are most consistant in terms of following their predicted scores?
*   How well can someone get their algorithm to predict games?
*   At which point is data no longer useful for current teams?
*   Are win / loss streaks a major factor in a team's performance? 

In [1]:
#Tables and Visualizations
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [78]:
#Initial DF
url = 'https://raw.githubusercontent.com/Nelsontorresjr330/CS5265-Repo/main/nfl_elo.csv'
raw_df = pd.read_csv(url, index_col=0)
raw_df.columns

Index(['season', 'neutral', 'playoff', 'team1', 'team2', 'elo1_pre',
       'elo2_pre', 'elo_prob1', 'elo_prob2', 'elo1_post', 'elo2_post',
       'qbelo1_pre', 'qbelo2_pre', 'qb1', 'qb2', 'qb1_value_pre',
       'qb2_value_pre', 'qb1_adj', 'qb2_adj', 'qbelo_prob1', 'qbelo_prob2',
       'qb1_game_value', 'qb2_game_value', 'qb1_value_post', 'qb2_value_post',
       'qbelo1_post', 'qbelo2_post', 'score1', 'score2', 'quality',
       'importance', 'total_rating'],
      dtype='object')

In [21]:
#Elo-Based Favorites
#Since the data doesn't provide explicit favorites as a column,
#I'll add one based on the team's elos probability, the team with
#The greater elo_prob is the favorite

df_w_fav = raw_df.copy()
df_w_fav['Favorite'] = np.where(raw_df['elo_prob1'] >= raw_df['elo_prob2'], raw_df['team1'], raw_df['team2'])
print(df_w_fav[['team1','team2','elo_prob1','elo_prob2','Favorite']])

           team1 team2  elo_prob1  elo_prob2 Favorite
date                                                 
1920-09-26   RII   STP   0.824651   0.175349      RII
1920-10-03   AKR   WHE   0.824212   0.175788      AKR
1920-10-03   BFF   WBU   0.802000   0.198000      BFF
1920-10-03   DAY   COL   0.575819   0.424181      DAY
1920-10-03   RII   MUN   0.644171   0.355829      RII
...          ...   ...        ...        ...      ...
2023-01-22   BUF   CIN   0.648277   0.351723      BUF
2023-01-22    SF   DAL   0.683613   0.316387       SF
2023-01-29   PHI    SF   0.450672   0.549328       SF
2023-01-29    KC   CIN   0.600966   0.399034       KC
2023-02-12   PHI    KC   0.375176   0.624824       KC

[17379 rows x 5 columns]


In [29]:
#Favored Team Win
#Another additional column for if the favored team won, as a Boolean

df_w_fav_win = df_w_fav.copy()
df_w_fav_win['Favorite_won'] = np.where((((df_w_fav['team1'] == df_w_fav['Favorite']) & (df_w_fav['score1'] > df_w_fav['score2'])) | ((df_w_fav['team2'] == df_w_fav['Favorite']) & (df_w_fav['score2'] > df_w_fav['score1']))), True, False) 
print(df_w_fav_win[['team1','team2','Favorite','score1','score2','Favorite_won']])

           team1 team2 Favorite  score1  score2  Favorite_won
date                                                         
1920-09-26   RII   STP      RII      48       0          True
1920-10-03   AKR   WHE      AKR      43       0          True
1920-10-03   BFF   WBU      BFF      32       6          True
1920-10-03   DAY   COL      DAY      14       0          True
1920-10-03   RII   MUN      RII      45       0          True
...          ...   ...      ...     ...     ...           ...
2023-01-22   BUF   CIN      BUF      10      27         False
2023-01-22    SF   DAL       SF      19      12          True
2023-01-29   PHI    SF       SF      31       7         False
2023-01-29    KC   CIN       KC      23      20          True
2023-02-12   PHI    KC       KC      35      38          True

[17379 rows x 6 columns]


In [80]:
#Next get the total count of times each team was favorited and won & divide it by the total times they were favorited

total_favorites = df_w_fav_win['Favorite'].value_counts()
favorites_won = df_w_fav_win.loc[df_w_fav_win['Favorite_won']==True].groupby('Favorite')['Favorite_won'].count()
favs_percents = favorites_won / total_favorites
favs_percents.dropna().sort_values()#['TEN'] #You can index by any team you'd like to see

LOU    0.250000
NYA    0.285714
TOR    0.285714
CRA    0.307692
DHR    0.333333
         ...   
DWL    1.000000
LAB    1.000000
KCB    1.000000
SLA    1.000000
CHT    1.000000
Length: 74, dtype: float64

### Future Engineering
Even with just this quick test to see the most consistant teams, I can see a few columns I need to add to help out / already added. I added a few already (Favorites, Favorites_Won, Favs_Percents) but another one I might need is something along the lines of "QB still on team", "Team currently in NFL" or "Team still mostly in NFL". Most of these are just to make sure the predictions for future games are accurate but might not be completely accurate since to make predictions I can basically assume they're all true. Still I imagine many more random columns will come along the way.

### Train/Test Split
I'm not entirely certain on my ideal Train Test Split but I believe what I would like to do to start off is train my model with every season prior to 2020 and test it with the 2021 and 2022 seasons. Each NFL season has about 250 or so games so this should provide about 16,500 games of training and 500 of testing. This probably isnt the greatest split but its where I would like to start since I'm only really interested in predicting recent/upcoming games.

### Initial Pipeline
I am also not entirely certain on the exact specifics for the pipeline but I do know there will be a handful of operations necessary in the pipeline, I'm just not sure how to go about implementing them. For example, the older NFL data does not provide, for example there are no QB elo stats prior to 1950 so all of those columns, if I intend to use the data prior to 1950, must be ran through Imputers. As mentioned before, there are already a handful of calculations and features I've implemented that will need to go through column transformers. There may be other parts I need to add to the pipeline later on but as of now this is all I can think of and then the final model fitting part of the pipeline.

### Model Fitting and Evaluation
After going through this assignment, I feel very confident in my future model. I believe I'll be able to hit my personal goals for the model. Some assumptions I have about the features are:

*   The model will rely heavily on the Pre_Elo values instead of features added onto it
*   The generic linear regression model will perform best

