# Naive Linear Regression of Score with no opponents
---
Creating a regression from a teams score through each game without consideration for who their opponent is would create a sort of blind score projection. There is a heavy feeling of uselessness here as considering NFL is a game with 2 teams, where the number of points scored is directly related to your opponents ability to defend, that this might be a less than useful regression. Ideally thhough this model would at least provide some support to the idea that the predicted score is also reliant on the opposing team.

In [None]:
import pandas as pd #DataFrames

In [2]:
df = pd.read_csv('../data/external/nfl_games.csv')
df.head()

In [3]:
df['date'] =  pd.to_datetime(df['date'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16274 entries, 0 to 16273
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   date       16274 non-null  datetime64[ns]
 1   season     16274 non-null  int64         
 2   neutral    16274 non-null  int64         
 3   playoff    16274 non-null  int64         
 4   team1      16274 non-null  object        
 5   team2      16274 non-null  object        
 6   elo1       16274 non-null  float64       
 7   elo2       16274 non-null  float64       
 8   elo_prob1  16274 non-null  float64       
 9   score1     16274 non-null  int64         
 10  score2     16274 non-null  int64         
 11  result1    16274 non-null  float64       
dtypes: datetime64[ns](1), float64(4), int64(5), object(2)
memory usage: 1.5+ MB


We are going to want to convert the datetime format to a usable numeric for linear regression. To do this we can use the datetime library

In [4]:
import datetime as dt #datetime
df['date'] = df['date'].map(dt.datetime.toordinal)
df.head()

Unnamed: 0,date,season,neutral,playoff,team1,team2,elo1,elo2,elo_prob1,score1,score2,result1
0,701169,1920,0,0,RII,STP,1503.947,1300.0,0.824651,48,0,1.0
1,701176,1920,0,0,AKR,WHE,1503.42,1300.0,0.824212,43,0,1.0
2,701176,1920,0,0,RCH,ABU,1503.42,1300.0,0.824212,10,0,1.0
3,701176,1920,0,0,DAY,COL,1493.002,1504.908,0.575819,14,0,1.0
4,701176,1920,0,0,RII,MUN,1516.108,1478.004,0.644171,45,0,1.0


Going to go ahead and clean up unnecessary columns as we only care for the date, the team, and the score. Then I will be creating the linear regressions for each team and storing them in a DataFrame.

In [5]:
df = df.drop(['season', 'neutral', 'playoff', 'elo1', 'elo2', 'elo_prob1', 'result1'], 1)
names = df['team1'].unique()
names

array(['RII', 'AKR', 'RCH', 'DAY', 'CHI', 'CBD', 'BFF', 'DHR', 'CHT',
       'ARI', 'FTW', 'LOG', 'RCK', 'CTI', 'PUL', 'ZAN', 'CHL', 'THO',
       'MNM', 'GAR', 'ELY', 'CHB', 'ROS', 'COL', 'WGC', 'UAP', 'RIC',
       'CST', 'ECG', 'DTI', 'MUN', 'GB', 'MNN', 'SEN', 'NG1', 'LOU',
       'RAC', 'TOL', 'OOR', 'MIL', 'DUL', 'HAM', 'SLA', 'CIB', 'FYJ',
       'KEN', 'KCB', 'DPN', 'PTB', 'PRV', 'NYG', 'HRT', 'BRL', 'NYA',
       'DWL', 'TOR', 'SIS', 'DET', 'BKN', 'CLI', 'WSH', 'PIT', 'RED',
       'PHI', 'GUN', 'LAR', 'STG', 'CRP', 'BYK', 'CLE', 'BBA', 'SF',
       'CRA', 'LDA', 'NAA', 'MSA', 'BDA', 'BCL', 'NYY', 'DTX', 'IND',
       'NE', 'LAC', 'OAK', 'NYJ', 'TEN', 'BUF', 'DAL', 'KC', 'DEN', 'MIN',
       'MIA', 'ATL', 'NO', 'CIN', 'SEA', 'TB', 'JAX', 'CAR', 'BAL', 'HOU'],
      dtype=object)

Now we will iterate over the array of names and pull their score data from the dataframe for each team

In [6]:
import numpy as np #Array
from sklearn.linear_model import LinearRegression #Linear Regression
from sklearn.model_selection import train_test_split #Training and Testing

In [7]:
models = {}
x_tests = {}
y_actuals = {}
for team in names:
    home_team = df[df.team1 == team].drop(['team2','score2'],1)
    away_team = df[df.team2 == team].drop(['team1','score1'],1)
    away_team.columns=home_team.columns
    combined = home_team.append(away_team, ignore_index=True).sort_values('date')
    x = combined.iloc[:, 0].values
    y = combined.iloc[:, 2].values
    print(y)
    x_train, x_test, y_train, y_actual = train_test_split(x, y, test_size=0.1, random_state=0)
    x_train= x_train.reshape(-1, 1)
    y_train= y_train.reshape(-1, 1)
    x_test = x_test.reshape(-1, 1)
    models[team] = LinearRegression().fit(x_train,y_train)
    x_tests[team] = x_test
    y_actuals[team] = y_actual
    

[48 45 26  0  7 20  0  7  0 48  0 10 14 14 13 14  0 19  6 60 26  0 43  0
  3  0 56  3  6  3  6  7  0  9 26 20  7  3  6 17  0  0  0  3 12  0  3  0
 35 40  6  0]
[43 37 13  7 10  7 13  7 14  0  0 14 41 23 20  3 19 21  0  0  0  0  7 36
 13  0 62 22  0  3 10  0  0  7  0  0  7  3  6  2  3 14 13  0  0  7  0 22
  7 14  0 20  0 17  7  0  6  0 17  0  0  0  0  0]
[10 66  0 21  6 27  0 16  3  7  0 13  0  0 45 27 13  0  0  0  0  0  0  6
  0  0  0  7  0  0  0  0  7  0 13  0  0  6  0]
[14  0 44 20 23 21  0 28  0 42  7 14  0  3  0 27  3  0 36  0 17 20  0  0
  0  7  7  0  0  3  3  0  0  3 19  7  0  6  0  0  6  7  0  0  0  0  3  0
  0  0  3  6  6  0  0  0  0  6  3  0  0  0  6  0  0  0  0  9  0  0  0  0
  7  0  0  0  0]
[20 25  7 ... 14 24 15]
[48 42  7 20  0 18 21  3  0  3  0  7 39  7 14  0 14  7  7 14 15  0 28 38
  0 14 22  7  0  3  7 20 14 40 19 17 37 30  6  7  7  3 41 46 28 14 10 14
 14  7  0  3  6  6  0  2 13  0 13  0  0  7  7  2  0  2  0  0]
[32 51 28 38 17 35 43  0  7  7  0 17 38 55 28 21 10  0  

ValueError: With n_samples=1, test_size=0.1 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.

From this ValueError I have come to realize that not every team has played more than 1 NFL game. This method of modelling is probably a dead-end here as there wouldn't be enough datapoints for most teams. There are a good amount with 50+ entries but for an overall score prediction model I would assert that individual 2D linear regression isn't ideal.

# Conclusion
---
Don't do this. No seriously, don't try to predict individual team scores by past results with no consideration to opponents or other factors.