# Using the _between_ function in API calls

This notebook was used to test how to use the _between_ argument in the API calls to the Head2Head endpoint.

In [9]:
import yaml
import pandas as pd
from datetime import date, timedelta
import variables_n_functions as vnf

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Import the API token and the teams.

In [7]:
key_file = open('dags_config.yaml', 'r')
config = yaml.safe_load(key_file)

sports_key = config['sports_token']

We will use the following 2 teams since there was a match between both in a lapse of 7 days (April 26) from today (May 3), and there will be a match in a lapse of 7 days starting today (May 4).

In [117]:
teams = {3468 : 'Real Madrid', 9 : 'Manchester City'}

### These 2 were used to test if a match occuring on the same day but later
### was going to be retrieved as a historical or future match.
### It was brought as a future match, which indicates we can make predictions
### for matches of the day even if we pull the info that same day, but earlier
# teams = {3477: 'Villarreal', 8: 'Liverpool'} 

### Before adding _between_
This is the structure as we have it now (May 3), without using _between_:

In [118]:
### Predefine the DF to store variables
df = pd.DataFrame(columns = vnf.columnas_df)

### The following variable is auxiliar to avoid duplicate requests
teams_aux = list(teams.keys())

### We recover the match history between every unique team - team combination, and store it in the df
for team_1 in teams.keys():
    
    teams_aux.remove(team_1)
    
    for team_2 in teams_aux:
        
        h2h = vnf.head2head(team_1, team_2, sports_key)
        
        if h2h is not None: 
        
            df = pd.concat([df] + [pd.DataFrame(pd.Series(h2h[k])).transpose() for k in range(len(h2h))])

Observe how it pulls every match between any 2 given teams, with dates going way back.

In [119]:
dates = df['time'].apply(lambda x : x['starting_at']['date'])
dates = pd.to_datetime(pd.DataFrame(aux)["time"])

print('Earliest match played on: ', dates.min())
print('Latest match played on: ', dates.max())
print('Total matches: ', len(dates))

Earliest match played on:  2005-08-19 00:00:00
Latest match played on:  2022-05-01 00:00:00
Total matches:  84


### Adding _between_
To get to the structure we want, we need to define the _between_ parameter. For it, we need a _start_ and an _end_ date. We will have 3 use cases for the _between_ functionality:

    1. Use it to initialize the DB with data from around 10 or 5 years from today.
    2. Use it to update the DB weekly with data from only 7 days.
    3. Use it to obtain the future matches for which we will make predictions, which will be 7 days from today.

#### 1) Initialize DB

In [120]:
### Define end and start dates
end = date.today()
start = end - timedelta(5 * 365) # 5 years worth of data

### Convert to the format we need
end = end.strftime('%Y-%m-%d')
start = start.strftime('%Y-%m-%d')

### Define the between variable
between = start + ',' + end

In [121]:
between

'2017-05-04,2022-05-03'

In [122]:
### Predefine the DF to store variables
df = pd.DataFrame(columns = vnf.columnas_df)

### The following variable is auxiliar to avoid duplicate requests
teams_aux = list(teams.keys())

### We recover the match history between every unique team - team combination, and store it in the df
for team_1 in teams.keys():
    
    teams_aux.remove(team_1)
    
    for team_2 in teams_aux:
        
        h2h = vnf.head2head(team_1, team_2, sports_key, between) ### We add here the between parameter !!!
        
        if h2h is not None: 
        
            df = pd.concat([df] + [pd.DataFrame(pd.Series(h2h[k])).transpose() for k in range(len(h2h))])

Observe how it pulls every match between any 2 given teams only in the range of dates specified, which considerably reduces the amount of matches that the DB will handle.

In [123]:
dates = df['time'].apply(lambda x : x['starting_at']['date'])
dates = pd.to_datetime(pd.DataFrame(dates)["time"])

print('Earliest match played on: ', dates.min())
print('Latest match played on: ', dates.max())
print('Total matches: ', len(dates))

Earliest match played on:  2020-02-26 00:00:00
Latest match played on:  2022-04-26 00:00:00
Total matches:  3


#### 2) Update DB

In [124]:
### Define end and start dates
end = date.today()
start = end - timedelta(7) # 5 years worth of data

### Convert to the format we need
end = end.strftime('%Y-%m-%d')
start = start.strftime('%Y-%m-%d')

### Define the between variable
between = start + ',' + end

In [125]:
between

'2022-04-26,2022-05-03'

In [126]:
### Predefine the DF to store variables
df = pd.DataFrame(columns = vnf.columnas_df)

### The following variable is auxiliar to avoid duplicate requests
teams_aux = list(teams.keys())

### We recover the match history between every unique team - team combination, and store it in the df
for team_1 in teams.keys():
    
    teams_aux.remove(team_1)
    
    for team_2 in teams_aux:
        
        h2h = vnf.head2head(team_1, team_2, sports_key, between) ### We add here the between parameter !!!
        
        if h2h is not None: 
        
            df = pd.concat([df] + [pd.DataFrame(pd.Series(h2h[k])).transpose() for k in range(len(h2h))])

Observe how it pulls every match between any 2 given teams only in the range of dates specified:

In [127]:
dates = df['time'].apply(lambda x : x['starting_at']['date'])
dates = pd.to_datetime(pd.DataFrame(dates)["time"])

print('Earliest match played on: ', dates.min())
print('Latest match played on: ', dates.max())
print('Total matches: ', len(dates))

Earliest match played on:  2022-04-26 00:00:00
Latest match played on:  2022-04-26 00:00:00
Total matches:  1


When this was run (May 3), there was only 1 match played on the week. So only that one register would be added, instead of having the DB check for every other historical match.

In [128]:
df['time'].iloc[0]['starting_at']['date']

'2022-04-26'

#### 2) Get Future Matches

Notice how we switched the start and time parameters; we now have a positive _timedelta_ value.

In [129]:
### Define end and start dates
start = date.today()
end = start + timedelta(7) # 5 years worth of data

### Convert to the format we need
end = end.strftime('%Y-%m-%d')
start = start.strftime('%Y-%m-%d')

### Define the between variable
between = start + ',' + end

In [130]:
between

'2022-05-03,2022-05-10'

In [131]:
### Predefine the DF to store variables
df = pd.DataFrame(columns = vnf.columnas_df)

### The following variable is auxiliar to avoid duplicate requests
teams_aux = list(teams.keys())

### We recover the match history between every unique team - team combination, and store it in the df
for team_1 in teams.keys():
    
    teams_aux.remove(team_1)
    
    for team_2 in teams_aux:
        
        h2h = vnf.head2head(team_1, team_2, sports_key, between) ### We add here the between parameter !!!
        
        if h2h is not None: 
        
            df = pd.concat([df] + [pd.DataFrame(pd.Series(h2h[k])).transpose() for k in range(len(h2h))])

Observe how it pulls every match between any 2 given teams only in the range of dates specified:

In [132]:
dates = df['time'].apply(lambda x : x['starting_at']['date'])
dates = pd.to_datetime(pd.DataFrame(dates)["time"])

print('Earliest match played on: ', dates.min())
print('Latest match played on: ', dates.max())
print('Total matches: ', len(dates))

Earliest match played on:  2022-05-04 00:00:00
Latest match played on:  2022-05-04 00:00:00
Total matches:  1


When this was run (May 3), there was only 1 match that will be played 7 days from now.

In [133]:
df['time'].iloc[0]['starting_at']['date']

'2022-05-04'

We take this opportunity to see if all the variables that we will need for prediction are retrieved for future matches, and indeed they are:

In [151]:
model_cols = [              
              'league_id', 
              'season_id', 
              'venue_id', 
              'referee_id',
              'localteam_id',
              'visitorteam_id',
              'standings'
              ]


to_predict = df.copy().set_index(['id'])[model_cols]

to_predict['localteam_position'] = to_predict['standings'].apply(lambda x : x['localteam_position'])
to_predict['visitorteam_position'] = to_predict['standings'].apply(lambda x : x['visitorteam_position'])
to_predict.drop('standings', 1, inplace = True)

to_predict

Unnamed: 0_level_0,league_id,season_id,venue_id,referee_id,localteam_id,visitorteam_id,localteam_position,visitorteam_position
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
18509783,2,18346,2020,16780,3468,9,1,1
