# NFL Fantasy Football Data Preprocessing

The overall goal of this notebook is to import data collected from https://github.com/amcheste/fball_data_collection and perform data preprocessing. This includes removing data that is not needed for the scope of this project, setting null values to zero and creating a data frames that can be used for data analysis and feature engineering.

In [166]:
import pandas as pd
import requests
import os

## Download Data
Due to the volume of data collected by https://github.com/amcheste/fball_data_collection, the output csv files are considered large files for git.  Therefore, various output csv files are stored in an [Object Storage](https://docs.oracle.com/en-us/iaas/Content/Object/Concepts/objectstorageoverview.htm) bucket in [Oracle Cloud Infrastructure (OCI)](https://docs.oracle.com/en-us/iaas/Content/GSG/Concepts/baremetalintro.htm).

Since the collected data does not include any sensitive data and is publicly available on the internet I created a [Pre-Authenticated Request (PAR)](https://docs.oracle.com/en-us/iaas/Content/Object/Tasks/usingpreauthenticatedrequests.htm), allowing non-authenticated users to download the files stored in the `DSC-412` bucket which is storing the various csv files containing data collected from ESPN's APIs.  This [PAR URL](https://objectstorage.us-ashburn-1.oraclecloud.com/p/gGzdsEKSIArLMAV1cP7SUkd6jSGF-P5wFn5kENCtQaABjvsLJkgJZ_vPi-27a8NL/n/id8zuxg6euyj/b/DSC-412/o/) will be valid for the duration of this class.

The downloaded csv files will be stored in directories under the `tmp` directory in this project.  If the `tmp` directory does not exist it will be created.  It should also be noted that the `tmp` directory is included in the `.gitignore` to ensure these large files are not checked into the git repository.  

The raw csv files will be stored under `tmp/00` with preprocessed data stored in `tmp/01` that will be used for data analysis and feature engineering in the `02_feature_engineering.ipnb` notebook.

In [167]:
if not os.path.isdir('../tmp'):
    os.mkdir('../tmp')
    os.mkdir('../tmp/00')
    os.mkdir('../tmp/01')
    os.mkdir('../tmp/02')

BUCKET_URL = 'https://objectstorage.us-ashburn-1.oraclecloud.com/p/gGzdsEKSIArLMAV1cP7SUkd6jSGF-P5wFn5kENCtQaABjvsLJkgJZ_vPi-27a8NL/n/id8zuxg6euyj/b/DSC-412/o/'   
#
# Import positions raw data
ret = requests.get(BUCKET_URL)
# TODO error
for object in ret.json()['objects']:
    ret = requests.get(f'{BUCKET_URL}{object['name']}')
    file = open(f'../tmp/00/{object['name']}', "w")
    file.write(ret.text)
    file.close()

## NFL Positions

In [168]:
#
# Import positions raw data
positions_df = pd.read_csv('../tmp/00/positions.csv')

#
# remove url column
positions_df = positions_df.drop(columns=['url'])
positions_df.set_index('id', inplace=True)

#fantasy_positions = ['WR','TE','QB','FB','HB','TB','LHB','RHB','PK']
fantasy_positions = ['QB']
fantasy_positions_df = positions_df[positions_df['abbreviation'].isin(fantasy_positions)]

# TODO: remove non fantasy positions
#fantasy_positions_df.to_csv('../tmp/01/positions.csv')
fantasy_positions_df

Unnamed: 0_level_0,name,abbreviation
id,Unnamed: 1_level_1,Unnamed: 2_level_1
8,Quarterback,QB


## NFL Teams

In [169]:
teams_df = pd.read_csv('../tmp/00/teams.csv')  
teams_df = teams_df.drop(columns=['url'])
teams_df.set_index('id', inplace=True)
teams_df.fillna(0, inplace=True)
teams_df
#teams_df.to_csv('../tmp/01/teams.csv')

#TODO update data collection for team stats
#team_stats_df = pd.read_csv('../tmp/00/team_stats.csv')
#team_stats_df.set_index('id', inplace=True)
#team_stats_df.fillna(0, inplace=True)
#team_stats_df.to_csv('../tmp/01/team_stats.csv')

Unnamed: 0_level_0,name,location,abbreviation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
22,Arizona Cardinals,Arizona,ARI
21,Philadelphia Eagles,Philadelphia,PHI
23,Pittsburgh Steelers,Pittsburgh,PIT
26,Seattle Seahawks,Seattle,SEA
25,San Francisco 49ers,San Francisco,SF
27,Tampa Bay Buccaneers,Tampa Bay,TB
10,Tennessee Titans,Tennessee,TEN
28,Washington Commanders,Washington,WSH
1,Atlanta Falcons,Atlanta,ATL
33,Baltimore Ravens,Baltimore,BAL


## NFL Games

In [170]:
#games_df = pd.read_csv('../tmp/00/games.csv')
#games_df = games_df.drop(columns=['url', 'pbp_url'])
#games_df.set_index('id', inplace=True)
#games_df.fillna(0, inplace=True)
#games_df.to_csv('../tmp/01/games.csv')

#game_stats_df = pd.read_csv('../tmp/00/game_stats.csv')
#game_stats_df.fillna(0, inplace=True)
#game_stats_df = game_stats_df.drop(columns=['id'])
#game_stats_df
# TODO remove future games
#game_stats_df.to_csv('../tmp/01/game_stats.csv')


#pbp_df = pd.read_csv('../tmp/00/pbp_2024.csv')
#pbp_df.head()
#cols = pbp_df.columns
#print(cols.tolist())
#print(pbp_df['fantasy_player_name'])
#print(pbp_df['passer_id'])

#pbp_df = pbp_df[['fantasy_player_name', 'passer_id','pass_length','air_yards', 'yards_after_catch',]].copy()
#pbp_df.shape
#'qb_dropback', 'qb_kneel', 'qb_spike', 'qb_scramble', 'pass_length',

## NFL Players

In [171]:
# TODO address null
players_df = pd.read_csv('../tmp/00/players.csv')
players_df = players_df.drop(columns=['url', 'stats_log'])
players_df.set_index('id', inplace=True)
players_df.fillna(0, inplace=True)


player_stats_general = pd.read_csv('../tmp/00/player_general_stats.csv')
player_stats_general = player_stats_general.drop(columns=['id', 'net_total_yards', 'net_yards_per_game'])
player_stats_general.fillna(0, inplace=True)

result = pd.merge(players_df, player_stats_general, left_on='id', right_on='player_id', how='inner')

player_stats_passing = pd.read_csv('../tmp/00/player_passing_stats.csv')
player_stats_passing = player_stats_passing.drop(columns=['id'])
player_stats_passing.fillna(0, inplace=True)

result = pd.merge(result, player_stats_passing, on='player_id', how='inner')

player_stats_rushing = pd.read_csv('../tmp/00/player_rushing_stats.csv')
player_stats_rushing = player_stats_rushing.drop(columns=['id'])
player_stats_rushing.fillna(0, inplace=True)

result = pd.merge(result, player_stats_rushing, on='player_id', how='inner')


index_to_drop = result[result['active'] != True].index
result = result.drop(index_to_drop)
index_to_drop = result[result['status'] != 'Active'].index
result = result.drop(index_to_drop)
index_to_drop = result[result['position'] != 8].index
result = result.drop(index_to_drop)

result.to_csv('../tmp/01/player_stats.csv')



## Player Fantasy Points

In [172]:
fpts_df = pd.read_csv('../tmp/00/player_points.csv')

index_to_drop = fpts_df[fpts_df['points'] == 0.00].index

# Drop rows by index
fpts_df = fpts_df.drop(index_to_drop)
fpts_df.to_csv('../tmp/01/player_points.csv', index=False)