# Football Assignment
In this project, you will use the skills and concepts we discussed this semester to ingest, manipulate, analyze, and report data using Python.

Some of the more helpful concepts could be used to complete this notebook:
* basic syntax, len() function, variables
* conditionals
* looping
* data structures: lists, dictionaries, and sets
* pandas
* regex - this is helpful to get text patterns
* JSON - reading and writing JSON files
* Pathlib for accessing the files, regex (if desired)

You have been provided a set of JSON files describing football games from the 2017 season. The files may or may not include all the games from that season. If a statistic in the provided data conflicts with *actual* real-world data, the correct answer is in the *provided* data. 

Use only the JSON files contained in the 'Full' folders (not 'Flattended').

The objective of this project is to answer the set of questions below. Your project's output is a JSON file containing the question (key) and the answer (value). The keys must be in the format qn, and the answer must be a value appropriate for the question.

The 'season' includes all games provided (including bowl games).

In [1]:
# Example of how to answer a question
answer_file = {} # create blank dictionary
answer_file['q1'] = 'yes' # Answer 'yes' to Question 1
print(answer_file)

{'q1': 'yes'}


> You must name the file 'mis501_python_project_*netid*.json', for example, mis511_python_project_gjbott.json.

In [38]:
#Import statements
import json
import pathlib
import os
import re

#define a function to load the contents of a json file
def load_json(json_path):
    payload = None
    #error handling
    try:
        with open(json_path) as fin:
            payload = json.load(fin)
    except:
        pass
    
    return payload

### All Games

In [3]:
#Create a Path object
cwd_path = pathlib.Path()
cwd_path.cwd()
#Path to all regular season games
#path = "**/full/Week*/*json"
#Path to all games in the season
path = "**/full/**/*json"
#create a blank list to store the file path of all games
all_game_files = []

#find all games and add them to a list
new_files = list(cwd_path.rglob(path))
all_game_files.extend(new_files)

#create a dictionary with week as the key and file path as the value
all_game_dict = {file.parts[2]: file for file in all_game_files}

In [4]:
all_game_list = []
for file in all_game_files:
    all_game_list += [file.parts[2]]
all_game_list


['400953322 - North Texas vs Troy.json',
 '400953323 - Oregon vs Boise State.json',
 '400953324 - Colorado State vs Marshall.json',
 '400953325 - Arkansas State vs Middle Tennessee.json',
 '400953386 - Florida Atlantic vs Akron.json',
 '400953387 - Florida Intl vs Temple.json',
 '400953388 - Ohio vs UAB.json',
 '400953389 - Wyoming vs Central Michigan.json',
 '400953390 - South Florida vs Texas Tech.json',
 '400953391 - Army vs San Diego State.json',
 '400953392 - Toledo vs Appalachian State.json',
 '400953393 - Houston vs Fresno State.json',
 '400953394 - Northern Illinois vs Duke.json',
 '400953395 - West Virginia vs Utah.json',
 '400953396 - UCLA vs Kansas State.json',
 '400953397 - Florida State vs Southern Mississippi.json',
 '400953398 - Boston College vs Iowa.json',
 '400953399 - Missouri vs Texas.json',
 '400953400 - Navy vs Virginia.json',
 '400953401 - TCU vs Stanford.json',
 '400953402 - Texas A_M vs Wake Forest.json',
 '400953403 - Northwestern vs Kentucky.json',
 '40095340

### Alabama Games

In [5]:
#create iterable list with the default for alabama games both home and away
bama_cases = [
    '**/full/**/*- Alabama *json',
    '**/full/**/*vs Alabama.*json',
]

#create a blank list to store the file path of alabama games
bama_game_files = []

#find all alabama games and add them to a list
for case in bama_cases:
    new_files = list(cwd_path.rglob(case))
    bama_game_files.extend(new_files)

#bama_game_files = [file for file in bama_game_files if 'Bowl' not in file.parts]
#create a dictionary with week as the key and file path as the value
bama_game_files = {file.parts[2]: file for file in bama_game_files}

In [6]:
# Code used to remove unwanted the 3 unwanted games from the folder of all games
# Create a list to path of games I want to remove 
# invalid_game_path = ['full/Week 1/400953746 - SOUTH vs NORTH.json', 'full/Week 1/400953747 - WEST vs EAST.json', 'full/Week 1/401003756 - National vs American.json'],
# for loop to iterate through the list and remove the game from path if it 
# for game in invalid_game_path:
#     if os.path.exists(game):
#         os.remove(game)
# 
all_game_dict

{'400953322 - North Texas vs Troy.json': WindowsPath('full/Bowl/400953322 - North Texas vs Troy.json'),
 '400953323 - Oregon vs Boise State.json': WindowsPath('full/Bowl/400953323 - Oregon vs Boise State.json'),
 '400953324 - Colorado State vs Marshall.json': WindowsPath('full/Bowl/400953324 - Colorado State vs Marshall.json'),
 '400953325 - Arkansas State vs Middle Tennessee.json': WindowsPath('full/Bowl/400953325 - Arkansas State vs Middle Tennessee.json'),
 '400953386 - Florida Atlantic vs Akron.json': WindowsPath('full/Bowl/400953386 - Florida Atlantic vs Akron.json'),
 '400953387 - Florida Intl vs Temple.json': WindowsPath('full/Bowl/400953387 - Florida Intl vs Temple.json'),
 '400953388 - Ohio vs UAB.json': WindowsPath('full/Bowl/400953388 - Ohio vs UAB.json'),
 '400953389 - Wyoming vs Central Michigan.json': WindowsPath('full/Bowl/400953389 - Wyoming vs Central Michigan.json'),
 '400953390 - South Florida vs Texas Tech.json': WindowsPath('full/Bowl/400953390 - South Florida vs T

## Question 1
How many games are in the data set?

In [7]:
answer_dict = {}
answer_dict["q1"] = len(all_game_dict)
#872
answer_dict

{'q1': 872}

## Question 2
What are topmost keys for each game file?

In [52]:
game = load_json(all_game_dict['400933827 - Alabama vs Florida State.json'])
top_keys = game.keys()

answer_dict["q2.a"] = list(top_keys)

{'q1': 872,
 'q2.a': ['scoringPlays',
  'videos',
  'drives',
  'teams',
  'id',
  'competitions',
  'season',
  'week'],
 'q2.b': 'Teams appear to be referenced consistently and there are no duplicate games in the list. There are a couple of games that seem to be mixed up in there where it shows EAST vs. WEST, National vs. American, and NORTH vs. SOUTH which are not college football teams. so I decided to remove those games.',
 'q3': 'yes',
 'q3.1': ['Abilene Christian',
  'Air Force',
  'Akron',
  'Alabama',
  'Alabama A_M',
  'Alabama State',
  'Albany',
  'Alcorn State',
  'Appalachian State',
  'Arizona',
  'Arizona State',
  'Arkansas',
  'Arkansas State',
  'Army',
  'Auburn',
  'Austin Peay',
  'BYU',
  'Ball State',
  'Baylor',
  'Boise State',
  'Boston College',
  'Bowling Green',
  'Buffalo',
  'California',
  'Central Arkansas',
  'Central Connecticut',
  'Central Michigan',
  'Charleston Southern',
  'Charlotte',
  'Chattanooga',
  'Cincinnati',
  'Clemson',
  'Coastal Ca

One of the challenges in data analysis is that the data being analyzed may have irregularities or errors that impact the accuracy of the results. For example, does that data set you've been given represent ALL games in the 2017 season? (This is not a question I need you to answer. It's just an example.) Although verifying the accuracy of the data is an important step, we will limit our scope to the titles of the files. 


Within the data set you've been given, are all teams refenced the same way (e.g., Texas A&M, Texas A and M, Texas A & M)? Are teams or competitions referenced more than once? To help answer this question, provide a python list of the teams represented in this data set, sorted alphabetically. Examine the file names to determine if a football game (i.e., competition) is duplicated.

In [9]:
# Testing for duplicate games by first creating a set of the list with all games
unique_game_list = set(all_game_list)
# If the unique game list and full game list are equal length then there would be no duplicates
if len(all_game_list) == len(unique_game_list):
    print("No duplicate games found")
else:
    print("Duplicate games found")

Duplicate games found


In [10]:
answer_dict["q2.b"] = "Teams appear to be referenced consistently and there are no duplicate games in the list. There are a couple of games that seem to be mixed up in there where it shows EAST vs. WEST, National vs. American, and NORTH vs. SOUTH which are not college football teams. so I decided to remove those games."

## Question 3
Are all teams referenced consistently? (yes/no)

In [11]:
answer_dict["q3"] = "yes"

### Question 3.1
Provide a Python list of all the teams represented in the files, sorted alphabetically.

In [12]:
# Create an empty team list
team_list = []
#create variables that will search and find regex to find all teams in the file
#home team
pattern_home = re.compile(r'^\d+\s*-\s*([\w\s()]+)\s+vs\s+[\w\s()]+\.json$')
#away team
pattern_away = re.compile(r'^\d+\s*-\s*[\w\s()]+\s+vs\s+([\w\s()]+)\.json$')
# a for loop that gets the name of the file which I had put in a list earlier and searches the file name for the home and away team
try:
    for game in all_game_list:
        #search home team
        home_team = re.match(pattern_home, game)
        #search away team
        away_team = re.match(pattern_away, game)
        #add teams to the team list if found
        if home_team:
            team_list.append(home_team.group(1))
        if away_team:
            team_list.append(away_team.group(1))
except Exception as e:
    print(f"An error occurred: {e}")
#use set to remove duplicates
unique_team_list = set(team_list)
answer_dict["q3.1"] = sorted(set(team_list))

## Question 4
Does the data seem reliable? 
* 'yes' or 'no'

In [13]:
answer_dict["q4"] = "yes"

### Question 4.1 
Write a sentence or two in support of how you answered question four. It must be based on quantifiable reasons obtained from the data set. If you fixed anything in the data set, explain what you did and why.

In [14]:
#Removing uneccessary teams. Could only use this once since it removes the file from file_path and would draw an error after
#list = ['full/Week 1/400953746 - SOUTH vs NORTH.json',
# 'full/Week 1/400953747 - WEST vs EAST.json',
# 'full/Week 1/401003756 - National vs American.json',]
#for i in list:
#    if os.path.exists(i):
#       os.remove(i)
#    else:
#       print("file path don't exist")

In [15]:
answer_dict["q4.1"] = "I looked at a few things to confirm. First skimming over the teams list, all the team names were familiar. There were 6 team names that were unusual (EAST, WEST, SOUTH, NORTH, American, and National). They were all in week one and were playing one another which made me think they could be a non-season all-star game so I chose to remove those 3 games since the focus seems to be the college football season games. I checked to see if there were duplicate games but there weren't any. The number of games (871) and teams (206) also appears to be reasonable as I assume some of the non-conference games a FBS school plays will be against non-FBS opponents who only appear once in the dataset and some smaller schools may not always have their game on ESPN so that may explain why the ratio of games to teams is lower than the actual number of games a team plays in a season."  

In [16]:
try:
    if len(all_game_list) == len(unique_game_list):
        print("No duplicate games found")
    else:
        print("Duplicate games found")
except Exception as e:
    print(f"An error occurred: {e}")

Duplicate games found


## Question 5
How many unique teams are represented in the data?

In [17]:
answer_dict["q5"] = len(unique_team_list)
answer_dict

{'q1': 872,
 'q2.a': dict_keys(['scoringPlays', 'videos', 'drives', 'teams', 'id', 'competitions', 'season', 'week']),
 'q2.b': 'Teams appear to be referenced consistently and there are no duplicate games in the list. There are a couple of games that seem to be mixed up in there where it shows EAST vs. WEST, National vs. American, and NORTH vs. SOUTH which are not college football teams. so I decided to remove those games.',
 'q3': 'yes',
 'q3.1': ['Abilene Christian',
  'Air Force',
  'Akron',
  'Alabama',
  'Alabama A_M',
  'Alabama State',
  'Albany',
  'Alcorn State',
  'Appalachian State',
  'Arizona',
  'Arizona State',
  'Arkansas',
  'Arkansas State',
  'Army',
  'Auburn',
  'Austin Peay',
  'BYU',
  'Ball State',
  'Baylor',
  'Boise State',
  'Boston College',
  'Bowling Green',
  'Buffalo',
  'California',
  'Central Arkansas',
  'Central Connecticut',
  'Central Michigan',
  'Charleston Southern',
  'Charlotte',
  'Chattanooga',
  'Cincinnati',
  'Clemson',
  'Coastal Carol

## Question 6
Alabama has not always been blessed with strong placekickers. Is there evidence in the 2017 season that Alabama misses field goals more often than other teams nationwide? 
qn = 'yes' or 'no'
qn+1 = Write a sentence or two supporting how you answered qn. It must include quantifiable reasons obtained from the data set.

# Get Missed FGs

In [18]:
#Dictionary to hold missed FGs
missed_fg_count = {}

try:
    # get the key value and file_path from the file and iterate through them
    for game, game_path in all_game_dict.items():
        # get the contents from game file
        current_data = load_json(game_path)
        # go to the drives data so that we can get information from the drives
        current_drive_data = current_data['drives']['previous']
        #iterate through the drives in a game
        for drive in current_drive_data:
            # Check if the drive resulted in a safety
            if 'Missed FG' in drive.get('displayResult', ''):
                # Get the team responsible for the safety
                team_responsible = drive.get('team', {}).get('displayName', 'Unknown')
            
                # Update the safety count for the team
                missed_fg_count[team_responsible] = missed_fg_count.get(team_responsible, 0) + 1
except Exception as e:
    print(f"An error occurred: {e}")

print(missed_fg_count['Alabama Crimson Tide'])

8


# Get Made FGs

In [19]:
# create dict to hold team name and number of missed field goals
made_fg_count = {}
#error handling
try:
    # get the key value and file_path from the file and iterate through them
    for game, game_path in all_game_dict.items():
        # get the contents from game file
        current_data = load_json(game_path)
        # go to the drives data so that we can get information from the drives
        current_drive_data = current_data['drives']['previous']
        #iterate through the drives in a game
        for drive in current_drive_data:
            # Check if the drive resulted in a safety
            if 'Field Goal' in drive.get('displayResult', ''):
                # Get the team responsible for the safety
                team_responsible = drive.get('team', {}).get('displayName', 'Unknown')
            
                # Update the safety count for the team
                made_fg_count[team_responsible] = made_fg_count.get(team_responsible, 0) + 1
except Exception as e:
    print(f"An error occurred: {e}")
    
#print Alabama's made fg
print(made_fg_count['Alabama Crimson Tide'])

18


# Ratio of Made FGs to Missed FGs

In [20]:
fg_ratio = {}
try:
    # iterate throguh teams
    for team in made_fg_count:
        #See if they are also in the missed_fg dict and make sure it does not equal zero to avoid division by zero
        if team in missed_fg_count and missed_fg_count[team]!=0:
            #get a ratio of made to missed fgs
            fg_ratio[team] = made_fg_count[team]/missed_fg_count[team]
        #otherwise just return a none value if they aren't in there
        else:
            fg_ratio[team] = None
    #return teams that have a value to a dictionary
    fg_ratio_filtered = {team: ratio for team, ratio in fg_ratio.items() if ratio is not None}
    #print Alabama's ratio
    print(fg_ratio_filtered['Alabama Crimson Tide'])
except Exception as e:
    print(f"An error occurred: {e}")

2.25


In [21]:
#Get an average ratio for all teams to compate alabama by summing the values in fg_ratio_filtered_dict
total_ratio = sum(fg_ratio_filtered.values())
#divide by length of dictionary to get an average
average_ratio = total_ratio/len(fg_ratio_filtered)
#print the average
print(average_ratio)

3.223379246324452


In [22]:
answer_dict['q6'] = "Yes"
answer_dict['q6+1'] = "Alabama wasn't the worst team but their ratio of made field goals to missed field goals was below average. Alabama made 18 and missed 8 field goals which is better than a school like South Carolina who only made 15 but missed 13. In the regular season, Alabama's ratio was 2.25 which is less than the average ratio which was 3.22 though that may include some teams that were only represented in the list once because they are a lower level school. According to the data, Alabama did miss more often than the average school."

## Question 7
A *saftey* in football refers to when the offensive player who has possession of the football is tackled or willingly downs the ball in their end zone. Two points are awared to the defensive team. The offensive team loses possesion of the ball.

In how many games did a safety occur?

In [23]:
safety_game_count = 0
try:
    #iterate through each game
    for game, game_path in all_game_dict.items():
        #load games
        current_data=load_json(game_path)
        #go to the drives
        current_drive_data = current_data['drives']['previous']
        #set safety value to False
        safety = False
        #iterate through drives
        for drive in current_drive_data:
            #If the result of the drive is a safety, change safety to True and break the for loop
            if 'Safety' in drive.get('displayResult', ''):
                safety = True
                break
        #add one to the counter if a safety occurred
        if safety == True:
            safety_game_count+=1
except Exception as e:
    print(f"An error occurred: {e}")

print(safety_game_count)

46


In [24]:
answer_dict['q7'] = safety_game_count

## Question 8
Which team scored the most safeties (include all teams with the same number if tied)?

In [25]:
#dictionary that keeps track of team and number of safeties against
safety_count = {}
#iterate through games and their files
try:
    for game, game_path in all_game_dict.items():
        #load game
        current_data = load_json(game_path)
        #go to drive information
        current_drive_data = current_data['drives']['previous']
        #iterate through drives in game
        for drive in current_drive_data:
            # Check if the drive resulted in a safety
            if 'Safety' in drive.get('displayResult', ''):
                # Get the team responsible for the safety
                team_against = drive.get('team', {}).get('displayName', 'Unknown')
                if team_against != current_data['teams'][0]['team']['displayName']:
                    team_for = current_data['teams'][0]['team']['displayName']
                else:
                    team_for = current_data['teams'][1]['team']['displayName']
                # Update the safety count for the team
                safety_count[team_for] = safety_count.get(team_for, 0) + 1
except Exception as e:
    print(f"An error occurred: {e}")

#get the maximum value for safeties across all teams
max_safeties = max(safety_count.values())
#create a list to hold teams
most_safeties_for = []
# get key and value pair from team_safety_count
for key, value in safety_count.items():
    #if statement
    if value == max_safeties:
        #
        most_safeties_for.append(key)

In [26]:
answer_dict["q8"] = most_safeties_for

## Question 9
Which teams (include all, if tied) gave up the most safeties?

In [27]:
#dictionary that keeps track of team and number of safeties against
team_safety_count = {}
#iterate through games and their files
try:
    for game, game_path in all_game_dict.items():
        #load game
        current_data = load_json(game_path)
        #go to drive information
        current_drive_data = current_data['drives']['previous']
        #iterate through drives in game
        for drive in current_drive_data:
            # Check if the drive resulted in a safety
            if 'Safety' in drive.get('displayResult', ''):
                # Get the team responsible for the safety
                team_responsible = drive.get('team', {}).get('displayName', 'Unknown')
            
                # Update the safety count for the team
                team_safety_count[team_responsible] = team_safety_count.get(team_responsible, 0) + 1
except Exception as e:
    print(f"An error occurred: {e}")
#get the maximum value for safeties across all teams
max_safeties = max(team_safety_count.values())
#create a list to hold teams
most_safeties_teams = []
# get key and value pair from team_safety_count
for key, value in team_safety_count.items():
    #if statement
    if value == max_safeties:
        #
        most_safeties_teams.append(key)

In [55]:
answer_dict["q9"] = most_safeties_teams

## Question 10
Find the longest play for the 2017 season. (Ex. a 99 yard interception return) If there are several
of the same length, show them all. Show team matchup, quarter, clocktime, and play text for each of the plays.

In [29]:
max_yardage = 0
#iterate through files
for game, game_path in all_game_dict.items():
    current_data = load_json(game_path)
    current_drive_data = current_data['drives']['previous']
    # iterates using the drive length
    for i in range(len(current_drive_data)):
        #go to plays
        plays = current_drive_data[i]['plays']
        # iterate over number of plays
        for j in range(len(plays)):
            #continuously append the yardage and determine if it is larger than the current max
            if plays[j]["statYardage"] <= 101 and plays[j]["statYardage"]>max_yardage:
                #make max_yardage the equal to the longest play
                max_yardage = plays[j]["statYardage"]

In [30]:
longest_play_info = []
# iterate through games
for game, game_path in all_game_dict.items():
    current_data = load_json(game_path)
    current_drive_data = current_data['drives']['previous']
    # iterates through the number of drives
    for i in range(len(current_drive_data)):
        #go to plays
        plays = current_drive_data[i]['plays']
        # iterates through number of plays
        for j in range(len(plays)):
            #if play is equal to max_yardage return team, opponent, quarter, time, and description
            if plays[j]["statYardage"] == max_yardage:
                team = current_drive_data[i]['team']['displayName']
                if team != current_data['teams'][0]['team']['displayName']:
                    opponent = current_data['teams'][0]['team']['displayName']
                else:
                    opponent = current_data['teams'][1]['team']['displayName']
                period = plays[j]["period"]["number"]
                clock = plays[j]["clock"]["displayValue"]
                text = plays[j]["text"]
                info = f"Team: {team}, Opponent: {opponent}, Quarter: {period}, Time: {clock}, Description: {text}"
                longest_play_info.append(info)

In [31]:
answer_dict["q10"] = longest_play_info

## Question 11
How long were Alabama's FIRST and LAST offensive plays of the season? Provide the description of each play including the yardage.

In [32]:
#Load first game file
UA_FSU_data = load_json("full/Week 1/400933827 - Alabama vs Florida State.json")
#Load last game file
UA_UGA_data = load_json("full/Bowl/400953415 - Georgia vs Alabama.json")
# get text and yardage from the first offensive play of the first game and last offensive play of their last game
text_1 = UA_FSU_data['drives']['previous'][0]['plays'][1]['text']
yardage_1 = UA_FSU_data['drives']['previous'][0]['plays'][1]['statYardage']
text_2 = UA_UGA_data['drives']['previous'][28]['plays'][1]['text']
yardage_2 = UA_UGA_data['drives']['previous'][28]['plays'][1]['statYardage']

In [33]:
answer_dict["q11"] = f"Alabama's first drive of the season went for {yardage_1} yards. {text_1}. Alabama's last drive of the season went for {yardage_2} yards. {text_2}."

## Question 12
How many times did Alabama punt in the 2017 season?

In [34]:
# set a punt counter
punt_count = 0
#error handling
try:
    #iterate through Alabama's game files
    for game, game_path in bama_game_files.items():
        #load json file iteratively
        current_data = load_json(game_path)
        #go to drives
        current_drive_data = current_data['drives']['previous']
        #iterate through the drives in each game
        for drive in current_drive_data:
            #find if result was drive was punt by Alabama and add that to the counter if True
            if 'Punt' in drive.get('displayResult', '') and drive.get('team', {}).get('displayName', '') == 'Alabama Crimson Tide':
                punt_count += 1

except Exception as e:
    print(f"Error {e} occurred")
#print punt_count
print("Alabama's punt count during the season:", punt_count)

Alabama's punt count during the season: 54


In [35]:
answer_dict["q12"] = punt_count

In [53]:
try:
    #create json file from dict
    with open("mis501_python_project_asingh26.json", "w") as json_file:
        json.dump(answer_dict, json_file, indent=4)
except Exception as e:
    print(f"Error {e} occurred")
