**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Alex Russell
- Varun Dinesh
- Jamal Karim
- Haohan Zou
- Sydnie Schlagheck

# Research Question


What is your research question? Include the specific question you're setting out to answer. This question should be specific, answerable with data, and clear. A general question with specific subquestions is permitted. (1-2 sentences)

Is there a relationship between whether an NFL team plays on their home field and the margin by which the team wins or loses?

## Background and Prior Work


Background: The National Football League (NFL) is a professional league where all the best American football players compete. The league spans all of the United States and is divided up into 32 teams. In this data analysis we examine how much of an impact home field advantage applies to the home team winning the game, if at all. We also analyze by what margin of victory or loss the home team exhibits. Home field advantage is a common concept amongst all sports where some fans believe that the home team gains many advantages. One of the factors is not having to travel long distances for games the night or two nights before and being familiar with your home atmosphere. Another factor is the support and amount of fans. An example of this being relevant could be: when on the offensive side of the ball, fans lower their voices so the players can hear their coaches and leaders on the field, and in the same respect, when the home team is on defense, the fans raise their voices and become extremely loud so the visiting team can’t enjoy those luxuries. A side product of the fans also may impact the referees as they are pressured into doing what the home fans want.

Prior Work: One piece of work looked at the value being home field advantage and predicting its effectiveness for the 2023 season. The study found that home field advantage in recent years has been trending downward. The base advantage that home field offers is usually 3 points but the study predicted it to go down to 1.4 for the 2023 season. https://www.covers.com/nfl/home-field-advantage

Another piece of work looked at why home field advantage hasn’t been as effective as it used to be. The two main factors that this study examined were travel and the fans. Travel takes a huge toll on players both physically and mentally, but with new leaps of medicine, players have been able to reduce the effect travel has on their bodies. The study also found that fans had a significant impact on the players. While positive energy from the home team did help players, the study also mentioned that players would be too worried about letting down the fans that they end up costing themselves the win. According to this study, fans also had an effect through referees, who would more often make calls in favor of the home team. However, because of recent rule changes such as the ability to review calls, this effect has decreased. https://www.discovermagazine.com/mind/why-the-home-field-advantage-is-on-the-decline

A third piece of work examined home-field advantage but from a view of individual NFL teams. It considers some confounding variables like weather, noise, and time zone. It also cites specific teams’ records and argues that certain teams actually do have a home field advantage, and that others do not necessarily have one. Some reasons that they thought could explain these advantages were that certain teams adapted more easily, time zone travel could hurt West coast teams, traction on the fields, and more. https://www.the33rdteam.com/category/analysis/which-nfl-teams-actually-have-home-field-advantage/ 


# Hypothesis



- Include your team's hypothesis
- Ensure that this hypothesis is clear to readers
- Explain why you think this will be the outcome (what was your thinking?)

What is your main hypothesis/predictions about what the answer to your question is? Briefly explain your thinking. (2-3 sentences)
- We believe that there will be a statistically significant advantage for the home team. We estimate that the home team will score an average of 5 points higher at home than away.
- We came to this hypothesis because we believe that familiarity with the home conditions and stadium could improve performance. Also, more fans could attend home games, elevating the morale of the home team in significant ways. We chose 5 because we think that the home team could be improved by almost a touchdown (6 points).

# Data

The data that we want should include the game history with variables that include the name of the home team, away team, the score of the game, the date of the game, and a column for which team won the game. Potentially we want to broaden our analysis on this topic to include other environmental factors such as the weather conditions (is it raining, snowing, or fair?) and the type of stadium that the game was played in (open, closed, occupancy). 

We found a starting dataset that gives us the game history from the 1990’s that includes many columns including the home team name, away team name, the date of the game played, the score of the game, an indicator of the team that won, and other potentially important information of features including the time of the game, the location, money-lines, and stadium name. The dataset comes from the Github of nfl verse which is a collection of data and R packages for NFL analytics. There’s about 5,000 observations where each observation represents a game between two teams. Potentially, we want to find another dataset to explore other potential environmental factors; for example, a dataset that includes the attendance of games/seasons that we can merge into the dataset that we already have. 



In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

 ## NFL Data Dataset

Dataset 1: https://github.com/nflverse/nfldata/blob/master/data/games.csv

Number of observations: 6693 / Number of variables: 46

This dataset contains info about every NFL game since 1999. Some important variables include what the home and away teams are, location, final scores, and even stadium/weather information. Some data types included are strings, dates, and numerical values (ints and floats). The concepts that the data are proxies for include team performance, game outcome, etc.


In [2]:
 YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
 FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 

In [3]:
df = pd.read_csv('Datasets/games1.csv')
print(df.shape)
pd.set_option("display.max_columns", 46)
df.head()
#df1['overtime'].value_counts()

(6693, 46)


Unnamed: 0,game_id,season,game_type,week,gameday,weekday,gametime,away_team,away_score,home_team,home_score,location,result,total,overtime,old_game_id,gsis,nfl_detail_id,pfr,pff,espn,ftn,away_rest,home_rest,away_moneyline,home_moneyline,spread_line,away_spread_odds,home_spread_odds,total_line,under_odds,over_odds,div_game,roof,surface,temp,wind,away_qb_id,home_qb_id,away_qb_name,home_qb_name,away_coach,home_coach,referee,stadium_id,stadium
0,1999_01_MIN_ATL,1999,REG,1,1999-09-12,Sunday,,MIN,17.0,ATL,14.0,Home,-3.0,31.0,0.0,1999091210,598.0,,199909120atl,,190912001,,7,7,,,-4.0,,,49.0,,,0,dome,astroturf,,,00-0003761,00-0002876,Randall Cunningham,Chris Chandler,Dennis Green,Dan Reeves,Gerry Austin,ATL00,Georgia Dome
1,1999_01_KC_CHI,1999,REG,1,1999-09-12,Sunday,,KC,17.0,CHI,20.0,Home,3.0,37.0,0.0,1999091206,597.0,,199909120chi,,190912003,,7,7,,,-3.0,,,38.0,,,0,outdoors,grass,80.0,12.0,00-0006300,00-0010560,Elvis Grbac,Shane Matthews,Gunther Cunningham,Dick Jauron,Phil Luckett,CHI98,Soldier Field
2,1999_01_PIT_CLE,1999,REG,1,1999-09-12,Sunday,,PIT,43.0,CLE,0.0,Home,-43.0,43.0,0.0,1999091213,604.0,,199909120cle,,190912005,,7,7,,,-6.0,,,37.0,,,1,outdoors,grass,78.0,12.0,00-0015700,00-0004230,Kordell Stewart,Ty Detmer,Bill Cowher,Chris Palmer,Bob McElwee,CLE00,Cleveland Browns Stadium
3,1999_01_OAK_GB,1999,REG,1,1999-09-12,Sunday,,OAK,24.0,GB,28.0,Home,4.0,52.0,0.0,1999091208,602.0,,199909120gnb,,190912009,,7,7,,,9.0,,,43.0,,,0,outdoors,grass,67.0,10.0,00-0005741,00-0005106,Rich Gannon,Brett Favre,Jon Gruden,Ray Rhodes,Tony Corrente,GNB00,Lambeau Field
4,1999_01_BUF_IND,1999,REG,1,1999-09-12,Sunday,,BUF,14.0,IND,31.0,Home,17.0,45.0,0.0,1999091202,591.0,,199909120clt,,190912011,,7,7,,,-3.0,,,45.5,,,1,dome,astroturf,,,00-0005363,00-0010346,Doug Flutie,Peyton Manning,Wade Phillips,Jim Mora,Ron Blum,IND99,RCA Dome


In [4]:
print("Data1: ", df.columns)

Data1:  Index(['game_id', 'season', 'game_type', 'week', 'gameday', 'weekday',
       'gametime', 'away_team', 'away_score', 'home_team', 'home_score',
       'location', 'result', 'total', 'overtime', 'old_game_id', 'gsis',
       'nfl_detail_id', 'pfr', 'pff', 'espn', 'ftn', 'away_rest', 'home_rest',
       'away_moneyline', 'home_moneyline', 'spread_line', 'away_spread_odds',
       'home_spread_odds', 'total_line', 'under_odds', 'over_odds', 'div_game',
       'roof', 'surface', 'temp', 'wind', 'away_qb_id', 'home_qb_id',
       'away_qb_name', 'home_qb_name', 'away_coach', 'home_coach', 'referee',
       'stadium_id', 'stadium'],
      dtype='object')


In [5]:
#rename columns, get rid of unnecessary columns
df = df[['game_id', 'season', 'game_type', 'gameday', 'away_team', 'away_score', 'home_team', 'home_score',
       'location', 'result', 'total', 'overtime', 'spread_line','div_game',
       'roof', 'temp', 'wind', 'stadium_id', 'stadium']]
print(df.shape)
df.head()

(6693, 19)


Unnamed: 0,game_id,season,game_type,gameday,away_team,away_score,home_team,home_score,location,result,total,overtime,spread_line,div_game,roof,temp,wind,stadium_id,stadium
0,1999_01_MIN_ATL,1999,REG,1999-09-12,MIN,17.0,ATL,14.0,Home,-3.0,31.0,0.0,-4.0,0,dome,,,ATL00,Georgia Dome
1,1999_01_KC_CHI,1999,REG,1999-09-12,KC,17.0,CHI,20.0,Home,3.0,37.0,0.0,-3.0,0,outdoors,80.0,12.0,CHI98,Soldier Field
2,1999_01_PIT_CLE,1999,REG,1999-09-12,PIT,43.0,CLE,0.0,Home,-43.0,43.0,0.0,-6.0,1,outdoors,78.0,12.0,CLE00,Cleveland Browns Stadium
3,1999_01_OAK_GB,1999,REG,1999-09-12,OAK,24.0,GB,28.0,Home,4.0,52.0,0.0,9.0,0,outdoors,67.0,10.0,GNB00,Lambeau Field
4,1999_01_BUF_IND,1999,REG,1999-09-12,BUF,14.0,IND,31.0,Home,17.0,45.0,0.0,-3.0,1,dome,,,IND99,RCA Dome


In [6]:
#look into invalid/missing values; address these
df['season'].value_counts()
df = df[df['season']<2023]
df.shape
df['location'].value_counts()
df = df[df['location']=='Home']
df.shape
#df['game_type'].value_counts()
df.isna().any()


game_id        False
season         False
game_type      False
gameday        False
away_team      False
away_score     False
home_team      False
home_score     False
location       False
result         False
total          False
overtime       False
spread_line    False
div_game       False
roof           False
temp            True
wind            True
stadium_id     False
stadium        False
dtype: bool

In [7]:
# if we want to trim dataset to only include years in confounding dataset
df['season'].value_counts()
df = df[(df['season']<2020) & (df['season']>1999)]
print(df.shape)
df

(5267, 19)


Unnamed: 0,game_id,season,game_type,gameday,away_team,away_score,home_team,home_score,location,result,total,overtime,spread_line,div_game,roof,temp,wind,stadium_id,stadium
259,2000_01_SF_ATL,2000,REG,2000-09-03,SF,28.0,ATL,36.0,Home,8.0,64.0,0.0,6.5,1,dome,,,ATL00,Georgia Dome
260,2000_01_JAX_CLE,2000,REG,2000-09-03,JAX,27.0,CLE,7.0,Home,-20.0,34.0,0.0,-10.5,1,outdoors,78.0,6.0,CLE00,Cleveland Browns Stadium
261,2000_01_IND_KC,2000,REG,2000-09-03,IND,27.0,KC,14.0,Home,-13.0,41.0,0.0,-3.5,0,outdoors,90.0,5.0,KAN00,Arrowhead Stadium
262,2000_01_CHI_MIN,2000,REG,2000-09-03,CHI,27.0,MIN,30.0,Home,3.0,57.0,0.0,4.5,1,dome,,,MIN00,Hubert H. Humphrey Metrodome
263,2000_01_TB_NE,2000,REG,2000-09-03,TB,21.0,NE,16.0,Home,-5.0,37.0,0.0,-3.0,0,outdoors,71.0,5.0,BOS99,Foxboro Stadium
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5577,2019_19_TEN_BAL,2019,DIV,2020-01-11,TEN,28.0,BAL,12.0,Home,-16.0,40.0,0.0,10.0,0,outdoors,69.0,12.0,BAL00,M&T Bank Stadium
5578,2019_19_HOU_KC,2019,DIV,2020-01-12,HOU,31.0,KC,51.0,Home,20.0,82.0,0.0,10.0,0,outdoors,,,KAN00,Arrowhead Stadium
5579,2019_19_SEA_GB,2019,DIV,2020-01-12,SEA,23.0,GB,28.0,Home,5.0,51.0,0.0,4.5,0,outdoors,23.0,8.0,GNB00,Lambeau Field
5580,2019_20_TEN_KC,2019,CON,2020-01-19,TEN,24.0,KC,35.0,Home,11.0,59.0,0.0,7.5,0,outdoors,,,KAN00,Arrowhead Stadium


In [8]:
#any other stuff we want: maybe graph/look at dist, remove obvious outliers, etc

## Stadium Dataset 

Dataset 2: https://www.kaggle.com/datasets/sujaykapadnis/nfl-stadium-attendance-dataset 

Number of observations: 10846 / Number of variables: 8 

This dataset provides info about attendance at NFL games. Some important variables include team, home and away attendance, and weekly attendance. The teams and team home locations are strings, and the other variables like year and total are integers. The integer values for attendance are proxies for engagement and how many fans support either side.

How we plan to combine these datasets: These datasets will be combined because the first dataset’s correlation between score and home-field advantage will be a baseline for the second dataset’s correlation between score and stadium attendance to be compared to. It will allow us to investigate whether stadium attendance can be a confounding variable. 


In [4]:
# Import three datasets

stadium = pd.read_csv("Datasets/attendance.csv")
print(stadium.shape)
stadium.head()

(10846, 8)


Unnamed: 0,team,team_name,year,total,home,away,week,weekly_attendance
0,Arizona,Cardinals,2000,893926,387475,506451,1,77434.0
1,Arizona,Cardinals,2000,893926,387475,506451,2,66009.0
2,Arizona,Cardinals,2000,893926,387475,506451,3,
3,Arizona,Cardinals,2000,893926,387475,506451,4,71801.0
4,Arizona,Cardinals,2000,893926,387475,506451,5,66985.0


In [6]:
games = pd.read_csv("Datasets/games.csv")
print(games.shape)
games.head()

(5324, 19)


Unnamed: 0,year,week,home_team,away_team,winner,tie,day,date,time,pts_win,pts_loss,yds_win,turnovers_win,yds_loss,turnovers_loss,home_team_name,home_team_city,away_team_name,away_team_city
0,2000,1,Minnesota Vikings,Chicago Bears,Minnesota Vikings,,Sun,September 3,1:00PM,30,27,374,1,425,1,Vikings,Minnesota,Bears,Chicago
1,2000,1,Kansas City Chiefs,Indianapolis Colts,Indianapolis Colts,,Sun,September 3,1:00PM,27,14,386,2,280,1,Chiefs,Kansas City,Colts,Indianapolis
2,2000,1,Washington Redskins,Carolina Panthers,Washington Redskins,,Sun,September 3,1:01PM,20,17,396,0,236,1,Redskins,Washington,Panthers,Carolina
3,2000,1,Atlanta Falcons,San Francisco 49ers,Atlanta Falcons,,Sun,September 3,1:02PM,36,28,359,1,339,1,Falcons,Atlanta,49ers,San Francisco
4,2000,1,Pittsburgh Steelers,Baltimore Ravens,Baltimore Ravens,,Sun,September 3,1:02PM,16,0,336,0,223,1,Steelers,Pittsburgh,Ravens,Baltimore


In [7]:
standings = pd.read_csv('Datasets/standings.csv')
print(standings.shape)
standings.head()

(638, 15)


Unnamed: 0,team,team_name,year,wins,loss,points_for,points_against,points_differential,margin_of_victory,strength_of_schedule,simple_rating,offensive_ranking,defensive_ranking,playoffs,sb_winner
0,Miami,Dolphins,2000,11,5,323,226,97,6.1,1.0,7.1,0.0,7.1,Playoffs,No Superbowl
1,Indianapolis,Colts,2000,10,6,429,326,103,6.4,1.5,7.9,7.1,0.8,Playoffs,No Superbowl
2,New York,Jets,2000,9,7,321,321,0,0.0,3.5,3.5,1.4,2.2,No Playoffs,No Superbowl
3,Buffalo,Bills,2000,8,8,315,350,-35,-2.2,2.2,0.0,0.5,-0.5,No Playoffs,No Superbowl
4,New England,Patriots,2000,5,11,276,338,-62,-3.9,1.4,-2.5,-2.7,0.2,No Playoffs,No Superbowl


In [8]:
# See missing values in the dataset
stadium.isna().sum()

team                   0
team_name              0
year                   0
total                  0
home                   0
away                   0
week                   0
weekly_attendance    638
dtype: int64

In [9]:
games.isna().sum()

year                 0
week                 0
home_team            0
away_team            0
winner               0
tie               5314
day                  0
date                 0
time                 0
pts_win              0
pts_loss             0
yds_win              0
turnovers_win        0
yds_loss             0
turnovers_loss       0
home_team_name       0
home_team_city       0
away_team_name       0
away_team_city       0
dtype: int64

In [10]:
standings.isna().sum()

team                    0
team_name               0
year                    0
wins                    0
loss                    0
points_for              0
points_against          0
points_differential     0
margin_of_victory       0
strength_of_schedule    0
simple_rating           0
offensive_ranking       0
defensive_ranking       0
playoffs                0
sb_winner               0
dtype: int64

In games dataset, we have seen that column 'tie' uses NaN to represent the games without a tie. At here, we will modify that columns into boolean datatype, where True represents a tie result happens and False otherwise. 

In [11]:
games['tie'] = games['tie'].fillna(False)
games['tie'] = games['tie'].apply(lambda x: x if not x else False)
games.isna().sum()

year              0
week              0
home_team         0
away_team         0
winner            0
tie               0
day               0
date              0
time              0
pts_win           0
pts_loss          0
yds_win           0
turnovers_win     0
yds_loss          0
turnovers_loss    0
home_team_name    0
home_team_city    0
away_team_name    0
away_team_city    0
dtype: int64

In [12]:
stadium= stadium.loc[~stadium['weekly_attendance'].isna()]
stadium.isna().sum()

team                 0
team_name            0
year                 0
total                0
home                 0
away                 0
week                 0
weekly_attendance    0
dtype: int64

In games dataset, since we don't care much about detail stats of the game, so we will only keep stats about score and final results.

In [13]:
games_1 = games.drop(columns = ['time', 'turnovers_win', 'turnovers_loss', 'yds_win', 'yds_loss', 'home_team_name', 'home_team_city', 'away_team_name','away_team_city'])
games_cleaned = games_1.rename(columns = {"day": "day_of_week"})
sign = (games_cleaned['home_team'] == games_cleaned['winner']).apply(lambda x: 1 if x else -1)
games_cleaned['points_diff'] = (games_cleaned['pts_win'] - games_cleaned['pts_loss']) * sign
games_cleaned.head()

Unnamed: 0,year,week,home_team,away_team,winner,tie,day_of_week,date,pts_win,pts_loss,points_diff
0,2000,1,Minnesota Vikings,Chicago Bears,Minnesota Vikings,False,Sun,September 3,30,27,3
1,2000,1,Kansas City Chiefs,Indianapolis Colts,Indianapolis Colts,False,Sun,September 3,27,14,-13
2,2000,1,Washington Redskins,Carolina Panthers,Washington Redskins,False,Sun,September 3,20,17,3
3,2000,1,Atlanta Falcons,San Francisco 49ers,Atlanta Falcons,False,Sun,September 3,36,28,8
4,2000,1,Pittsburgh Steelers,Baltimore Ravens,Baltimore Ravens,False,Sun,September 3,16,0,-16


We will only keep the team name without the team city in stadium dataset

In [18]:
stadium_cleaned = stadium.copy()
stadium_cleaned['team_name'] = stadium['team'] + " " + stadium['team_name']
stadium_cleaned= stadium.drop(columns ='team')
stadium_cleaned.head()

Unnamed: 0,team_name,year,total,home,away,week,weekly_attendance
0,Arizona Cardinals,2000,893926,387475,506451,1,77434.0
1,Arizona Cardinals,2000,893926,387475,506451,2,66009.0
3,Arizona Cardinals,2000,893926,387475,506451,4,71801.0
4,Arizona Cardinals,2000,893926,387475,506451,5,66985.0
5,Arizona Cardinals,2000,893926,387475,506451,6,44296.0


In [17]:
# Remove any non-regular season games
games_modified = games_cleaned[games_cleaned['week'].str.isdigit()]
games_modified.shape

(5104, 11)

Now, we hope to link these two datasets together so that we can have one dataset that contains both the number of attendance and the final results of the game.

In [27]:
games_temp = games_modified.copy()
games_temp['key'] = games_temp['home_team'] + ' ' + games_temp['year'].astype(str) + ' ' + games_temp['week'].astype(str)

stadium_temp = stadium_cleaned.copy()
stadium_temp['key'] = stadium_temp['team_name'] + ' ' + stadium_temp['year'].astype(str) + ' ' + stadium_temp['week'].astype(str)

complete_df = games_temp.merge(stadium_temp, left_on = 'key', right_on = 'key')
complete_df = complete_df.drop(columns = ['key', 'week_y', 'year_y', 'day_of_week', 'team_name'])
complete_df = complete_df.rename(columns = {'year_x': 'year', 'week_x': 'week'})
complete_df

Unnamed: 0,year,week,home_team,away_team,winner,tie,date,pts_win,pts_loss,points_diff,total,home,away,weekly_attendance
0,2000,1,Minnesota Vikings,Chicago Bears,Minnesota Vikings,False,September 3,30,27,3,1029262,513322,515940,64104.0
1,2000,1,Kansas City Chiefs,Indianapolis Colts,Indianapolis Colts,False,September 3,27,14,-13,1115272,631365,483907,78357.0
2,2000,1,Washington Redskins,Carolina Panthers,Washington Redskins,False,September 3,20,17,3,1174332,647424,526908,80257.0
3,2000,1,Atlanta Falcons,San Francisco 49ers,Atlanta Falcons,False,September 3,36,28,8,964579,422814,541765,54626.0
4,2000,1,Pittsburgh Steelers,Baltimore Ravens,Baltimore Ravens,False,September 3,16,0,-16,987037,440426,546611,55049.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5099,2019,17,New York Giants,Philadelphia Eagles,Philadelphia Eagles,False,December 29,34,17,-17,1143109,597316,545793,75029.0
5100,2019,17,Dallas Cowboys,Washington Redskins,Dallas Cowboys,False,December 29,47,16,31,1289027,727432,561595,90646.0
5101,2019,17,Baltimore Ravens,Pittsburgh Steelers,Baltimore Ravens,False,December 29,28,10,18,1091363,565020,526343,70695.0
5102,2019,17,Los Angeles Rams,Arizona Cardinals,Los Angeles Rams,False,December 29,31,24,7,1147715,582325,565390,68665.0


# Ethics & Privacy

Although the concept of home team advantage is relevant across sports and skill levels, we decided to study the highly visible NFL to get the most comprehensive, most reputable, and least biased data possible. There are few privacy or terms of use issues present in our proposed data because the data of NFL matches and players are publicly available. However, in the second dataset, people did not necessarily know that their attendance was being recorded, but their attendance is not attached to any identifying information so this should not be a major ethical issue. A potential unintended consequence is that if it becomes common knowledge that there is a definitive/significant home-field advantage effect, teams could also try to exploit this. A potential bias in our first dataset is that it includes neutral games where neither team had home-field advantage; this could be irrelevant or distracting to our question, so we plan to remove these observations from the data. Additionally, there may be some sources of bias for home-team advantage within NFL data due to home stadiums being different sizes, containing different mixes of home/visiting fans, and other metrics that are difficult to measure and/or account for. In terms of equitable impact, the impact of our analysis could be potentially limited because of the significant differences/biases between NFL matches and matches in other sports/leagues, so it is unlikely that our results could directly be applied to those other populations. We can deal with detecting specific biases before analysis by choosing broad datasets that include every game and stat possible. We can detect specific biases during and after analysis by checking the data for outliers, unexpected patterns, and confounding variables that demonstrate the presence of significant bias and then addressing those issues as they arise. We addressed other potential bias issues by targeting data sources that contain comprehensive match results (they leave nothing out for the season they were played in), unaffiliated with a specific team or stadium.


# Team Expectations 


Read over the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) individually. Then, include your group’s expectations of one another for successful completion of your COGS108 project below. Discuss and agree on what all of your expectations are. Discuss how your team will communicate throughout the quarter and consider how you will communicate respectfully should conflicts arise. By including each member’s name above and by adding their name to the submission, you are indicating that you have read the COGS108 Team Policies, accept your team’s expectations below, and have every intention to fulfill them. These expectations are for your team’s use and benefit — they won’t be graded for their details.

* In our team, each group member will show respect to each other, no matter what kind of conflicts you have with others. 
* Each person in our group should be responsible and diligent to tackle the tasks that they are assigned. 
* We shall assign tasks to each member that match their backgrounds and strengths best with equal workloads. If anyone has any difficulties in their work and can’t solve these obstacles by themselves, they will reach out to others for help as soon as possible. 
* When conflict arises, we will work as a team, analyzing the pros and cons of each strategy and choosing the best solution. 
* All members in our group should actively participate in discussions, though absence due to time conflict or emergency is understandable. In the discussion, everyone should listen carefully when other members are sharing their ideas. 
* Each member in our group should finish their tasks on time. If anyone can’t finish the work before the deadline we agreed on, notification and explanation in advance is required. 
* Before the submission of each deadline (proposal, checkin, and final project), each member is responsible to look over the overall work we have done and fix any typos, inappropriate expressions, and coding errors in our work. 


# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 10/30  |  7 PM | Brainstorm topics + questions  | Decide on final project topic, question(s), and hypothesis | 
| 10/31  |  4 PM |  Prepare drafts of each section - each person does one | Go over drafts, edit, finalize, and submit proposal (due 11/1) | 
| 11/7  | 4:15 PM  | Look for datasets - 1+ each | Discuss wrangling and potential analytical approaches, assign group members to lead certain pieces   |
| 11/13  | 7 PM  | Import data, remove unnecessary data | Review/edit datasets, submit data checkpoint (due 11/15)   |
| 11/20  | 7 PM  | Finalize wrangling/EDA, start analyzing | Discuss and edit analysis |
| 11/27  | 7 PM  | Edit analysis independently | Submit EDA checkpoint (due 11/29) |
| 12/4  | 7 PM  | Finish analysis, draft results/conclusion/discussion | Discuss + edit full project |
| 12/11  | 7 PM  | Final edits | Turn in final project, video, + last group progress survey (due 12/13) |