# The Baseball Parity Project: Methodology

In this project, I take a look at Major League Baseball's parity in the divisional era (1995-2021) by visualizing the standard deviation of team wins on each day of the regular season for each year. 

We see that the last few full seasons have shown an increased disparity in team wins across the league. At the 2021 All-Star Break, an annual break in the schedule representing the unofficial halfway point of the season, I show that the current season is trending towards another highly disparate distribution. 

This is of interest because the higher the deviation, the more likely that more of the games are being played non-competitively, especially towards the end of the season when some teams have been effectively eliminated from playoff contention.
 

### Getting Started

This project began with the idea to visualize the standings of each day of the schedule during Major League Baseball's divisional era (1995-2021). This meant I had to build a scraper to pull from a daily standings page on Baseball-Reference.com (https://www.baseball-reference.com/boxes/), where the site hosts the current standings for any day of the year.

Then I would be able to analyze that dataframe to find the daily standard deviation in wins, from which I could build my first visualization.

### Building the Scraper

I built the scraper using Beautiful Soup in a file found in the repo called 'Baseball Parity Calendar-Creating a DataFrame with Wins.ipynb'.

In the file, I start by parsing the standings from a specific day in July 2019, which can be found at https://www.baseball-reference.com/boxes/?month=7&day=1&year=2019. Noting the format of the url, I knew that if I could generate a list of dates of the year, I could eventually build a for loop to run through all the dates once I had written the code for my scraper.

I began by calling all the divisional standings tables on the page and drilling down until I could isolate the specific statistics I wanted to analyze. For each division, I created a dataframe with the columns of team, wins, winning percentage (wp), and games back from first place (gb).

Once I had this code, I built a list of all of the dates that I was interested in including, and added them each to lists by year. From this list, I built the urls for each day and then wrote a for loop for running my scraper through each of the dates on each list.

### Year by Year

Now I had a list of dates and a scraper, and I wanted to run the scraper through all of the years to find dataframes for each day of my desired visualization. This took forever, and I wanted a little bit more control over potential debugging, so I decided to run the scraper through each year manually to: 
1. Generate dataframes for all days,
2. Analyze each dataframe for the standard deviation in win totals across the league, and
3. Add those daily standard deviations to their own dataframe, leaving us with a df of daily standard deviation values where the rows are dates of the calendar and the columns are years.

### Creating a Second DataFrame

Once I had this dataframe, I did sone exploration on what a visualization might look like, deciding that I wanted a sort of horizontal calendar heatmap with each year displayed vertically, and I could see that the last few full seasons had shown a trend in increasing disparity.

But because of the nature of my visualization, the pace of the deviation in the shortened 60-game 2020 season (due to coronavirus) and the in-progess 2021 season didn't come through as well visually. So I decided I would create a second dataframe to show how the values in the last two partial seasons would compare to all seasons if adjusted for a full, 162-game season.

For this dataframe, I would need the end-of-season standard deviations and total games scheduled for each season. I knew that all seasons in this time had been scheduled as 162 games except for 1995 (because of a player strike), 2020, and this season (being in progress), so I was able to create this dataframe easily using data from my previous dataframe and these season game totals.

### Melting

For the first visual, I was hoping to use Flourish's heatmap template, but I quickly realized I would need to change the format of my data, because although I wanted to display years vertically and days horizontally, the template required two columns of data for the id variables, so I had to use pd.melt to reshape my first dataframe.

### Visualization

From here, I was able to visualize my first dataframe as a heatmap in Flourish and my second dataframe as a column chart in DataWrapper, which showed that, when adjusted for a 162-game season, both 2020 and the first half of 2021 showed a continuation of the highly disparate distribution.