# Homework: Testing Knowledge of the Pandas Data Analysis Library

You are provided with a dataset containing information on the results of the 2016 baseball season.

Your task is to load the dataset (it is included in the archive with this homework assignment), clean it if necessary, and answer the given analytical questions.

List of columns presented in the dataset:

---
attendance – number of spectators at the game
* away_team – name of the away team
* away_team_errors – number of errors made by the away team
* away_team_hits – number of hits by the away team (a hit in baseball is when the batter successfully reaches first base)
* away_team_runs – number of runs scored by the away team (a run is a point scored by the offensive player)
* date – date of the game
* field_type – type of playing field
* game_type – type of game
* home_team – name of the home team
* home_team_errors – number of errors made by the home team
* home_team_hits – number of hits by the home team
* home_team_runs – number of runs scored by the home team
* start_time – start time of the game
* venue – name of the venue (stadium, field, arena)
* day_of_week – day of the week the game was played
* temperature – air temperature on the day of the game (in Fahrenheit)
* wind_speed – wind speed on the day of the game
* wind_direction – wind direction
* sky – cloudiness
* total_runs – total number of runs scored by both teams
* game_hours_dec – game duration in hours (decimal format)
* season – type of game season
* home_team_win – result of the home team (1 – win)
* home_team_loss – result of the home team (0 – loss)
* home_team_outcome – outcome of the game
---

There are a total of 20 questions in the assignment. Each correct answer is worth 5 points. Therefore, if all answers are correct, you will receive 100 points. The final score will then be scaled to a 10-point grading system.

---

Good luck on your quest for the truth :)

In [1]:
# Import all necessary libraries and modules you need
import pandas as pd
import numpy as np
from datetime import datetime

In [2]:
# Load the dataset and perform cleaning and formatting if necessary

In [3]:
# Let's load the dataset using the read_csv() method, which is used to read text files.
df = pd.read_csv('baseball_games.csv', index_col=0)

In [4]:
# Let's take a look at the first few (5) rows of the table we loaded.
# We'll use the head() method. It returns the first n rows of the DataFrame (by default, n=5).
df.head()

Unnamed: 0,attendance,away_team,away_team_errors,away_team_hits,away_team_runs,date,field_type,game_type,home_team,home_team_errors,...,temperature,wind_speed,wind_direction,sky,total_runs,game_hours_dec,season,home_team_win,home_team_loss,home_team_outcome
0,40030.0,New York Mets,1,7,3,2016-04-03,on grass,Night Game,Kansas City Royals,0,...,74.0,14.0,from Right to Left,Sunny,7,3.216667,regular season,1,0,Win
1,21621.0,Philadelphia Phillies,0,5,2,2016-04-06,on grass,Night Game,Cincinnati Reds,0,...,55.0,24.0,from Right to Left,Overcast,5,2.383333,regular season,1,0,Win
2,12622.0,Minnesota Twins,0,5,2,2016-04-06,on grass,Night Game,Baltimore Orioles,0,...,48.0,7.0,out to Leftfield,Unknown,6,3.183333,regular season,1,0,Win
3,18531.0,Washington Nationals,0,8,3,2016-04-06,on grass,Night Game,Atlanta Braves,1,...,65.0,10.0,from Right to Left,Cloudy,4,2.883333,regular season,0,1,Loss
4,18572.0,Colorado Rockies,1,8,4,2016-04-06,on grass,Day Game,Arizona Diamondbacks,0,...,77.0,0.0,in unknown direction,In Dome,7,2.65,regular season,0,1,Loss


In [5]:
# Let's find out what data types are stored in the table, 
# as well as the number of non-null values per column 
# and the amount of memory used, using the info() method
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2463 entries, 0 to 2462
Data columns (total 25 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   attendance         2460 non-null   float64
 1   away_team          2463 non-null   object 
 2   away_team_errors   2463 non-null   int64  
 3   away_team_hits     2463 non-null   int64  
 4   away_team_runs     2463 non-null   int64  
 5   date               2463 non-null   object 
 6   field_type         2463 non-null   object 
 7   game_type          2463 non-null   object 
 8   home_team          2463 non-null   object 
 9   home_team_errors   2463 non-null   int64  
 10  home_team_hits     2463 non-null   int64  
 11  home_team_runs     2463 non-null   int64  
 12  start_time         2463 non-null   object 
 13  venue              2463 non-null   object 
 14  day_of_week        2463 non-null   object 
 15  temperature        2463 non-null   float64
 16  wind_speed         2463 non-n

In [6]:
# Check the number of duplicate rows
sum(df.duplicated())

0

<font color='darkblue'>
Our DataFrame has 25 columns (the number matches the textual description) and 2,463 rows.
<br>4 columns are of type float64, 9 are int64, and 12 are object.
<br>Memory usage is 500.3+ KB.
<br>All columns except for 'attendance' have no null values.
<br>There are no duplicates.
<br>
<br>Let's take a closer look at several columns to check if their data types match the textual description and whether they contain null or non-null values.
</font>

<div style="background-color: lightblue; padding: 10px; border-radius: 10px;">
    <font color='darkblue'>
        <b>attendance</b> – number of spectators at the game
    </font>
</div>

In [7]:
# Use isnull() to identify null values and sum() to count them
sum(df['attendance'].isnull())

3

In [8]:
# Calculate the percentage of null values in this column
print(f"{(sum(df['attendance'].isnull()) / len(df['attendance'])):.2%}")

0.12%


<font color='darkblue'>
We can see that the <b>attendance</b> column has 3 null values, which makes up 0.12% of the total.  
We decide to replace these three rows with the median number of spectators.
</font>

In [9]:
# Use median() to calculate the median
# Replace missing values using fillna()
df['attendance'] = df['attendance'].fillna(df['attendance'].median())

<font color='darkblue'>
    The <b>attendance</b> column has the type <b>float64</b> (floating-point numbers) and provides information about the number of spectators at the game.  
    Since the number of spectators should be an integer, the <b>int</b> type is more appropriate.
</font>

In [10]:
#Преобразуем тип колонки attendance в int при помощи astype(int)
df['attendance'] = df['attendance'].astype(int)

<div style="background-color: lightblue; padding: 10px; border-radius: 10px;">
    <font color='darkblue'>
        <b>date</b> – date of the game. The column has 2,463 non-null values of type object.
    </font> 
</div>

In [11]:
# Use to_datetime() to convert the data type
df['date'] = pd.to_datetime(df['date'])

<div style="background-color: lightblue; padding: 10px; border-radius: 10px;">
    <font color='darkblue'>
        <b>start_time</b> – start time of the game. Let's check the format of the data and convert it to datetime.
    </font>
</div>

In [12]:
# Check the format in which the time is recorded
df['start_time'].head(3)

0    7:38 p.m. Local
1    7:11 p.m. Local
2    7:07 p.m. Local
Name: start_time, dtype: object

In [13]:
# Use str.replace() to replace 'Local' with an empty string, 'p.m.' with 'PM', and 'a.m.' with 'AM'
df['start_time'] = df['start_time'].str.replace(' Local', '')
df['start_time'] = df['start_time'].str.replace('p.m.', 'PM')
df['start_time'] = df['start_time'].str.replace('a.m.', 'AM')

In [14]:
# Let's see the result
df['start_time'].head(3)

0    7:38 PM
1    7:11 PM
2    7:07 PM
Name: start_time, dtype: object

In [15]:
# Convert the column type to datetime
df['start_time'] = df['start_time'].apply(lambda x: datetime.strptime(x, "%I:%M %p").time())

<div style="background-color: lightblue; padding: 10px; border-radius: 10px;">
   <font color='darkblue'>
    <b>temperature</b> – air temperature on the day of the game, in Fahrenheit. The column contains 2,463 non-null values of type float64.  
    Later, in question 20, a temperature below 0 is mentioned — it is obvious that this refers to degrees in Celsius.
   </font>
</div>

In [16]:
# Convert degrees from Fahrenheit to Celsius
df['temperature'] = df['temperature'].apply(lambda x: round((x - 32) * 5 / 9, 2))

<div style="background-color: lightblue; padding: 10px; border-radius: 10px;">
   <font color='darkblue'>
    <b>home_team_outcome</b> – outcome of the game. The column contains 2,463 non-null values of type object.
   </font>
</div>

In [17]:
# Check which unique values are present in the home_team_outcome column
df['home_team_outcome'].unique()

array(['Win', 'Loss'], dtype=object)

In [18]:
# Replace values: Loss → 0, Win → 1
df['home_team_outcome'] = df.home_team_outcome.map({'Loss': 0, 'Win': 1})

# 1. Which game had the highest number of spectators during the entire season?

In [19]:
# Find the highest number of spectators during the entire season
# To do this, select the games that took place in the regular season
mask_season = (df['season'] == 'regular season')

<font color='darkblue'>
    Since we have several similar questions, let's write a helper function to avoid code duplication.
</font>

In [20]:
def select_by_condition(data, col, func):
    '''
    Returns rows from data where the condition data[col] == func(data[col]) is met.
    
    Args:
        data: the dataset from which to select rows
        col: the column to which the selection condition is applied
        func: the function defining the selection condition (e.g., max, min)
        
    Returns:
        Rows from data that satisfy the condition
    '''
    cond_val = func(data[col])
    print(f'{str(func)[-4:-1]} for {col} : {cond_val}')
    return data[data[col] == cond_val]

In [21]:
# Find the highest number of spectators during the entire season
# Use max() on 'attendance' for the games played in the regular season (mask_season)
select_by_condition(df.loc[mask_season], 'attendance', max)

max для attendance : 53621


Unnamed: 0,attendance,away_team,away_team_errors,away_team_hits,away_team_runs,date,field_type,game_type,home_team,home_team_errors,...,temperature,wind_speed,wind_direction,sky,total_runs,game_hours_dec,season,home_team_win,home_team_loss,home_team_outcome
534,53621,San Francisco Giants,1,11,2,2016-09-20,on grass,Night Game,Los Angeles Dodgers,0,...,25.0,6.0,out to Rightfield,Cloudy,2,3.6,regular season,0,1,0


<font color='darkblue'>
    The game <b>San Francisco Giants vs Los Angeles Dodgers</b> had the highest number of spectators — 53,621 — during the entire season.
</font>

# 2. Which game was the coldest (based on temperature) during the entire season?

In [22]:
# Find the lowest temperature during the entire season
# Use min() on 'temperature' for the games played in the regular season (mask_season)
select_by_condition(df.loc[mask_season], 'temperature', min)

min для temperature : -0.56


Unnamed: 0,attendance,away_team,away_team_errors,away_team_hits,away_team_runs,date,field_type,game_type,home_team,home_team_errors,...,temperature,wind_speed,wind_direction,sky,total_runs,game_hours_dec,season,home_team_win,home_team_loss,home_team_outcome
2409,32419,New York Yankees,1,13,8,2016-04-09,on grass,Day Game,Detroit Tigers,1,...,-0.56,18.0,from Left to Right,Cloudy,12,3.333333,regular season,0,1,0


<font color='darkblue'>
    The game <b>New York Yankees vs Detroit Tigers</b> was played at -0.56°C and was the coldest game of the entire season.
</font>

# 3. Which game was the warmest during the entire season?

In [23]:
# Find the highest temperature during the entire season
# Use max() on 'temperature' for the games played in the regular season (mask_season)
select_by_condition(df.loc[mask_season], 'temperature', max)

max для temperature : 38.33


Unnamed: 0,attendance,away_team,away_team_errors,away_team_hits,away_team_runs,date,field_type,game_type,home_team,home_team_errors,...,temperature,wind_speed,wind_direction,sky,total_runs,game_hours_dec,season,home_team_win,home_team_loss,home_team_outcome
2026,21753,San Francisco Giants,0,8,3,2016-05-13,on grass,Night Game,Arizona Diamondbacks,0,...,38.33,9.0,in unknown direction,Sunny,4,3.0,regular season,0,1,0


<font color='darkblue'>
The game <b>San Francisco Giants vs Arizona Diamondbacks</b> was played at a temperature of 38.33°C and was the warmest game of the entire season.
</font>

# 4. Which game in the season had the longest duration?

In [24]:
# Find the longest game of the entire season
# Use max() on 'game_hours_dec' for the games played in the regular season (mask_season)
select_by_condition(df.loc[mask_season], 'game_hours_dec', max)

max для game_hours_dec : 6.216666666666667


Unnamed: 0,attendance,away_team,away_team_errors,away_team_hits,away_team_runs,date,field_type,game_type,home_team,home_team_errors,...,temperature,wind_speed,wind_direction,sky,total_runs,game_hours_dec,season,home_team_win,home_team_loss,home_team_outcome
1445,45825,Cleveland Indians,0,15,2,2016-07-01,on turf,Day Game,Toronto Blue Jays,2,...,20.0,0.0,in unknown direction,In Dome,3,6.216667,regular season,0,1,0


<font color='darkblue'>
The game <b>Cleveland Indians vs Toronto Blue Jays</b> was the longest in duration, lasting 6.22 hours.
</font>

# 5. Which game in the season had the shortest duration?

In [25]:
# Find the shortest game of the entire season
# Use min() on 'game_hours_dec' for the games played in the regular season (mask_season)
select_by_condition(df.loc[mask_season], 'game_hours_dec', min)

min для game_hours_dec : 1.25


Unnamed: 0,attendance,away_team,away_team_errors,away_team_hits,away_team_runs,date,field_type,game_type,home_team,home_team_errors,...,temperature,wind_speed,wind_direction,sky,total_runs,game_hours_dec,season,home_team_win,home_team_loss,home_team_outcome
423,19991,Chicago Cubs,1,4,1,2016-09-29,on grass,Night Game,Pittsburgh Pirates,0,...,17.22,12.0,in from Leftfield,Overcast,2,1.25,regular season,0,0,0


<font color='darkblue'>
The game <b>Chicago Cubs vs Pittsburgh Pirates</b> was the shortest in duration, lasting 1.25 hours.
</font>

# 6. How many games in the season ended in a tie?

<font color='darkblue'>
    In baseball, there are <b>NO</b> ties.
</font>

In [26]:
# But let's double-check just in case)))
# First, select the games played in the regular season (mask_season)
# Then use value_counts() to see the number of outcomes for home teams
df.loc[mask_season]['home_team_outcome'].value_counts()

home_team_outcome
1    1287
0    1141
Name: count, dtype: int64

<font color='darkblue'>
    Thus, in the 2016 season, there were 1,287 wins, 1,141 losses, and, as expected, <b>0 ties</b> for the home teams.
</font>

In [27]:
# Find the number of games where the home team neither won nor lost
sum((df['home_team_win'] == 0) & (df['home_team_loss'] == 0))

1

<font color='darkblue'>
    Thus, in the 2016 season, there was <b>1</b> strange game in which the home team neither won nor lost. This game was canceled due to weather conditions.
</font>

# 7. Which game was the last one of the season?

In [28]:
# To do this, select the games played in the regular season (mask_season)
# Sort the 'date' using sort_values() in descending order (ascending=False)
df.loc[mask_season].sort_values(['date', 'start_time'], ascending=[False, False]).head(1)

Unnamed: 0,attendance,away_team,away_team_errors,away_team_hits,away_team_runs,date,field_type,game_type,home_team,home_team_errors,...,temperature,wind_speed,wind_direction,sky,total_runs,game_hours_dec,season,home_team_win,home_team_loss,home_team_outcome
395,36787,Toronto Blue Jays,0,9,2,2016-10-02,on grass,Day Game,Boston Red Sox,1,...,13.33,6.0,out to Leftfield,Unknown,3,3.233333,regular season,0,1,0


<font color='darkblue'>
    The game <b>Toronto Blue Jays vs Boston Red Sox</b> was played on 2016-10-02 at 15:17:00 and  
    was the last game of the season.
</font>

# 8. Which game had the lowest number of spectators?

In [29]:
# Select games with the lowest number of spectators
# Use min() on 'attendance' across all games
select_by_condition(df, 'attendance', min)

min для attendance : 8766


Unnamed: 0,attendance,away_team,away_team_errors,away_team_hits,away_team_runs,date,field_type,game_type,home_team,home_team_errors,...,temperature,wind_speed,wind_direction,sky,total_runs,game_hours_dec,season,home_team_win,home_team_loss,home_team_outcome
2130,8766,Detroit Tigers,0,5,0,2016-05-04,on grass,Night Game,Cleveland Indians,0,...,12.22,11.0,from Left to Right,Overcast,4,2.316667,regular season,1,0,1


<font color='darkblue'>
    The game <b>Detroit Tigers vs Cleveland Indians</b> had the lowest number of spectators (8,766).
</font>

# 9. Which game in the season was the windiest?

In [30]:
# Find max() for 'wind_speed' in the games played during the regular season (mask_season)
select_by_condition(df.loc[mask_season], 'wind_speed', max)

max для wind_speed : 25.0


Unnamed: 0,attendance,away_team,away_team_errors,away_team_hits,away_team_runs,date,field_type,game_type,home_team,home_team_errors,...,temperature,wind_speed,wind_direction,sky,total_runs,game_hours_dec,season,home_team_win,home_team_loss,home_team_outcome
1655,41543,Milwaukee Brewers,1,11,5,2016-06-13,on grass,Night Game,San Francisco Giants,0,...,14.44,25.0,out to Centerfield,Cloudy,16,3.633333,regular season,1,0,1
2005,35736,Houston Astros,2,8,9,2016-05-15,on grass,Day Game,Boston Red Sox,3,...,14.44,25.0,out to Rightfield,Cloudy,19,3.666667,regular season,1,0,1


<font color='darkblue'>
    The maximum wind speed during the entire season was 25.0.  
    <br>Two games were played under such conditions:  
    <br><b>Milwaukee Brewers vs San Francisco Giants</b>  
    <br><b>Houston Astros vs Boston Red Sox</b>
</font>

# 10. In which game was the highest number of total runs scored?

In [31]:
# Runs in baseball are called "runs", so we need to calculate the sum of away_team_runs and home_team_runs
# This will give us the total number of runs per game (we'll sum the columns), then select the max of that sum
# This new column will also be useful for question 20
df['all_runs'] = df['away_team_runs'] + df['home_team_runs']
# Find max() for 'all_runs' across all games
select_by_condition(df, 'all_runs', max)

max для all_runs : 29


Unnamed: 0,attendance,away_team,away_team_errors,away_team_hits,away_team_runs,date,field_type,game_type,home_team,home_team_errors,...,wind_speed,wind_direction,sky,total_runs,game_hours_dec,season,home_team_win,home_team_loss,home_team_outcome,all_runs
1788,22588,Seattle Mariners,1,16,16,2016-06-02,on grass,Night Game,San Diego Padres,1,...,10.0,out to Rightfield,Sunny,29,3.833333,regular season,0,1,0,29


<font color='darkblue'>
    The game <b>Seattle Mariners vs San Diego Padres</b> had the highest number of total runs scored (29).
</font>

# 11. Which game had the highest number of errors made by the home team?

In [32]:
# Find max() for 'home_team_errors' across all games
select_by_condition(df, 'home_team_errors', max)

max для home_team_errors : 5


Unnamed: 0,attendance,away_team,away_team_errors,away_team_hits,away_team_runs,date,field_type,game_type,home_team,home_team_errors,...,wind_speed,wind_direction,sky,total_runs,game_hours_dec,season,home_team_win,home_team_loss,home_team_outcome,all_runs
1178,22581,Arizona Diamondbacks,1,11,8,2016-07-27,on grass,Night Game,Milwaukee Brewers,5,...,9.0,in from Leftfield,Cloudy,9,2.933333,regular season,0,1,0,9


<font color='darkblue'>
    The game <b>Arizona Diamondbacks vs Milwaukee Brewers</b> had the highest number of errors made by the home team (5).
</font>

# 12. In which game was the highest number of hits?

In [33]:
# Calculate the sum of away_team_hits and home_team_hits (create a new column)
# Then find the max of that sum
df[(df['away_team_hits'] + df['home_team_hits']) == (df['away_team_hits'] + df['home_team_hits']).max()]

Unnamed: 0,attendance,away_team,away_team_errors,away_team_hits,away_team_runs,date,field_type,game_type,home_team,home_team_errors,...,wind_speed,wind_direction,sky,total_runs,game_hours_dec,season,home_team_win,home_team_loss,home_team_outcome,all_runs
1413,36253,Texas Rangers,0,16,5,2016-07-04,on grass,Day Game,Boston Red Sox,2,...,11.0,out to Centerfield,Cloudy,17,3.666667,regular season,1,0,1,17


<font color='darkblue'>
    The game <b>Texas Rangers vs Boston Red Sox</b> had the highest number of hits (37).
</font>

# 13. Display the number of games each team played during the regular season

In [34]:
# Select games played in the regular season (mask_season)
# Count the number of games for 'away_team' and 'home_team' using value_counts
# Sum the results using sum(axis=1)
df.loc[mask_season][['away_team', 'home_team']].apply(pd.Series.value_counts).sum(axis=1)

Arizona Diamondbacks             162
Atlanta Braves                   161
Baltimore Orioles                162
Boston Red Sox                   162
Chicago Cubs                     162
Chicago White Sox                162
Cincinnati Reds                  162
Cleveland Indians                161
Colorado Rockies                 162
Detroit Tigers                   161
Houston Astros                   162
Kansas City Royals               162
Los Angeles Angels of Anaheim    162
Los Angeles Dodgers              162
Miami Marlins                    161
Milwaukee Brewers                162
Minnesota Twins                  162
New York Mets                    162
New York Yankees                 162
Oakland Athletics                162
Philadelphia Phillies            162
Pittsburgh Pirates               162
San Diego Padres                 162
San Francisco Giants             162
Seattle Mariners                 162
St. Louis Cardinals              162
Tampa Bay Rays                   162
T

# 14. Which team won the highest number of games during the season? (Be careful to consider only regular season games)

In [35]:
# Create an array to store which team won each game
# If 'home_team_outcome' is 0, then the away team won
# If 'home_team_outcome' is 1, then the home team won
conditions = [
    (df['home_team_outcome'] == 0),
    (df['home_team_outcome'] == 1),
]
choices = [df['away_team'], df['home_team']]

# Use np.select to create the array
win_arr = np.select(conditions, choices)

# Create a new DataFrame, filter by regular season (mask_season), and count occurrences using value_counts
pd.DataFrame(win_arr, columns=['team'])[mask_season].value_counts(sort=True).head(1)

team        
Chicago Cubs    104
Name: count, dtype: int64

<font color='darkblue'>
The <b>Chicago Cubs</b> won the highest number of games in the season (104), thus breaking the Curse of the Billy Goat.
</font>

# 15. Which team won the highest number of home games during the season?

In [36]:
# Select all games in the regular season (mask_season) where the home team won (home_team_outcome == 1),
# then count the number of wins per home_team using value_counts()
df[mask_season & (df['home_team_outcome'] == 1)]['home_team'].value_counts(sort=True).head(1)

home_team
Chicago Cubs    57
Name: count, dtype: int64

<font color='darkblue'>
The <b>Chicago Cubs</b> won the highest number of home games during the season (57).
</font>

# 16. Which team won the highest number of away games during the season?

In [37]:
# Select all games in the regular season where the home team lost (home_team_outcome == 0),
# meaning the away team won (an away victory),
# then count the number of away wins per away_team using value_counts()
df[mask_season & (df['home_team_outcome'] == 0)]['away_team'].value_counts(sort=True).head(1)

away_team
St. Louis Cardinals    48
Name: count, dtype: int64

<font color='darkblue'>
The <b>St. Louis Cardinals</b> won the highest number of away games during the season (48).
</font>

# 17. Which team lost the highest number of games during the season?

In [38]:
# Create an array to store which team lost each game
# If 'home_team_outcome' is 0, the home team lost
# If 'home_team_outcome' is 1, the away team lost
conditions = [
    (df['home_team_outcome'] == 0),
    (df['home_team_outcome'] == 1)
]
choices = [df['home_team'], df['away_team']]

# Use np.select to create the array
loss_arr = np.select(conditions, choices)

# Create a new DataFrame, filter by regular season (mask_season), and count losses with value_counts
pd.DataFrame(loss_arr, columns=['team'])[mask_season].value_counts(sort=True).head(1)

team           
Minnesota Twins    103
Name: count, dtype: int64

<font color='darkblue'>
The <b>Minnesota Twins</b> lost the highest number of games during the season (103).
</font>

# 18. Does the outcome of a game (win) depend on the number of spectators?  
# (see pandas.DataFrame.cov for covariance calculation)

In [39]:
# Build the covariance matrix for 'attendance' and 'home_team_outcome'
cov_matrix = df[['attendance', 'home_team_outcome']].cov(numeric_only=True)
display(cov_matrix)

Unnamed: 0,attendance,home_team_outcome
attendance,97406070.0,224.736752
home_team_outcome,224.7368,0.249235


<font color='darkblue'>
The covariance (224.736752) shows that the values in the columns tend to change in the same direction,  
but the magnitude of this number alone is not informative without normalization, as it depends on the units of measurement.
</font>

In [40]:
# Build the correlation matrix for 'attendance' and 'home_team_outcome'
cor_matrix = df[['attendance', 'home_team_outcome']].corr(numeric_only=True)
display(cor_matrix)

Unnamed: 0,attendance,home_team_outcome
attendance,1.0,0.045612
home_team_outcome,0.045612,1.0


<font color='darkblue'>
The correlation of 0.045612 indicates a very weak positive linear relationship.  
A correlation value close to 0 means that changes in one column have almost no effect on the values in the other.  
<br><b>Conclusion</b>  
<br>  
Although there may be a slight tendency for 'attendance' and 'home_team_outcome' to change in the same direction (positive covariance),  
the correlation magnitude (close to 0) shows that a linear relationship is practically absent.
</font>

# 19. Is it true that most home losses occur on Saturdays and Sundays?

In [41]:
# Check home losses (where 'home_team_outcome' == 0)
# and count occurrences by 'day_of_week' using value_counts()
df[df['home_team_outcome'] == 0]['day_of_week'].value_counts()

day_of_week
Sunday       190
Tuesday      187
Friday       184
Saturday     175
Wednesday    167
Monday       131
Thursday     125
Name: count, dtype: int64

<font color='darkblue'>
No, the majority of lost <b>home</b> games occur on <b>Sunday</b> (190) and <b>Tuesday</b> (187).
</font>

# 20. Is it true that the highest number of runs occurs in cold weather?  
# (Cold weather is defined as temperatures below 0 degrees Celsius)

In [42]:
# Sort by total number of runs and check the corresponding temperatures
df.sort_values('all_runs', ascending=False).head(5)[['all_runs', 'temperature']]

Unnamed: 0,all_runs,temperature
1788,29,24.44
881,27,24.44
1475,24,15.0
2052,24,29.44
1562,24,26.11


In [43]:
# Look at the number of runs during low temperatures
df.sort_values('temperature', ascending=True).head(5)[['all_runs', 'temperature']]

Unnamed: 0,all_runs,temperature
2409,12,-0.56
2412,10,0.0
2445,8,1.11
10,8,2.22
2422,4,3.33


<font color='darkblue'>
No, that is not true. Based on the available data, the highest number of runs <b>does not occur</b> in cold weather.  
<br>Moreover, there is only one game in the data that took place at a temperature below zero.
</font>