<a href="https://colab.research.google.com/github/SrishtiTyagii/coursework/blob/main/NHL_Win_Loss_Ratio_Correlation_with_Population_(2018).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NHL Win/Loss Ratio Correlation with Population (2018)
This code calculates the Pearson correlation coefficient between the win/loss ratio of NHL (National Hockey League) teams in 2018 and the population of the metropolitan area where the teams are based. The goal is to determine if there is any correlation between the size of a city's population and the performance of its NHL team(s).

# 1. Importing Required Libraries

In [4]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import re

* pandas: For data manipulation and reading data from files.
* numpy: For numerical operations, particularly handling NaN values.
* scipy.stats: To calculate the Pearson correlation coefficient.
* re: To perform regular expression operations for string cleaning.

# 2. Loading and Cleaning City Data

In [5]:
cities = pd.read_html("/content/wikipedia_data.html")[1]
cities = cities.iloc[:-1, [0, 3, 5, 6, 7, 8]]
cities.rename(columns={"Population (2016 est.)[8]": "Population"}, inplace=True)
cities['NFL'] = cities['NFL'].str.replace(r"\[.*\]", "")
cities['MLB'] = cities['MLB'].str.replace(r"\[.*\]", "")
cities['NBA'] = cities['NBA'].str.replace(r"\[.*\]", "")
cities['NHL'] = cities['NHL'].str.replace(r"\[.*\]", "")


* pd.read_html reads the Wikipedia table containing metropolitan areas and their associated sports teams.
* The iloc function selects relevant columns from the table, which include city name, population, and team names for the NFL, MLB, NBA, and NHL.
* We clean up the team names by removing any references or footnotes (denoted by square brackets) using str.replace().

# 3. Extracting NHL Team Names

In [6]:
Big4 = 'NHL'
team = cities[Big4].str.extract('([A-Z]{0,2}[a-z0-9]*\ [A-Z]{0,2}[a-z0-9]*|[A-Z]{0,2}[a-z0-9]*)')
team['Metropolitan area'] = cities['Metropolitan area']
team = pd.melt(team, id_vars=['Metropolitan area']).drop(columns=['variable']).replace("", np.nan).replace("—", np.nan).dropna().reset_index().rename(columns={"value": "team"})
team = pd.merge(team, cities, how='left', on='Metropolitan area').iloc[:, 1:4]
team = team.astype({'Metropolitan area': str, 'team': str, 'Population': int})
team['team'] = team['team'].str.replace('[\w.]*\ ', '')

* We extract the NHL team names for each city using a regular expression, which captures team names with or without spaces. * This handles cases where multiple teams are associated with a single metropolitan area.
* The pd.melt() function transforms the DataFrame so that each row corresponds to a single team and its corresponding city.
* The resulting DataFrame is cleaned, merged with the cities DataFrame to include population data, and the team names are cleaned further.

# 4. Loading and Cleaning NHL Win/Loss Data

In [8]:
_df = pd.read_csv("/content/nhl.csv")
_df = _df[_df['year'] == 2018]
_df['team'] = _df['team'].str.replace(r'\*', "")
_df = _df[['team', 'W', 'L']]

* We load the NHL win/loss data from a CSV file and filter for the 2018 season.
* Asterisks are removed from the team names, and only the relevant columns (team, W, L) are kept.

# 5. Handling Invalid Rows

In [9]:
dropList = []
for i in range(_df.shape[0]):
    row = _df.iloc[i]
    if row['team'] == row['W'] and row['L'] == row['W']:
        dropList.append(i)
_df = _df.drop(dropList)

* This section removes any rows where the team name is incorrectly populated with win or loss values, which can happen due to data formatting issues.

# 6. Cleaning Team Names and Calculating Win/Loss Ratios

In [10]:
_df['team'] = _df['team'].str.replace('[\w.]* ', '')
_df = _df.astype({'team': str, 'W': int, 'L': int})
_df['W/L%'] = _df['W'] / (_df['W'] + _df['L'])

* The team names are cleaned by removing extra characters or words, and we ensure that the W (wins) and L (losses) columns are integers.
* The win/loss ratio (W/L%) is calculated by dividing wins by the total number of games played.

# 7. Merging City and Team Data

In [11]:
merge = pd.merge(team, _df, how='outer', on='team')
merge = merge.groupby('Metropolitan area').agg({'W/L%': np.nanmean, 'Population': np.nanmean})

  merge = merge.groupby('Metropolitan area').agg({'W/L%': np.nanmean, 'Population': np.nanmean})
  merge = merge.groupby('Metropolitan area').agg({'W/L%': np.nanmean, 'Population': np.nanmean})


* We merge the city population data (team) with the NHL win/loss data (_df) based on the team column.
* After merging, we group by Metropolitan area to calculate the average win/loss ratio and population for cities with multiple teams.

# 8. Correlation Calculation

In [15]:
population_by_region = merge['Population']
win_loss_by_region = merge['W/L%']

assert len(population_by_region) == len(win_loss_by_region), "Your lists must be the same length"
assert len(population_by_region) == 28, "There should be 28 teams being analyzed for NHL"

return stats.pearsonr(population_by_region, win_loss_by_region)[0]
correlation = calculate_pearson_correlation()
print(correlation)

SyntaxError: 'return' outside function (<ipython-input-15-fb85810504db>, line 7)