# Data Reading Section:
##### In this section, we will use the following codes, which will be explained below, to read the data we need from the main reference file and convert it to CSV format so that we can use it in the next steps.

---

# ‚ö†Ô∏è IMPORTANT ‚Äî READ BEFORE RUNNING

This notebook expects the raw dataset to be available **before execution**.  
If the required ZIP file is not placed in the correct path, the ETL pipeline will fail or generate incomplete/duplicated outputs.

---

## ‚úÖ Required Action

Please make sure the following file exists **before running the notebook**:

..\data\raw\tennis_data.zip


> üìå The path is already configured inside the notebook‚Äôs Python extraction script ‚Äî do **not** change it unless necessary.

If the ZIP file has a different name, please rename it or update the code accordingly.

---

# Part 1: Importing the required libraries, defining the paths, and creating the required directories if they do not exist.
In this section, we import the libraries and items we need to use them later, and then we define the main paths, such as the main zip file path, the output file directory, and the temp directory, in a relational manner, to be included in the data folder of this project.

In [None]:
import os, zipfile
import pandas as pd
import numpy as np
import pyarrow.parquet as pq
from io import BytesIO
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns

print(os.getcwd())


# Define paths
main_zip = r".\data\raw\tennis_data.zip"
output_dir = r".\data\processed"
temp_dir = r".\data\raw\temp"
base_path = r"..\data\processed"
clean_path = r"..\data\processed\clean"
etl_files_path = r"..\data\processed\clean"

os.makedirs(clean_path, exist_ok=True)
os.makedirs(output_dir, exist_ok=True)
os.makedirs(temp_dir, exist_ok=True)

# Part 2: Creating a CSV table generator function from Parquet files
In this section, we have created a very useful function that, based on the keyword of the parquet category name that we give it, goes to the defined path of the main zip file and reads the parquets belonging to the specified tables and the data related to the specified columns from the zip files for each day. In addition to all this, we specify that the records of this table should be unique based on the unique data identifier or that this table can have multiple rows for each unique identifier. Our unique identifier is match_id.

In [None]:
def build_table(table_keyword, needed_cols, output_name, dedup_on="match_id"):
    """
    table_keyword: like 'event_' or 'home_team_'
    needed_cols: list of needed columns
    output_name: name of output CSV file
    dedup_on: unique column for deduplication (default is 'match_id')
    """
    csv_path = os.path.join(output_dir, output_name)
    if os.path.exists(csv_path):
        os.remove(csv_path)

    all_dfs = []
    row_counter = 0

    with zipfile.ZipFile(main_zip, "r") as main_zip_ref:
        daily_zips = main_zip_ref.namelist()
        print(f"üì¶ Count of daily zips: {len(daily_zips)}")

        for i, daily_zip_name in enumerate(daily_zips, start=1):
            print(f"üîπ ({i}/{len(daily_zips)}) processing {daily_zip_name} ...")
            main_zip_ref.extract(daily_zip_name, temp_dir)
            daily_zip_path = os.path.join(temp_dir, daily_zip_name)

            with zipfile.ZipFile(daily_zip_path, "r") as daily_zip_ref:
                parquet_files = [f for f in daily_zip_ref.namelist() if f.endswith(".parquet") and table_keyword in f]
                for f in parquet_files:
                    with daily_zip_ref.open(f) as pf:
                        table = pq.read_table(BytesIO(pf.read()))
                        df = table.to_pandas()
                        df = df[[c for c in needed_cols if c in df.columns]]
                        df["date_source"] = daily_zip_name.replace(".zip", "")
                        all_dfs.append(df)
                        row_counter += len(df)

            os.remove(daily_zip_path)

    if all_dfs:
        df_all = pd.concat(all_dfs, ignore_index=True)
        print(f"‚úÖ Shape: {df_all.shape}")
        if dedup_on and dedup_on in df_all.columns:
            df_all = df_all.drop_duplicates(subset=dedup_on)
        else:
            df_all = df_all.drop_duplicates()
        print(f"üßπ after cleaning duplicated rows: {df_all.shape}")

        df_all.to_csv(csv_path, index=False)
        print(f"üíæ Saved: {csv_path}")
        print(f"üìä Count of all rows: {len(df_all)}")
    else:
        print(f"‚ö†Ô∏è There is no file for {table_keyword}")

# Part 3: Using the above cell function and creating CSVs of the tables required for analysis according to the columns required from them
In this part, based on the initial analysis we had of the 17 questions in question and the data they required, we extracted a series of tables from a total of 15 tables and a series of their columns that were needed to analyze and answer the 17 questions we needed. Here, we want to extract them from the original raw zip file and convert them to CSV files so that we can use these files later in analyzing and answering the questions.

In [None]:
build_table(
    table_keyword="event_",
    needed_cols=["match_id", "first_to_serve", "winner_code", "default_period_count", "start_datetime", "match_slug"],
    output_name="event.csv",
    dedup_on="match_id"
)

build_table(
    table_keyword="home_team_",
    needed_cols=["match_id", "player_id", "full_name", "gender", "height", "weight", "plays", "current_rank", "country"],
    output_name="home_team.csv",
    dedup_on="match_id"
)

build_table(
    table_keyword="away_team_",
    needed_cols=["match_id", "player_id", "full_name", "gender", "height", "weight", "plays", "current_rank", "country"],
    output_name="away_team.csv",
    dedup_on="match_id"
)

build_table(
    table_keyword="tournament_",
    needed_cols=["match_id", "tournament_id", "tournament_name", "ground_type", "tennis_points", "start_datetime"],
    output_name="tournament.csv",
    dedup_on="match_id"
)

build_table(
    table_keyword="time_",
    needed_cols=["match_id", "period_1", "period_2", "period_3", "period_4", "period_5", "current_period_start_timestamp"],
    output_name="time.csv",
    dedup_on="match_id"
)

build_table(
    table_keyword="statistics_",
    needed_cols=["match_id", "statistic_name", "home_value", "away_value"],
    output_name="statistics.csv",
    dedup_on=None  # No deduplication because we have multiple rows per match_id in statistics
)

build_table(
    table_keyword="power_",
    needed_cols=["match_id", "set_num", "game_num", "value", "break_occurred"],
    output_name="power.csv",
    dedup_on=None # No deduplication because we have multiple rows per match_id in power
)

build_table(
    table_keyword="pbp_",
    needed_cols=["match_id", "set_id", "game_id", "point_id", "home_point", "away_point", "home_score"],
    output_name="pbp.csv",
    dedup_on=None # No deduplication because we have multiple rows per match_id in pbp
)


##  Part 4: Data Cleaning Stage
#### In this section, we will clean the extracted CSV files created in the previous section.
# 
### **Goal:**  
- Remove duplicate rows  
 - Handle missing values (`NaN`)  
 - Standardize data types  
 The cleaned outputs will be stored in `../data/clean` for the next normalization phase.


###  Cleaning: Event Table

In [None]:
df_event = pd.read_csv(os.path.join(base_path, "event.csv"))
df_event.drop_duplicates(inplace=True)

for col in df_event.columns:
    if df_event[col].dtype == 'object':
        df_event[col] = df_event[col].fillna("Unknown")
    else:
        df_event[col] = df_event[col].fillna(0)

if "match_id" in df_event.columns:
    df_event["match_id"] = df_event["match_id"].astype(str)

df_event.to_csv(os.path.join(clean_path, "event_clean.csv"), index=False)
print("‚úÖ event_clean.csv created successfully!")

###  Cleaning: Home Team Table

In [None]:
df_home = pd.read_csv(os.path.join(base_path, "home_team.csv"))
df_home = df_home.drop_duplicates()

string_cols = ["full_name", "gender", "plays", "country"]
numeric_cols = ["height", "weight", "current_rank"]

for col in string_cols:
    if col in df_home.columns:
        df_home[col] = df_home[col].fillna("Unknown")

for col in numeric_cols:
    if col in df_home.columns:
        df_home[col] = df_home[col].fillna(0)

if "match_id" in df_home.columns:
    df_home["match_id"] = df_home["match_id"].astype(str)

df_home.to_csv(os.path.join(clean_path, "home_team_clean.csv"), index=False)
print("‚úÖ home_team_clean.csv created successfully!")

###  Cleaning: Away Team Table

In [None]:
df_away = pd.read_csv(os.path.join(base_path, "away_team.csv"))
df_away = df_away.drop_duplicates()

string_cols = ["full_name", "gender", "plays", "country"]
numeric_cols = ["height", "weight", "current_rank"]

for col in string_cols:
    if col in df_away.columns:
        df_away[col] = df_away[col].fillna("Unknown")

for col in numeric_cols:
    if col in df_away.columns:
        df_away[col] = df_away[col].fillna(0)

if "match_id" in df_away.columns:
    df_away["match_id"] = df_away["match_id"].astype(str)

df_away.to_csv(os.path.join(clean_path, "away_team_clean.csv"), index=False)
print("‚úÖ away_team_clean.csv created successfully!")

### Cleaning: Tournamet Table

In [None]:
df_tournament = pd.read_csv(os.path.join(base_path, "tournament.csv"))
df_tournament = df_tournament.drop_duplicates()

mask = df_tournament['ground_type'].isnull() | (df_tournament['ground_type'].str.strip() == '')
df_tournament.loc[mask, 'ground_type'] = 'Unknown'

df_tournament.to_csv(os.path.join(clean_path, "tournament_clean.csv"), index=False)
print("‚úÖ tournament_clean.csv created successfully!")

### Cleaning: Statistics Table

In [None]:
df_statistics = pd.read_csv(os.path.join(base_path, "statistics.csv"))
df_statistics = df_statistics.drop_duplicates()

# This data dosent have any nulls to clean however we display the count of nulls for verification
display(df_statistics.isnull().sum())

if "match_id" in df_statistics.columns:
    df_statistics["match_id"] = df_statistics["match_id"].astype(str)

df_statistics.to_csv(os.path.join(clean_path, "statistics_clean.csv"), index=False)
print("‚úÖ statistics_clean.csv created successfully!")

### Cleaning: Time Table

In [None]:
df_time = pd.read_csv(os.path.join(base_path, "time.csv"))
df_time = df_time.drop_duplicates()

df_time.drop(columns=["period_4", "period_5"], inplace=True) # because all tennis matches are best of 3 sets

if "match_id" in df_time.columns:
    df_time["match_id"] = df_time["match_id"].astype(int)

df_time.to_csv(os.path.join(clean_path, "time_clean.csv"), index=False)
print("‚úÖ time_clean.csv created successfully!")

### Cleaning: Point By Point Table

In [None]:
df_pbp = pd.read_csv(os.path.join(base_path, "pbp.csv"))
df_pbp = df_pbp.drop_duplicates()

df_pbp['home_point'] = df_pbp['home_point'].replace('A', 1).astype(int)
df_pbp['away_point'] = df_pbp['away_point'].replace('A', 1).astype(int)

# This data dosent have any nulls to clean however we display the count of nulls for verification
display(df_pbp.isnull().sum())

if "match_id" in df_pbp.columns:
    df_pbp["match_id"] = df_pbp["match_id"].astype(str)


df_pbp.to_csv(os.path.join(clean_path, "pbp_clean.csv"), index=False)
print("‚úÖ pbp_clean.csv created successfully!")

### Cleaning: Power Table

In [None]:
df_power = pd.read_csv(os.path.join(base_path, "power.csv"))
df_power = df_power.drop_duplicates()

# No nulls to clean, just save the cleaned file and other column is clean as you can see here
display("Sum of nulls:", df_power.isnull().sum())
display("Sum of invalid game_num entries:", df_power['match_id'][df_power['game_num'] < 1].sum())
display("DataFrame dtypes for verification correctness of data:", df_power.dtypes)

df_power.to_csv(os.path.join(clean_path, "power_clean.csv"), index=False)
print("‚úÖ power_clean.csv created successfully!")

## Part 5: Normalization Stage
#### Now that we have clean CSVs, in this part we will:
#
 - Convert data types (e.g., timestamps to datetime)  
 - Standardize text (e.g., capitalization, spacing)  
 - Fill remaining missing values intelligently (using mean, median, or mode)  
 The normalized final datasets will be saved in `../data/processed/clean` as `_final.csv` files.


###  Normalization ‚Äî Event Table

In [None]:
input_path = os.path.join(clean_path, "event_clean.csv")
output_path = os.path.join(clean_path, "event_final.csv")

df_event = pd.read_csv(input_path)

df_event["match_id"] = df_event["match_id"].astype(int)
df_event["default_period_count"] = df_event["default_period_count"].astype(int)
df_event["date_source"] = df_event["date_source"].astype(int)

if np.issubdtype(df_event["start_datetime"].dtype, np.number):
    df_event["start_datetime"] = pd.to_datetime(df_event["start_datetime"], unit="s", errors="coerce")

df_event["winner_code"] = df_event["winner_code"].fillna(df_event["winner_code"].mode()[0])
df_event["first_to_serve"] = df_event["first_to_serve"].fillna(df_event["first_to_serve"].mode()[0])

df_event.to_csv(output_path, index=False)
print("‚úÖ event_final.csv created successfully!")
print(df_event.info())
print(df_event.isna().sum())

###  Normalization ‚Äî Home Team Table

In [None]:
input_path = os.path.join(clean_path, "home_team_clean.csv")
output_path = os.path.join(clean_path, "home_team_final.csv")

df_home = pd.read_csv(input_path)

numeric_cols = ["height", "weight", "current_rank"]
for col in numeric_cols:
    if col in df_home.columns:
        df_home[col] = pd.to_numeric(df_home[col], errors="coerce")

if "gender" in df_home.columns:
    df_home["gender"] = df_home["gender"].astype(str).str.strip().str.title().replace({"Nan":"Unknown"})
if "plays" in df_home.columns:
    df_home["plays"] = df_home["plays"].astype(str).str.strip().str.lower().replace({"nan":"unknown"})
for col in ["full_name", "country"]:
    if col in df_home.columns:
        df_home[col] = df_home[col].astype(str).str.strip()

if "height" in df_home.columns:
    df_home["height"] = df_home["height"].fillna(df_home["height"].mean(skipna=True))
if "weight" in df_home.columns:
    df_home["weight"] = df_home["weight"].fillna(df_home["weight"].mean(skipna=True))
if "current_rank" in df_home.columns:
    df_home["current_rank"] = df_home["current_rank"].fillna(df_home["current_rank"].median(skipna=True))

for col in ["gender", "plays"]:
    if col in df_home.columns:
        mode_val = df_home[col].mode(dropna=True)
        if not mode_val.empty:
            df_home[col] = df_home[col].fillna(mode_val.iloc[0])
        else:
            df_home[col] = df_home[col].fillna("Unknown")

for col in ["player_id", "full_name", "country"]:
    if col in df_home.columns:
        df_home[col] = df_home[col].fillna("Unknown")

if "match_id" in df_home.columns:
    df_home["match_id"] = df_home["match_id"].astype(int)

df_home.to_csv(output_path, index=False)
print("‚úÖ home_team_final.csv created successfully!")
print(df_home.info())
print(df_home.isna().sum())

###  Normalization ‚Äî Away Team Table

In [None]:
input_path = os.path.join(clean_path, "away_team_clean.csv")
output_path = os.path.join(clean_path, "away_team_final.csv")

df_away = pd.read_csv(input_path)

numeric_cols = ["height", "weight", "current_rank"]
for col in numeric_cols:
    if col in df_away.columns:
        df_away[col] = pd.to_numeric(df_away[col], errors="coerce")

if "gender" in df_away.columns:
    df_away["gender"] = df_away["gender"].astype(str).str.strip().str.title().replace({"Nan":"Unknown"})
if "plays" in df_away.columns:
    df_away["plays"] = df_away["plays"].astype(str).str.strip().str.lower().replace({"nan":"unknown"})
for col in ["full_name", "country"]:
    if col in df_away.columns:
        df_away[col] = df_away[col].astype(str).str.strip()

if "height" in df_away.columns:
    df_away["height"] = df_away["height"].fillna(df_away["height"].mean(skipna=True))
if "weight" in df_away.columns:
    df_away["weight"] = df_away["weight"].fillna(df_away["weight"].mean(skipna=True))
if "current_rank" in df_away.columns:
    df_away["current_rank"] = df_away["current_rank"].fillna(df_away["current_rank"].median(skipna=True))

for col in ["gender", "plays"]:
    if col in df_away.columns:
        mode_val = df_away[col].mode(dropna=True)
        if not mode_val.empty:
            df_away[col] = df_away[col].fillna(mode_val.iloc[0])
        else:
            df_away[col] = df_away[col].fillna("Unknown")

for col in ["player_id", "full_name", "country"]:
    if col in df_away.columns:
        df_away[col] = df_away[col].fillna("Unknown")

if "match_id" in df_away.columns:
    df_away["match_id"] = df_away["match_id"].astype(int)

df_away.to_csv(output_path, index=False)
print("‚úÖ away_team_final.csv created successfully!")
print(df_away.info())
print(df_away.isna().sum())

###  Normalization ‚Äî Tournament Table

In [None]:
input_path = os.path.join(clean_path, "tournament_clean.csv")
output_path = os.path.join(clean_path, "tournament_final.csv")

df_tournament = pd.read_csv(input_path)

df_tournament['match_id'] = df_tournament['match_id'].astype(int)

df_tournament.to_csv(output_path, index=False)
print("‚úÖ tournament_final.csv created successfully!")
print(df_tournament.info())
print(df_tournament.isna().sum())

### Normalization ‚Äî Statistics Table

In [None]:
input_path = os.path.join(clean_path, "statistics_clean.csv")
output_path = os.path.join(clean_path, "statistics_final.csv")

df_statistics = pd.read_csv(input_path)

df_statistics['date_source'] = pd.to_datetime(df_statistics['date_source'], format='%Y%m%d')
df_statistics['home_value'] = pd.to_numeric(df_statistics['home_value'], errors='coerce')
df_statistics['away_value'] = pd.to_numeric(df_statistics['away_value'], errors='coerce')
df_statistics['statistic_name'] = df_statistics['statistic_name'].astype(str).str.replace(" ", "_").str.lower()
df_statistics['match_id'] = df_statistics['match_id'].astype(int)



df_statistics.to_csv(output_path, index=False)
print("‚úÖ statistics_final.csv created successfully!")
print(df_statistics.info())
print(df_statistics.isna().sum())

### Normalization ‚Äî Time Table

In [None]:
input_path = os.path.join(clean_path, "time_clean.csv")
output_path = os.path.join(clean_path, "time_final.csv")

df_time = pd.read_csv(input_path)

periods = ["period_1", "period_2", "period_3"]

df_time["match_id"] = df_time["match_id"].astype(int)
df_time['date_source'] = pd.to_datetime(df_time['date_source'], format='%Y%m%d')
df_time['current_period_start_timestamp'] = pd.to_datetime(df_time['current_period_start_timestamp'], unit='s', errors='coerce')

df_time['match_id'] = df_time['match_id'].astype(int)

MS_TRESHOLD = 100_000  # 100,000 milliseconds = 100 seconds

for period in periods:
    df_time[period] = pd.to_numeric(df_time[period], errors='coerce').abs()
    mask = df_time[period] > MS_TRESHOLD
    df_time.loc[mask, period] = df_time.loc[mask, period] / 1000 # convert milliseconds to seconds


df_time.to_csv(output_path, index=False)
print("‚úÖ time_final.csv created successfully!")

### Normalization ‚Äî Point By Point Table

In [None]:
input_path = os.path.join(clean_path, "pbp_clean.csv")
output_path = os.path.join(clean_path, "pbp_final.csv")

df_pbp = pd.read_csv(input_path)

df_pbp['date_source'] = pd.to_datetime(df_pbp['date_source'], format='%Y%m%d')
df_pbp['match_id'] = df_pbp['match_id'].astype(int)
df_pbp['home_point'] = df_pbp['home_point'].astype(int)
df_pbp['away_point'] = df_pbp['away_point'].astype(int)
df_pbp['home_score'] = df_pbp['home_score'].astype(int)
df_pbp['set_id'] = df_pbp['set_id'].astype(int)
df_pbp['game_id'] = df_pbp['game_id'].astype(int)
df_pbp['point_id'] = df_pbp['point_id'].astype(int)

df_pbp.to_csv(output_path, index=False)
print("‚úÖ pbp_final.csv created successfully!")

### Normalization ‚Äî Power Table

In [None]:
input_path = os.path.join(clean_path, "pbp_clean.csv")
output_path = os.path.join(clean_path, "pbp_final.csv")

df_pbp = pd.read_csv(input_path)

df_pbp['date_source'] = pd.to_datetime(df_pbp['date_source'], format='%Y%m%d')
df_pbp['match_id'] = df_pbp['match_id'].astype(int)
df_pbp['home_point'] = df_pbp['home_point'].astype(int)
df_pbp['away_point'] = df_pbp['away_point'].astype(int)
df_pbp['set_id'] = df_pbp['set_id'].astype(int)
df_pbp['game_id'] = df_pbp['game_id'].astype(int)
df_pbp['point_id'] = df_pbp['point_id'].astype(int)

df_pbp.to_csv(output_path, index=False)
print("‚úÖ pbp_final.csv created successfully!")

### Question 1 ‚Äî How many tennis players are included in the dataset?

To find the number of unique tennis players, we combine the home and away team tables, remove duplicate players based on `player_id`, and count how many unique players remain.


In [None]:
import pandas as pd
import numpy as np
import os

base = "/Users/macbook/Downloads/Daneshkar/tennis project/TennisProject/data/processed/clean"

# Load cleaned & normalized player data
df_home = pd.read_csv(os.path.join(base, "home_team_final.csv"))
df_away = pd.read_csv(os.path.join(base, "away_team_final.csv"))

# Combine home and away players
players = pd.concat([df_home, df_away], ignore_index=True)

# Keep only unique players
unique_players = players.drop_duplicates(subset="player_id")

# Count unique players
total_unique_players = unique_players.shape[0]

print("total unique players =",total_unique_players)

###  Question 2 ‚Äî What is the average height of the players?
The goal of this question is to calculate the average height of the tennis players in the dataset.
To do this, we first need to create a unique players table so that players who are duplicates in home and away are not counted again.
Then we correct invalid heights (such as 0 or NaN) and calculate the true average.

In [None]:
base = "/Users/macbook/Downloads/Daneshkar/tennis project/TennisProject/data/processed/clean"
# load data
df_home = pd.read_csv(os.path.join(base, "home_team_final.csv"))
df_away = pd.read_csv(os.path.join(base, "away_team_final.csv"))

# combine & remove duplicate players
players = pd.concat([df_home , df_away], ignore_index=True)
players_unique = players.drop_duplicates(subset="player_id").copy()

# replace zeros with NaN
players_unique.loc[:, "height"] = players_unique["height"].replace(0, np.nan)

# fill missing heights with mean
mean_height = players_unique["height"].mean(skipna=True)
players_unique.loc[:, "height"] = players_unique["height"].fillna(mean_height)

# Calculate average height
average_height = players_unique["height"].mean()
print("Average height =" , average_height)

The histogram + KDE curve helps us observe:  
- The central height tendency (mean around ~182 cm)  
- Spread of heights  
- Possible outliers  
- Whether the distribution is normal or skewed

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10,6))
plt.title("Player Height Distribution")
sns.histplot(players_unique["height"], kde=True, bins=30)
plt.xlabel("Height (cm)")
plt.ylabel("Count")
plt.grid(True, alpha=0.3)
plt.show()

### Insight
The average tennis player height is around **182 cm**.  
The distribution appears slightly right-skewed, meaning a small number of players are significantly taller than the average.

Most players fall between **175‚Äì190 cm**, which aligns with typical professional tennis standards.

### Question 10 ‚Äî Correlation Between Player Height and Ranking
Interpretation

We want to check whether taller players tend to have higher or lower rankings.
We combine home & away players, remove duplicates, clean invalid values, and compute Pearson correlation.

In [None]:
# Paths
base = "/Users/macbook/Downloads/Daneshkar/tennis project/TennisProject/data/processed/clean"

# Load datasets
df_home = pd.read_csv(os.path.join(base, "home_team_final.csv"))
df_away = pd.read_csv(os.path.join(base, "away_team_final.csv"))

# Merge home + away
players = pd.concat([df_home, df_away], ignore_index=True)

# Keep unique players
players_unique = players.drop_duplicates(subset="player_id")

# Replace invalid zeros with NaN
players_unique.loc[:, "height"] = players_unique["height"].replace(0, np.nan)
players_unique.loc[:, "current_rank"] = players_unique["current_rank"].replace(0, np.nan)

# Drop rows with missing required values
clean_players = players_unique.dropna(subset=["height", "current_rank"])

# Compute correlation
correlation = clean_players["height"].corr(clean_players["current_rank"], method="pearson")

print("correlation =" , correlation)


Correlation = 0.10355

Conclusion:

There is no meaningful correlation between player height and ranking.
Height does not significantly influence global ranking.

In [None]:
plt.figure(figsize=(10,6))
sns.regplot(
    x="height",
    y="current_rank",
    data=clean_players,
    scatter_kws={"alpha":0.4},
    line_kws={"color":"red"}
)

plt.title("Height vs Ranking")
plt.xlabel("Height (cm)")
plt.ylabel("Ranking (Lower is Better)")
plt.grid(True, alpha=0.3)
plt.show()

###  Insight
The correlation coefficient was approximately **0.10**, indicating a **very weak positive correlation**.

This means **taller players tend to have slightly worse rankings**, but the effect is extremely small.  

Height does NOT strongly predict performance or ranking in professional tennis.

--------------------

### Question 11 ‚Äî What is the average duration of matches?

To calculate the average match duration, we used the `time_final.csv` dataset, which contains
the duration of each period inside a tennis match:

- `period_1`
- `period_2`
- `period_3`

These periods contain the **duration in seconds**.

#### **Steps**
1. Replace NaN values in period columns with 0.  
2. Compute total match duration:  `duration_seconds = period_1 + period_2 + period_3`  
3. Keep only matches where duration > 0.  
4. Compute the mean duration.


In [None]:
base = "/Users/macbook/Downloads/Daneshkar/tennis project/TennisProject/data/processed/clean"

df_time = pd.read_csv(os.path.join(base, "time_final.csv"))

# Replace NaN periods with 0
for col in ["period_1", "period_2", "period_3"]:
    df_time[col] = df_time[col].fillna(0)

# Compute duration in seconds
df_time.loc[:, "duration_seconds"] = df_time["period_1"] + df_time["period_2"] + df_time["period_3"]

# Filter out zeros (invalid matches)
df_valid = df_time[df_time["duration_seconds"] > 0].copy()
df_valid["duration_minutes"] = df_valid["duration_seconds"] / 60

# Average duration
avg_sec = df_valid["duration_seconds"].mean()
avg_minutes = avg_sec / 60
avg_hours = avg_minutes / 60

print("Average duration (seconds)=", avg_sec)
print("Average duration (minutes)=", avg_minutes)
print("Average duration (hours)=", avg_hours)


### **Final Answer**
- **Average duration (seconds):** 6705.87  
- **Average duration (minutes):** 111.76  
- **Average duration (hours):** 1.86  

#### **Average tennis match duration ‚âà 1 hour and 52 minutes**

In [None]:
plt.figure(figsize=(10,6))
plt.title("Distribution of Match Duration")
sns.histplot(df_valid["duration_minutes"], kde=True, bins=40)
plt.xlabel("Duration (minutes)")
plt.ylabel("Count")
plt.grid(True, alpha=0.3)
plt.show()

###  Insight

The distribution shows:  
- Most matches are between **60 and 130 minutes**  
- A small number of matches last more than **180 minutes**  
- Very short or extremely long matches are rare  

### Question 12 - What is the average number of games per set in men's matches compared to women's matches?

To calculate the average number of games per set for men's and women's matches, we used three main datasets: `home_team_final.csv`, `away_team_final.csv`, and `pbp_final.csv`.

The `gender` column in the `home_team_final.csv` and `away_team_final.csv` datasets contains information about the **gender** of the players, and in the `pbp_final.csv` dataset we can also have `information about each game` to count it into women's and men's matches.

#### **Steps**
1. Reading datasets and storing them in code as dataframes
2. Create a `gender` dataframe that contains information about the gender and matches played by home and away players.
3. Obtaining the gender of each match using player grouping and getting the mode of the players for each match with the code `df_gender = df_gender.groupby("match_id")['gender'].apply(lambda x: x.mode()[0]).reset_index()`
4. Counting games per match and storing them in `df_games`
5. Merging the `df_gender` and `df_games` dataframes for the final calculation
6. Obtain the final result by grouping `df_merged` by gender and averaging for each gender (ignoring data where the `gender` column is unknown)
7. Obtaining the ratio of gender-unknown data to total data

In [None]:
df_home = pd.read_csv(os.path.join(etl_files_path, "home_team_final.csv"))
df_away = pd.read_csv(os.path.join(etl_files_path, "away_team_final.csv"))
df_pbp = pd.read_csv(os.path.join(etl_files_path, "pbp_final.csv"))

df_gender = pd.concat(
    [df_home[['match_id', 'gender']],
     df_away[['match_id', 'gender']]],
    ignore_index=True)

df_gender = df_gender.groupby("match_id")['gender'].apply(lambda x: x.mode()[0]).reset_index()

df_games = df_pbp.groupby(['match_id', 'set_id',])['game_id'].nunique().reset_index(name='games_in_set')

df_merged = df_games.merge(df_gender, on='match_id', how='inner')

result = df_merged.groupby('gender')['games_in_set'].mean().drop('Unknown')

stat = df_merged.groupby('gender')['games_in_set']

ratio_of_unknown = (df_merged['gender'] == 'Unknown').sum() / len(df_merged) * 100

print(result)
print(ratio_of_unknown)
print('minimum games in a match', stat.max())

### **Final Answer**
- The average number of games per set in men's matches is approximately equal to **9.27**
- The average number of games per set in women's matches is approximately equal to **8.90**
- The highest number of games in a match for both women and men was **13**.

#### The proportion of players whose gender was unknown to the total dataset is approximately equal to **5.1%**

In [None]:
fig, ax = plt.subplots(figsize=(16, 7))
sns.barplot(
    y=['Male', 'Female'],
    x=result.values,
    ax=ax,
    color='skyblue',
    hue=result.index,
    edgecolor = 'gray',
    palette=['skyblue', 'skyblue'],
    alpha = 0.9
)
ax.set_title('Average Number of Games per Match by Gender', fontsize=16, loc='left')
ax.set_ylabel('Gender', fontsize=14, loc='bottom', color='gray')
ax.set_xlabel('Games per Match', fontsize=14, loc="left", color='gray')
ax.tick_params(axis='both', which='major', colors='gray', labelsize=10)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_color('gray')
ax.grid(axis='x', linestyle='--', alpha=0.7)
ax.set_xticks(np.arange(0, 10, 1))
ax.legend().set_visible(False)
plt.show()

### Insight

The statistics shows:
- The average number of games in a match between women and men is not much different **(around 0.3 games per match)**.
- About **5.1%** of players had their gender unknown in the dataset and were not included in this statistic.