# Tennis Ace - Multiple Linear Regression

This project aims to create a linear regression model that predicts the outcomes for a tennis player based on their playing habits.

Data from https://en.wikipedia.org/wiki/Association_of_Tennis_Professionals (change hyperlink) will be analysed and modelled. 

# Data dictionary

    Player: name of the tennis player
    Year: year data was recorded

Service Game Columns (Offensive)

    Aces: number of serves by the player where the receiver does not touch the ball
    DoubleFaults: number of times player missed both first and second serve attempts
    FirstServe: % of first-serve attempts made
    FirstServePointsWon: % of first-serve attempt points won by the player
    SecondServePointsWon: % of second-serve attempt points won by the player
    BreakPointsFaced: number of times where the receiver could have won service game of the player
    BreakPointsSaved: % of the time the player was able to stop the receiver from winning service game when they had the chance
    ServiceGamesPlayed: total number of games where the player served
    ServiceGamesWon: total number of games where the player served and won
    TotalServicePointsWon: % of points in games where the player served that they won

Return Game Columns (Defensive)

    FirstServeReturnPointsWon: % of opponents first-serve points the player was able to win
    SecondServeReturnPointsWon: % of opponents second-serve points the player was able to win
    BreakPointsOpportunities: number of times where the player could have won the service game of the opponent
    BreakPointsConverted: % of the time the player was able to win their opponent’s service game when they had the chance
    ReturnGamesPlayed: total number of games where the player’s opponent served
    ReturnGamesWon: total number of games where the player’s opponent served and the player won
    ReturnPointsWon: total number of points where the player’s opponent served and the player won
    TotalPointsWon: % of points won by the player

Outcomes

    Wins: number of matches won in a year
    Losses: number of matches lost in a year
    Winnings: total winnings in USD($) in a year
    Ranking: ranking at the end of year

In [126]:
#importing the relevant libaries
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Data Processing

In [127]:
# loading in the csv file and looking at the top 5 rows
tennis_stats = pd.read_csv("tennis_stats.csv")
tennis_stats.head()

In [128]:
# looking at the bottom 5 rows
tennis_stats.tail()

In [129]:
# looking at the shape of the dataset i.e. how many rows and columns there are
#there's  1721 rows and 24 columns
#also checking the datatypes to see if they are correct
tennis_stats.info()

In [130]:
#Working out how many unique players are in the dataset
tennis_stats.Player.nunique()


In [131]:
# checking for null values
tennis_stats.isna().sum()

In [132]:
#dropping duplicate rows
tennis_stats.drop_duplicates(inplace=True)

In [133]:
tennis_stats.describe(include = "all")

# EDA


To see the relationship between the quantitative variables the following visualisation tool a scatter plot will be used.
to quantise the relationship: depending on the linearity of the plot one may use either a pearson or spearmans test


In [134]:
Y = tennis_stats["Wins"]
X = [tennis_stats["Aces"], tennis_stats["DoubleFaults"], tennis_stats["FirstServe"],
     tennis_stats["FirstServePointsWon"], tennis_stats["SecondServePointsWon"],
     tennis_stats["BreakPointsFaced"], tennis_stats["BreakPointsSaved"],
     tennis_stats["ServiceGamesPlayed"], tennis_stats["ServiceGamesWon"],
     tennis_stats["TotalServicePointsWon"], tennis_stats["FirstServeReturnPointsWon"],
     tennis_stats["SecondServeReturnPointsWon"], tennis_stats["BreakPointsOpportunities"],
     tennis_stats["BreakPointsConverted"], tennis_stats["ReturnGamesPlayed"],
     tennis_stats["ReturnGamesWon"], tennis_stats["ReturnPointsWon"], tennis_stats["TotalPointsWon"]]

X_name = ["Aces", "DoubleFaults", "FirstServe", "FirstServePointsWon",
          "SecondServePointsWon", "BreakPointsFaced", "BreakPointsSaved",
          "ServiceGamesPlayed", "ServiceGamesWon", "TotalServicePointsWon",
          "FirstServeReturnPointsWon", "SecondServeReturnPointsWon", "BreakPointsOpportunities",
          "BreakPointsConverted", "ReturnGamesPlayed", "ReturnGamesWon", "ReturnPointsWon",
          "TotalPointsWon"]

In [135]:
# Create the subplots
fig, axes = plt.subplots(nrows=6, ncols=3, figsize=(20, 20))

# Flatten the axes array to loop through it with a single index
axes = axes.flatten()

# Loop through the X variables and plot Y vs. each X variable
for i, ax in enumerate(axes):
        ax.scatter(X[i], Y, alpha=0.5)
        ax.set_xlabel(X_name[i])
        ax.set_ylabel("Wins")
        ax.set_title(f'Wins vs. {X_name[i]}')

        # Calculate Pearson correlation coefficient
        correlation_coefficient = np.corrcoef(X[i], Y)[0, 1]

        # Display the correlation coefficient on the plot
        ax.text(0.1, 0.9, f"Pearson Corr: {correlation_coefficient:.2f}", transform=ax.transAxes, fontsize=10, fontweight='bold')

# Adjust the layout and add space between subplots
plt.tight_layout(pad=1.5)

# Show the plot
plt.show()


Let's build a single feature linear regression model. Based of the graphs it seems like the number of aces correlates well to the number of wins so lets start with this.

In [136]:
features = np.array(tennis_stats["Aces"]).reshape(-1,1)
outcome = np.array(tennis_stats["Wins"]).reshape(-1,1)

features_train, features_test, outcome_train, outcome_test = train_test_split(features, outcome, train_size=0.8)

model = LinearRegression()
model.fit(features_train, outcome_train)

# Calculate the coefficient of determination (R^2) on the test set
r_squared = model.score(features_test, outcome_test)
print("R-squared:", r_squared)

# Make predictions on the test set
prediction = model.predict(features_test)

# Plot the scatter plot
plt.scatter(outcome_test, prediction)
plt.xlabel("Actual Wins")
plt.ylabel("Predicted Wins")
plt.title("Actual Wins vs. Predicted Wins")
plt.show()


In [137]:
features = np.array(tennis_stats["ServiceGamesPlayed"]).reshape(-1,1)
outcome = np.array(tennis_stats["Wins"]).reshape(-1,1)

features_train, features_test, outcome_train, outcome_test = train_test_split(features, outcome, train_size=0.8)

model = LinearRegression()
model.fit(features_train, outcome_train)

# Calculate the coefficient of determination (R^2) on the test set
r_squared = model.score(features_test, outcome_test)
print("R-squared:", r_squared)

# Make predictions on the test set
prediction = model.predict(features_test)

# Plot the scatter plot
plt.scatter(outcome_test, prediction)
plt.xlabel("Actual Wins")
plt.ylabel("Predicted Wins")
plt.title("Actual Wins vs. Predicted Wins")
plt.show()


Building a multiple linear regression model based on the features that correlate well with the win outcome

In [138]:
# Create a list of the features you want to include in the array
feature_columns = [
    "ServiceGamesPlayed",
    "Aces",
    "DoubleFaults",
    "BreakPointsFaced",
    "BreakPointsOpportunities",
    "ReturnGamesPlayed"
]

# Use the list of feature columns to extract the corresponding data from tennis_stats
features = np.array(tennis_stats[feature_columns])
outcome = np.array(tennis_stats["Wins"]).reshape(-1, 1)

features_train, features_test, outcome_train, outcome_test = train_test_split(features, outcome, train_size=0.8)

model = LinearRegression()
model.fit(features_train, outcome_train)

# Calculate the coefficient of determination (R^2) on the test set
r_squared = model.score(features_test, outcome_test)
print("R-squared:", r_squared)

# Make predictions on the test set
prediction = model.predict(features_test)

# Plot the scatter plot
plt.scatter(outcome_test, prediction)
plt.xlabel("Actual Wins")
plt.ylabel("Predicted Wins")
plt.title("Actual Wins vs. Predicted Wins")
plt.show()

print(features)

In this section we will now build a multiple regression model to predict the rankings of a player based on the features

In [139]:
Y = tennis_stats["Winnings"]
X = [tennis_stats["Aces"], tennis_stats["DoubleFaults"], tennis_stats["FirstServe"],
     tennis_stats["FirstServePointsWon"], tennis_stats["SecondServePointsWon"],
     tennis_stats["BreakPointsFaced"], tennis_stats["BreakPointsSaved"],
     tennis_stats["ServiceGamesPlayed"], tennis_stats["ServiceGamesWon"],
     tennis_stats["TotalServicePointsWon"], tennis_stats["FirstServeReturnPointsWon"],
     tennis_stats["SecondServeReturnPointsWon"], tennis_stats["BreakPointsOpportunities"],
     tennis_stats["BreakPointsConverted"], tennis_stats["ReturnGamesPlayed"],
     tennis_stats["ReturnGamesWon"], tennis_stats["ReturnPointsWon"], tennis_stats["TotalPointsWon"]]

X_name = ["Aces", "DoubleFaults", "FirstServe", "FirstServePointsWon",
          "SecondServePointsWon", "BreakPointsFaced", "BreakPointsSaved",
          "ServiceGamesPlayed", "ServiceGamesWon", "TotalServicePointsWon",
          "FirstServeReturnPointsWon", "SecondServeReturnPointsWon", "BreakPointsOpportunities",
          "BreakPointsConverted", "ReturnGamesPlayed", "ReturnGamesWon", "ReturnPointsWon",
          "TotalPointsWon"]

In [140]:
# Create the subplots
fig, axes = plt.subplots(nrows=6, ncols=3, figsize=(20, 20))

# Flatten the axes array to loop through it with a single index
axes = axes.flatten()

# Loop through the X variables and plot Y vs. each X variable
for i, ax in enumerate(axes):
        ax.scatter(X[i], Y, alpha=0.5)
        ax.set_xlabel(X_name[i])
        ax.set_ylabel("Winnings")
        ax.set_title(f'Winnings vs. {X_name[i]}')

        # Calculate Pearson correlation coefficient
        correlation_coefficient = np.corrcoef(X[i], Y)[0, 1]

        # Display the correlation coefficient on the plot
        ax.text(0.1, 0.9, f"Pearson Corr: {correlation_coefficient:.2f}", transform=ax.transAxes, fontsize=10, fontweight='bold')

# Adjust the layout and add space between subplots
plt.tight_layout(pad=1.5)

# Show the plot
plt.show()


In [141]:
# Create a list of the features you want to include in the array
feature_columns = [
    "ServiceGamesPlayed",
    "Aces",
    "DoubleFaults",
    "BreakPointsFaced",
    "BreakPointsOpportunities",
    "ReturnGamesPlayed"
]

# Use the list of feature columns to extract the corresponding data from tennis_stats
features = np.array(tennis_stats[feature_columns])
outcome = np.array(tennis_stats["Winnings"]).reshape(-1, 1)

features_train, features_test, outcome_train, outcome_test = train_test_split(features, outcome, train_size=0.8)

model = LinearRegression()
model.fit(features_train, outcome_train)

# Calculate the coefficient of determination (R^2) on the test set
r_squared = model.score(features_test, outcome_test)
print("R-squared:", r_squared)

# Make predictions on the test set
prediction = model.predict(features_test)

# Plot the scatter plot
plt.scatter(outcome_test, prediction)
plt.xlabel("Actual Winnings")
plt.ylabel("Predicted Winnings")
plt.title("Actual Winnings vs. Predicted Winnings")
plt.show()

print(features)