<a href="https://colab.research.google.com/github/EricAshby/EDA-Tennis-Rankings/blob/main/TEDA1030_Mod4_practice_EricAshby_08_16_23.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Exploratory Data Analysis of Tennis Rankings
—Eric Ashby—

##Introduction
To be analyzed in this project is data on the 2022 tennis rankings. This data set includes information on the name, country, points, and rank of various tennis players from around the world as of 2022. This project aims to compare points and rankings across the tennis players in the data set, determine how one affects the other, and demonstrate the capabilities of dictionaries, lists, and for loops in python.

##Analysis Purpose
This project endeavors to
*   Determine the best ranking player
*   Determine the highest scoring player
*   Dertermine how points affects rank
*   Determine the points scored by the player in rank 100
*   Create a list of the top 5 best ranking players
*   Determine the average points scored by the top 3 best ranking players
*   Demonstrate the capabilities of dictionaries, lists, and for loops in python



In [None]:
import pandas as pd

df = pd.read_csv("Rankings_2022.csv")

##Overview
The following code shows the first 5 rows of the the data set to give us an idea of how the data is layed out.


In [None]:
df.head()

Unnamed: 0,rank,name,country_name,country_id,points,bestRank,bestRankDate,rankDiff,pointsDiff,bestPoints
0,1,Novak Djokovic,Serbia,SRB,11015,1,2011-07-04,0.0,0.0,16950
1,2,Daniil Medvedev,Russian Federation,RUS,10125,2,2021-03-15,0.0,1190.0,10780
2,3,Alexander Zverev,Germany,GER,7780,3,2017-11-06,0.0,-190.0,8240
3,4,Stefanos Tsitsipas,Greece,GRE,7170,3,2021-08-09,0.0,630.0,8350
4,5,Rafael Nadal,Spain,ESP,6875,1,2008-08-18,0.0,2000.0,15390


The code below prints the dataframe meta info. We can see that there are 10 columns and 200 entries for tennis players included in the data set with 4 missing data for the `rankDiff` and `pointsDiff` columns.

In [None]:
print("Dataframe Meta Info:")
df.info()

Dataframe Meta Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   rank          200 non-null    int64  
 1   name          200 non-null    object 
 2   country_name  200 non-null    object 
 3   country_id    200 non-null    object 
 4   points        200 non-null    int64  
 5   bestRank      200 non-null    int64  
 6   bestRankDate  200 non-null    object 
 7   rankDiff      196 non-null    float64
 8   pointsDiff    196 non-null    float64
 9   bestPoints    200 non-null    int64  
dtypes: float64(2), int64(4), object(4)
memory usage: 15.8+ KB


Shown here are the four entries for which there are no values in `rankDiff` and `pointsDiff`.

In [None]:
df[df["rankDiff"].isna()]

Unnamed: 0,rank,name,country_name,country_id,points,bestRank,bestRankDate,rankDiff,pointsDiff,bestPoints
177,178,Camilo Ugo Carabelli,Argentina,ARG,358,178,2022-01-31,,,358
180,181,Pavel Kotov,Russian Federation,RUS,354,181,2022-01-31,,,354
194,195,Thomas Fabbiano,Italy,ITA,332,70,2017-09-18,,,715
197,198,Facundo Mena,Argentina,ARG,327,195,2019-09-16,,,327


Below, we have the descriptive statistics for the data set. Notice that the median lags behind the mean in most of the categories. In particular, note that the data for `points` and `pointsDiff` are especially right skewed.

In [None]:
df.describe()

Unnamed: 0,rank,points,bestRank,rankDiff,pointsDiff,bestPoints
count,200.0,200.0,200.0,196.0,196.0,200.0
mean,100.5,1142.55,64.725,-0.178571,19.183673,1907.095
std,57.879185,1460.045558,53.173929,10.119985,213.437788,2450.270334
min,1.0,325.0,1.0,-71.0,-1020.0,327.0
25%,50.75,455.5,19.0,-2.0,-20.0,692.5
50%,100.5,719.0,52.5,1.0,0.0,1062.0
75%,150.25,1149.0,99.75,4.0,29.25,2115.75
max,200.0,11015.0,195.0,28.0,2000.0,16950.0


##Analysis
###Who is the best ranked player?
This code sorts the data by `rank` and prints the entry for the best ranked player which we find to be Novak Djokovic from Serbia.


In [None]:
df.sort_values(by = "rank").head(1)

Unnamed: 0,rank,name,country_name,country_id,points,bestRank,bestRankDate,rankDiff,pointsDiff,bestPoints
0,1,Novak Djokovic,Serbia,SRB,11015,1,2011-07-04,0.0,0.0,16950


###Who is the highest scoring player?
The code below sorts the data by `points` and prints the entry for the highest scoring player which we find to also be Novak Djokovic of Serbia.

In [None]:
df.sort_values(by = "points", ascending = False).head(1)

Unnamed: 0,rank,name,country_name,country_id,points,bestRank,bestRankDate,rankDiff,pointsDiff,bestPoints
0,1,Novak Djokovic,Serbia,SRB,11015,1,2011-07-04,0.0,0.0,16950


###How do points affect rank?
Here, we have code that determines the correlation coefficient between `points` and `rank`. We find that `rank` and `points` are moderately negatively correlated, meaning that as points increase, rank number decreases (improves). More succinctly: low scoring players tend to rank better.

Interesting to note is that the correlation is not stronger, implying that rank may be determined by more factors than just points.

In [None]:
df[['rank' , 'points']].corr()

Unnamed: 0,rank,points
rank,1.0,-0.631281
points,-0.631281,1.0


###How many points were scored by the player in rank 100?

This code determines the player in rank 100 and creates an associated dictionary called `middle_rank_player` cotaining their rank, name, country, and points.

In [None]:
#written as a for loop to make use of .iterrows() ability to extract values without index data attached
for index, player in df.iterrows():
  if player['rank'] == 100:
    middle_rank_player = {"rank" : player['rank'] , "name" : player['name'] , "country name" : player['country_name'] , "points" : player['points']}
middle_rank_player

{'rank': 100,
 'name': 'Steve Johnson',
 'country name': 'United States',
 'points': 719}

From our dictionary, we can pull the points scored:

In [None]:
middle_rank_player['points']

719

###Create a list of the top 5 best ranking players
The following code provides the top 5 ranking players in the data set. We may note that this is the same as the first five rows of the unsorted data set. This is because the data set comes already sorted by `rank`.

In [None]:
dfTop5 = df.sort_values(by = 'rank').head()
dfTop5

Unnamed: 0,rank,name,country_name,country_id,points,bestRank,bestRankDate,rankDiff,pointsDiff,bestPoints
0,1,Novak Djokovic,Serbia,SRB,11015,1,2011-07-04,0.0,0.0,16950
1,2,Daniil Medvedev,Russian Federation,RUS,10125,2,2021-03-15,0.0,1190.0,10780
2,3,Alexander Zverev,Germany,GER,7780,3,2017-11-06,0.0,-190.0,8240
3,4,Stefanos Tsitsipas,Greece,GRE,7170,3,2021-08-09,0.0,630.0,8350
4,5,Rafael Nadal,Spain,ESP,6875,1,2008-08-18,0.0,2000.0,15390


Here, we create at list containing each of the top 5 players, called `the_greats`.

In [None]:
#both of these work but the second returns a list of dictionaries as apposed to a list of dataframe rows

# the_greats = []  #initialize as empty list
# for index , player in dfTop5.iterrows():
#   the_greats.append(player)

columnTitles = dfTop5.columns.values
the_greats = []  #initialize as empty list
for index , playerRow in dfTop5.iterrows():
  playerDict = {}  #initialize as empty library
  for key in columnTitles:
    playerDict[key] = playerRow[key]
  the_greats.append(playerDict)

If we want information on the fifth ranked player, Rafael Nadal, we can pull his inormation with this code:

In [None]:
the_greats[4]

{'rank': 5,
 'name': 'Rafael Nadal',
 'country_name': 'Spain',
 'country_id': 'ESP',
 'points': 6875,
 'bestRank': 1,
 'bestRankDate': '2008-08-18',
 'rankDiff': 0.0,
 'pointsDiff': 2000.0,
 'bestPoints': 15390}

###What is the average of the points scored by the top 3 best ranking players?

The following code selects the top 3 ranking players.

In [None]:
top_3_greats = []
keys = ['name' , 'points']
for index in range(3):
  player = {}
  for key in keys:
    player[key]=the_greats[index][key]
  top_3_greats.append(player)

top_3_greats

[{'name': 'Novak Djokovic', 'points': 11015},
 {'name': 'Daniil Medvedev', 'points': 10125},
 {'name': 'Alexander Zverev', 'points': 7780}]

We find the average, as calculated below, to be

In [None]:
pointSum = 0
for player in top_3_greats:
  pointSum += player['points']

pointMean = pointSum / len(top_3_greats)
print(pointMean)

9640.0


###Demonstration of dictionary and list capabilities

The code below sorts the data alphabetically by name, returning the first 5 entries.

In [None]:
dfAlpha5 = df.sort_values(by = "name").head()
dfAlpha5

Unnamed: 0,rank,name,country_name,country_id,points,bestRank,bestRankDate,rankDiff,pointsDiff,bestPoints
57,58,Adrian Mannarino,France,FRA,1049,22,2018-03-19,11.0,170.0,1735
43,44,Albert Ramos,Spain,ESP,1259,17,2017-05-08,0.0,0.0,2180
46,47,Alejandro Davidovich Fokina,Spain,ESP,1160,32,2021-08-30,3.0,0.0,1723
143,144,Alejandro Tabilo,Chile,CHI,483,135,2021-11-29,-9.0,-35.0,518
139,140,Aleksandar Vukic,Australia,AUS,504,140,2022-01-31,4.0,27.0,504


Here, we create a dictionary containing `name` and `rank` data for the first 5 players in alphabetical order. We see that Albert Ramos had the best ranking among these five players.

In [None]:
playersAlpha5 = {}  #initialize an empty library
for index, player in dfAlpha5.iterrows():
  playersAlpha5[player['name']] = player['rank']  #define library through iteration

playersAlpha5

{'Adrian Mannarino': 58,
 'Albert Ramos': 44,
 'Alejandro Davidovich Fokina': 47,
 'Alejandro Tabilo': 144,
 'Aleksandar Vukic': 140}

The following code iterates through each letter in the name 'Daniil Medvedev' and prints each on a separate line.

In [None]:
name = 'Daniil Medvedev'
for letter in name:
  print(letter)

D
a
n
i
i
l
 
M
e
d
v
e
d
e
v


##Results
Our analysis foinds that the top ranked and top scoring player are the same person, Novak Djokovic of Serbia.  Better scoring players tend to have higher ranks. The mid-ranked player scored 719 points and the top 3 rankers scored an average of 9640 points.