# Intro to Python
## Solution Sheet

## Section 0: Setting up

Python is great.
But it's a little bare bones if you use it in its original form.
We need to import some libraries if we want to do data science.

*Libraries?*  
Consider Python as a workbench that came with a few basic tools.
You could build anything, but it would be hard work.
Libraries are like lovingly crafted little kits with specific tools that allow you to create beautiful things on your workbench and significantly speed up your work. 

Typically, you need to install libraries on your machine, but as we're using colab, the ones we need today come pre-installed.
We only need to tell Python we want to use them - which is what we mean with 'importing'.
To stick with the analogy: we already bought our special tools (the 'installation'), but we need to take them out of the storage cabinet to use them ('importing').
We will cover installation of libraries in a separate session. 
R users will recognise the concept of libraries but might think of them as 'packages'.

A very commonly used library in data science is [pandas](https://pandas.pydata.org/). 
It provides easy-to-use data structures and data analysis tools for the Python programming language.
We will use it a lot in this workshop.

Other libraries often used are:

- numpy helps with doing maths on colletions of numbers (like a matrix).
- sci-kitlearn & statsmodels fit machine learning and statistical models.
- matplotlib, seaborn, and plotly are libraries used for data viz.

For this workshop we will at least need pandas and a specific module that comes with plotly, named graph_objects. 
We import them in the following cell.

Put your cursor in the cell and pressing `ctrl+enter` will execute the cell.

In [113]:
import pandas as pd
import plotly.graph_objects as go

We will be using pandas a lot in this session.
Because we don't want to type 'pandas' everytime we use one of its tools, we say 'as pd'. 
It means that we can now refer to pandas as pd, which saves a bit of typing. 
You can see we've done the same for plotly.graph_objects.
It is not essential to do this, but it's handy.

As a next step, we will import the data we'll be playing around with today.
Again, put your cursor in the cell, and press `ctrl+enter`.
This is going to take a bit of time, it's a large-ish datafile.


In [127]:
# import the data
#df = pd.read_csv('https://raw.githubusercontent.com/BobbyGlennS/bit-python-intro/main/data/combined_matches.csv')
df = pd.read_csv('data/combined_matches.csv')


Let's dissect this piece of code a little bit:

`# import the data`
This line does not actually do anything. It is a *code comment*. It's there to help the human reader understand what the code does. You can include comments in your code by preceding them with a hash `#`. This tells Python to ignore this line when executing the code. It's handy when writing lots of code, or to tell an AI what type of code you want it to generate for you (more on that in another session).

`df`    
We don't just want to load the data, we will also need to tell python to store it somewhere. We are creating a variable named `df`. It will serve as a pointer to the data that we are about to load into the memory. This way python knows to keep hold of it, and let us bring it back up everytime we type `df`.

`=`  
We're telling python that this new variable should point to the result of the code that follows the equals sign (`=`).  

`pd.read_csv`  
Use the function read_csv from the pandas library (which we said we would refer to as pd).  

`('https://....')`  
Between the parentheses we provide an 'argument'. It's an instruction to the function `read_csv()`. In this case the instruction is: 'go here to retrieve the file'. Note that we don't need to have the file on our laptop! You can point to any accessible location, including a file on your laptop but also, as is the caase here, to an online location.

Great! 
If all went well we now have a dataset loaded.
It's stored in Python's memory in an object called a pandas DataFrame, which we can retrieve by typing `df`.

#### A dataset of tennis
Before diving in, let's first quickly say something about the data itself.

hHe data we will be using contains all WTA (women's) and ATP (men's) tennis data since 1968.
It is carefully assembled by Jeff Sackmann and is freely available on the internet in the form of github repositories, links for which you can find here:

    https://github.com/JeffSackmann/tennis_atp
    https://github.com/JeffSackmann/tennis_wta

The dataset loaded here has been slightly preprocessed for the purposes of this workshop, by combining some datasets and leaving out some columns, but the data contained in the set we are working with has been left unaltered otherwise.

Ok, now we're all set, let's play!

First we have a little look.
Let's print the dataframe with a function named `print()`.

Like with `read_csv()`, we write the function and between the parentheses we provide an argument that contains an instruction for the function to work with. In this case the instruction is to print of whatever we provide as an argument.

In [128]:
print(df)


       Tour          tourney_id      tourney_name surface tourney_level  \
0       ATP           1968-2029            Dublin   Grass             A   
1       ATP           1968-2029            Dublin   Grass             A   
2       ATP           1968-2029            Dublin   Grass             A   
3       ATP           1968-2029            Dublin   Grass             A   
4       ATP           1968-2029            Dublin   Grass             A   
...     ...                 ...               ...     ...           ...   
347318  WTA  2023-W-FC-2023-POS  BJK Cup Playoffs    Hard             D   
347319  WTA  2023-W-FC-2023-POS  BJK Cup Playoffs    Hard             D   
347320  WTA  2023-W-FC-2023-POS  BJK Cup Playoffs    Hard             D   
347321  WTA  2023-W-FC-2023-POS  BJK Cup Playoffs    Hard             D   
347322  WTA  2023-W-FC-2023-POS  BJK Cup Playoffs    Hard             D   

        tourney_date  match_num  winner_id         winner_name  winner_ht  \
0           19680708  

We can use also the *method* head() on a pandas DataFrame to get the first few rows, which looks slightly nicer.

The technical distinction between methods and functions is something for another time. But what is useful to know is that methods are like little appendices that come with certain objects. The pandas DataFrame object has a set of methods that you can use on it. For example to explore the dataframe, which is what we will be doing in the next few code cells.

We call them slightly differently than functions: we write the name of the object we want to use a method on, and then after that we put a dot and we write the name of the method.
See below.


In [129]:
df.head()

Unnamed: 0,Tour,tourney_id,tourney_name,surface,tourney_level,tourney_date,match_num,winner_id,winner_name,winner_ht,winner_ioc,winner_age,loser_id,loser_name,loser_ht,loser_ioc,loser_age,score,best_of,round
0,ATP,1968-2029,Dublin,Grass,A,19680708,270,112411,Doug Smith,,AUS,,110196,Peter Ledbetter,,IRL,24.0,6-1 7-5,3,R32
1,ATP,1968-2029,Dublin,Grass,A,19680708,271,126914,Louis Pretorius,,RSA,,209536,Maurice Pollock,,IRL,,6-1 6-1,3,R32
2,ATP,1968-2029,Dublin,Grass,A,19680708,272,209523,Cecil Pedlow,,IRL,,209535,John Mulvey,,IRL,,6-2 6-2,3,R32
3,ATP,1968-2029,Dublin,Grass,A,19680708,273,100084,Tom Okker,178.0,NED,24.3,209534,Unknown Fearmon,,,,6-1 6-1,3,R32
4,ATP,1968-2029,Dublin,Grass,A,19680708,274,100132,Armistead Neely,,USA,21.3,209533,Harry Sheridan,,IRL,,6-2 6-4,3,R32


We can also give methods specific instructions. That's what the parentheses are for! E.g. you can ask for a specific number of rows.

In [130]:
df.head(10)


Unnamed: 0,Tour,tourney_id,tourney_name,surface,tourney_level,tourney_date,match_num,winner_id,winner_name,winner_ht,winner_ioc,winner_age,loser_id,loser_name,loser_ht,loser_ioc,loser_age,score,best_of,round
0,ATP,1968-2029,Dublin,Grass,A,19680708,270,112411,Doug Smith,,AUS,,110196,Peter Ledbetter,,IRL,24.0,6-1 7-5,3,R32
1,ATP,1968-2029,Dublin,Grass,A,19680708,271,126914,Louis Pretorius,,RSA,,209536,Maurice Pollock,,IRL,,6-1 6-1,3,R32
2,ATP,1968-2029,Dublin,Grass,A,19680708,272,209523,Cecil Pedlow,,IRL,,209535,John Mulvey,,IRL,,6-2 6-2,3,R32
3,ATP,1968-2029,Dublin,Grass,A,19680708,273,100084,Tom Okker,178.0,NED,24.3,209534,Unknown Fearmon,,,,6-1 6-1,3,R32
4,ATP,1968-2029,Dublin,Grass,A,19680708,274,100132,Armistead Neely,,USA,21.3,209533,Harry Sheridan,,IRL,,6-2 6-4,3,R32
5,ATP,1968-2029,Dublin,Grass,A,19680708,275,207073,John Mcgrath,,IRL,,209532,Brendan Kelly,,IRL,,4-6 8-6 6-2,3,R32
6,ATP,1968-2029,Dublin,Grass,A,19680708,276,109783,Bob Howe,183.0,AUS,42.9,125672,Kenneth Reid,,IRL,,6-0 9-7,3,R32
7,ATP,1968-2029,Dublin,Grass,A,19680708,277,109745,Lew Hoad,179.0,AUS,33.6,125716,Jim Buckley,,IRL,31.2,6-1 6-1,3,R32
8,ATP,1968-2029,Dublin,Grass,A,19680708,278,201749,Graydon Garner,,RSA,23.0,113401,Vivian Gotto,,IRL,46.8,6-1 6-0,3,R32
9,ATP,1968-2029,Dublin,Grass,A,19680708,279,209525,Des Foley,,IRL,,110045,Peter Mockler,,IRL,,4-6 6-4 6-0,3,R32


We can also ask pandas to display specific columns.

In [6]:
# view tourney date, tourney name, and winner name columns
df[['tourney_date', 'tourney_name', 'winner_name']]

Unnamed: 0,tourney_date,tourney_name,winner_name
0,20181231,Brisbane,Kei Nishikori
1,20181231,Brisbane,Daniil Medvedev
2,20181231,Brisbane,Kei Nishikori
3,20181231,Brisbane,Jo-Wilfried Tsonga
4,20181231,Brisbane,Daniil Medvedev
...,...,...,...
347318,19781103,Wightman Cup,Chris Evert
347319,19781103,Wightman Cup,Michelle Tyler
347320,19781103,Wightman Cup,Virginia Wade
347321,19781103,Wightman Cup,Chris Evert


You can combine different operations.
We call this 'chaining'.
This code is executed sequentially.
Before running it, try to picture in your head what it will do.

In [132]:
df[['tourney_date', 'tourney_name', 'winner_name']].head(5)

Unnamed: 0,tourney_date,tourney_name,winner_name
0,19680708,Dublin,Doug Smith
1,19680708,Dublin,Louis Pretorius
2,19680708,Dublin,Cecil Pedlow
3,19680708,Dublin,Tom Okker
4,19680708,Dublin,Armistead Neely


Here are some other operations that you may find useful:

In [133]:
#get column names
df.columns

Index(['Tour', 'tourney_id', 'tourney_name', 'surface', 'tourney_level',
       'tourney_date', 'match_num', 'winner_id', 'winner_name', 'winner_ht',
       'winner_ioc', 'winner_age', 'loser_id', 'loser_name', 'loser_ht',
       'loser_ioc', 'loser_age', 'score', 'best_of', 'round'],
      dtype='object')

In [134]:
# how many rows and columns?
df.shape


(347323, 20)

## Exercise 0.1

You've now seen a few ways to explore a dataframe.

In the next cell, have a go yourself.

- Can you create a subset of data by *sampling* 20 rows from a select set of *colums* of your choice? *Note: don't just select the top 5 rows, but sample some at random.*
- Can you store this subset in a new pandas DataFrame?
- Can you check the dimensions of this new dataframe?

For this, you will need to use at least one new method.

In [135]:
df2 = df[['tourney_date', 'tourney_name', 'winner_name']].sample(20)

df2.shape

# do the same as above but with independent copy of the data. We'll cover this another time.
df3 = df[['tourney_date', 'tourney_name', 'winner_name']].sample(20).copy()

## Section 1: Diving in

Now we are setup, we will ask you to do some exercises to answer two questions:

- Who won the most Grand Slam Tournaments between 2000 and 2019? 
- Who won the most Australian Opens in the same period? 

For this, we will have to transform some of the data from one form into another. 
We call this process *Data Wrangling*.

Let's have a go.

Here are some references to the documentation of functions that we will use: \
\
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html \
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html \
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html \
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

Read about using plotly here: \
\
https://plotly.com/python-api-reference/generated/plotly.graph_objects.Bar.html

### Exercise 1.1: Sort out the match dates

So that we can select tournaments from the right time period, we need to make sure the date of the matches is in the right format.

Find the column that stores the match dates and reformat it so that it is a date, stored as YYYY-MM-DD.

For this exercise, consider using the [datetime](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) method. 
You can also start with a google or a GPT question on how to turn a pandas column into the format YYYY-MM-DD.

In [137]:
# Inspecting a sample of 5 rows from the 'tourney_date' column
df['tourney_date'].sample(5)

115575   1998-08-03
262218   1993-06-21
94179    1993-01-04
294612   2004-03-10
32404    1976-01-26
Name: tourney_date, dtype: datetime64[ns]

The date is in an unusual format that may be difficult to read. 

Let's change the format of this column using the 'to_datetime' function in pandas. See https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html for documentation. 

In [136]:
# Changing the date format
df['tourney_date'] = pd.to_datetime(df['tourney_date'], format='%Y%m%d')

Let's check that it has worked as intended.

In [138]:
# Checking the new 'tourney_date' column
df['tourney_date'].sample(5)

68725    1985-12-20
264606   1993-04-22
37514    1977-05-23
150992   2009-02-16
189180   2023-01-16
Name: tourney_date, dtype: datetime64[ns]

This looks good! 

### Exercise 1.2: Identify finals matches from Grand Slam Tournaments

We only want to include final matches from grand slam tournaments in the dataframe, as they will tell us who won the relevant tournaments.
For this we need to work out where we get that information from.

- First work out which column will tell us what round a match was part of, and how you might identify final rounds.
- Then do the same for tournaments, find the column that tells us the level of a tournament, and how we can identify grand slam tournaments.

From the column names we can derive that 'round' may be the relevant column. Let's work out which matches are finals:

In [14]:
df['round'].value_counts()

round
R32     110623
R16      62374
R64      55736
QF       32921
RR       29925
R128     28956
SF       17355
F         9124
BR         277
ER          32
Name: count, dtype: int64

A good guess here would be 'F' for final. This can be checked by checking a specific row with 'F' in the 'tourney_level' column and seeing that the match detail are correct using google. 

Let's find the rows that correspond to grand slams.

In [15]:
# Inspecting the 'tourney_level' column
df['tourney_level'].value_counts()


tourney_level
A     126189
W      70090
G      51766
D      26625
M      23608
I      14054
P       9903
T1      4778
T3      4562
T2      4329
PM      3663
T4      2912
CC      1913
F       1067
T5       968
O        608
E        282
J          6
Name: count, dtype: int64

It is unclear which code corresponds to grand slams. Let's have a look at the tournament name and level for some rows. 

In [139]:
df[['tourney_name', 'tourney_level']].sample(20)

Unnamed: 0,tourney_name,tourney_level
246941,Bastad,W
45400,Sydney Indoor,A
204853,Toronto,W
46477,Baltimore WCT,A
218671,Oldsmar,W
42224,Sarasota,A
110566,Davis Cup WG QF: USA vs NED,D
104481,Kitzbuhel,A
17179,South Orange,A
958,Davis Cup EUR QF: BEL vs TCH,D


It looks like 'G' is the level for grand slams. Let's check that all grand slams are included in this level.

In [141]:
df[df['tourney_level'] == 'G']['tourney_name'].value_counts()

tourney_name
Wimbledon                   13490
Roland Garros               13304
US Open                     12382
Australian Open             11231
Us Open                      1143
Australian Open-2              63
Australian Chps.               61
Australian Championships       61
Australian Open 2              31
Name: count, dtype: int64

This looks good, except for the fact that there are multiple formats for some of the tournaments, which may cause issues when analysing the Australian Open tournaments. 

Let's rename some of them so that each grand slam only has one format. 

In [142]:
# Setting up a dictionary of desired replacements
replacements = {
    'Us Open' : 'US Open',
    'Australian Open-2' : 'Australian Open',
    'Australian Chps.' : 'Australian Open',
    'Australian Open 2' : 'Australian Open',
    'Australian Championships' : 'Australian Open'
}

# Doing the replacements
df['tourney_name'] = df['tourney_name'].replace(replacements)

# Checking the replacements
df[df['tourney_level'] == 'G']['tourney_name'].value_counts()

tourney_name
US Open            13525
Wimbledon          13490
Roland Garros      13304
Australian Open    11447
Name: count, dtype: int64

Great! This has worked and we are ready to make our filtered dataframe.

### Exercise 1.3: Filter the data 

Next we want to create a dataset that only contains the data we are interested in. 
This is known as filtering.

As a reminder the data should only contain:
- Data from grand slam tournaments
- The final rounds of those tournaments
- Only between 2000 and 2019

First let's filter to only include grand slam finals:

In [149]:
# Making a dataframe that only includes grandslams
gs_df = df[(df['tourney_level'] == 'G')] 

# only include final rounds
gs_df = gs_df[(gs_df['round'] == 'F')] 

Next we filter the dataframe to only include matches that happened between 2000 and 2019:

In [150]:
gs_df = gs_df[(gs_df['tourney_date'].dt.year >= 2000) & (gs_df['tourney_date'].dt.year < 2020)]

### Exercise 1.4 Find who won the most grand slam tournaments between 2000 and 2019. Plot a bar chart with this information.


In [151]:
gs_top_5 = gs_df['winner_name'].value_counts().head(5)
gs_top_5

winner_name
Serena Williams    22
Roger Federer      20
Rafael Nadal       19
Novak Djokovic     16
Justine Henin       7
Name: count, dtype: int64

Now let's extract the names and number of weeks for each player.

In [152]:
gs_top_players = list(gs_top_5.index)
gs_top_nums = gs_top_5.values

In [153]:
fig = go.Figure([go.Bar(x=gs_top_players, y=gs_top_nums)])

fig.update_layout(
    title='Grand Slam Wins by Top 5 Tennis Players',
    xaxis_title='Players',
    yaxis_title='Grand Slam Wins between 2000-2019',
    template='plotly_dark'  
)

fig.show()

### Exercise 1.5: Make a bar chart showing which 5 players won the most Australian Opens between 2000 and 2019.

The solution is very similar. Note that we have to add the condition that the tournament name is `Australian Open`. Here, we just filter inline.

In [156]:
# Making the grouped series
most_aus = gs_df[(gs_df['tourney_name'] == 'Australian Open')].value_counts('winner_name').head(5)

# Getting the players and number of wins 
players_aus = list(most_aus.index)
num_aus = most_aus.values

# Plotting the bar chart
fig = go.Figure([go.Bar(x=players_aus, y=num_aus)])

fig.update_layout(
    title='Australian Open Wins by Top 5 Tennis Players',
    xaxis_title='Players',
    yaxis_title='Australian Open Wins between 2000-2019',
    template='plotly_dark'  
)

fig.show()

Serena Williams comes out on top again! It's interesting note that while Federer won more grandslams than Djokovic, he won less Australian Opens. It could be interesting to investigate what factors might cause this, for example court type. 

### Exercise 1.5: How young were the top winners when they won a grand slam for the first time?

Recall the list of top players (by number of grandslams won between 2000 and 2019), and the number of grandslam's they won in that period.

In [157]:
print(gs_top_players)
print(gs_top_nums)

['Serena Williams', 'Roger Federer', 'Rafael Nadal', 'Novak Djokovic', 'Justine Henin']
[22 20 19 16  7]


Let's create a dataset with only the top 5.

In [163]:
gs_df_top_5 = gs_df[gs_df['winner_name'].isin(gs_top_players)]

Now, with the `groupby` method, we can perform operations on a dataframe on subgroups of the full data.
Like, for example, if we wanted to know the minimum age for each player separately, we could do:

In [165]:
gs_df_top_5.groupby('winner_name')['winner_age'].min()

winner_name
Justine Henin      20.9
Novak Djokovic     20.6
Rafael Nadal       18.9
Roger Federer      21.8
Serena Williams    20.6
Name: winner_age, dtype: float64

*Fancier solution*  
We can also do it slightly more advanced, by getting the index of each row of the youngest win first.
We can then use those indexes to retrieve the relevant rows from the dataframe and present some nice additional information.

In [166]:
first_win_indices = gs_df_top_5.groupby('winner_name')['winner_age'].idxmin()


Now we fetch the rows from the `gs_df` dataframe corresponding to these indices:

In [167]:
first_win_dates = gs_df.loc[first_win_indices]

Now we can select the desired columns and see when the players' first grand slam wins were:

In [168]:
first_win_dates[['winner_name', 'tourney_date', 'winner_age']]

Unnamed: 0,winner_name,tourney_date,winner_age
291310,Justine Henin,2003-05-26,20.9
148716,Novak Djokovic,2008-01-14,20.6
138696,Rafael Nadal,2005-05-23,18.9
132331,Roger Federer,2003-06-23,21.8
288214,Serena Williams,2002-05-27,20.6


# Section 2: BONUS: Analysing Streaks
Through this section, we will be analysing grand slam winning streaks from the men's dataframe. (You can try it for yourself on the women's data later).

Useful functions: 

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html \
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.eq.html \
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.cumsum.html \
https://pandas.pydata.org/docs/reference/api/pandas.Index.get_level_values.html

## Question 1: Which Male Players had the Longest Grand Slam Winning Streaks?

### Exercise 3.1: Filter the grand slams dataframe to only include male players

In [169]:
# Splitting the dataframe into male and female players
gs_df_men = gs_df[gs_df['Tour'] == 'ATP']

### Exercise 3.1: Shift the 'winner_name' column in the men's dataframe down by one using the shift function.

Let's see what the shift function does to a column. First we check what the column looks like before applying the shift function.

In [170]:
# Checking the 'winner_name' column before applying the shift function
gs_df_men['winner_name']

122346    Gustavo Kuerten
122473       Pete Sampras
122600        Marat Safin
122789       Andre Agassi
125724    Gustavo Kuerten
               ...       
178509     Novak Djokovic
179155     Novak Djokovic
180347       Rafael Nadal
180644     Novak Djokovic
181195       Rafael Nadal
Name: winner_name, Length: 80, dtype: object

In [171]:
# Shifting the 'winner_name' column down by 1
gs_df_men['winner_name'].shift()

122346               None
122473    Gustavo Kuerten
122600       Pete Sampras
122789        Marat Safin
125724       Andre Agassi
               ...       
178509     Novak Djokovic
179155     Novak Djokovic
180347     Novak Djokovic
180644       Rafael Nadal
181195     Novak Djokovic
Name: winner_name, Length: 80, dtype: object

It seems to shist the whole column down by one. Notice that the index remains the same. This is important for comparing the shifted row to the original. 

Now let's see what the eq function does. 

### Exercise 3.2: Use the `eq` function to compare the shifted 'winner_name' to the original. What is the result? 

In [172]:
gs_df_men['winner_name'].eq(gs_df_men['winner_name'].shift())

122346    False
122473    False
122600    False
122789    False
125724    False
          ...  
178509     True
179155     True
180347    False
180644    False
181195    False
Name: winner_name, Length: 80, dtype: bool

The result is a boolean array indicating when the two columns are the same (True), and when they are not (False).

### Exercise 3.3: How can we use these functions to determine grand slam winning streaks for each player?

Hint: you might find it helpful to use the `cumsum` function, with which you can sum values in a boolean array (https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.cumsum.html). 

Hint: also see https://joshdevlin.com/blog/calculate-streaks-in-pandas/#:~:text=The%20first%20step%20in%20calculating,us%20which%20are%20not%20equal for a blog post on how to find streaks in a dataframe. 

#### Solution:

We will solve this problem by first adding a `streak_indicator` column, that will be a boolean (as above) that indicates when the winning player is the same as in the shifted `winner_name` column. Then we will negate this column so that `False` indicates that the `winner_name` is the same as the previous `winner_name`. We can now apply the `cumsxum` function and create a new streak indicator column whose entries will be numbers that increment by 1 everytime a new streak is started. 
\
\
Finally we will group the dataframe by `winner_name` and `streak_indicator_num` (in that order), aggregating using the `size` function, to see how larger each group (i.e. each streak) is.
\
\
After sorting these values, we can pick the top 5 to get the top 5 streaks and the players associated with them.

### Exercise 3.4: Apply your method to find the player with the 5 longest streaks, and the lenth of their streaks

In [173]:
# Sorting the rows by 'tourney_date'
gs_df_men = gs_df_men.sort_values(by='tourney_date')

In [176]:
# Creating a streak indicator column in the men's dataframe
gs_df_men['streak_indicator_bool'] = gs_df_men['winner_name'].eq(gs_df_men['winner_name'].shift())

# Creating a streak indictor column that contains numbers that indicate different streaks
gs_df_men['streak_indicator_num'] = (~gs_df_men['streak_indicator_bool']).cumsum()

# Grouping the dataframe by 'winner_name' and 'streak_indicator_num' 
streaks_men = gs_df_men.groupby(['winner_name', 'streak_indicator_num']).size()

# Sorting the streaks object to find 5 longest streaks 
highest_streaks = streaks_men.sort_values(ascending=False)

# Getting the players who got the longest streaks
players_with_highest_streak = list(highest_streaks.index.get_level_values('winner_name'))

highest_streaks = list(highest_streaks.values)

# Making a dictionary of the players with their streaks
unique_highest_streaks = {
    'player' : players_with_highest_streak,
    'streak' : highest_streaks
}

# Making the dictionary into a dataframe
unique_highest_streaks_df = pd.DataFrame(unique_highest_streaks)

# Dropping rows that have the same pair of entries in the 'player' and 'streak' column 
unique_highest_streaks_df.drop_duplicates(subset=['player', 'streak'], inplace=True)

# Getting the top 5 plpayers and streaks
top_5_players = list(unique_highest_streaks_df['player'].head(5))
top_5_streaks = list(unique_highest_streaks_df['streak'].head(5))

# Printing the results
print(f'Players with highest streaks are {top_5_players}, with streak(s) of {top_5_streaks}, respectively.')

Players with highest streaks are ['Novak Djokovic', 'Novak Djokovic', 'Rafael Nadal', 'Roger Federer', 'Roger Federer'], with streak(s) of [4, 3, 3, 3, 2], respectively.
