# Intro to Python - Worksheet

## Section 0: Setting up

Python is great.
But it's a little bare bones.
We need to import some libraries if we want to do data science.

Libraries?
Consider Python as your workbench.
Libraries are like little lovingly crafted bags (independent shop vibes) with specific tools that allow you to create beautiful things on your workbench. 
Typically, you need to install libraries on your machine, but as we're using colab, they come pre-installed.
We will cover installation of libraries in a separate session. 
R users will recognise the concept of libraries but might think of them as 'packages'.

A very commonly used library in data science is [pandas](https://pandas.pydata.org/). 
It provides easy-to-use data structures and data analysis tools for the Python programming language.
We will use it a lot in this workshop.

Other libraries often used are:

- numpy: helps with doing maths on colletions of numbers (like a matrix).
- sci-kitlearn & statsmodels fit machine learning and statistical models.
- matplotlib, seaborn, and plotly are libraries used for data viz.

For this workshop we will at least need pandas and a specific module that comes with plotly, named graph_objects. 
We import them in the following cell.

Put your cursor in the cell and pressing `ctrl+enter` will execute the cell.

In [None]:
import pandas as pd
import plotly.graph_objects as go

> Note: moving forward we will call a lot of the tools (called methods and functions) that come with pandas. 
Because we don't want to type 'pandas' everytime, we say 'as pd'. 
It means that we can now refer to pandas as pd, which saves a bit of typing. 
It is not essential.
You can see we've done the same for plotly.graph_objects.

As a next step, we will import the data we'll be playing around with today.
Again, put your cursor in the cell, and press `ctrl+enter`.


In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/BobbyGlennS/bit-python-intro/main/data/combined_matches.csv')

Let's dissect this piece of code a little bit:

`df`    
We don't just want to load the data, we will also need to tell python to store it somewhere. We are creating a variable named `df`. It will serve as a pointer to the data that we are about to load into the memory. This way python knows to keep hold of it, and let us bring it back up everytime we type `df`.


`=`  
We're telling python that this new variable should point to the result of the code that follows the equals sign (`=`).  

`pd.read_csv`  
Use the function read_csv from the pandas library (which we said we would refer to as pd).  

`('https://....')`  
Between the parentheses we provide an 'argument'. It's an instruction to the function. In this case the instruction is: 'go here to retrieve the file'.

Great! 
If all went well we now have a dataset loaded.
It's stored in Python's memory in an object called a pandas DataFrame, which we can retrieve by typing `df`.

We can start playing.
Let's first have a little look.
Let's print the dataframe with a function named `print()`.

Like with `read_csv()`, we write the function and between the parentheses we provide an argument that contains an instruction for the function to work with. In this case the instruction is to print of whatever we provide as an argument.

In [None]:
print(df)


We can use also the *method* head() on a pandas DataFrame to get the first few rows, which looks slightly nicer.

The technical distinction between methods and functions is something for another time. But what is useful to know is that methods are like little appendices that come with certain objects. A pandas DataFrame has a set of methods that you can use on it. For example to explore the dataframe, which is what we will be doing in the next few code cells.

We call them slightly differently than functions: we write the object we want to use a method with, and then after the dot we write the name of the method.
See below.


In [None]:
df.head()

We can also give methods specific instructions. That's what the parentheses are for! E.g. you can ask for a specific number of rows.

In [None]:
df.head(10)


Leaving methods behind now for a second.

We can also ask pandas to display specific columns.

In [None]:
# view tourney date, tourney name, and winner name columns
df[['tourney_date', 'tourney_name', 'winner_name']]

You can combine different operations.
We call this 'chaining'.
This code is executed sequentially.

In [None]:
df[['tourney_date', 'tourney_name', 'winner_name']].head(5)

Here are some other operations that you may find useful:

In [None]:
#get column names
df.columns

In [None]:
# how many rows and columns?
df.shape


### Exercise 0.1

You've now seen a few ways to explore a dataframe.

In the next cell, have a go yourself.

- Can you create a subset of data by *sampling* 20 rows from a select set of *colums* of your choice? *Note: don't just select the top 5 rows, but sample some at random.*
- Can you store this subset in a new pandas DataFrame, named `df2`?
- Can you check the dimensions of this new dataframe?

For this, you will need to use at least one new method.

## Section 1: Diving in

Now we are setup, we will ask you to do some exercises to answer two questions:

- Who won the most Grand Slam Tournaments between 2000 and 2019? 
- Who won the most Australian Opens in the same period? 

For this, we will have to transform some of the data from one form into another. 
We call this process *Data Wrangling*.

Let's have a go.

Useful functions: \
\
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html \
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html \
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html \
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

Read about using plotly here: \
\
https://plotly.com/python-api-reference/generated/plotly.graph_objects.Bar.html

### Exercise 1.1: Sort out the match dates

So that we can select tournaments from the right time period, we need to make sure the date of the matches is in the right format.

Find the column that stores the match dates and reformat it so that it is a date, stored as YYYY-MM-DD.

For this exercise, consider using the [datetime](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) method. 
You can also start with a google or a GPT question on how to turn a pandas column into the format YYYY-MM-DD.


### Exercise 1.2: Identify finals matches from Grand Slam Tournaments

We only want to include final matches from grand slam tournaments in the dataframe, as they will tell us who won the relevant tournaments.
For this we need to work out where we get that information from.

- First work out which column will tell us what round a match was part of, and how you might identify final rounds.
- Then do the same for tournaments, find the column that tells us the level of a tournament, and how we can identify grand slam tournaments.

During your exploration, you may have noticed that that Australian Open and US Open tournaments are recorded in a variety of ways in the dataset.
Inconsistencies in data is something that often happens!
Below is included a code cell that takes care of this.
Henry or Bobby can explain to you how this works.


In [None]:
# Setting up a dictionary of desired replacements
replacements = {
    'Us Open' : 'US Open',
    'Australian Open-2' : 'Australian Open',
    'Australian Chps.' : 'Australian Open',
    'Australian Open 2' : 'Australian Open',
    'Australian Championships' : 'Australian Open'
}

# Doing the replacements
df['tourney_name'] = df['tourney_name'].replace(replacements)

# Checking the replacements
df[df['tourney_level'] == 'G']['tourney_name'].value_counts().

Great! This has worked and we are ready to make our filtered dataframe.

### Exercise 1.4: Filter the data 

Next we want to create a dataset that only contains the data we are interested in. 
This is known as filtering.

As a reminder the data should only contain:
- Data from grand slam tournaments
- The final rounds of those tournaments
- Only between 2000 and 2019


### Exercise 1.5 Find who won the most grand slam tournaments between 2000 and 2019. Plot a bar chart with this information.

### Exercise 1.6: Make a bar chart showing which 5 players won the most Australian Opens between 2000 and 2019.

The solution is very similar. Note that we have to add the condition that the tournament name is `Australian Open`.

### Exercise 1.7 How young were the top winners when they won a grand slam for the first time?

## Section 2: BONUS: Analysing Streaks

Through this section, we will be analysing grand slam winning streaks from the men's dataframe.

Usefull functions: 

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html \
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.eq.html \
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.cumsum.html \
https://pandas.pydata.org/docs/reference/api/pandas.Index.get_level_values.html

## Question 1: Which Male Players had the Longest Grand Slam Winning Streaks?

### Exercise 3.1: Filter the grand slams dataframe to only include male players

In [None]:
# Splitting the dataframe into male and female players
gs_df_men = gs_df[gs_df['Tour'] == 'ATP']

### Exercise 3.1: Shift the 'winner_name' column in the men's dataframe down by one using the shift function.

Let's see what the shift function does to a column. First we check what the column looks like before applying the shift function.

In [None]:
# Checking the 'winner_name' column before applying the shift function
gs_df_men['winner_name']

In [None]:
# Shifting the 'winner_name' column down by 1
gs_df_men['winner_name'].shift()

It seems to shist the whole column down by one. Notice that the index remains the same. This is important for comparing the shifted row to the original. 

Now let's see what the eq function does. 

### Exercise 3.2: Use the `eq` function to compare the shifted 'winner_name' to the original. What is the result? 

In [None]:
gs_df_men['winner_name'].eq(gs_df_men['winner_name'].shift())

The result is a boolean array indicating when the two columns are the same (True), and when they are not (False).

### Exercise 3.3: How can we use these functions to determine grand slam winning streaks for each player?

Hint: you might find it helpful to use the `cumsum` function, with which you can sum values in a boolean array (https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.cumsum.html). 

Hint: also see https://joshdevlin.com/blog/calculate-streaks-in-pandas/#:~:text=The%20first%20step%20in%20calculating,us%20which%20are%20not%20equal for a blog post on how to find streaks in a dataframe. 

#### Solution:

We will solve this problem by first adding a `streak_indicator` column, that will be a boolean (as above) that indicates when the winning player is the same as in the shifted `winner_name` column. Then we will negate this column so that `False` indicates that the `winner_name` is the same as the previous `winner_name`. We can now apply the `cumsxum` function and create a new streak indicator column whose entries will be numbers that increment by 1 everytime a new streak is started. 
\
\
Finally we will group the dataframe by `winner_name` and `streak_indicator_num` (in that order), aggregating using the `size` function, to see how larger each group (i.e. each streak) is.
\
\
After sorting these values, we can pick the top 5 to get the top 5 streaks and the players associated with them.

### Exercise 3.4: Apply your method to find the player with the 5 longest streaks, and the lenth of their streaks

In [None]:
# Sorting the rows by 'tourney_date'
gs_df_men = gs_df_men.sort_values(by='tourney_date')

In [None]:
# Creating a streak indicator column in the men's dataframe
gs_df_men['streak_indicator_bool'] = gs_df_men['winner_name'].eq(gs_df_men['winner_name'].shift())

# Creating a streak indictor column that contains numbers that indicate different streaks
gs_df_men['streak_indicator_num'] = (~gs_df_men['streak_indicator_bool']).cumsum()

# Grouping the dataframe by 'winner_name' and 'streak_indicator_num' 
streaks_men = gs_df_men.groupby(['winner_name', 'streak_indicator_num']).size()

# Sorting the streaks object to find 5 longest streaks 
highest_streaks = streaks_men.sort_values(ascending=False)

# Getting the players who got the longest streaks
players_with_highest_streak = list(highest_streaks.index.get_level_values('winner_name'))

highest_streaks = list(highest_streaks.values)

# Making a dictionary of the players with their streaks
unique_highest_streaks = {
    'player' : players_with_highest_streak,
    'streak' : highest_streaks
}

# Making the dictionary into a dataframe
unique_highest_streaks_df = pd.DataFrame(unique_highest_streaks)

# Dropping rows that have the same pair of entries in the 'player' and 'streak' column 
unique_highest_streaks_df.drop_duplicates(subset=['player', 'streak'], inplace=True)

# Getting the top 5 plpayers and streaks
top_5_players = list(unique_highest_streaks_df['player'].head(5))
top_5_streaks = list(unique_highest_streaks_df['streak'].head(5))

# Printing the results
print(f'Players with highest streaks are {top_5_players}, with streak(s) of {top_5_streaks}, respectively.')