# Intro to Python - Worksheet

## Section 0: Setting up

Python is great.
But it's a little bare bones if you use it in its original form.
We need to import some libraries if we want to do data science.

*Libraries?*  
Consider Python as a workbench that came with a few basic tools.
You could build anything, but it would be hard work.
Libraries are like lovingly crafted little kits with specific tools that allow you to create beautiful things on your workbench and significantly speed up your work. 

Typically, you need to install libraries on your machine, but as we're using colab, the ones we need today come pre-installed.
We only need to tell Python we want to use them - which is what we mean with 'importing'.
To stick with the analogy: we already bought our special tools (the 'installation'), but we need to take them out of the storage cabinet to use them ('importing')
We will cover installation of libraries in a separate session. 


*Note*. R users will recognise the concept of libraries but might think of them as 'packages'.

A very commonly used library in data science is [pandas](https://pandas.pydata.org/). 
It provides easy-to-use data structures and data analysis tools for the Python programming language.
We will use it a lot in this workshop.

Other libraries often used are:

- numpy helps with doing maths on collections of numbers (like a matrix).
- sci-kitlearn & statsmodels fit machine learning and statistical models.
- matplotlib, seaborn, and plotly are libraries used for data viz.

For this workshop we will at least need pandas and a specific module that comes with plotly, named graph_objects. 
We import them in the following cell.

Put your cursor in the cell and pressing `ctrl+enter` will execute the cell.

In [None]:
import pandas as pd
import plotly.graph_objects as go

We will be using pandas a lot in this session.
Because we don't want to type 'pandas' everytime we use one of its tools, we say 'as pd'. 
It means that we can now refer to pandas as pd, which saves a bit of typing. 
You can see we've done the same for plotly.graph_objects.
It is not essential to do this, but it's handy.

As a next step, we will import the data we'll be playing around with today.
Again, put your cursor in the cell, and press `ctrl+enter`.
This is going to take a bit of time, it's a large-ish datafile.


In [None]:
# import the data
df = pd.read_csv('https://raw.githubusercontent.com/BobbyGlennS/bit-python-intro/main/data/combined_matches.csv')

Let's dissect this piece of code a little bit:

`# import the data`
This line does not actually do anything. It is a *code comment*. It's there to help the human reader understand what the code does. You can include comments in your code by preceding them with a hash `#`. This tells Python to ignore this line when executing the code. It's handy when writing lots of code, or to tell an AI what type of code you want it to generate for you (more on that in another session).

`df`    
We don't just want to load the data, we will also need to tell python to store it somewhere. We are creating a variable named `df`. It will serve as a pointer to the data that we are about to load into the memory. This way python knows to keep hold of it, and let us bring it back up everytime we type `df`.

`=`  
We're telling python that this new variable should point to the result of the code that follows the equals sign (`=`).  

`pd.read_csv`  
Use the function read_csv from the pandas library (which we said we would refer to as pd).  

`('https://....')`  
Between the parentheses we provide an 'argument'. It's an instruction to the function `read_csv()`. In this case the instruction is: 'go here to retrieve the file'. Note that we don't need to have the file on our laptop! You can point to any accessible location, including a file on your laptop but also, as is the case here, to an online location.

Great! 
If all went well we now have a dataset loaded.
It's stored in Python's memory in an object called a pandas DataFrame, which we can retrieve by typing `df`.

#### A dataset of tennis
Before diving in, let's first quickly say something about the data itself.

The data we will be using contains all WTA (women's) and ATP (men's) tennis data since 1968.
It is carefully assembled by Jeff Sackmann and is freely available on the internet in the form of github repositories, links for which you can find here:

[https://github.com/JeffSackmann/tennis_atp](https://github.com/JeffSackmann/tennis_atp)  
[https://github.com/JeffSackmann/tennis_wta](https://github.com/JeffSackmann/tennis_wta)

The dataset used in this workshop has been slightly preprocessed for the purposes of this workshop, by combining some separate datafiles and leaving out some columns, but the values contained in the set we are working with have been left unaltered otherwise.

Ok, now we're all set, let's play!

First we have a little look.
Let's print the dataframe with a function named `print()`.

Like with `read_csv()`, we write the function and between the parentheses we provide an argument that contains an instruction for the function to work with. In this case the instruction is to print of whatever we provide as an argument.

In [None]:
print(df)


We can use also the *method* head() on a pandas DataFrame to get the first few rows, which looks slightly nicer.

The technical distinction between methods and functions is something for another time. But what is useful to know is that methods are like little appendices that come with certain objects. The pandas DataFrame object has a set of methods that you can use on it. For example to explore the dataframe, which is what we will be doing in the next few code cells.

We call them slightly differently than functions: we write the name of the object we want to use a method on, and then after that we put a dot and we write the name of the method.
See below.


In [None]:
df.head()

We can also give methods specific instructions. That's what the parentheses are for! E.g. you can ask for a specific number of rows.

In [None]:
df.head(10)


We can also ask pandas to display specific columns.

In [None]:
# view tourney date, tourney name, and winner name columns
df[['tourney_date', 'tourney_name', 'winner_name']]

You can combine different operations.
We call this 'chaining'.
This code is executed sequentially.

In [None]:
df[['tourney_date', 'tourney_name', 'winner_name']].head(5)

Here are some other operations that you may find useful:

In [None]:
#get column names
df.columns

In [None]:
# how many rows and columns?
df.shape


### Exercise 0.1

You've now seen a few ways to explore a dataframe.

In the next cell, have a go yourself.

- Can you create a subset of data by *sampling* 20 rows from a select set of *colums* of your choice? *Note: don't just select the top 5 rows, but sample some at random.*
- Can you store this subset in a new pandas DataFrame, named `df2`?
- Can you check the dimensions of this new dataframe?

For this, you will need to use at least one new method.

## Section 1: Diving in

Now we are setup, we will ask you to do some exercises to answer two questions:

- Who won the most Grand Slam Tournaments between 2000 and 2019? 
- Who won the most Australian Opens in the same period? 

For this, we will have to transform some of the data from one form into another. 
We call this process *Data Wrangling*.

Let's have a go.

Here are some references to the documentation of functions that we will use: \
\
[sample](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) \
[to_datetime](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) \
[value_counts](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) \
[replace](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html) \
[groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)

Read about using plotly [here](https://plotly.com/python-api-reference/generated/plotly.graph_objects.Bar.html) 

### Exercise 1.1: Sort out the match dates

So that we can select tournaments from the right time period, we need to make sure the date of the matches is in the right format.

Find the column that stores the match dates and reformat it so that it is a date, stored as YYYY-MM-DD.

For this exercise, consider using the [datetime](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) method. 
You can also start with a google or a GPT question on how to turn a pandas column into the format YYYY-MM-DD.

### Exercise 1.2: Identify finals matches from Grand Slam Tournaments

We only want to include final matches from grand slam tournaments in the dataframe, as they will tell us who won the relevant tournaments.
For this we need to work out where we get that information from.

- First work out which column will tell us what round a match was part of, and how you might identify final rounds.
- Then do the same for tournaments, find the column that tells us the level of a tournament, and how we can identify grand slam tournaments.

During your exploration, you may have noticed that that Australian Open and US Open tournaments are recorded in a variety of ways in the dataset.
Inconsistencies in data is something that often happens!
Below is included a code cell that takes care of this.
Henry or Bobby can explain to you how this works.


In [None]:
# Setting up a dictionary of desired replacements
replacements = {
    'Us Open' : 'US Open',
    'Australian Open-2' : 'Australian Open',
    'Australian Chps.' : 'Australian Open',
    'Australian Open 2' : 'Australian Open',
    'Australian Championships' : 'Australian Open'
}

# Doing the replacements
df['tourney_name'] = df['tourney_name'].replace(replacements)

# Checking the replacements
df[df['tourney_level'] == 'G']['tourney_name'].value_counts()

Now we are ready to make our filtered dataframe.

### Exercise 1.3: Filter the data 

Next we want to create a dataset that only contains the data we are interested in. 
This is known as filtering.

As a reminder the data should only contain:
- Data from grand slam tournaments
- The final rounds of those tournaments
- Only between 2000 and 2019


### Exercise 1.4 Find who won the most grand slam tournaments between 2000 and 2019. Plot a bar chart with this information.

You can use the plotly library for plotting.
If you Google or use GPT - you will find that there are also other options, for example using the libraries seaborn or matplotlib. 
If your solution uses this - don't forget to import said libraries!

### Exercise 1.5: Make a bar chart showing which 5 players won the most Australian Opens between 2000 and 2019.

The solution should be very similar. Note that we have to add the condition that the tournament name is `Australian Open`.

### Exercise 1.6 How young were the top winners when they won a grand slam for the first time?

## Section 2: BONUS: Analysing Streaks

### Which Male Players had the Longest Grand Slam Winning Streaks?

**This is a really tough one!!!** We included this as a challenge for workshop participants who are already a bit more familiar with coding and who may run through the preceding exercises rather quickly. Have a go at it. Don't feel discouraged if this is all a bit tricky and feels next level. It is meant to be!

The solutionfile contains the answer for the men's data. It's an idea to try to solve it for the women's data yourself. 

Useful functions: 

[shift](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html) \
[eq](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.eq.html) \
[consum](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.cumsum.html) \
[get_level_values](https://pandas.pydata.org/docs/reference/api/pandas.Index.get_level_values.html)

