# Intro to Python
## Work sheet

Python is great.
But it's a little bare bones.
We need to import some libraries if we want to do data science.

Libraries?
Consider Python as your workbench.
Libraries are like little lovingly crafted bags (independent shop vibes) with specific tools that allow you to create beautiful things on your workbench. 
Typically, you need to install libraries on your machine, but as we're using colab, they come pre-installed.
We will cover installation of libraries in a separate session. 
R users will recognise the concept of libraries but might think of them as 'packages'.

A very commonly used library in data science is [pandas](https://pandas.pydata.org/). 
It provides easy-to-use data structures and data analysis tools for the Python programming language.
We will use it a lot in this workshop.

Other libraries often used are:

- numpy: helps with doing maths on colletions of numbers (like a matrix).
- sci-kitlearn & statsmodels fit machine learning and statistical models.
- matplotlib, seaborn, and plotly are libraries used for data viz.

For this workshop we will at least need pandas and a specific module that comes with plotly, named graph_objects. 
We import them in the following cell.

Put your cursor in the cell and pressing `ctrl+enter` will execute the cell.

In [None]:
import pandas as pd
import plotly.graph_objects as go

> Note: moving forward we will call a lot of the tools (called methods and functions) that come with pandas. 
Because we don't want to type 'pandas' everytime, we say 'as pd'. 
It means that we can now refer to pandas as pd, which saves a bit of typing. 
It is not essential.
You can see we've done the same for plotly.graph_objects.

As a next step, we will import the data we'll be playing around with today.
Again, put your cursor in the cell, and press `ctrl+enter`.


In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/BobbyGlennS/bit-python-intro/main/data/combined_matches.csv')

Let's dissect this piece of code a little bit:

`df`    
We don't just want to load the data, we will also need to tell python to store it somewhere. We are creating a variable named `df`. It will serve as a pointer to the data that we are about to load into the memory. This way python knows to keep hold of it, and let us bring it back up everytime we type `df`.


`=`  
We're telling python that this new variable should point to the result of the code that follows the equals sign (`=`).  

`pd.read_csv`  
Use the function read_csv from the pandas library (which we said we would refer to as pd).  

`('https://....')`  
Between the parentheses we provide an 'argument'. It's an instruction to the function. In this case the instruction is: 'go here to retrieve the file'.

Great! 
If all went well we now have a dataset loaded.
It's stored in Python's memory in an object called a pandas DataFrame, which we can retrieve by typing `df`.

We can start playing.
Let's first have a little look.
Let's print the dataframe with a function named `print()`.

Like with `read_csv()`, we write the function and between the parentheses we provide an argument that contains an instruction for the function to work with. In this case the instruction is to print of whatever we provide as an argument.

In [None]:
print(df)


We can use also the *method* head() on a pandas DataFrame to get the first few rows, which looks slightly nicer.

The technical distinction between methods and functions is something for another time. But what is useful to know is that methods are like little appendices that come with certain objects. A pandas DataFrame has a set of methods that you can use on it. For example to explore the dataframe, which is what we will be doing in the next few code cells.

We call them slightly differently than functions: we write the object we want to use a method with, and then after the dot we write the name of the method.
See below.


In [None]:
df.head()

We can also give methods specific instructions. That's what the parentheses are for! E.g. you can ask for a specific number of rows.

In [None]:
df.head(10)


Leaving methods behind now for a second.

We can also ask pandas to display specific columns.

In [None]:
# view tourney date, tourney name, and winner name columns
df[['tourney_date', 'tourney_name', 'winner_name']]

You can combine different operations.
We call this 'chaining'.
This code is executed sequentially.

In [None]:
df[['tourney_date', 'tourney_name', 'winner_name']].head(5)

Here are some other operations that you may find useful:

In [None]:
#get column names
df.columns

In [None]:
# how many rows and columns?
df.shape


## Exercise 0.1

You've now seen a few ways to explore a dataframe.

In the next cell, have a go yourself.

- Can you create a subset of data by *sampling* 20 rows from a select set of *colums* of your choice? *Note: don't just select the top 5 rows, but sample some at random.*
- Can you store this subset in a new pandas DataFrame?
- Can you check the dimensions of this new dataframe?

For this, you will need to use at least one new method.

# Section 1: Data Wrangling

Data wrangling is the process of transforming data from one form into another.

Let's have a go.

## Question: Who won the most Grand Slams between 2000 and 2019? Who won the most Australian Opens in the same period? 

Useful functions: \
\
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html \
https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html \
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html \
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html \
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html \
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

Read about using plotly here: \
\
https://plotly.com/python-api-reference/generated/plotly.graph_objects.Bar.html

First we might want to have a look at the date data for each match, and make sure that it is in the correct format.

### Exercise 1.1: Have a look at the date of the matches and make sure that it is in the format YYY-MM-DD

We can start by inspecting the 'tourney_date' column.

In [None]:
# Inspecting a sample of 5 rows from the 'tourney_date' column
df['tourney_date'].sample(5)

The date is in an unusual format that may be difficult to read. 

Let's change the format of this column using the 'to_datetime' function in pandas. See https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html for documentation. 

In [None]:
# Changing the date format
df['tourney_date'] = pd.to_datetime(df['tourney_date'].astype(str), format='%Y%m%d')

Note that we first change the date to a string, so as to.......

Let's check that it has worked as intended.

In [None]:
# Checking the new 'tourney_date' column
df['tourney_date'].sample(5)

This looks good! 

Next we might want to only indlude relevant matches in the dataframe. This is known as filtering.

### Exercise 1.2: Filter the dataframe to only include finals matches from Grand Slam Tournaments

Using ChatGPT, StackOverflow, and your own brilliance, can you write some code that:
- Keeps only rows where the value on the column `round` is `"F"`
- Keeps only rows where the value on the column `tourney_level` is `"G"`

First let's work out which matches are finals:

In [None]:
df['round'].value_counts()

A good guess here would be 'F' for final. This can be checked by checking a specific row with 'F' in the 'tourney_level' column and seeing that the match detail are correct using google. 

Let's find the rows that correspond to grand slams.

In [None]:
# Inspecting the 'tourney_level' column
df['tourney_level'].value_counts()


It is unclear which code corresponds to grand slams. Let's have a look at the tournament name and level for some rows. 

In [None]:
df[['tourney_name', 'tourney_level']].sample(20)

It looks like 'G' is the level for grand slams. Let's check that all grand slams are included in this level.

In [None]:
df[df['tourney_level'] == 'G']['tourney_name'].value_counts()

This looks good, except for the fact that there are multiple formats for some of the tournaments, which may cause issues when analysing the Australian Open tournaments. 

Let's rename some of them so that each grand slam only has one format. 

In [None]:
# Setting up a dictionary of desired replacements
replacements = {
    'Us Open' : 'US Open',
    'Australian Open-2' : 'Australian Open',
    'Australian Chps.' : 'Australian Open',
    'Australian Open 2' : 'Australian Open',
    'Australian Championships' : 'Australian Open'
}

# Doing the replacements
df['tourney_name'] = df['tourney_name'].replace(replacements)

# Checking the replacements
df[df['tourney_level'] == 'G']['tourney_name'].value_counts().

Great! This has worked and we are ready to make our filtered dataframe.

In [None]:
# Making a dataframe that only includes grandslam finals
gs_df = df[(df['tourney_level'] == 'G') & (df['round'] == 'F')]

### Exercise 1.3: Find who won the most grand slam tournaments/Aus Open's between 2000 and 2019. Plot two bar charts showing this information.

We will use the `groupby` function to do this. `groupby` groups data in a dataframe and applies some aggregate function to data within each group. For example, if we had data on 

See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html.

Our first step is to filter the dataframe to only include matches that happened between 2000 and 2019.

In [None]:
gs_df_time_res = gs_df[(gs_df['tourney_date'].dt.year >= 2000) & (gs_df['tourney_date'].dt.year < 2020)]

Next we want to group this dataframe by `winner_name`. We will use the agregate funtion `size` as we wish to count the number of appearances of each player.

In [None]:
gs_wins_grouped = gs_df_time_res.groupby('winner_name').size()

Now we will order the series in descending order, and take the top 5 entries.

In [None]:
gs_top_5 = gs_wins_grouped.sort_values(ascending=False).head(5)

Note that you could do this in one line of code: 

In [None]:
gs_top_5 = gs_df[(gs_df['tourney_date'].dt.year >= 2000) & (gs_df['tourney_date'].dt.year < 2020)].groupby('winner_name').size().sort_values(ascending=False).head(5)

Now let's extract the names and number of weeks for each player.

In [None]:
gs_top_players = list(gs_top_5.index)
gs_top_nums = gs_top_5.values

In [None]:
fig = go.Figure([go.Bar(x=gs_top_players, y=gs_top_nums)])

fig.update_layout(
    title='Grand Slam Wins by Top 5 Tennis Players',
    xaxis_title='Players',
    yaxis_title='Grand Slam Wins between 2000-2019',
    template='plotly_dark'  
)

fig.show()

### Exercise 1.4: Make a bar chart showing which 5 players won the most Australian Opens between 2000 and 2019.

The solution is very similar. Note that we have to add the condition that the tournament name is `Australian Open`.

In [None]:
# Making the grouped series
most_aus = gs_df[(gs_df['tourney_date'].dt.year >= 2000) & (gs_df['tourney_date'].dt.year < 2020) & (gs_df['tourney_name'] == 'Australian Open')].groupby('winner_name').size().sort_values(ascending=False).head(5)

# Getting the players and number of wins 
players_aus = list(most_aus.index)
num_aus = most_aus.values

# Plotting the bar chart
fig = go.Figure([go.Bar(x=players_aus, y=num_aus)])

fig.update_layout(
    title='Australian Open Wins by Top 5 Tennis Players',
    xaxis_title='Players',
    yaxis_title='Australian Open Wins between 2000-2019',
    template='plotly_dark'  
)

fig.show()

Serena Williams comes out on top again! It's interesting note that while Federer won more grandslams than Djokovic, he won less Australian Opens. It could be interesting to investigate what factors might cause this, for example court type. 

# Section 3: Analysing Streaks

Through this section, we will be analysing grand slam winning streaks from the men's dataframe.

Usefull functions: 

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html \
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.eq.html \
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.cumsum.html \
https://pandas.pydata.org/docs/reference/api/pandas.Index.get_level_values.html

## Question 1: Which Male Players had the Longest Grand Slam Winning Streaks?

### Exercise 3.1: Filter the grand slams dataframe to only include male players

In [None]:
# Splitting the dataframe into male and female players
gs_df_men = gs_df[gs_df['Tour'] == 'ATP']

### Exercise 3.1: Shift the 'winner_name' column in the men's dataframe down by one using the shift function.

Let's see what the shift function does to a column. First we check what the column looks like before applying the shift function.

In [None]:
# Checking the 'winner_name' column before applying the shift function
gs_df_men['winner_name']

In [None]:
# Shifting the 'winner_name' column down by 1
gs_df_men['winner_name'].shift()

It seems to shist the whole column down by one. Notice that the index remains the same. This is important for comparing the shifted row to the original. 

Now let's see what the eq function does. 

### Exercise 3.2: Use the `eq` function to compare the shifted 'winner_name' to the original. What is the result? 

In [None]:
gs_df_men['winner_name'].eq(gs_df_men['winner_name'].shift())

The result is a boolean array indicating when the two columns are the same (True), and when they are not (False).

### Exercise 3.3: How can we use these functions to determine grand slam winning streaks for each player?

Hint: you might find it helpful to use the `cumsum` function, with which you can sum values in a boolean array (https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.cumsum.html). 

Hint: also see https://joshdevlin.com/blog/calculate-streaks-in-pandas/#:~:text=The%20first%20step%20in%20calculating,us%20which%20are%20not%20equal for a blog post on how to find streaks in a dataframe. 

#### Solution:

We will solve this problem by first adding a `streak_indicator` column, that will be a boolean (as above) that indicates when the winning player is the same as in the shifted `winner_name` column. Then we will negate this column so that `False` indicates that the `winner_name` is the same as the previous `winner_name`. We can now apply the `cumsxum` function and create a new streak indicator column whose entries will be numbers that increment by 1 everytime a new streak is started. 
\
\
Finally we will group the dataframe by `winner_name` and `streak_indicator_num` (in that order), aggregating using the `size` function, to see how larger each group (i.e. each streak) is.
\
\
After sorting these values, we can pick the top 5 to get the top 5 streaks and the players associated with them.

### Exercise 3.4: Apply your method to find the player with the 5 longest streaks, and the lenth of their streaks

In [None]:
# Sorting the rows by 'tourney_date'
gs_df_men = gs_df_men.sort_values(by='tourney_date')

In [None]:
# Creating a streak indicator column in the men's dataframe
gs_df_men['streak_indicator_bool'] = gs_df_men['winner_name'].eq(gs_df_men['winner_name'].shift())

# Creating a streak indictor column that contains numbers that indicate different streaks
gs_df_men['streak_indicator_num'] = (~gs_df_men['streak_indicator_bool']).cumsum()

# Grouping the dataframe by 'winner_name' and 'streak_indicator_num' 
streaks_men = gs_df_men.groupby(['winner_name', 'streak_indicator_num']).size()

# Sorting the streaks object to find 5 longest streaks 
highest_streaks = streaks_men.sort_values(ascending=False)

# Getting the players who got the longest streaks
players_with_highest_streak = list(highest_streaks.index.get_level_values('winner_name'))

highest_streaks = list(highest_streaks.values)

# Making a dictionary of the players with their streaks
unique_highest_streaks = {
    'player' : players_with_highest_streak,
    'streak' : highest_streaks
}

# Making the dictionary into a dataframe
unique_highest_streaks_df = pd.DataFrame(unique_highest_streaks)

# Dropping rows that have the same pair of entries in the 'player' and 'streak' column 
unique_highest_streaks_df.drop_duplicates(subset=['player', 'streak'], inplace=True)

# Getting the top 5 plpayers and streaks
top_5_players = list(unique_highest_streaks_df['player'].head(5))
top_5_streaks = list(unique_highest_streaks_df['streak'].head(5))

# Printing the results
print(f'Players with highest streaks are {top_5_players}, with streak(s) of {top_5_streaks}, respectively.')