# Intro to Python
## Solution Sheet

As a first step, we need to import some libraries.

Libraries?
Consider Python as your workbench.
Libraries are like little lovingly crafted bags (independent shop vibes) with specific tools that allow you to create beautiful things on your workbench. 

Two very commonly used libraries in data science are numpy and pandas.
- pandas provides easy-to-use data structures and data analysis tools for the Python programming language
- numpy can be used to perform a wide variety of mathematical operations on arrays (collections of data)

Here we will import pandas. 


In [1]:
import pandas as pd

> Moving forward we will call a lot of the tools (called methods and functions) that come with pandas. 
Because we don't want to type 'pandas' everytime, we say 'as pd'. 
It means that we can now refer to pandas as pd, which saves a bit of typing. 
It is not essential.

As a next step, we will import the data we'll be playing around with today.


In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/BobbyGlennS/bit-python-intro/main/data/combined_matches.csv')

Let's dissect this a little bit:

- `df`: We don't just want to load the data, we will also need to tell python to store it somewhere. We create a dataframe that we will call df.
- `=`: We're telling python "hey python buddy, remember this thing that I just brought into existence from out of nothing? Called 'df'? Yes that one :) Well, can you turn it into the following:"
- `pd.read_csv`: Use the function read_csv from the pandas library (which we said we would refer to as pd).
- `('https://....')`: Between the parentheses we provide an 'argument'. It's an instruction to the function. In this case the instruction is: 'go here to retrieve the file'.

Great! 
If all went well we now have a dataset loaded.
It's stored in Python's memory in an object called a pandas DataFrame.
We can start playing.
Let's first have a little look.
We can use the *method* head() to get the first few rows.

In [3]:
df.head()


Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,...,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points,Tour
0,2019-M020,Brisbane,Hard,32.0,A,20181231,300,105453,2.0,,...,34.0,20.0,14.0,10.0,15.0,9.0,3590.0,16.0,1977.0,ATP
1,2019-M020,Brisbane,Hard,32.0,A,20181231,299,106421,4.0,,...,36.0,7.0,10.0,10.0,13.0,16.0,1977.0,239.0,200.0,ATP
2,2019-M020,Brisbane,Hard,32.0,A,20181231,298,105453,2.0,,...,15.0,6.0,8.0,1.0,5.0,9.0,3590.0,40.0,1050.0,ATP
3,2019-M020,Brisbane,Hard,32.0,A,20181231,297,104542,,PR,...,38.0,9.0,11.0,4.0,6.0,239.0,200.0,31.0,1298.0,ATP
4,2019-M020,Brisbane,Hard,32.0,A,20181231,296,106421,4.0,,...,46.0,19.0,15.0,2.0,4.0,16.0,1977.0,18.0,1855.0,ATP


You can also ask for a specific number of rows.

In [4]:
df.head(10)


Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,...,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points,Tour
0,2019-M020,Brisbane,Hard,32.0,A,20181231,300,105453,2.0,,...,34.0,20.0,14.0,10.0,15.0,9.0,3590.0,16.0,1977.0,ATP
1,2019-M020,Brisbane,Hard,32.0,A,20181231,299,106421,4.0,,...,36.0,7.0,10.0,10.0,13.0,16.0,1977.0,239.0,200.0,ATP
2,2019-M020,Brisbane,Hard,32.0,A,20181231,298,105453,2.0,,...,15.0,6.0,8.0,1.0,5.0,9.0,3590.0,40.0,1050.0,ATP
3,2019-M020,Brisbane,Hard,32.0,A,20181231,297,104542,,PR,...,38.0,9.0,11.0,4.0,6.0,239.0,200.0,31.0,1298.0,ATP
4,2019-M020,Brisbane,Hard,32.0,A,20181231,296,106421,4.0,,...,46.0,19.0,15.0,2.0,4.0,16.0,1977.0,18.0,1855.0,ATP
5,2019-M020,Brisbane,Hard,32.0,A,20181231,295,104871,,,...,40.0,18.0,15.0,6.0,9.0,40.0,1050.0,185.0,275.0,ATP
6,2019-M020,Brisbane,Hard,32.0,A,20181231,294,105453,2.0,,...,37.0,13.0,12.0,6.0,9.0,9.0,3590.0,19.0,1835.0,ATP
7,2019-M020,Brisbane,Hard,32.0,A,20181231,293,104542,,PR,...,34.0,11.0,11.0,6.0,11.0,239.0,200.0,77.0,691.0,ATP
8,2019-M020,Brisbane,Hard,32.0,A,20181231,292,200282,7.0,,...,30.0,3.0,9.0,3.0,6.0,31.0,1298.0,72.0,715.0,ATP
9,2019-M020,Brisbane,Hard,32.0,A,20181231,291,106421,4.0,,...,27.0,7.0,10.0,2.0,6.0,16.0,1977.0,240.0,200.0,ATP


You can also ask for specific columns only.

In [5]:
# view tourney date, tourney name, and winner name columns
df[['tourney_date', 'tourney_name', 'winner_name']]

Unnamed: 0,tourney_date,tourney_name,winner_name
0,20181231,Brisbane,Kei Nishikori
1,20181231,Brisbane,Daniil Medvedev
2,20181231,Brisbane,Kei Nishikori
3,20181231,Brisbane,Jo-Wilfried Tsonga
4,20181231,Brisbane,Daniil Medvedev
...,...,...,...
191915,20141109,Tour Finals,Novak Djokovic
191916,20141109,Tour Finals,Novak Djokovic
191917,20141121,Davis Cup WG F: FRA vs SUI,Stan Wawrinka
191918,20141121,Davis Cup WG F: FRA vs SUI,Gael Monfils


... And combine

In [6]:
df[['tourney_date', 'tourney_name', 'winner_name']].head(5)

Unnamed: 0,tourney_date,tourney_name,winner_name
0,20181231,Brisbane,Kei Nishikori
1,20181231,Brisbane,Daniil Medvedev
2,20181231,Brisbane,Kei Nishikori
3,20181231,Brisbane,Jo-Wilfried Tsonga
4,20181231,Brisbane,Daniil Medvedev


What columns do we actually have?

In [7]:
#get column names
df.columns

Index(['tourney_id', 'tourney_name', 'surface', 'draw_size', 'tourney_level',
       'tourney_date', 'match_num', 'winner_id', 'winner_seed', 'winner_entry',
       'winner_name', 'winner_hand', 'winner_ht', 'winner_ioc', 'winner_age',
       'loser_id', 'loser_seed', 'loser_entry', 'loser_name', 'loser_hand',
       'loser_ht', 'loser_ioc', 'loser_age', 'score', 'best_of', 'round',
       'minutes', 'w_ace', 'w_df', 'w_svpt', 'w_1stIn', 'w_1stWon', 'w_2ndWon',
       'w_SvGms', 'w_bpSaved', 'w_bpFaced', 'l_ace', 'l_df', 'l_svpt',
       'l_1stIn', 'l_1stWon', 'l_2ndWon', 'l_SvGms', 'l_bpSaved', 'l_bpFaced',
       'winner_rank', 'winner_rank_points', 'loser_rank', 'loser_rank_points',
       'Tour'],
      dtype='object')

In [8]:
# how many rows and columns?
df.shape


(191920, 50)

# Section 1: Data Wrangling

Data wrangling is......

Let's have a go. We can start by inspecting the 'tourney_date' column.

In [13]:
# Inspecting a sample of 5 rows from the 'tourney_date' column
df['tourney_date'].sample(5)

90534     19910422
77912     19860526
59445     19970908
166662    20060911
106231    19760524
Name: tourney_date, dtype: int64

## Exercise 1.1

The date is in an unusual format, that may be difficult to work with. Change the format of this column using the 'to_datetime' function in pandas. See https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html for documentation. 

In [14]:
# Changing the date format
df['tourney_date'] = pd.to_datetime(df['tourney_date'].astype(str), format='%Y%m%d')

Note that we first change the date to a string, so as to.......

Let's check that it has worked as intended.

In [17]:
# Checking the new 'tourney_date' column
df['tourney_date'].sample(5)

146309   2004-05-10
69551    1978-07-23
114746   1974-06-24
2662     2019-09-13
159996   2013-07-29
Name: tourney_date, dtype: datetime64[ns]

This looks good! 

Now let's try to filter the dataframe.

We wish to filter the dataframe so that it only contains rows corresponding to grand slam finals 

## Exercise 1.2

Using ChatGPT, StackOverflow, and your own brilliance, can you write some code that:
- Keeps only rows where the value on the column `round` is `"F"`
- Keeps only rows where the value on the column `tourney_level` is `"G"`



In [11]:
# filter df so that it only contains rows where round == "F" and tourney_level == "G"
df = df[(df['round'] == 'F') & (df['tourney_level'] == 'G')]

df.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,...,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points,Tour
139,2019-580,Australian Open,Hard,128.0,G,20190114,226,104925,1.0,,...,24.0,16.0,13.0,3.0,8.0,1.0,9135.0,2.0,7480.0,ATP
1331,2019-520,Roland Garros,Clay,128.0,G,20190527,1701,104745,2.0,,...,37.0,14.0,17.0,6.0,13.0,2.0,7945.0,4.0,4685.0,ATP
1628,2019-540,Wimbledon,Grass,128.0,G,20190701,226,104925,1.0,,...,100.0,39.0,34.0,5.0,8.0,1.0,12415.0,3.0,6620.0,ATP
2179,2019-560,US Open,Hard,128.0,G,20190826,226,104745,2.0,,...,76.0,35.0,26.0,15.0,21.0,2.0,7945.0,5.0,4125.0,ATP
3071,2018-580,Australian Open,Hard,128.0,G,20180115,701,103819,2.0,,...,61.0,27.0,22.0,7.0,13.0,2.0,9605.0,6.0,3805.0,ATP


Great! now lets split the dataframe into two new ones, one for the men's tour ('ATP'), and one for the women's tour ('WTA').

## Exercise 1.3

Create two new dataframes, one showing men's grand slam finals, and one showing women's.

In [18]:
# Splitting the dataframe into male and female players
df_men = df[df['Tour'] == 'ATP']
df_women = df[df['Tour'] == 'WTA']

# Section 2: Grouping Data

How does the goupby function work?

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

# Section 3: Analysing Streaks

Through this section, we will be analysing grand slam winning streaks from the men's dataframe.

We will use the 'shift' and 'eq' functions to analyse players' streaks. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html and https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.eq.html for documentation.

## Exercise 3.1

Let's see what the shift function does to a column. 

Shift the 'winner_name' column in the men's dataframe down by one using the shift function.

In [19]:
# Shifting the 'winner_name' column down by 1
df_men['winner_name'].shift()

0                       None
1              Kei Nishikori
2            Daniil Medvedev
3              Kei Nishikori
4         Jo-Wilfried Tsonga
                 ...        
191915         Roger Federer
191916        Novak Djokovic
191917        Novak Djokovic
191918         Stan Wawrinka
191919          Gael Monfils
Name: winner_name, Length: 191920, dtype: object

Notice that the index remains the same. This is important for comparing the shifted row to the original. 

Now let's see what the eq function does. 

## Exercise 3.2

Use the eq function to compare the shifted 'winner_name' to the original. What is the result? 

In [20]:
df_men['winner_name'].eq(df_men['winner_name'].shift())

0         False
1         False
2         False
3         False
4         False
          ...  
191915    False
191916     True
191917    False
191918    False
191919    False
Name: winner_name, Length: 191920, dtype: bool

The result is a boolean array indicating when the two columns are the same (True), and when they are not (False).

## Exercise 3.3

How can we use this method to determine grand slam winning streaks for each player?

Hint: you can use the cumsum function to sum values in a boolean array (https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.cumsum.html). 

Hint: see https://joshdevlin.com/blog/calculate-streaks-in-pandas/#:~:text=The%20first%20step%20in%20calculating,us%20which%20are%20not%20equal.

In [79]:
# Creating a streak indicator column in the men's dataframe
df_men['streak_indicator_bool'] = df_men['winner_name'].eq(df_men['winner_name'].shift())

# Creating a streak indictor column that contains numbers that indicate different streaks
df_men['streak_indicator_num'] = (~df_men['streak_indicator_bool']).cumsum()

# Grouping the dataframe by 'winner_name' and 'streak_indicator_num' 
streaks_men = df_men.groupby(['winner_name', 'streak_indicator_num']).size()

# Sorting the streaks object to find 5 longest streaks 
highest_streaks = streaks_men.sort_values().tail(5)

# Getting the players who got the longest streaks
players_with_highest_streak = highest_streaks.index.get_level_values('winner_name')

# Printing the results
print(f'Player(s) with highest streaks are {list(players_with_highest_streak)}, with streak(s) of {list(highest_streaks.values)}, (respectively).')

Player(s) with highest streaks are ['Ilie Nastase', 'Ivan Lendl', 'Roger Federer', 'Ken Rosewall', 'Rod Laver'], with streak(s) of [6, 7, 7, 7, 9], (respectively).


Now let's adapt our method to find which women players are best on the clay court. 

## Exercise 3.4

Find the women players with the longest streaks on the clay court (in grand slam tournaments). 