# Intro to Python
## Solution Sheet

As a first step, we need to import some libraries.

Libraries?
Consider Python as your workbench.
Libraries are like little bags with specific tools that allow you to craft beautiful things on your workbench. 

Two very commonly used libraries in data science are numpy and pandas.
- pandas provides easy-to-use data structures and data analysis tools for the Python programming language
- numpy can be used to perform a wide variety of mathematical operations on arrays (collections of data)

Here we will import pandas. 


In [2]:
import pandas as pd

> Moving forward we will call a lot of the tools (called methods and functions) that come with pandas. 
Because we don't want to type 'pandas' everytime, we say 'as pd'. 
It means that we can now refer to pandas as pd, which saves a bit of typing. 
It is not essential.

As a next step, we will import the data we'll be playing around with today.


In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/BobbyGlennS/bit-python-intro/main/data/combined_matches.csv')

Let's dissect this a little bit:

- `df`: We don't just want to load the data, we will also need to tell python to store it somewhere. We create a dataframe that we will call df.
- `=`: We're telling python "hey python buddy, remember this thing that I just brought into existence called 'df'? Yes? Well, you should turn it into the following."
- `pd.read_csv`: Use the function read_csv from the pandas library (which we said we would refer to as pd).
- `('https://....')`: Between the parentheses we provide an 'argument'. It's an instruction to the function. In this case the instruction is: 'go here to retrieve the file'.

Great! 
If all went well we now have a dataset loaded.
It's stored in Python's memory in an object called a pandas DataFrame.
We can start playing.
Let's first have a little look.
We can use the *method* head() to get the first few rows.

In [4]:
df.head()


Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,...,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points,Tour
0,2019-M020,Brisbane,Hard,32.0,A,20181231,300,105453,2.0,,...,34.0,20.0,14.0,10.0,15.0,9.0,3590.0,16.0,1977.0,ATP
1,2019-M020,Brisbane,Hard,32.0,A,20181231,299,106421,4.0,,...,36.0,7.0,10.0,10.0,13.0,16.0,1977.0,239.0,200.0,ATP
2,2019-M020,Brisbane,Hard,32.0,A,20181231,298,105453,2.0,,...,15.0,6.0,8.0,1.0,5.0,9.0,3590.0,40.0,1050.0,ATP
3,2019-M020,Brisbane,Hard,32.0,A,20181231,297,104542,,PR,...,38.0,9.0,11.0,4.0,6.0,239.0,200.0,31.0,1298.0,ATP
4,2019-M020,Brisbane,Hard,32.0,A,20181231,296,106421,4.0,,...,46.0,19.0,15.0,2.0,4.0,16.0,1977.0,18.0,1855.0,ATP


You can also ask for a specific number of rows.

In [None]:
df.head(10)


You can also ask for specific columns only.

In [6]:
# view tourney date, tourney name, and winner name columns
df[['tourney_date', 'tourney_name', 'winner_name']]

Unnamed: 0,tourney_date,tourney_name,winner_name
0,20181231,Brisbane,Kei Nishikori
1,20181231,Brisbane,Daniil Medvedev
2,20181231,Brisbane,Kei Nishikori
3,20181231,Brisbane,Jo-Wilfried Tsonga
4,20181231,Brisbane,Daniil Medvedev
...,...,...,...
191915,20141109,Tour Finals,Novak Djokovic
191916,20141109,Tour Finals,Novak Djokovic
191917,20141121,Davis Cup WG F: FRA vs SUI,Stan Wawrinka
191918,20141121,Davis Cup WG F: FRA vs SUI,Gael Monfils


... And combine

In [7]:
df[['tourney_date', 'tourney_name', 'winner_name']].head(5)

Unnamed: 0,tourney_date,tourney_name,winner_name
0,20181231,Brisbane,Kei Nishikori
1,20181231,Brisbane,Daniil Medvedev
2,20181231,Brisbane,Kei Nishikori
3,20181231,Brisbane,Jo-Wilfried Tsonga
4,20181231,Brisbane,Daniil Medvedev


What columns do we actually have?

In [9]:
#get column names
df.columns

Index(['tourney_id', 'tourney_name', 'surface', 'draw_size', 'tourney_level',
       'tourney_date', 'match_num', 'winner_id', 'winner_seed', 'winner_entry',
       'winner_name', 'winner_hand', 'winner_ht', 'winner_ioc', 'winner_age',
       'loser_id', 'loser_seed', 'loser_entry', 'loser_name', 'loser_hand',
       'loser_ht', 'loser_ioc', 'loser_age', 'score', 'best_of', 'round',
       'minutes', 'w_ace', 'w_df', 'w_svpt', 'w_1stIn', 'w_1stWon', 'w_2ndWon',
       'w_SvGms', 'w_bpSaved', 'w_bpFaced', 'l_ace', 'l_df', 'l_svpt',
       'l_1stIn', 'l_1stWon', 'l_2ndWon', 'l_SvGms', 'l_bpSaved', 'l_bpFaced',
       'winner_rank', 'winner_rank_points', 'loser_rank', 'loser_rank_points',
       'Tour'],
      dtype='object')

In [10]:
# how many rows and columns?
df.shape


(191920, 50)

## Question 1: Who won the most grandslams?

### Exercise

We need to filter the dataset so that it only contains the finals from grand slam tournaments.
Using ChatGPT, StackOverflow, and your own brilliance, can you write some code that:
- Keeps only rows where the value on the column `round` is `"F"`
- Keeps only rows where the value on the column `tourney_level` is `"G"`



In [11]:
# filter df so that it only contains rows where round == "F" and tourney_level == "G"
df = df[(df['round'] == 'F') & (df['tourney_level'] == 'G')]

df.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,...,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points,Tour
139,2019-580,Australian Open,Hard,128.0,G,20190114,226,104925,1.0,,...,24.0,16.0,13.0,3.0,8.0,1.0,9135.0,2.0,7480.0,ATP
1331,2019-520,Roland Garros,Clay,128.0,G,20190527,1701,104745,2.0,,...,37.0,14.0,17.0,6.0,13.0,2.0,7945.0,4.0,4685.0,ATP
1628,2019-540,Wimbledon,Grass,128.0,G,20190701,226,104925,1.0,,...,100.0,39.0,34.0,5.0,8.0,1.0,12415.0,3.0,6620.0,ATP
2179,2019-560,US Open,Hard,128.0,G,20190826,226,104745,2.0,,...,76.0,35.0,26.0,15.0,21.0,2.0,7945.0,5.0,4125.0,ATP
3071,2018-580,Australian Open,Hard,128.0,G,20180115,701,103819,2.0,,...,61.0,27.0,22.0,7.0,13.0,2.0,9605.0,6.0,3805.0,ATP
