# 01. Introduction to pandas

In this part we are going to explain some basic concepts in pandas and how to work with Pandas on simple datasets

## 1. Import the Pandas library in a virtual environment

To use this pandas notebook it is good pratice to build a virtual environment first and then activate it.

To do this go to the directory in the terminal where you cloned this repository and type: `python3 -m venv venv`.

This should create an environment that you can activate by typing: `source ./venv/bin/activate`.
Now the prompt in your terminal should start with `(venv)` and you can start installing libraries without messing with the rest of you systems.

For a detailed tutorial on how to use virtual environments the site https://realpython.com/python-virtual-environments-a-primer/ is a good bit somewhat outdated start. Planning on creating my own one soon..

NB: If you are using source control you must also create a file called `.gitignore` in the root directory of this project and type `venv/` and save it. Otherwise the environment is stored in your repository.

Now we are ready to import the panda library using `!pip install pandas` in the notebook to install the Pandas module and import it in our notebook.

We use `as pd` as a convention to alias the library.


In [3]:
# installing pandas
!pip install pandas

Defaulting to user installation because normal site-packages is not writeable


In [4]:
# importing pandas in the project. As a convention we import the library with the 
import pandas as pd

## 1. Introduction to a dataframes.

The 3 components of pandas are the "Index", "Series" and "DataFrame".

A Series is one column, and a DataFrame is a one or more columns combined, together with a unique index.
Series is something we will touch on later in this course. Lets first explore the main component, namely the dataframe.

Let's see how this works:

In [5]:
# List of first names
firstnames = ['John', 'Peter', 'Francis', 'Bob', 'Mary']
# turn it into a dataframe
df = pd.DataFrame(firstnames)
# and inspect what we have
df

Unnamed: 0,0
0,John
1,Peter
2,Francis
3,Bob
4,Mary


As we can see we now have something that resembles a spreadsheet. Although there are some notable differences:
* The rows start with a '0' instead of '1'.
* The column also has a '0' as a header and not 'firstnames' or an 'A' as you may be familiar with in a spreadsheet.
Let's add another column..

In [6]:
# list of last names
lastnames = ['Doe','Parker','Bacon','Smith','Poppins']
# make a list of both the lists that we have
names = [firstnames,lastnames]
# turn it into a dataframe again
df = pd.DataFrame(names)
# and inspect what we have
df

Unnamed: 0,0,1,2,3,4
0,John,Peter,Francis,Bob,Mary
1,Doe,Parker,Bacon,Smith,Poppins


Clearly not the intended result! We wanted a the firstnames and the lastnames in columns not rows. To accomplish this we need to put the the lists in a different order. We need to specify the rows that we want.

In [7]:
# list of names
names = [['John','Doe'],['Peter','Parker'],['Frances','Bacon'],['Bob','Smith'],['Mary','Poppins']]
# turn it into a dataframe again
df = pd.DataFrame(names)
# and inspect what we have
df

Unnamed: 0,0,1
0,John,Doe
1,Peter,Parker
2,Frances,Bacon
3,Bob,Smith
4,Mary,Poppins


This looks better but very tedious as we want the data to be added as lists of first names and last names. To accomplish the we need to put the data in a Python dictionary. 

In [8]:
names = {"A":firstnames,"B":lastnames}
# turn it into a dataframe again
df = pd.DataFrame(names)
# and inspect what we have
df

Unnamed: 0,A,B
0,John,Doe
1,Peter,Parker
2,Francis,Bacon
3,Bob,Smith
4,Mary,Poppins


Voila! a spreadsheet, albeit with the rows starting at 0.. But as you may have noticed, we can now change the column names to something more usefull.

In [9]:
names = {"First names":firstnames,"Last names":lastnames}
# turn it into a dataframe again
df = pd.DataFrame(names)
# and inspect what we have
df

Unnamed: 0,First names,Last names
0,John,Doe
1,Peter,Parker
2,Francis,Bacon
3,Bob,Smith
4,Mary,Poppins


We are still left with the rows starting a "0"

In [10]:
# index = [1,2,3,4,5]
# or better
index = list(range(1,len(firstnames)+1))
# turn it into a dataframe again but now with index
df = pd.DataFrame(names, index=index)
# and inspect what we have
df



Unnamed: 0,First names,Last names
1,John,Doe
2,Peter,Parker
3,Francis,Bacon
4,Bob,Smith
5,Mary,Poppins


That looks what we are after!

Or not.. Unless you are sure you should keep the index as is. We will see some situations were adding a custom index is a smart idea late in this course.

In [11]:
df["First names"]

1       John
2      Peter
3    Francis
4        Bob
5       Mary
Name: First names, dtype: object

## 2. Importing data from a file

Typing in a complete dataset is seldom a good idea. More often you ar provided a dataset that you will be working on. That is what this part is about. We will explore the import of datasets in very basic form like Excel and .csv format and look how this works. Later parts will touch on several other formats as well as connecting to databases, that you might encounter in a more real world scenario.

Lets first import a simple csv-file. CSV stands for "comma seperated file" and is a widely used format to export data from one application to another.

I compiled a csv-file of the best tennis players of all time in another course about webscraping. Lets use this dataset for our purpose. you can find the source at this link: https://howtheyplay.com/individual-sports/Top-10-Greatest-Male-Tennis-Players-of-All-Time. I have taken a sports dataset specifically because it is a well known subject and the datasets around it are very detailed and rich. Very handy later on this course to explain all concepts of Pandas, data engineering and data science. Tennis is not my favorite sport, but after looking at several other sport datasets, I found this one to be very usefull to show the different scenario's that anyone may encounter in data.\
NB: please do not bug me with questions that Boris Becker or some other person is not included. We will come to that later in the course if we touch on unstructured data.

First we will import a simple dataset in CSV.

I compiled the the webpage in a csv file that looks like this:

`id,name,dob,pob,cob,por,cor,pro,retired,price,wins,aus,fra,usa,eng,olympic,thof`\
`11,Ken Rosewall,1934-11-02,Sydney,Australia,Sydney,Australia,1957,1980,1602700,133,4,10,4,5,0,1980`\
`10,Andre Agassi,1970-04-29,Las Vegas,United States,Las Vegas,United States,1986,2006,315275,61,4,1,2,1,1,2011`\
`9,John McEnroe,1959-02-16,Wiesbaden,West Germany,New York City,United States,1978,1992,12547797,105,0,0,4,3,0,1999`\
`8,Jimmy Connors,1952-09-02,East St-Louis,United States,Santa Barbara,United States,1972,1996,8641040,147,1,0,2,5,0,1998`\
`7,Ivan Lendl,1960-03-07,Ostrava,Czechoslovakia,Goshen,United States,1978,1994,21262417,144,2,3,3,0,0,2001`\
`6,Bjorn Borg,1956-06-06,Sodertalje,Sweden,Stockholm,Sweden,1973,1983,3655751,101,0,6,0,5,0,1987`\
`5,Pete Sampras,1971-07-12,Potomac,United States,Lake Sherwood,United States,1988,2002,43280489,64,2,0,5,7,0,2007`\
`4,Rod Laver,1938-07-08,Rockhampton,Australia,Carlsbad,United States,1962,1979,1565413,200,3,3,5,9,0,1981`\
`3,Roger Federer,1981-07-08,Basel,Switzerland,Bottmingen,Switzerland,1998,,130594339,103,6,1,5,8,0,`\
`2,Rafael Nadal,1986-06-03,Manacor,Spain,Manacor,Spain,2001,,134529921,92,14,2,4,2,1,2022`\
`1,Novak Djokovic,1987-05-22,Belgrade,Serbia,Monte Carlo,Monaco,2003,,164786653,93,10,2,3,7,0,`

in columnar format this looks like:
| id | name           | dob        | pob           | cob            | por           | cor           | pro  | retired | price     | wins | aus | fra | usa | eng | olympic | thof |
|----|----------------|------------|---------------|----------------|---------------|---------------|------|---------|-----------|------|-----|-----|-----|-----|---------|------|
| 11 | Ken Rosewall   | 1934-11-02 | Sydney        | Australia      | Sydney        | Australia     | 1957 | 1980    | 1602700   | 133  | 4   | 10  | 4   | 5   | 0       | 1980 |
| 10 | Andre Agassi   | 1970-04-29 | Las Vegas     | United States  | Las Vegas     | United States | 1986 | 2006    | 315275    | 61   | 4   | 1   | 2   | 1   | 1       | 2011 |
| 9  | John McEnroe   | 1959-02-16 | Wiesbaden     | West Germany   | New York City | United States | 1978 | 1992    | 12547797  | 105  | 0   | 0   | 4   | 3   | 0       | 1999 |
| 8  | Jimmy Connors  | 1952-09-02 | East St-Louis | United States  | Santa Barbara | United States | 1972 | 1996    | 8641040   | 147  | 1   | 0   | 2   | 5   | 0       | 1998 |
| 7  | Ivan Lendl     | 1960-03-07 | Ostrava       | Czechoslovakia | Goshen        | United States | 1978 | 1994    | 21262417  | 144  | 2   | 3   | 3   | 0   | 0       | 2001 |
| 6  | Bjorn Borg     | 1956-06-06 | Sodertalje    | Sweden         | Stockholm     | Sweden        | 1973 | 1983    | 3655751   | 101  | 0   | 6   | 0   | 5   | 0       | 1987 |
| 5  | Pete Sampras   | 1971-07-12 | Potomac       | United States  | Lake Sherwood | United States | 1988 | 2002    | 43280489  | 64   | 2   | 0   | 5   | 7   | 0       | 2007 |
| 4  | Rod Laver      | 1938-07-08 | Rockhampton   | Australia      | Carlsbad      | United States | 1962 | 1979    | 1565413   | 200  | 3   | 3   | 5   | 9   | 0       | 1981 |
| 3  | Roger Federer  | 1981-07-08 | Basel         | Switzerland    | Bottmingen    | Switzerland   | 1998 |         | 130594339 | 103  | 6   | 1   | 5   | 8   | 0       |      |
| 2  | Rafael Nadal   | 1986-06-03 | Manacor       | Spain          | Manacor       | Spain         | 2001 |         | 134529921 | 92   | 14  | 2   | 4   | 2   | 1       |      |
| 1  | Novak Djokovic | 1987-05-22 | Belgrade      | Serbia         | Monte Carlo   | Monaco        | 2003 |         | 164786653 | 93   | 10  | 2   | 3   | 7   | 0       |      |

To create a dataframe out of this in Pandas is very easy. Pass the path to the file to the `read_csv` method like so

In [12]:
#Read csv file and inspect
df = pd.read_csv("../tennis-data/10tennisplayer.csv")
df

Unnamed: 0,Place,Name,Birth date,Birth Place,cob,por,cor,pro,Retired,Price money,Pro wins,Aus,Fra,Usa,Eng,Olympic,Hall of Fame
0,11,Ken Rosewall,1934-11-02,Sydney,Australia,Sydney,Australia,1957,1980.0,1602700,133,4,2,2,0,0,1980.0
1,10,Andre Agassi,1970-04-29,Las Vegas,United States,Las Vegas,United States,1986,2006.0,315275,61,4,1,2,1,1,2011.0
2,9,John McEnroe,1959-02-16,Wiesbaden,West Germany,New York City,United States,1978,1992.0,12547797,105,0,0,4,3,0,1999.0
3,8,Jimmy Connors,1952-09-02,East St-Louis,United States,Santa Barbara,United States,1972,1996.0,8641040,147,1,0,2,5,0,1998.0
4,7,Ivan Lendl,1960-03-07,Ostrava,Czechoslovakia,Goshen,United States,1978,1994.0,21262417,144,2,3,3,0,0,2001.0
5,6,Bjorn Borg,1956-06-06,Sodertalje,Sweden,Stockholm,Sweden,1973,1983.0,3655751,101,0,6,0,5,0,1987.0
6,5,Pete Sampras,1971-07-12,Potomac,United States,Lake Sherwood,United States,1988,2002.0,43280489,64,2,0,5,7,0,2007.0
7,4,Rod Laver,1938-07-08,Rockhampton,Australia,Carlsbad,United States,1962,1979.0,1565413,200,3,2,2,4,0,1981.0
8,3,Roger Federer,1981-07-08,Basel,Switzerland,Bottmingen,Switzerland,1998,2022.0,130594339,103,6,1,5,8,0,
9,2,Rafael Nadal,1986-06-03,Manacor,Spain,Manacor,Spain,2001,,134529921,92,14,2,4,2,1,


As we look at this result we can make a few observations:
* As seen previously, the index is set by Pandas. We might wanna change that to the 'Place' column.
* The 'Place' column is descending.
* The turned-pro column looks good, but the 'retired' and hall-of-fame column are integers but floating points.
* Also these columns have NaN (which stands for "Not a Number") values for some of the values.

Lets tackle the first two issues first.

In [13]:
# set the column that we want to use as index
df = pd.read_csv("../tennis-data/10tennisplayer.csv",index_col="Place")
# sort them in ascending order
df.sort_index(inplace=True)
# display
df

Unnamed: 0_level_0,Name,Birth date,Birth Place,cob,por,cor,pro,Retired,Price money,Pro wins,Aus,Fra,Usa,Eng,Olympic,Hall of Fame
Place,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1,Novak Djokovic,1987-05-22,Belgrade,Serbia,Monte Carlo,Monaco,2003,,164786653,93,10,2,3,7,0,
2,Rafael Nadal,1986-06-03,Manacor,Spain,Manacor,Spain,2001,,134529921,92,14,2,4,2,1,
3,Roger Federer,1981-07-08,Basel,Switzerland,Bottmingen,Switzerland,1998,2022.0,130594339,103,6,1,5,8,0,
4,Rod Laver,1938-07-08,Rockhampton,Australia,Carlsbad,United States,1962,1979.0,1565413,200,3,2,2,4,0,1981.0
5,Pete Sampras,1971-07-12,Potomac,United States,Lake Sherwood,United States,1988,2002.0,43280489,64,2,0,5,7,0,2007.0
6,Bjorn Borg,1956-06-06,Sodertalje,Sweden,Stockholm,Sweden,1973,1983.0,3655751,101,0,6,0,5,0,1987.0
7,Ivan Lendl,1960-03-07,Ostrava,Czechoslovakia,Goshen,United States,1978,1994.0,21262417,144,2,3,3,0,0,2001.0
8,Jimmy Connors,1952-09-02,East St-Louis,United States,Santa Barbara,United States,1972,1996.0,8641040,147,1,0,2,5,0,1998.0
9,John McEnroe,1959-02-16,Wiesbaden,West Germany,New York City,United States,1978,1992.0,12547797,105,0,0,4,3,0,1999.0
10,Andre Agassi,1970-04-29,Las Vegas,United States,Las Vegas,United States,1986,2006.0,315275,61,4,1,2,1,1,2011.0


Voila, We have our dataframe. We can do the same with an excel sheet.
For this pandas is dependend on openpyxl so we will have to import that first:

In [14]:
!pip install openpyxl

Defaulting to user installation because normal site-packages is not writeable


Now we are ready to create a dataframe from an excel sheet. Be ware that the sheet needs to be a continuous set of rows and columns. Otherwise you will run into errors. You can pass a sheet name to identify the worksheet that contains the data.

In a later tutorial we will go into detail about working with excel sheets and see how we can work with formulas, tables, named ranges and the like.

In [15]:
df_excel = pd.read_excel("../tennis-data/top10tennisplayers.xlsx", index_col="Place", sheet_name="Sheet1")
df_excel

Unnamed: 0_level_0,Name,Birth date,Birth Place,cob,por,cor,pro,Retired,Price money,Pro wins,Aus,Fra,Usa,Eng,Olympic,Hall of Fame
Place,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
11,Ken Rosewall,1934-11-02,Sydney,Australia,Sydney,Australia,1957,1980.0,1602700,133,4,2,2,0,0,1980.0
10,Andre Agassi,1970-04-29,Las Vegas,United States,Las Vegas,United States,1986,2006.0,315275,61,4,1,2,1,1,2011.0
9,John McEnroe,1959-02-16,Wiesbaden,West Germany,New York City,United States,1978,1992.0,12547797,105,0,0,4,3,0,1999.0
8,Jimmy Connors,1952-09-02,East St-Louis,United States,Santa Barbara,United States,1972,1996.0,8641040,147,1,0,2,5,0,1998.0
7,Ivan Lendl,1960-03-07,Ostrava,Czechoslovakia,Goshen,United States,1978,1994.0,21262417,144,2,3,3,0,0,2001.0
6,Bjorn Borg,1956-06-06,Sodertalje,Sweden,Stockholm,Sweden,1973,1983.0,3655751,101,0,6,0,5,0,1987.0
5,Pete Sampras,1971-07-12,Potomac,United States,Lake Sherwood,United States,1988,2002.0,43280489,64,2,0,5,7,0,2007.0
4,Rod Laver,1938-07-08,Rockhampton,Australia,Carlsbad,United States,1962,1979.0,1565413,200,3,2,2,4,0,1981.0
3,Roger Federer,1981-07-08,Basel,Switzerland,Bottmingen,Switzerland,1998,2022.0,130594339,103,6,1,5,8,0,
2,Rafael Nadal,1986-06-03,Manacor,Spain,Manacor,Spain,2001,,134529921,92,14,2,4,2,1,2022.0


As you can see, it works just as a regular csv file.

# 3. Inspecting a data frame

We now have a data frame created, albeit a very small one. By typing `df` in Jupyter we can inspect it's content, as if we were doing a print statement. A dataframe is a very usefull format to work with columnar data as we will see later in this tutorial.

In this section we are going to explain how to inspect this data format and the functions that help see what the data is so we can use that later to enrich and transform it in usefull information.

Printing `df` is fine for a small dataset, but when we have thousands of rows, this is not what we desire. If we just want to inspect the first rows another function is more used and recommended.

The first ones are `head` and `tail` to view either the first or last row.

In [16]:
df.head(2)

Unnamed: 0_level_0,Name,Birth date,Birth Place,cob,por,cor,pro,Retired,Price money,Pro wins,Aus,Fra,Usa,Eng,Olympic,Hall of Fame
Place,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1,Novak Djokovic,1987-05-22,Belgrade,Serbia,Monte Carlo,Monaco,2003,,164786653,93,10,2,3,7,0,
2,Rafael Nadal,1986-06-03,Manacor,Spain,Manacor,Spain,2001,,134529921,92,14,2,4,2,1,


In [17]:
df.tail(3)

Unnamed: 0_level_0,Name,Birth date,Birth Place,cob,por,cor,pro,Retired,Price money,Pro wins,Aus,Fra,Usa,Eng,Olympic,Hall of Fame
Place,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
9,John McEnroe,1959-02-16,Wiesbaden,West Germany,New York City,United States,1978,1992.0,12547797,105,0,0,4,3,0,1999.0
10,Andre Agassi,1970-04-29,Las Vegas,United States,Las Vegas,United States,1986,2006.0,315275,61,4,1,2,1,1,2011.0
11,Ken Rosewall,1934-11-02,Sydney,Australia,Sydney,Australia,1957,1980.0,1602700,133,4,2,2,0,0,1980.0


In [21]:
df.take([0,1,2])

Unnamed: 0_level_0,Name,Birth date,Birth Place,cob,por,cor,pro,Retired,Price money,Pro wins,Aus,Fra,Usa,Eng,Olympic,Hall of Fame
Place,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1,Novak Djokovic,1987-05-22,Belgrade,Serbia,Monte Carlo,Monaco,2003,,164786653,93,10,2,3,7,0,
2,Rafael Nadal,1986-06-03,Manacor,Spain,Manacor,Spain,2001,,134529921,92,14,2,4,2,1,
3,Roger Federer,1981-07-08,Basel,Switzerland,Bottmingen,Switzerland,1998,2022.0,130594339,103,6,1,5,8,0,


In [31]:
df.size


176

In [32]:
df.shape

(11, 16)

In [33]:
df.count()

Name            11
Birth date      11
Birth Place     11
cob             11
por             11
cor             11
pro             11
Retired          9
Price money     11
Pro wins        11
Aus             11
Fra             11
Usa             11
Eng             11
Olympic         11
Hall of Fame     8
dtype: int64

In [35]:
df.values

array([['Novak Djokovic', '1987-05-22', 'Belgrade', 'Serbia',
        'Monte Carlo', 'Monaco', 2003, nan, 164786653, 93, 10, 2, 3, 7,
        0, nan],
       ['Rafael Nadal', '1986-06-03', 'Manacor', 'Spain', 'Manacor',
        'Spain', 2001, nan, 134529921, 92, 14, 2, 4, 2, 1, nan],
       ['Roger Federer', '1981-07-08', 'Basel', 'Switzerland',
        'Bottmingen', 'Switzerland', 1998, 2022.0, 130594339, 103, 6, 1,
        5, 8, 0, nan],
       ['Rod Laver', '1938-07-08', 'Rockhampton', 'Australia',
        'Carlsbad', 'United States', 1962, 1979.0, 1565413, 200, 3, 2, 2,
        4, 0, 1981.0],
       ['Pete Sampras', '1971-07-12', 'Potomac', 'United States',
        'Lake Sherwood', 'United States', 1988, 2002.0, 43280489, 64, 2,
        0, 5, 7, 0, 2007.0],
       ['Bjorn Borg', '1956-06-06', 'Sodertalje', 'Sweden', 'Stockholm',
        'Sweden', 1973, 1983.0, 3655751, 101, 0, 6, 0, 5, 0, 1987.0],
       ['Ivan Lendl', '1960-03-07', 'Ostrava', 'Czechoslovakia',
        'Goshen', 'U

In [36]:
df.dtypes

Name             object
Birth date       object
Birth Place      object
cob              object
por              object
cor              object
pro               int64
Retired         float64
Price money       int64
Pro wins          int64
Aus               int64
Fra               int64
Usa               int64
Eng               int64
Olympic           int64
Hall of Fame    float64
dtype: object

In [38]:
df.columns

Index(['Name', 'Birth date', 'Birth Place', 'cob', 'por', 'cor', 'pro',
       'Retired', 'Price money', 'Pro wins', 'Aus', 'Fra', 'Usa', 'Eng',
       'Olympic', 'Hall of Fame'],
      dtype='object')

In [45]:
df.at[1,"Name"]

'Novak Djokovic'

In [49]:
df.loc[1]

Name            Novak Djokovic
Birth date          1987-05-22
Birth Place           Belgrade
cob                     Serbia
por                Monte Carlo
cor                     Monaco
pro                       2003
Retired                    NaN
Price money          164786653
Pro wins                    93
Aus                         10
Fra                          2
Usa                          3
Eng                          7
Olympic                      0
Hall of Fame               NaN
Name: 1, dtype: object

In [50]:
df.iloc[1]

Name            Rafael Nadal
Birth date        1986-06-03
Birth Place          Manacor
cob                    Spain
por                  Manacor
cor                    Spain
pro                     2001
Retired                  NaN
Price money        134529921
Pro wins                  92
Aus                       14
Fra                        2
Usa                        4
Eng                        2
Olympic                    1
Hall of Fame             NaN
Name: 2, dtype: object

In [51]:
df.keys()

Index(['Name', 'Birth date', 'Birth Place', 'cob', 'por', 'cor', 'pro',
       'Retired', 'Price money', 'Pro wins', 'Aus', 'Fra', 'Usa', 'Eng',
       'Olympic', 'Hall of Fame'],
      dtype='object')

In [54]:
df.get(["Name","Retired"])

Unnamed: 0_level_0,Name,Retired
Place,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Novak Djokovic,
2,Rafael Nadal,
3,Roger Federer,2022.0
4,Rod Laver,1979.0
5,Pete Sampras,2002.0
6,Bjorn Borg,1983.0
7,Ivan Lendl,1994.0
8,Jimmy Connors,1996.0
9,John McEnroe,1992.0
10,Andre Agassi,2006.0


In [53]:
df.Name

Place
1     Novak Djokovic
2       Rafael Nadal
3      Roger Federer
4          Rod Laver
5       Pete Sampras
6         Bjorn Borg
7         Ivan Lendl
8      Jimmy Connors
9       John McEnroe
10      Andre Agassi
11      Ken Rosewall
Name: Name, dtype: object

In [55]:
df.describe()

Unnamed: 0,pro,Retired,Price money,Pro wins,Aus,Fra,Usa,Eng,Olympic,Hall of Fame
count,11.0,9.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,8.0
mean,1981.454545,1994.888889,47525620.0,113.0,4.181818,1.727273,2.909091,3.818182,0.181818,1995.5
std,15.312502,13.851394,63275730.0,40.422766,4.354726,1.737292,1.513575,2.857208,0.40452,11.612801
min,1957.0,1979.0,315275.0,61.0,0.0,0.0,0.0,0.0,0.0,1980.0
25%,1972.5,1983.0,2629226.0,92.5,1.5,0.5,2.0,1.5,0.0,1985.5
50%,1978.0,1994.0,12547800.0,103.0,3.0,2.0,3.0,4.0,0.0,1998.5
75%,1993.0,2002.0,86937410.0,138.5,5.0,2.0,4.0,6.0,0.0,2002.5
max,2003.0,2022.0,164786700.0,200.0,14.0,6.0,5.0,8.0,1.0,2011.0


In [56]:
df.describe(include="all")

Unnamed: 0,Name,Birth date,Birth Place,cob,por,cor,pro,Retired,Price money,Pro wins,Aus,Fra,Usa,Eng,Olympic,Hall of Fame
count,11,11,11,11,11,11,11.0,9.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,8.0
unique,11,11,11,8,11,6,,,,,,,,,,
top,Novak Djokovic,1987-05-22,Belgrade,United States,Monte Carlo,United States,,,,,,,,,,
freq,1,1,1,3,1,6,,,,,,,,,,
mean,,,,,,,1981.454545,1994.888889,47525620.0,113.0,4.181818,1.727273,2.909091,3.818182,0.181818,1995.5
std,,,,,,,15.312502,13.851394,63275730.0,40.422766,4.354726,1.737292,1.513575,2.857208,0.40452,11.612801
min,,,,,,,1957.0,1979.0,315275.0,61.0,0.0,0.0,0.0,0.0,0.0,1980.0
25%,,,,,,,1972.5,1983.0,2629226.0,92.5,1.5,0.5,2.0,1.5,0.0,1985.5
50%,,,,,,,1978.0,1994.0,12547800.0,103.0,3.0,2.0,3.0,4.0,0.0,1998.5
75%,,,,,,,1993.0,2002.0,86937410.0,138.5,5.0,2.0,4.0,6.0,0.0,2002.5


In [57]:
df.isna()

Unnamed: 0_level_0,Name,Birth date,Birth Place,cob,por,cor,pro,Retired,Price money,Pro wins,Aus,Fra,Usa,Eng,Olympic,Hall of Fame
Place,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True
2,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
10,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [72]:
df.dtypes

Name             object
Birth date       object
Birth Place      object
cob              object
por              object
cor              object
pro               int64
Retired         float64
Price money       int64
Pro wins          int64
Aus               int64
Fra               int64
Usa               int64
Eng               int64
Olympic           int64
Hall of Fame    float64
dtype: object

In [83]:
df.astype({"Retired":"Int64"})

Unnamed: 0_level_0,Name,Birth date,Birth Place,cob,por,cor,pro,Retired,Price money,Pro wins,Aus,Fra,Usa,Eng,Olympic,Hall of Fame
Place,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1,Novak Djokovic,1987-05-22,Belgrade,Serbia,Monte Carlo,Monaco,2003,,164786653,93,10,2,3,7,0,
2,Rafael Nadal,1986-06-03,Manacor,Spain,Manacor,Spain,2001,,134529921,92,14,2,4,2,1,
3,Roger Federer,1981-07-08,Basel,Switzerland,Bottmingen,Switzerland,1998,2022.0,130594339,103,6,1,5,8,0,
4,Rod Laver,1938-07-08,Rockhampton,Australia,Carlsbad,United States,1962,1979.0,1565413,200,3,2,2,4,0,1981.0
5,Pete Sampras,1971-07-12,Potomac,United States,Lake Sherwood,United States,1988,2002.0,43280489,64,2,0,5,7,0,2007.0
6,Bjorn Borg,1956-06-06,Sodertalje,Sweden,Stockholm,Sweden,1973,1983.0,3655751,101,0,6,0,5,0,1987.0
7,Ivan Lendl,1960-03-07,Ostrava,Czechoslovakia,Goshen,United States,1978,1994.0,21262417,144,2,3,3,0,0,2001.0
8,Jimmy Connors,1952-09-02,East St-Louis,United States,Santa Barbara,United States,1972,1996.0,8641040,147,1,0,2,5,0,1998.0
9,John McEnroe,1959-02-16,Wiesbaden,West Germany,New York City,United States,1978,1992.0,12547797,105,0,0,4,3,0,1999.0
10,Andre Agassi,1970-04-29,Las Vegas,United States,Las Vegas,United States,1986,2006.0,315275,61,4,1,2,1,1,2011.0


In [84]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11 entries, 1 to 11
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Name          11 non-null     object 
 1   Birth date    11 non-null     object 
 2   Birth Place   11 non-null     object 
 3   cob           11 non-null     object 
 4   por           11 non-null     object 
 5   cor           11 non-null     object 
 6   pro           11 non-null     int64  
 7   Retired       9 non-null      float64
 8   Price money   11 non-null     int64  
 9   Pro wins      11 non-null     int64  
 10  Aus           11 non-null     int64  
 11  Fra           11 non-null     int64  
 12  Usa           11 non-null     int64  
 13  Eng           11 non-null     int64  
 14  Olympic       11 non-null     int64  
 15  Hall of Fame  8 non-null      float64
dtypes: float64(2), int64(8), object(6)
memory usage: 1.8+ KB
