# <span style="font-family: Times New Roman;">Table of Contents</span>
- <span style="font-family: Times New Roman;">[What is Pandas?](#1)</span>
- <span style="font-family: Times New Roman;">[Loading data](#2)</span> 
- <span style="font-family: Times New Roman;">[Viewing and understanding DataFrames](#3)</span> 
- <span style="font-family: Times New Roman;">[Slicing and Extracting Data](#4)</span>

## <span style="font-family: Times New Roman;">What is Pandas?</span> <a id='1'></a>

<p style="font-family: Times New Roman"> In pandas, a <strong>DataFrame</strong> is a two-dimensional labeled data structure, similar to a table or spreadsheet. It consists of rows and columns, where each column can have a different type of data (integer, float, string, etc.).
<br><br>Pandas’ functionality includes data transformations, like sorting rows and taking subsets, to calculating summary statistics such as the mean, reshaping DataFrames, and joining DataFrames together</p>

In [1]:
# Importing pandas
import pandas as pd

<p style="font-family: Times New Roman"><a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#">Creating basic DataFrame</a></p>

In [2]:
data = {
    'Name': ['John', 'Emma', 'Alex'],
    'Age': [28, 24, 32],
    'City': ['New York', 'San Francisco', 'Seattle']
}

df = pd.DataFrame(data=data)
df

Unnamed: 0,Name,Age,City
0,John,28,New York
1,Emma,24,San Francisco
2,Alex,32,Seattle


In [3]:
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
                   columns=['a', 'b', 'c'])
df

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


## <span style="font-family: Times New Roman;">Loading data</span> <a id='2'></a>

<p style="font-family: Times New Roman"> Documents are read with the command "pandas.read_type_of_file()". <a href="https://pandas.pydata.org/pandas-docs/stable/reference/io.html#">Click for more information</a> </p>

In [4]:
# read CSV from data file
path_of_file = '../data/raw/athlete_events.csv'
data = pd.read_csv(path_of_file)
data.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


In [5]:
# can be used to save files with to_csv() functions
data.to_csv('test_file.csv')

## <span style="font-family: Times New Roman;">Viewing and understanding DataFrames</span> <a id='3'></a> 

<p style="font-family: Times New Roman">How to view data using <strong>.head()</strong> and <strong>.tail()?</strong><br>
You can view the first few or last few rows of a DataFrame using the .head() or .tail() methods, respectively. You can specify the number of rows through the n argument (the default value is 5).</p>

In [6]:
data.head(2)

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,


In [7]:
data.tail(2)

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
271114,135571,Tomasz Ireneusz ya,M,30.0,185.0,96.0,Poland,POL,1998 Winter,1998,Winter,Nagano,Bobsleigh,Bobsleigh Men's Four,
271115,135571,Tomasz Ireneusz ya,M,34.0,185.0,96.0,Poland,POL,2002 Winter,2002,Winter,Salt Lake City,Bobsleigh,Bobsleigh Men's Four,


<p style="font-family: Times New Roman"> <strong>.sample()</strong> function gives us random values within the data frame</p>

In [8]:
data.sample(2)

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
235169,117922,Kiyomi Takahashi,F,22.0,157.0,52.0,Japan,JPN,1988 Summer,1988,Summer,Seoul,Swimming,Swimming Women's 100 metres Butterfly,
116403,58908,Jukka Pekka Sakari Keskisalo,M,31.0,185.0,66.0,Finland,FIN,2012 Summer,2012,Summer,London,Athletics,"Athletics Men's 3,000 metres Steeplechase",


<p style="font-family: Times New Roman"> Understanding data using <strong>.describe()</strong><br> The .describe() method prints the summary statistics of all numeric columns, such as count, mean, standard deviation, range, and quartiles of numeric columns.</p>


In [9]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,271116.0,68248.954396,39022.286345,1.0,34643.0,68205.0,102097.25,135571.0
Age,261642.0,25.556898,6.393561,10.0,21.0,24.0,28.0,97.0
Height,210945.0,175.33897,10.518462,127.0,168.0,175.0,183.0,226.0
Weight,208241.0,70.702393,14.34802,25.0,60.0,70.0,79.0,214.0
Year,271116.0,1978.37848,29.877632,1896.0,1960.0,1988.0,2002.0,2016.0


<p style="font-family: Times New Roman"> Understanding data using <strong>.info()</strong><br> The .info() method is a quick way to look at the data types, missing values, and data size of a DataFrame. Here, we’re setting the show_counts argument to True, which gives a few over the total non-missing values in each column.</p>


In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   ID      271116 non-null  int64  
 1   Name    271116 non-null  object 
 2   Sex     271116 non-null  object 
 3   Age     261642 non-null  float64
 4   Height  210945 non-null  float64
 5   Weight  208241 non-null  float64
 6   Team    271116 non-null  object 
 7   NOC     271116 non-null  object 
 8   Games   271116 non-null  object 
 9   Year    271116 non-null  int64  
 10  Season  271116 non-null  object 
 11  City    271116 non-null  object 
 12  Sport   271116 non-null  object 
 13  Event   271116 non-null  object 
 14  Medal   39783 non-null   object 
dtypes: float64(3), int64(2), object(10)
memory usage: 31.0+ MB


<p style="font-family: Times New Roman"> It is possible to find the null variable with the function <strong>.isnull()</strong></p>

In [11]:
data.isnull().sum()

ID             0
Name           0
Sex            0
Age         9474
Height     60171
Weight     62875
Team           0
NOC            0
Games          0
Year           0
Season         0
City           0
Sport          0
Event          0
Medal     231333
dtype: int64

<p style="font-family: Times New Roman">Get all column names and index</p>


In [12]:
data.columns
# for list -> data.columns.to_list()

Index(['ID', 'Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 'Games',
       'Year', 'Season', 'City', 'Sport', 'Event', 'Medal'],
      dtype='object')

In [13]:
data.index

RangeIndex(start=0, stop=271116, step=1)

## <span style="font-family: Times New Roman;">Slicing and Extracting Data</span> <a id='4'></a> 

<p style="font-family: Times New Roman"> The pandas package offers several ways to subset, filter, and isolate data in your DataFrames. Here, we'll see the most common ways. <a href = "https://pandas.pydata.org/docs/user_guide/indexing.html#"> Click to see all methods </a></p>


<p style="font-family: Times New Roman"> Isolating one column using <strong>[ ]</strong><br> You can isolate a single column using a square bracket [ ] with a column name in it. The output is a pandas Series object. A pandas Series is a one-dimensional array containing data of any type, including integer, float, string, boolean, python objects, etc. A DataFrame is comprised of many series that act as columns.</p>

In [14]:
data['City'].head()

0    Barcelona
1       London
2    Antwerpen
3        Paris
4      Calgary
Name: City, dtype: object

<p style="font-family: Times New Roman"> Isolating two or more columns using <strong> [[ ]]</strong><br> You can also provide a list of column names inside the square brackets to fetch more than one column. Here, square brackets are used in two different ways. We use the outer square brackets to indicate a subset of a DataFrame, and the inner square brackets to create a list.</p>

In [15]:
data[['Name', 'Age']].head()

Unnamed: 0,Name,Age
0,A Dijiang,24.0
1,A Lamusi,23.0
2,Gunnar Nielsen Aaby,24.0
3,Edgar Lindenau Aabye,34.0
4,Christine Jacoba Aaftink,21.0


<p style="font-family: Times New Roman"> Isolating one row using <strong>[ ]</strong><br> A single row can be fetched by passing in a boolean series with one True value. In the example below, the second row with index = 1 is returned. Here, .index returns the row labels of the DataFrame, and the comparison turns that into a Boolean one-dimensional array.</p>

In [16]:
data[data.index==1]

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,


In [17]:
data[data.City == 'London'].head(2)

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
41,17,Paavo Johannes Aaltonen,M,28.0,175.0,64.0,Finland,FIN,1948 Summer,1948,Summer,London,Gymnastics,Gymnastics Men's Individual All-Around,Bronze


<p style="font-family: Times New Roman"> Isolating two or more rows using <strong>[ ]</strong><br> Similarly, two or more rows can be returned using the <strong>.isin()</strong> method instead of a == operator.</p>

In [18]:
data[data.index.isin(range(2,4))]

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold


In [19]:
data[data.Season.isin(['Summer','Winter'])].head(2)

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,


In [20]:
data[data.Season.isin(['Summer','Winter'])].tail(2)

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
271114,135571,Tomasz Ireneusz ya,M,30.0,185.0,96.0,Poland,POL,1998 Winter,1998,Winter,Nagano,Bobsleigh,Bobsleigh Men's Four,
271115,135571,Tomasz Ireneusz ya,M,34.0,185.0,96.0,Poland,POL,2002 Winter,2002,Winter,Salt Lake City,Bobsleigh,Bobsleigh Men's Four,


<p style="font-family: Times New Roman"> Using <strong>.loc[]</strong> and <strong>.iloc[]</strong> to fetch rows <br> You can fetch specific rows by labels or conditions using <strong>.loc[]</strong> and <strong>.iloc[]</strong> ("location" and "integer location"). .loc[] uses a label to point to a row, column or cell, whereas .iloc[] uses the numeric position.<br>
Syntax: df.loc[row_label, column_label]<br>
Syntax: df.iloc[row_position, column_position]
 </p>

In [21]:
# Fetch the first 5 rows with desired three columns
data.loc[:4, ['Name', 'Age', 'Height']]

Unnamed: 0,Name,Age,Height
0,A Dijiang,24.0,180.0
1,A Lamusi,23.0,170.0
2,Gunnar Nielsen Aaby,24.0,
3,Edgar Lindenau Aabye,34.0,
4,Christine Jacoba Aaftink,21.0,185.0


In [22]:
data.iloc[:5, [1, 3, 4]]

Unnamed: 0,Name,Age,Height
0,A Dijiang,24.0,180.0
1,A Lamusi,23.0,170.0
2,Gunnar Nielsen Aaby,24.0,
3,Edgar Lindenau Aabye,34.0,
4,Christine Jacoba Aaftink,21.0,185.0


<p style="font-family: Times New Roman"> Note: <strong>.iloc</strong> is primarily integer position based (from 0 to length-1 of the axis) </p>

In [23]:
# Set the 'Weight' to 100 where the 'Height' is 185 using .loc.
data.loc[(data.Weight == 100) & (data.Height == 185), ['Name','Weight','Height']].head()

Unnamed: 0,Name,Weight,Height
4857,"Michael Charles ""Mike"" Aljoe",100.0,185.0
10320,Plamen Asparukhov,100.0,185.0
10924,Karl Magnus Augustson,100.0,185.0
12032,Gerd Bachmann,100.0,185.0
12033,Gerd Bachmann,100.0,185.0


In [24]:
# Select all rows where the 'Age' is greater than 30 using .iloc.
data.iloc[data.index[data.Age > 30].to_list(), [1, 2, 3]].sample(5)

Unnamed: 0,Name,Sex,Age
56498,Sloan Doak,M,34.0
85401,Aim Gruet-Masson,M,31.0
21803,Poul Bille-Holst,M,42.0
77634,Georg Hermann Gelbke,M,49.0
62993,Alexander Eliraz,M,37.0


<p style="font-family: Times New Roman"> Note: .iloc does not support boolean indexing directly. For this reason we found idexes firstly.</p> 

<p style="font-family: Times New Roman"> DataFrame objects have a <strong> query() </strong> method that allows selection using an expression.
 </p>

In [25]:
# Check the person whose height is greater than his/her weight
data.query('Weight > Height')

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
23155,12177,"Ricardo Blas, Jr.",M,21.0,183.0,214.0,Guam,GUM,2008 Summer,2008,Summer,Beijing,Judo,Judo Men's Heavyweight,
23156,12177,"Ricardo Blas, Jr.",M,25.0,183.0,214.0,Guam,GUM,2012 Summer,2012,Summer,London,Judo,Judo Men's Heavyweight,


In [26]:
# Check the person who won the Gold Medal in Tennis at the London 2012 summer games and whose age is older than 30.
age_condition = 30
year_condition = 2012
medal_condition = 'Gold'
sport_condition = 'Tennis'

data.query('Age > @age_condition \
            & Year == @year_condition \
            & Medal == @medal_condition \
            & Sport == @sport_condition ')

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
30824,15880,"Robert Charles ""Bob"" Bryan",M,34.0,193.0,92.0,United States-1,USA,2012 Summer,2012,Summer,London,Tennis,Tennis Men's Doubles,Gold
30833,15882,"Michael Carl ""Mike"" Bryan",M,34.0,191.0,87.0,United States-1,USA,2012 Summer,2012,Summer,London,Tennis,Tennis Men's Doubles,Gold
159975,80274,"Maksim Nikolayevich ""Max"" Mirnyi",M,35.0,196.0,90.0,Belarus,BLR,2012 Summer,2012,Summer,London,Tennis,Tennis Mixed Doubles,Gold
261165,130696,Venus Ebony Starr Williams,F,32.0,185.0,75.0,United States-1,USA,2012 Summer,2012,Summer,London,Tennis,Tennis Women's Doubles,Gold
