# Pandas Tutorial

In [2]:
import pandas as pd

## Creating DataFrame with Demo Data

We can specify the column name by passing the names columns as a list to the `columns` argument. Similarly we can pass the index with the `index` arugument.

In [13]:
# df = pd.DataFrame([[1,2,3], [4,5,6],[7,8,9]])
df = pd.DataFrame([[1,2,3], [4,5,6],[7,8,9]], columns = ['A', 'B', 'C']) # Adding the column names
# df = pd.DataFrame([[1,2,3], [4,5,6],[7,8,9]], columns = ['A', 'B', 'C'], index=['x', 'y','z']) # Adding the index
df

Unnamed: 0,A,B,C
0,1,2,3
1,4,5,6
2,7,8,9


## Dataframe Basic

### df.head()
- It will show us the top 5 rows of the dataframe by default
- We can specify the amount of rows we want to see `df.head(3)`, `df.head(2)`

In [4]:
df.head(2)

Unnamed: 0,A,B,C
0,1,2,3
1,4,5,6


### df.tail()
- It will show us the bottom 5 rows of the dataframe by default
- We can specify the amount of rows we want to see `df.tail(3)`, `df.tail(2)`

In [5]:
df.tail(2)

Unnamed: 0,A,B,C
1,4,5,6
2,7,8,9


### df.columns
It will show us the name of the columns/headers. Remember this is an attribute not a method.

In [6]:
df.columns

Index(['A', 'B', 'C'], dtype='object')

### df.index
It will give us the overview of all the records.

In [9]:
print(df.index)

# To see the list of indexes
print(df.index.to_list())

RangeIndex(start=0, stop=3, step=1)
[0, 1, 2]


### df.info()
> Gives us an overview of the dataframe.
1. Index: gives us the number of records are there and the range of index.
2. Data columns: The number of columns
3. Columnwise null value information as well as the Data type
4. An overall infromation about the data types in this dataframe
5. The size of the dataframe

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       3 non-null      int64
 1   B       3 non-null      int64
 2   C       3 non-null      int64
dtypes: int64(3)
memory usage: 204.0 bytes


### df.describe()
> Gives us some meanigful information about our dataframe based on each column.
1. **count**: Number of records under each column.
2. **mean**: The mean of each column.
3. **std**: The standard deviation of each column.
4. **min**: The minimum value in the column.
5. **25%** (1st quartile): The value below which 25% of the data falls.
6. **50%** (Median): The middle value that splits the data into two halves.
7. **75%** (3rd quartile): The value below which 75% of the data falls.
8. **max**: The maximum value in the column.

In [12]:
df.describe()

Unnamed: 0,A,B,C
count,3.0,3.0,3.0
mean,4.0,5.0,6.0
std,3.0,3.0,3.0
min,1.0,2.0,3.0
25%,2.5,3.5,4.5
50%,4.0,5.0,6.0
75%,5.5,6.5,7.5
max,7.0,8.0,9.0


### df.nunique()
> To see the number of unique values in each column

In [14]:
df.nunique()

A    3
B    3
C    3
dtype: int64

### df['col'].unique()
> This will give us the list of unique values in a column

In [15]:
df['A'].unique()

array([1, 4, 7])

### df.shape
> Tells us the number of rows and columns in the dataframe.

In [16]:
df.shape

(3, 3)

### df.size
> Returns the total number of datapoints in the data. It's actually the multiplication of rows and columns

In [17]:
df.size

9

### df.sample()
> It will return a random row. We can specify how many we want to see by `df.sample(3)`.

In [26]:
df.sample(3)

Unnamed: 0,A,B,C
2,7,8,9
1,4,5,6
0,1,2,3


## Creating DataFrame from online Data

In [19]:
coffee = pd.read_csv('https://raw.githubusercontent.com/KeithGalli/complete-pandas-tutorial/refs/heads/master/warmup-data/coffee.csv')

coffee.head()

Unnamed: 0,Day,Coffee Type,Units Sold
0,Monday,Espresso,25
1,Monday,Latte,15
2,Tuesday,Espresso,30
3,Tuesday,Latte,20
4,Wednesday,Espresso,35


## Accessing the Data from a DataFrame


### df.loc[#rows, #columns]
> We can access the value of specific row(s) and col(s). We will use the column names whereas in the `iloc()`, we will use the column index.

In [29]:
# Accessing the specific row by index
coffee.loc[3]

Day            Tuesday
Coffee Type      Latte
Units Sold          20
Name: 3, dtype: object

In [30]:
# Accessing multiple rows
coffee.loc[[1,6,8]]

Unnamed: 0,Day,Coffee Type,Units Sold
1,Monday,Latte,15
6,Thursday,Espresso,40
8,Friday,Espresso,45


In [31]:
# Perform slicing
coffee.loc[5:8]

Unnamed: 0,Day,Coffee Type,Units Sold
5,Wednesday,Latte,25
6,Thursday,Espresso,40
7,Thursday,Latte,30
8,Friday,Espresso,45


In [34]:
# Specific column
coffee.loc[5:8, ['Day']]

Unnamed: 0,Day
5,Wednesday
6,Thursday
7,Thursday
8,Friday


In [35]:
# Specific columns
coffee.loc[5:8, ['Day', 'Units Sold']]

Unnamed: 0,Day,Units Sold
5,Wednesday,25
6,Thursday,40
7,Thursday,30
8,Friday,45


In [38]:
# For all the rows and specific columns
coffee.loc[:, ['Day', 'Units Sold']]

Unnamed: 0,Day,Units Sold
0,Monday,25
1,Monday,15
2,Tuesday,30
3,Tuesday,20
4,Wednesday,35
5,Wednesday,25
6,Thursday,40
7,Thursday,30
8,Friday,45
9,Friday,35


### df.iloc[#rows, #colum_index]
> This will perform the same thing, Instead of column name we need to access via the column_index <br>
Remember when slicing, `loc` fucntion includes the upper limit, however `iloc` function does not.

In [39]:
# Accessing the specific row by index
coffee.iloc[3]

Day            Tuesday
Coffee Type      Latte
Units Sold          20
Name: 3, dtype: object

In [42]:
# Accessing multiple rows
coffee.iloc[[1,6,8]]

Unnamed: 0,Day,Coffee Type,Units Sold
1,Monday,Latte,15
6,Thursday,Espresso,40
8,Friday,Espresso,45


In [44]:
# Perform slicing
coffee.iloc[5:8]

# Look the upper limit is not inclusive

Unnamed: 0,Day,Coffee Type,Units Sold
5,Wednesday,Latte,25
6,Thursday,Espresso,40
7,Thursday,Latte,30


In [47]:
# Specific column
coffee.iloc[5:8, [0]] # Instead of the name of the column, we are passing the index

Unnamed: 0,Day
5,Wednesday
6,Thursday
7,Thursday


In [48]:
# Specific columns
coffee.iloc[5:8, [0,2]]

Unnamed: 0,Day,Units Sold
5,Wednesday,25
6,Thursday,40
7,Thursday,30


In [49]:
# For all the rows and specific columns
coffee.iloc[:, [0, 2]]

Unnamed: 0,Day,Units Sold
0,Monday,25
1,Monday,15
2,Tuesday,30
3,Tuesday,20
4,Wednesday,35
5,Wednesday,25
6,Thursday,40
7,Thursday,30
8,Friday,45
9,Friday,35


### Grab specific column
> We can grab all the information of a specific column

In [50]:
coffee.Day

0        Monday
1        Monday
2       Tuesday
3       Tuesday
4     Wednesday
5     Wednesday
6      Thursday
7      Thursday
8        Friday
9        Friday
10     Saturday
11     Saturday
12       Sunday
13       Sunday
Name: Day, dtype: object

In [52]:
# Another way and my favorite is
print(coffee[['Day']]) # Double square bracket returs a dataframe.
print(coffee['Day']) # Single square bracket returns a series

          Day
0      Monday
1      Monday
2     Tuesday
3     Tuesday
4   Wednesday
5   Wednesday
6    Thursday
7    Thursday
8      Friday
9      Friday
10   Saturday
11   Saturday
12     Sunday
13     Sunday
0        Monday
1        Monday
2       Tuesday
3       Tuesday
4     Wednesday
5     Wednesday
6      Thursday
7      Thursday
8        Friday
9        Friday
10     Saturday
11     Saturday
12       Sunday
13       Sunday
Name: Day, dtype: object


## Sorting Values in a Dataframe
> We can sort the values of a Dataframe

### df.sort_value('column_name')
> This will sort a column by ascending order

In [54]:
coffee.sort_values('Units Sold')

Unnamed: 0,Day,Coffee Type,Units Sold
1,Monday,Latte,15
3,Tuesday,Latte,20
0,Monday,Espresso,25
5,Wednesday,Latte,25
2,Tuesday,Espresso,30
7,Thursday,Latte,30
4,Wednesday,Espresso,35
9,Friday,Latte,35
13,Sunday,Latte,35
11,Saturday,Latte,35


### ascending = False
> Sort in descending order

In [56]:
coffee.sort_values('Units Sold', ascending=False)

Unnamed: 0,Day,Coffee Type,Units Sold
10,Saturday,Espresso,45
8,Friday,Espresso,45
12,Sunday,Espresso,45
6,Thursday,Espresso,40
4,Wednesday,Espresso,35
11,Saturday,Latte,35
13,Sunday,Latte,35
9,Friday,Latte,35
2,Tuesday,Espresso,30
7,Thursday,Latte,30


### Sort Multiple Columns
1. We need to pass the columns in a list
2. In the ascending argument, it will be also a list of 0s and 1s. 0 = False, 1 = True

In [59]:
coffee.sort_values(['Day', 'Units Sold', 'Coffee Type'], ascending=[1,0,1])

Unnamed: 0,Day,Coffee Type,Units Sold
8,Friday,Espresso,45
9,Friday,Latte,35
0,Monday,Espresso,25
1,Monday,Latte,15
10,Saturday,Espresso,45
11,Saturday,Latte,35
12,Sunday,Espresso,45
13,Sunday,Latte,35
6,Thursday,Espresso,40
7,Thursday,Latte,30


## Filtering Data

We will look into the bios.csv and learn on this.

In [60]:
bios = pd.read_csv('https://raw.githubusercontent.com/KeithGalli/complete-pandas-tutorial/refs/heads/master/data/bios.csv')

bios.head(3)

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
0,1,Jean-François Blanchy,1886-12-12,Bordeaux,Gironde,FRA,France,,,1960-10-02
1,2,Arnaud Boetsch,1969-04-01,Meulan,Yvelines,FRA,France,183.0,76.0,
2,3,Jean Borotra,1898-08-13,Biarritz,Pyrénées-Atlantiques,FRA,France,183.0,76.0,1994-07-17


We will filter this csv based on the height_cm column. We will look into how many people in there have height more than 215cm.

In [None]:
# Conditional Formatting
bios.loc[bios['height_cm']>215]

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
5089,5108,Viktor Pankrashkin,1957-06-19,Moskva (Moscow),Moskva,RUS,Soviet Union,220.0,112.0,1993-07-24
5583,5606,Paulinho Villas Boas,1963-01-26,São Paulo,São Paulo,BRA,Brazil,217.0,106.0,
5673,5696,Gunther Behnke,1963-01-19,Leverkusen,Nordrhein-Westfalen,GER,Germany,221.0,114.0,
5716,5739,Uwe Blab,1962-03-26,München (Munich),Bayern,GER,Germany West Germany,218.0,110.0,
5781,5804,Tommy Burleson,1952-02-24,Crossnore,North Carolina,USA,United States,223.0,102.0,
5796,5819,Andy Campbell,1956-07-21,Melbourne,Victoria,AUS,Australia,218.0,93.0,
6223,6250,Lars Hansen,1954-09-27,København (Copenhagen),Hovedstaden,DEN,Canada,216.0,105.0,
6270,6298,Hu Zhangbao,1963-04-05,,,,People's Republic of China,216.0,135.0,
6409,6440,Sergey Kovalenko,1947-08-11,,,,Soviet Union,216.0,111.0,2004-11-18
6420,6451,Jānis Krūmiņš,1930-01-30,Cēsis,Cēsu novads,LAT,Soviet Union,218.0,141.0,1994-11-20


In [64]:
# Conditional Formatting
bios.loc[bios['height_cm']>215, ['name', 'height_cm']]

Unnamed: 0,name,height_cm
5089,Viktor Pankrashkin,220.0
5583,Paulinho Villas Boas,217.0
5673,Gunther Behnke,221.0
5716,Uwe Blab,218.0
5781,Tommy Burleson,223.0
5796,Andy Campbell,218.0
6223,Lars Hansen,216.0
6270,Hu Zhangbao,216.0
6409,Sergey Kovalenko,216.0
6420,Jānis Krūmiņš,218.0


In [67]:
# Use multiple condition:
# We have to keep each condition inside parenthesis.

bios.loc[(bios['height_cm'] > 215) & (bios['born_country'] == 'USA'), ['name', 'height_cm']]

Unnamed: 0,name,height_cm
5781,Tommy Burleson,223.0
6722,Shaquille O'Neal,216.0
6937,David Robinson,216.0
123850,Tyson Chandler,216.0


## Filtering Data with String operations

It will work with the string in our dataframe.

### str.contains()
> Check whether a substring is in a string

In [68]:
bios.loc[bios['name'].str.contains('keith', case=False)] # case = False to make it not to be case sensitive

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
1897,1907,Keith Hanlon,1966-09-01,,,,Ireland,,,
3505,3517,Keith Wallace,1961-03-29,Preston,England,GBR,Great Britain,165.0,51.0,1999-12-31
6228,6255,Keith Hartley,1940-10-15,Vancouver,British Columbia,CAN,Canada,200.0,85.0,
8898,8946,Keith Mwila,1966-01-01,,,,Zambia,,,1993-01-09
12053,12118,Keith Hervey,1898-11-03,Fulham,England,GBR,Great Britain,,,1973-02-22
...,...,...,...,...,...,...,...,...,...,...
109900,111105,Keith Cumberpatch,1927-08-25,Christchurch,Canterbury,NZL,New Zealand,,,2013-11-15
115973,117348,Keith Sanderson,1975-02-02,Plymouth,Massachusetts,USA,United States,183.0,95.0,
117676,119195,Duncan Keith,1983-07-16,Winnipeg,Manitoba,CAN,Canada,185.0,88.0,
122121,124176,Keith Ferguson,1979-09-07,Sale,Victoria,AUS,Australia,176.0,78.0,


In [69]:
# What if we want to filter with keith or patrick in their name.
bios.loc[bios['name'].str.contains('keith|patrick', case=False)]

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
6,7,Patrick Chila,1969-11-27,Ris-Orangis,Essonne,FRA,France,180.0,73.0,
119,120,Patrick Wheatley,1899-01-20,Vryheid,KwaZulu-Natal,RSA,Great Britain,,,1967-11-05
319,320,Patrick De Koning,1961-04-23,Dendermonde,Oost-Vlaanderen,BEL,Belgium,178.0,92.0,
1897,1907,Keith Hanlon,1966-09-01,,,,Ireland,,,
2115,2125,Patrick Jopp,1962-01-08,,,,Switzerland,176.0,67.0,
...,...,...,...,...,...,...,...,...,...,...
143975,147633,Patrick Chinyemba,2001-01-03,,,,Zambia,,,
144172,147850,Patrick Jakob,1996-10-17,Sankt Johann in Tirol,Tirol,AUT,Austria,,,
144547,148239,Patrick Galbraith,1986-03-11,Haderslev,Syddanmark,DEN,Denmark,,,
144565,148257,Patrick Russell,1993-01-04,Gentofte,Hovedstaden,DEN,Denmark,186.0,93.0,


### isin() and str.startswidth()
> `isin()` check whether something is in a given list or not?<br>
> `str.startswidth()` check whether the String begins with the given substring

In [73]:
bios.loc[(bios['born_country'].isin(['USA', 'FRA', 'GBR'])) & (bios['name'].str.contains('Keith'))]

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
3505,3517,Keith Wallace,1961-03-29,Preston,England,GBR,Great Britain,165.0,51.0,1999-12-31
12053,12118,Keith Hervey,1898-11-03,Fulham,England,GBR,Great Britain,,,1973-02-22
14577,14674,Keith Harrison,1933-03-28,Birmingham,England,GBR,Great Britain,,,
16166,16281,Keith Reynolds,1963-12-25,Solihull,England,GBR,Great Britain,173.0,68.0,
18734,18862,Keith Sinclair,1945-06-26,Sunderland,England,GBR,Great Britain,190.0,79.0,
29897,30123,Keith Langley,1961-06-03,Aldershot,England,GBR,Great Britain,173.0,70.0,
34011,34275,Keith Remfry,1947-11-17,Ealing,England,GBR,Great Britain,193.0,114.0,2015-09-16
46885,47234,Keith Collin,1937-01-18,Marylebone,England,GBR,Great Britain,168.0,63.0,1991-03-06
50929,51288,Keith Carter,1924-08-30,Akron,Ohio,USA,United States,,,2013-05-03
51185,51544,Keith Russell,1948-01-15,Mesa,Arizona,USA,United States,188.0,73.0,


## Adding and Removig a column

### Adding a column

In [74]:
coffee.head()

Unnamed: 0,Day,Coffee Type,Units Sold
0,Monday,Espresso,25
1,Monday,Latte,15
2,Tuesday,Espresso,30
3,Tuesday,Latte,20
4,Wednesday,Espresso,35


In [77]:
# Lets say we want to add a price column in this dataframe

coffee['Price'] = 4.99
coffee.head()

Unnamed: 0,Day,Coffee Type,Units Sold,Price
0,Monday,Espresso,25,4.99
1,Monday,Latte,15,4.99
2,Tuesday,Espresso,30,4.99
3,Tuesday,Latte,20,4.99
4,Wednesday,Espresso,35,4.99


In [79]:
print('This is')

This is
