# Introduction to Pandas

<img src="https://raw.githubusercontent.com/fralfaro/DS-Cheat-Sheets/main/docs/examples/pandas/pandas.png" alt="numpy logo" width = "300">

[Pandas](https://pandas.pydata.org/) is built on NumPy and provides easy-to-use
data structures and data analysis tools for the Python
programming language.

## Install and import Pandas

`
$ pip install pandas
`

In [48]:
# Import Pandas convention
import pandas as pd

## Pandas Data Structures

### Series

<img src="https://raw.githubusercontent.com/fralfaro/DS-Cheat-Sheets/main/docs/examples/pandas/serie.png" alt="numpy logo" >

A **one-dimensional** labeled array a capable of holding any data type.

In [49]:
# Import pandas
import pandas as pd

# Create a pandas Series representing monthly sales data
sales_data = pd.Series(
    [1500, 1200, 1800, 1600, 1300, 1700, 1400, 1500, 1600, 1800],
    index=['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct']
)

# Print the pandas Series
print("Monthly Sales Data:")
print(sales_data)

Monthly Sales Data:
jan    1500
feb    1200
mar    1800
apr    1600
may    1300
jun    1700
jul    1400
aug    1500
sep    1600
oct    1800
dtype: int64


### DataFrame

<img src="https://raw.githubusercontent.com/fralfaro/DS-Cheat-Sheets/main/docs/examples/pandas/df.png" alt="numpy logo" >

**two-dimensional** labeled data structure with columns of potentially different types.

In [50]:
# Create a pandas DataFrame with more instances
data = {
    'country': ['United States', 'China', 'Japan', 'Germany', 'United Kingdom', 'India', 'France', 'Italy', 'Brazil', 'Canada'],
    'capital': ['Washington, D.C.', 'Beijing', 'Tokyo', 'Berlin', 'London', 'New Delhi', 'Paris', 'Rome', 'Brasília', 'Ottawa'],
    'population': [331449281, 1393000000, 126476461, 83783945, 67886011, 1303171035, 67186600, 60277900, 211050000, 37742154],
    'GDP': [21.44, 14.34, 5.07, 4.01, 2.99, 3.11, 2.78, 2.15, 1.77, 1.73]
}
df = pd.DataFrame(
    data,
    columns=['country', 'capital', 'population', 'GDP']
)

# Print the DataFrame 'df'
print("\ndf:")
df


df:


Unnamed: 0,country,capital,population,GDP
0,United States,"Washington, D.C.",331449281,21.44
1,China,Beijing,1393000000,14.34
2,Japan,Tokyo,126476461,5.07
3,Germany,Berlin,83783945,4.01
4,United Kingdom,London,67886011,2.99
5,India,New Delhi,1303171035,3.11
6,France,Paris,67186600,2.78
7,Italy,Rome,60277900,2.15
8,Brazil,Brasília,211050000,1.77
9,Canada,Ottawa,37742154,1.73


## Read csv files

In [51]:
covid_df = pd.read_csv('./data/covid19-og.csv')
covid_df

Unnamed: 0,dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2018
0,14/04/2020,14,4,2020,58,3,Afghanistan,AF,AFG,37172386.0
1,13/04/2020,13,4,2020,52,0,Afghanistan,AF,AFG,37172386.0
2,12/04/2020,12,4,2020,34,3,Afghanistan,AF,AFG,37172386.0
3,11/04/2020,11,4,2020,37,0,Afghanistan,AF,AFG,37172386.0
4,10/04/2020,10,4,2020,61,1,Afghanistan,AF,AFG,37172386.0
...,...,...,...,...,...,...,...,...,...,...
10737,25/03/2020,25,3,2020,0,0,Zimbabwe,ZW,ZWE,14439018.0
10738,24/03/2020,24,3,2020,0,1,Zimbabwe,ZW,ZWE,14439018.0
10739,23/03/2020,23,3,2020,0,0,Zimbabwe,ZW,ZWE,14439018.0
10740,22/03/2020,22,3,2020,1,0,Zimbabwe,ZW,ZWE,14439018.0


In [52]:
# Load the first 10 rows of the AirBnb NYC 2019 dataset for quick inspection
nyc_df = pd.read_csv('./data/AirBnb_NYC_2019.csv', index_col=0, nrows=10)
nyc_df

Unnamed: 0_level_0,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0
5099,Large Cozy 1 BR Apartment In Midtown East,7322,Chris,Manhattan,Murray Hill,40.74767,-73.975,Entire home/apt,200,3,74,2019-06-22,0.59,1,129
5121,BlissArtsSpace!,7356,Garon,Brooklyn,Bedford-Stuyvesant,40.68688,-73.95596,Private room,60,45,49,2017-10-05,0.4,1,0
5178,Large Furnished Room Near B'way,8967,Shunichi,Manhattan,Hell's Kitchen,40.76489,-73.98493,Private room,79,2,430,2019-06-24,3.47,1,220
5203,Cozy Clean Guest Room - Family Apt,7490,MaryEllen,Manhattan,Upper West Side,40.80178,-73.96723,Private room,79,2,118,2017-07-21,0.99,1,0
5238,Cute & Cozy Lower East Side 1 bdrm,7549,Ben,Manhattan,Chinatown,40.71344,-73.99037,Entire home/apt,150,1,160,2019-06-09,1.33,4,188


In [53]:
# Load the dataset with multi-level indices and headers
warehouse_df = pd.read_csv('./data/multi_index_warehouses.csv', index_col=[0,1,2], header=[0,1])
warehouse_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,2010,2010,2011,2011,2012,2012,2013,2013,2014,2014,2015,2015,2016,2016,2017,2017,2018,2018,2019,2019
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Jan-Jun,Jul-Dec,Jan-Jun,Jul-Dec,Jan-Jun,Jul-Dec,Jan-Jun,Jul-Dec,Jan-Jun,Jul-Dec,Jan-Jun,Jul-Dec,Jan-Jun,Jul-Dec,Jan-Jun,Jul-Dec,Jan-Jun,Jul-Dec,Jan-Jun,Jul-Dec
NY Warehouses,Buffalo,Mobile,26,12,10,23,18,10,10,26,16,18,20,21,20,26,29,20,11,21,25,16
NY Warehouses,Buffalo,TV,19,22,27,19,27,12,24,28,27,28,10,16,25,26,20,25,10,27,20,20
NY Warehouses,Buffalo,AC,16,24,20,23,29,15,10,20,16,16,25,21,19,12,21,19,11,28,19,19
NY Warehouses,Ithaca,Mobile,10,28,27,22,18,14,22,12,14,16,21,13,27,17,15,19,21,15,29,14
NY Warehouses,Ithaca,TV,13,13,11,15,12,15,27,17,10,25,20,27,17,16,13,23,15,26,15,28
NY Warehouses,Ithaca,AC,18,17,19,28,28,14,21,18,25,17,18,27,23,24,22,12,11,12,19,22
NY Warehouses,Beacon,Mobile,10,17,24,27,25,11,22,26,10,13,25,13,29,23,28,22,22,26,20,18
NY Warehouses,Beacon,TV,12,22,26,26,15,28,15,26,12,18,17,10,21,19,24,23,26,19,19,13
NY Warehouses,Beacon,AC,13,11,23,13,20,26,10,12,14,10,28,26,21,26,29,12,29,18,20,24
CA Warehouses,San Francisco,Mobile,15,25,18,16,13,13,19,15,21,23,11,26,27,16,16,18,29,22,25,20


## Read Excel files

In [54]:
pd.read_excel('./data/covid19.xlsx')

Unnamed: 0,dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2018
0,2020-04-14,14,4,2020,58,3,Afghanistan,AF,AFG,37172386.0
1,2020-04-13,13,4,2020,52,0,Afghanistan,AF,AFG,37172386.0
2,2020-04-12,12,4,2020,34,3,Afghanistan,AF,AFG,37172386.0
3,2020-04-11,11,4,2020,37,0,Afghanistan,AF,AFG,37172386.0
4,2020-04-10,10,4,2020,61,1,Afghanistan,AF,AFG,37172386.0
...,...,...,...,...,...,...,...,...,...,...
10737,2020-03-25,25,3,2020,0,0,Zimbabwe,ZW,ZWE,14439018.0
10738,2020-03-24,24,3,2020,0,1,Zimbabwe,ZW,ZWE,14439018.0
10739,2020-03-23,23,3,2020,0,0,Zimbabwe,ZW,ZWE,14439018.0
10740,2020-03-22,22,3,2020,1,0,Zimbabwe,ZW,ZWE,14439018.0


## Read JSON files

In [55]:
pd.read_json('./data/admits.json').sort_index()

Unnamed: 0,gpa,gre,toefl,workex,research,admit
0,6.80,326.0,106,0,3+,0
1,8.24,305.0,114,3.5,0,1
2,6.56,312.0,116,1,2,1
3,7.62,326.0,107,3,3,1
4,6.01,314.0,87,2,2,1
...,...,...,...,...,...,...
95,4.96,303.0,92,0.5,2,0
96,6.13,323.0,119,5+,1,1
97,8.65,333.0,119,2.5,2,1
98,5.79,303.0,91,5,none,1


In [56]:
admits = pd.read_csv('./data/admits.csv')
admits

Unnamed: 0,gpa,gre,toefl,workex,research,admit
0,6.80,326.0,106,0,3+,0
1,8.24,305.0,114,3.5,0,1
2,6.56,312.0,116,1,2,1
3,7.62,326.0,107,3,3,1
4,6.01,314.0,87,2,2,1
...,...,...,...,...,...,...
95,4.96,303.0,92,0.5,2,0
96,6.13,323.0,119,5+,1,1
97,8.65,333.0,119,2.5,2,1
98,5.79,303.0,91,5,none,1


## Renaming indices/columns

In [57]:
df.rename({"US": "zero"}, axis=0)

Unnamed: 0,country,capital,population,GDP
0,United States,"Washington, D.C.",331449281,21.44
1,China,Beijing,1393000000,14.34
2,Japan,Tokyo,126476461,5.07
3,Germany,Berlin,83783945,4.01
4,United Kingdom,London,67886011,2.99
5,India,New Delhi,1303171035,3.11
6,France,Paris,67186600,2.78
7,Italy,Rome,60277900,2.15
8,Brazil,Brasília,211050000,1.77
9,Canada,Ottawa,37742154,1.73


In [58]:
df.rename({"population": "Population_number"}, axis=1)

Unnamed: 0,country,capital,Population_number,GDP
0,United States,"Washington, D.C.",331449281,21.44
1,China,Beijing,1393000000,14.34
2,Japan,Tokyo,126476461,5.07
3,Germany,Berlin,83783945,4.01
4,United Kingdom,London,67886011,2.99
5,India,New Delhi,1303171035,3.11
6,France,Paris,67186600,2.78
7,Italy,Rome,60277900,2.15
8,Brazil,Brasília,211050000,1.77
9,Canada,Ottawa,37742154,1.73


## Getting Elements


In [59]:
# Get one element from a Series
sales_data['jan']

# another way to do it
sales_data.jan

1500

In [60]:
# Get subset of a DataFrame
df[1:]

Unnamed: 0,country,capital,population,GDP
1,China,Beijing,1393000000,14.34
2,Japan,Tokyo,126476461,5.07
3,Germany,Berlin,83783945,4.01
4,United Kingdom,London,67886011,2.99
5,India,New Delhi,1303171035,3.11
6,France,Paris,67186600,2.78
7,Italy,Rome,60277900,2.15
8,Brazil,Brasília,211050000,1.77
9,Canada,Ottawa,37742154,1.73


## Dropping


In [61]:
# Drop values from rows (axis=0)
sales_data.drop(['may', 'mar'])

jan    1500
feb    1200
apr    1600
jun    1700
jul    1400
aug    1500
sep    1600
oct    1800
dtype: int64

In [62]:
# Drop values from columns (axis=1)
df.drop('country', axis=1)

Unnamed: 0,capital,population,GDP
0,"Washington, D.C.",331449281,21.44
1,Beijing,1393000000,14.34
2,Tokyo,126476461,5.07
3,Berlin,83783945,4.01
4,London,67886011,2.99
5,New Delhi,1303171035,3.11
6,Paris,67186600,2.78
7,Rome,60277900,2.15
8,Brasília,211050000,1.77
9,Ottawa,37742154,1.73


## Applying Functions


In [63]:
# Define a function
f = lambda x: x*2

In [64]:
# Apply function to DataFrame
df.apply(f)

Unnamed: 0,country,capital,population,GDP
0,United StatesUnited States,"Washington, D.C.Washington, D.C.",662898562,42.88
1,ChinaChina,BeijingBeijing,2786000000,28.68
2,JapanJapan,TokyoTokyo,252952922,10.14
3,GermanyGermany,BerlinBerlin,167567890,8.02
4,United KingdomUnited Kingdom,LondonLondon,135772022,5.98
5,IndiaIndia,New DelhiNew Delhi,2606342070,6.22
6,FranceFrance,ParisParis,134373200,5.56
7,ItalyItaly,RomeRome,120555800,4.3
8,BrazilBrazil,BrasíliaBrasília,422100000,3.54
9,CanadaCanada,OttawaOttawa,75484308,3.46


In [65]:
# Apply function element-wise
df.applymap(f)

  df.applymap(f)


Unnamed: 0,country,capital,population,GDP
0,United StatesUnited States,"Washington, D.C.Washington, D.C.",662898562,42.88
1,ChinaChina,BeijingBeijing,2786000000,28.68
2,JapanJapan,TokyoTokyo,252952922,10.14
3,GermanyGermany,BerlinBerlin,167567890,8.02
4,United KingdomUnited Kingdom,LondonLondon,135772022,5.98
5,IndiaIndia,New DelhiNew Delhi,2606342070,6.22
6,FranceFrance,ParisParis,134373200,5.56
7,ItalyItaly,RomeRome,120555800,4.3
8,BrazilBrazil,BrasíliaBrasília,422100000,3.54
9,CanadaCanada,OttawaOttawa,75484308,3.46


In [66]:

df["country"] = df["country"].apply(lambda x: x.upper())

df

Unnamed: 0,country,capital,population,GDP
0,UNITED STATES,"Washington, D.C.",331449281,21.44
1,CHINA,Beijing,1393000000,14.34
2,JAPAN,Tokyo,126476461,5.07
3,GERMANY,Berlin,83783945,4.01
4,UNITED KINGDOM,London,67886011,2.99
5,INDIA,New Delhi,1303171035,3.11
6,FRANCE,Paris,67186600,2.78
7,ITALY,Rome,60277900,2.15
8,BRAZIL,Brasília,211050000,1.77
9,CANADA,Ottawa,37742154,1.73


## TQDM with pandas

In [67]:
import time
def placeholder_function(x):
    time.sleep(0.5)
    return x.upper()

In [68]:
from tqdm import tqdm
# Create new `pandas` methods which use `tqdm` progress
# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas()

df["country"] = df["country"].progress_apply(placeholder_function)

df

100%|██████████| 10/10 [00:05<00:00,  1.99it/s]


Unnamed: 0,country,capital,population,GDP
0,UNITED STATES,"Washington, D.C.",331449281,21.44
1,CHINA,Beijing,1393000000,14.34
2,JAPAN,Tokyo,126476461,5.07
3,GERMANY,Berlin,83783945,4.01
4,UNITED KINGDOM,London,67886011,2.99
5,INDIA,New Delhi,1303171035,3.11
6,FRANCE,Paris,67186600,2.78
7,ITALY,Rome,60277900,2.15
8,BRAZIL,Brasília,211050000,1.77
9,CANADA,Ottawa,37742154,1.73


In [69]:
# Even better progress bar
from tqdm.auto import tqdm
# Create new `pandas` methods which use `tqdm` progress
# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas()

df["country"] = df["country"].progress_apply(placeholder_function)

df

  from .autonotebook import tqdm as notebook_tqdm
100%|██████████| 10/10 [00:05<00:00,  1.99it/s]


Unnamed: 0,country,capital,population,GDP
0,UNITED STATES,"Washington, D.C.",331449281,21.44
1,CHINA,Beijing,1393000000,14.34
2,JAPAN,Tokyo,126476461,5.07
3,GERMANY,Berlin,83783945,4.01
4,UNITED KINGDOM,London,67886011,2.99
5,INDIA,New Delhi,1303171035,3.11
6,FRANCE,Paris,67186600,2.78
7,ITALY,Rome,60277900,2.15
8,BRAZIL,Brasília,211050000,1.77
9,CANADA,Ottawa,37742154,1.73


## Basic Information


In [70]:
# Get the shape (rows, columns)
df.shape

(10, 4)

In [71]:
# Describe index
df.index

RangeIndex(start=0, stop=10, step=1)

In [72]:
# Describe DataFrame columns
df.columns

Index(['country', 'capital', 'population', 'GDP'], dtype='object')

In [73]:
# Info on DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     10 non-null     object 
 1   capital     10 non-null     object 
 2   population  10 non-null     int64  
 3   GDP         10 non-null     float64
dtypes: float64(1), int64(1), object(2)
memory usage: 448.0+ bytes


In [74]:
# Number of non-NA values
df.count()

country       10
capital       10
population    10
GDP           10
dtype: int64

In [75]:
df["country"].value_counts()

country
UNITED STATES     1
CHINA             1
JAPAN             1
GERMANY           1
UNITED KINGDOM    1
INDIA             1
FRANCE            1
ITALY             1
BRAZIL            1
CANADA            1
Name: count, dtype: int64

## Summary

In [76]:
# Sum of values
sum_values = df['population'].sum()

# Cumulative sum of values
cumulative_sum_values = df['population'].cumsum()

# Minimum/maximum values
min_values = df['population'].min()
max_values = df['population'].max()

# Index of minimum/maximum values
idx_min_values = df['population'].idxmin()
idx_max_values = df['population'].idxmax()

# Summary statistics
summary_stats = df['population'].describe()

# Mean of values
mean_values = df['population'].mean()

# Median of values
median_values = df['population'].median()

print("Example DataFrame:")
print(df)

print("\nSum of values:")
print(sum_values)

print("\nCumulative sum of values:")
print(cumulative_sum_values)

print("\nMinimum values:")
print(min_values)

print("\nMaximum values:")
print(max_values)

print("\nIndex of minimum values:")
print(idx_min_values)

print("\nIndex of maximum values:")
print(idx_max_values)

print("\nSummary statistics:")
print(summary_stats)

print("\nMean values:")
print(mean_values)

print("\nMedian values:")
print(median_values)

Example DataFrame:
          country           capital  population    GDP
0   UNITED STATES  Washington, D.C.   331449281  21.44
1           CHINA           Beijing  1393000000  14.34
2           JAPAN             Tokyo   126476461   5.07
3         GERMANY            Berlin    83783945   4.01
4  UNITED KINGDOM            London    67886011   2.99
5           INDIA         New Delhi  1303171035   3.11
6          FRANCE             Paris    67186600   2.78
7           ITALY              Rome    60277900   2.15
8          BRAZIL          Brasília   211050000   1.77
9          CANADA            Ottawa    37742154   1.73

Sum of values:
3682023387

Cumulative sum of values:
0     331449281
1    1724449281
2    1850925742
3    1934709687
4    2002595698
5    3305766733
6    3372953333
7    3433231233
8    3644281233
9    3682023387
Name: population, dtype: int64

Minimum values:
37742154

Maximum values:
1393000000

Index of minimum values:
9

Index of maximum values:
1

Summary statistics:


In [77]:
# easier way to get the summaries
df.describe()
# df.describe().T

Unnamed: 0,population,GDP
count,10.0,10.0
mean,368202300.0,5.939
std,524359000.0,6.595166
min,37742150.0,1.73
25%,67361450.0,2.3075
50%,105130200.0,3.05
75%,301349500.0,4.805
max,1393000000.0,21.44


## Introduction to data profiling

In [78]:
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title="Profiling Report",explorative=True)
# profile.to_widgets()
# profile.to_notebook_iframe()
# profile.to_file("your_report.html")