 # Introduction to Pandas

 <img src="https://raw.githubusercontent.com/fralfaro/DS-Cheat-Sheets/main/docs/examples/pandas/pandas.png" alt="numpy logo" width = "300">

 [Pandas](https://pandas.pydata.org/) is built on NumPy and provides easy-to-use
 data structures and data analysis tools for the Python
 programming language.

 ## Install and import Pandas

 `
 $ pip install pandas
 `

In [1]:
# Import Pandas convention
import pandas as pd


 ## Pandas Data Structures

 ### Series

 <img src="https://raw.githubusercontent.com/fralfaro/DS-Cheat-Sheets/main/docs/examples/pandas/serie.png" alt="numpy logo" >

 A **one-dimensional** labeled array a capable of holding any data type.

In [2]:
# Import pandas
import pandas as pd

# Create a pandas Series representing monthly sales data
sales_data = pd.Series(
    [1500, 1200, 1800, 1600, 1300, 1700, 1400, 1500, 1600, 1800],
    index=['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct']
)

# Print the pandas Series
print("Monthly Sales Data:")
print(sales_data)
print(type(sales_data))


Monthly Sales Data:
jan    1500
feb    1200
mar    1800
apr    1600
may    1300
jun    1700
jul    1400
aug    1500
sep    1600
oct    1800
dtype: int64
<class 'pandas.core.series.Series'>


 ### DataFrame

 <img src="https://raw.githubusercontent.com/fralfaro/DS-Cheat-Sheets/main/docs/examples/pandas/df.png" alt="numpy logo" >

 **two-dimensional** labeled data structure with columns of potentially different types.

In [3]:
# Create a pandas DataFrame with more instances
data = {
    'country': ['United States', 'China', 'Japan', 'Germany', 'United Kingdom', 'India', 'France', 'Italy', 'Brazil', 'Canada'],
    'capital': ['Washington, D.C.', 'Beijing', 'Tokyo', 'Berlin', 'London', 'New Delhi', 'Paris', 'Rome', 'Brasília', 'Ottawa'],
    'population': [331449281, 1393000000, 126476461, 83783945, 67886011, 1303171035, 67186600, 60277900, 211050000, 37742154],
    'GDP': [21.44, 14.34, 5.07, 4.01, 2.99, 3.11, 2.78, 2.15, 1.77, 1.73]
}
df = pd.DataFrame(
    data,
    columns=['country', 'capital', 'population', 'GDP']
)

# Print the DataFrame 'df'
print("\ndf:")
df



df:


Unnamed: 0,country,capital,population,GDP
0,United States,"Washington, D.C.",331449281,21.44
1,China,Beijing,1393000000,14.34
2,Japan,Tokyo,126476461,5.07
3,Germany,Berlin,83783945,4.01
4,United Kingdom,London,67886011,2.99
5,India,New Delhi,1303171035,3.11
6,France,Paris,67186600,2.78
7,Italy,Rome,60277900,2.15
8,Brazil,Brasília,211050000,1.77
9,Canada,Ottawa,37742154,1.73


In [4]:
import pandas as pd

# Original data structure
data_list_of_dicts = [
    {"country": "United States","capital": "Washington, D.C.","population": 331449281,"GDP": 21.44,},
    {"country": "China", "capital": "Beijing", "population": 1393000000, "GDP": 14.34},
    {"country": "Japan", "capital": "Tokyo", "population": 126476461, "GDP": 5.07},
    {"country": "Germany", "capital": "Berlin", "population": 83783945, "GDP": 4.01},
    {"country": "United Kingdom","capital": "London","population": 67886011,"GDP": 2.99},
    {"country": "India", "capital": "New Delhi", "population": 1303171035, "GDP": 3.11},
    {"country": "France", "capital": "Paris", "population": 67186600, "GDP": 2.78},
    {"country": "Italy", "capital": "Rome", "population": 60277900, "GDP": 2.15},
    {"country": "Brazil", "capital": "Brasília", "population": 211050000, "GDP": 1.77},
    {"country": "Canada", "capital": "Ottawa", "population": 37742154, "GDP": 1.73},
]

# Creating DataFrame from list of dictionaries
df = pd.DataFrame(
    data_list_of_dicts, columns=["country", "capital", "population", "GDP"]
)
df.sample()


Unnamed: 0,country,capital,population,GDP
6,France,Paris,67186600,2.78


 # How to read Data

 ## Read csv files

In [5]:
import pandas as pd


In [6]:
covid_df = pd.read_csv('./pandas_data/covid19-og.csv')
covid_df


Unnamed: 0,dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2018
0,14/04/2020,14,4,2020,58,3,Afghanistan,AF,AFG,37172386.0
1,13/04/2020,13,4,2020,52,0,Afghanistan,AF,AFG,37172386.0
2,12/04/2020,12,4,2020,34,3,Afghanistan,AF,AFG,37172386.0
3,11/04/2020,11,4,2020,37,0,Afghanistan,AF,AFG,37172386.0
4,10/04/2020,10,4,2020,61,1,Afghanistan,AF,AFG,37172386.0
...,...,...,...,...,...,...,...,...,...,...
10737,25/03/2020,25,3,2020,0,0,Zimbabwe,ZW,ZWE,14439018.0
10738,24/03/2020,24,3,2020,0,1,Zimbabwe,ZW,ZWE,14439018.0
10739,23/03/2020,23,3,2020,0,0,Zimbabwe,ZW,ZWE,14439018.0
10740,22/03/2020,22,3,2020,1,0,Zimbabwe,ZW,ZWE,14439018.0


In [7]:
# Load the first 10 rows of the AirBnb NYC 2019 dataset for quick inspection
nyc_df = pd.read_csv("./pandas_data/AirBnb_NYC_2019.csv", index_col=0)
nyc_df


Unnamed: 0_level_0,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,,,1,365
3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.10,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9
36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36
36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27
36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2


In [8]:
# Load the dataset with multi-level indices and headers
warehouse_df = pd.read_csv(
    "./pandas_data/multi_index_warehouses.csv", index_col=[0, 1, 2], header=[0, 1]
)
warehouse_df


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,2010,2010,2011,2011,2012,2012,2013,2013,2014,2014,2015,2015,2016,2016,2017,2017,2018,2018,2019,2019
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Jan-Jun,Jul-Dec,Jan-Jun,Jul-Dec,Jan-Jun,Jul-Dec,Jan-Jun,Jul-Dec,Jan-Jun,Jul-Dec,Jan-Jun,Jul-Dec,Jan-Jun,Jul-Dec,Jan-Jun,Jul-Dec,Jan-Jun,Jul-Dec,Jan-Jun,Jul-Dec
NY Warehouses,Buffalo,Mobile,26,12,10,23,18,10,10,26,16,18,20,21,20,26,29,20,11,21,25,16
NY Warehouses,Buffalo,TV,19,22,27,19,27,12,24,28,27,28,10,16,25,26,20,25,10,27,20,20
NY Warehouses,Buffalo,AC,16,24,20,23,29,15,10,20,16,16,25,21,19,12,21,19,11,28,19,19
NY Warehouses,Ithaca,Mobile,10,28,27,22,18,14,22,12,14,16,21,13,27,17,15,19,21,15,29,14
NY Warehouses,Ithaca,TV,13,13,11,15,12,15,27,17,10,25,20,27,17,16,13,23,15,26,15,28
NY Warehouses,Ithaca,AC,18,17,19,28,28,14,21,18,25,17,18,27,23,24,22,12,11,12,19,22
NY Warehouses,Beacon,Mobile,10,17,24,27,25,11,22,26,10,13,25,13,29,23,28,22,22,26,20,18
NY Warehouses,Beacon,TV,12,22,26,26,15,28,15,26,12,18,17,10,21,19,24,23,26,19,19,13
NY Warehouses,Beacon,AC,13,11,23,13,20,26,10,12,14,10,28,26,21,26,29,12,29,18,20,24
CA Warehouses,San Francisco,Mobile,15,25,18,16,13,13,19,15,21,23,11,26,27,16,16,18,29,22,25,20


 ## Read Excel files

In [9]:
pd.read_excel("./pandas_data/covid19.xlsx", sheet_name="gre")


Unnamed: 0,name,gre
0,jack,325
1,anna,329
2,,300
3,jasmine,338


 ## Read JSON files

In [10]:
pd.read_json("./pandas_data/admits.json")


Unnamed: 0,gpa,gre,toefl,workex,research,admit
0,6.80,326.0,106,0,3+,0
1,8.24,305.0,114,3.5,0,1
2,6.56,312.0,116,1,2,1
3,7.62,326.0,107,3,3,1
4,6.01,314.0,87,2,2,1
...,...,...,...,...,...,...
95,4.96,303.0,92,0.5,2,0
96,6.13,323.0,119,5+,1,1
97,8.65,333.0,119,2.5,2,1
98,5.79,303.0,91,5,none,1


 # Editing the DF

 ## Renaming indices/columns

In [11]:
import pandas as pd

# Original data structure
data_list_of_dicts = [
    {"country": "United States","capital": "Washington, D.C.","population": 331449281,"GDP": 21.44},
    {"country": "China", "capital": "Beijing", "population": 1393000000, "GDP": 14.34},
    {"country": "Japan", "capital": "Tokyo", "population": 126476461, "GDP": 5.07},
    {"country": "Germany", "capital": "Berlin", "population": 83783945, "GDP": 4.01},
    {"country": "United Kingdom","capital": "London","population": 67886011,"GDP": 2.99},
    {"country": "India", "capital": "New Delhi", "population": 1303171035, "GDP": 3.11},
    {"country": "France", "capital": "Paris", "population": 67186600, "GDP": 2.78},
    {"country": "Italy", "capital": "Rome", "population": 60277900, "GDP": 2.15},
    {"country": "Brazil", "capital": "Brasília", "population": 211050000, "GDP": 1.77},
    {"country": "Canada", "capital": "Ottawa", "population": 37742154, "GDP": 1.73},
]

# Creating DataFrame from list of dictionaries
df = pd.DataFrame(
    data_list_of_dicts, columns=["country", "capital", "population", "GDP"]
)
df.head(10)


Unnamed: 0,country,capital,population,GDP
0,United States,"Washington, D.C.",331449281,21.44
1,China,Beijing,1393000000,14.34
2,Japan,Tokyo,126476461,5.07
3,Germany,Berlin,83783945,4.01
4,United Kingdom,London,67886011,2.99
5,India,New Delhi,1303171035,3.11
6,France,Paris,67186600,2.78
7,Italy,Rome,60277900,2.15
8,Brazil,Brasília,211050000,1.77
9,Canada,Ottawa,37742154,1.73


In [12]:
new_df = df.set_index("country")
new_df.rename({"united States": "us", "United Kingdom": "uk"}, axis=0,inplace=True)
new_df


Unnamed: 0_level_0,capital,population,GDP
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
United States,"Washington, D.C.",331449281,21.44
China,Beijing,1393000000,14.34
Japan,Tokyo,126476461,5.07
Germany,Berlin,83783945,4.01
uk,London,67886011,2.99
India,New Delhi,1303171035,3.11
France,Paris,67186600,2.78
Italy,Rome,60277900,2.15
Brazil,Brasília,211050000,1.77
Canada,Ottawa,37742154,1.73


In [13]:
df.rename({0: "zero"}, axis=0)


Unnamed: 0,country,capital,population,GDP
zero,United States,"Washington, D.C.",331449281,21.44
1,China,Beijing,1393000000,14.34
2,Japan,Tokyo,126476461,5.07
3,Germany,Berlin,83783945,4.01
4,United Kingdom,London,67886011,2.99
5,India,New Delhi,1303171035,3.11
6,France,Paris,67186600,2.78
7,Italy,Rome,60277900,2.15
8,Brazil,Brasília,211050000,1.77
9,Canada,Ottawa,37742154,1.73


In [14]:
df.rename({"population": "population_number","GDP" : "gross_domestic_product"}, axis=1)


Unnamed: 0,country,capital,population_number,gross_domestic_product
0,United States,"Washington, D.C.",331449281,21.44
1,China,Beijing,1393000000,14.34
2,Japan,Tokyo,126476461,5.07
3,Germany,Berlin,83783945,4.01
4,United Kingdom,London,67886011,2.99
5,India,New Delhi,1303171035,3.11
6,France,Paris,67186600,2.78
7,Italy,Rome,60277900,2.15
8,Brazil,Brasília,211050000,1.77
9,Canada,Ottawa,37742154,1.73


In [15]:
import pandas as pd

# Original data structure
data_list_of_dicts = [
    {"Country": "United States","Capital": "Washington, D.C.","Population Number": 331449281,"Gross Domestic Product": 21.44},
    {"Country": "China", "Capital": "Beijing", "Population Number": 1393000000, "Gross Domestic Product": 14.34},
    {"Country": "Japan", "Capital": "Tokyo", "Population Number": 126476461, "Gross Domestic Product": 5.07},
    {"Country": "Germany", "Capital": "Berlin", "Population Number": 83783945, "Gross Domestic Product": 4.01},
    {"Country": "United Kingdom","Capital": "London","Population Number": 67886011,"Gross Domestic Product": 2.99},
    {"Country": "India", "Capital": "New Delhi", "Population Number": 1303171035, "Gross Domestic Product": 3.11},
    {"Country": "France", "Capital": "Paris", "Population Number": 67186600, "Gross Domestic Product": 2.78},
    {"Country": "Italy", "Capital": "Rome", "Population Number": 60277900, "Gross Domestic Product": 2.15},
    {"Country": "Brazil", "Capital": "Brasília", "Population Number": 211050000, "Gross Domestic Product": 1.77},
    {"Country": "Canada", "Capital": "Ottawa", "Population Number": 37742154, "Gross Domestic Product": 1.73},
]

# Creating DataFrame from list of dictionaries
df = pd.DataFrame(
    data_list_of_dicts, columns=["Country", "Capital", "Population Number", "Gross Domestic Product"]
)
df.head(10)


Unnamed: 0,Country,Capital,Population Number,Gross Domestic Product
0,United States,"Washington, D.C.",331449281,21.44
1,China,Beijing,1393000000,14.34
2,Japan,Tokyo,126476461,5.07
3,Germany,Berlin,83783945,4.01
4,United Kingdom,London,67886011,2.99
5,India,New Delhi,1303171035,3.11
6,France,Paris,67186600,2.78
7,Italy,Rome,60277900,2.15
8,Brazil,Brasília,211050000,1.77
9,Canada,Ottawa,37742154,1.73


In [16]:
df.columns = df.columns.str.lower().str.replace(" ","_")
df


Unnamed: 0,country,capital,population_number,gross_domestic_product
0,United States,"Washington, D.C.",331449281,21.44
1,China,Beijing,1393000000,14.34
2,Japan,Tokyo,126476461,5.07
3,Germany,Berlin,83783945,4.01
4,United Kingdom,London,67886011,2.99
5,India,New Delhi,1303171035,3.11
6,France,Paris,67186600,2.78
7,Italy,Rome,60277900,2.15
8,Brazil,Brasília,211050000,1.77
9,Canada,Ottawa,37742154,1.73


In [17]:
df.columns = [col.lower().replace(" ", "_") for col in df.columns]
df


Unnamed: 0,country,capital,population_number,gross_domestic_product
0,United States,"Washington, D.C.",331449281,21.44
1,China,Beijing,1393000000,14.34
2,Japan,Tokyo,126476461,5.07
3,Germany,Berlin,83783945,4.01
4,United Kingdom,London,67886011,2.99
5,India,New Delhi,1303171035,3.11
6,France,Paris,67186600,2.78
7,Italy,Rome,60277900,2.15
8,Brazil,Brasília,211050000,1.77
9,Canada,Ottawa,37742154,1.73


 ## Getting Elements


In [18]:
# Import pandas
import pandas as pd

# Create a pandas Series representing monthly sales data
sales_data = pd.Series(
    [1500, 1200, 1800, 1600, 1300, 1700, 1400, 1500, 1600, 1800],
    index=["jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct"],
)
sales_data


jan    1500
feb    1200
mar    1800
apr    1600
may    1300
jun    1700
jul    1400
aug    1500
sep    1600
oct    1800
dtype: int64

In [19]:
# Get one element from a Series
sales_data["jan"]

# # another way to do it
# sales_data.jan


1500

In [20]:
sales_data[["jan", "apr"]]


jan    1500
apr    1600
dtype: int64

In [21]:
# Get subset of a DataFrame
sales_data[1:6:2]


feb    1200
apr    1600
jun    1700
dtype: int64

In [22]:
sales_data[sales_data > 1500]


mar    1800
apr    1600
jun    1700
sep    1600
oct    1800
dtype: int64

In [23]:
import pandas as pd

# Original data structure
data_list_of_dicts = [
    {
        "country": "United States",
        "capital": "Washington, D.C.",
        "population": 331449281,
        "GDP": 21.44,
    },
    {"country": "China", "capital": "Beijing", "population": 1393000000, "GDP": 14.34},
    {"country": "Japan", "capital": "Tokyo", "population": 126476461, "GDP": 5.07},
    {"country": "Germany", "capital": "Berlin", "population": 83783945, "GDP": 4.01},
    {
        "country": "United Kingdom",
        "capital": "London",
        "population": 67886011,
        "GDP": 2.99,
    },
    {"country": "India", "capital": "New Delhi", "population": 1303171035, "GDP": 3.11},
    {"country": "France", "capital": "Paris", "population": 67186600, "GDP": 2.78},
    {"country": "Italy", "capital": "Rome", "population": 60277900, "GDP": 2.15},
    {"country": "Brazil", "capital": "Brasília", "population": 211050000, "GDP": 1.77},
    {"country": "Canada", "capital": "Ottawa", "population": 37742154, "GDP": 1.73},
]

# Creating DataFrame from list of dictionaries
df = pd.DataFrame(
    data_list_of_dicts, columns=["country", "capital", "population", "GDP"]
)
df.head(10)


Unnamed: 0,country,capital,population,GDP
0,United States,"Washington, D.C.",331449281,21.44
1,China,Beijing,1393000000,14.34
2,Japan,Tokyo,126476461,5.07
3,Germany,Berlin,83783945,4.01
4,United Kingdom,London,67886011,2.99
5,India,New Delhi,1303171035,3.11
6,France,Paris,67186600,2.78
7,Italy,Rome,60277900,2.15
8,Brazil,Brasília,211050000,1.77
9,Canada,Ottawa,37742154,1.73


In [24]:
df[(df["GDP"] > 10) | (df["population"] > 331449281)]


Unnamed: 0,country,capital,population,GDP
0,United States,"Washington, D.C.",331449281,21.44
1,China,Beijing,1393000000,14.34
5,India,New Delhi,1303171035,3.11


 ## Dropping


In [25]:
# Import pandas
import pandas as pd

# Create a pandas Series representing monthly sales data
sales_data = pd.Series(
    [1500, 1200, 1800, 1600, 1300, 1700, 1400, 1500, 1600, 1800],
    index=["jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct"],
)
sales_data


jan    1500
feb    1200
mar    1800
apr    1600
may    1300
jun    1700
jul    1400
aug    1500
sep    1600
oct    1800
dtype: int64

In [26]:
# Drop values from rows (axis=0)
sales_data.drop(['may', 'mar'])


jan    1500
feb    1200
apr    1600
jun    1700
jul    1400
aug    1500
sep    1600
oct    1800
dtype: int64

In [27]:
import pandas as pd

# Original data structure
data_list_of_dicts = [
    {"country": "United States","capital": "Washington, D.C.","population": 331449281,"GDP": 21.44},
    {"country": "China", "capital": "Beijing", "population": 1393000000, "GDP": 14.34},
    {"country": "Japan", "capital": "Tokyo", "population": 126476461, "GDP": 5.07},
    {"country": "Germany", "capital": "Berlin", "population": 83783945, "GDP": 4.01},
    {"country": "United Kingdom","capital": "London","population": 67886011,"GDP": 2.99},
    {"country": "India", "capital": "New Delhi", "population": 1303171035, "GDP": 3.11},
    {"country": "France", "capital": "Paris", "population": 67186600, "GDP": 2.78},
    {"country": "Italy", "capital": "Rome", "population": 60277900, "GDP": 2.15},
    {"country": "Brazil", "capital": "Brasília", "population": 211050000, "GDP": 1.77},
    {"country": "Canada", "capital": "Ottawa", "population": 37742154, "GDP": 1.73},
]

# Creating DataFrame from list of dictionaries
df = pd.DataFrame(
    data_list_of_dicts, columns=["country", "capital", "population", "GDP"]
)
df.head(20)


Unnamed: 0,country,capital,population,GDP
0,United States,"Washington, D.C.",331449281,21.44
1,China,Beijing,1393000000,14.34
2,Japan,Tokyo,126476461,5.07
3,Germany,Berlin,83783945,4.01
4,United Kingdom,London,67886011,2.99
5,India,New Delhi,1303171035,3.11
6,France,Paris,67186600,2.78
7,Italy,Rome,60277900,2.15
8,Brazil,Brasília,211050000,1.77
9,Canada,Ottawa,37742154,1.73


In [28]:
df.drop_duplicates(subset=["country", "capital"],keep="last")


Unnamed: 0,country,capital,population,GDP
0,United States,"Washington, D.C.",331449281,21.44
1,China,Beijing,1393000000,14.34
2,Japan,Tokyo,126476461,5.07
3,Germany,Berlin,83783945,4.01
4,United Kingdom,London,67886011,2.99
5,India,New Delhi,1303171035,3.11
6,France,Paris,67186600,2.78
7,Italy,Rome,60277900,2.15
8,Brazil,Brasília,211050000,1.77
9,Canada,Ottawa,37742154,1.73


In [29]:
df.duplicated(subset=["country", "capital"]).sum()


0

In [30]:
import pandas as pd

# Assuming the original DataFrame creation code remains unchanged
data_list_of_dicts = [
    {"country": "United States","capital": "Washington, D.C.","population": 331449281,"GDP": 21.44},
    {"country": "China", "capital": "Beijing", "population": 1393000000, "GDP": 14.34},
    {"country": "Japan", "capital": "Tokyo", "population": 126476461, "GDP": 5.07},
    {"country": "Germany", "capital": "Berlin", "population": 83783945, "GDP": 4.01},
    {"country": "United Kingdom","capital": "London","population": 67886011,"GDP": 2.99},
    {"capital": "New Delhi", "population": 1303171035, "GDP": 3.11},
    {"country": "France", "capital": "Paris", "population": 67186600},
    {"country": "Italy", "capital": "Rome", "population": 60277900, "GDP": 2.15},
    {"country": "Brazil", "capital": "Brasília", "GDP": 1.77},
    {"country": "Canada", "capital": "Ottawa", "population": 37742154, "GDP": 1.73},
]

df = pd.DataFrame(
    data_list_of_dicts, columns=["country", "capital", "population", "GDP"]
)
df




Unnamed: 0,country,capital,population,GDP
0,United States,"Washington, D.C.",331449300.0,21.44
1,China,Beijing,1393000000.0,14.34
2,Japan,Tokyo,126476500.0,5.07
3,Germany,Berlin,83783940.0,4.01
4,United Kingdom,London,67886010.0,2.99
5,,New Delhi,1303171000.0,3.11
6,France,Paris,67186600.0,
7,Italy,Rome,60277900.0,2.15
8,Brazil,Brasília,,1.77
9,Canada,Ottawa,37742150.0,1.73


In [31]:
na_counts = df.isna().sum()
print(na_counts)


country       1
capital       0
population    1
GDP           1
dtype: int64


In [32]:
# Dropping rows with any NaN values
df.dropna()


Unnamed: 0,country,capital,population,GDP
0,United States,"Washington, D.C.",331449300.0,21.44
1,China,Beijing,1393000000.0,14.34
2,Japan,Tokyo,126476500.0,5.07
3,Germany,Berlin,83783940.0,4.01
4,United Kingdom,London,67886010.0,2.99
7,Italy,Rome,60277900.0,2.15
9,Canada,Ottawa,37742150.0,1.73


 ## Applying Functions


In [33]:
df


Unnamed: 0,country,capital,population,GDP
0,United States,"Washington, D.C.",331449300.0,21.44
1,China,Beijing,1393000000.0,14.34
2,Japan,Tokyo,126476500.0,5.07
3,Germany,Berlin,83783940.0,4.01
4,United Kingdom,London,67886010.0,2.99
5,,New Delhi,1303171000.0,3.11
6,France,Paris,67186600.0,
7,Italy,Rome,60277900.0,2.15
8,Brazil,Brasília,,1.77
9,Canada,Ottawa,37742150.0,1.73


In [34]:
# Apply function to DataFrame
df.apply(lambda x: x/2)


TypeError: unsupported operand type(s) for /: 'str' and 'int'

In [None]:
df["country"] = df["country"].apply(lambda x: x.upper())
df["capital"] = df["capital"].apply(lambda x: x.lower())

df


 ## TQDM with pandas

In [None]:
import time

def placeholder_function(x):
    time.sleep(0.5)
    return x.upper()





In [None]:
from tqdm import tqdm
# # Create new `pandas` methods which use `tqdm` progress
# # (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas()

df["country"] = df["country"].progress_apply(placeholder_function)

df


In [None]:
# Even better progress bar
from tqdm.auto import tqdm
# Create new `pandas` methods which use `tqdm` progress
# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas()

df["country"] = df["country"].progress_apply(placeholder_function)

df


 ## Basic Information


In [None]:
import pandas as pd

# Original data structure
data_list_of_dicts = [
    {"country": "United States","capital": "Washington, D.C.","population": 331449281,"GDP": 21.44},
    {"country": "China", "capital": "Beijing", "population": 1393000000, "GDP": 14.34},
    {"country": "Japan", "capital": "Tokyo", "population": 126476461, "GDP": 5.07},
    {"country": "Germany", "capital": "Berlin", "population": 83783945, "GDP": 4.01},
    {"country": "United Kingdom","capital": "London","population": 67886011,"GDP": 2.99},
    {"country": "India", "capital": "New Delhi", "population": 1303171035, "GDP": 3.11},
    {"country": "France", "capital": "Paris", "population": 67186600, "GDP": 2.78},
    {"country": "Italy", "capital": "Rome", "population": 60277900, "GDP": 2.15},
    {"country": "Brazil", "capital": "Brasília", "population": 211050000, "GDP": 1.77},
    {"country": "Canada", "capital": "Ottawa", "population": 37742154, "GDP": 1.73},
]

# Creating DataFrame from list of dictionaries
df = pd.DataFrame(
    data_list_of_dicts, columns=["country", "capital", "population", "GDP"]
)
df.head(20)


In [None]:
df


In [None]:
# Get the shape (rows, columns)
df.shape


In [None]:
# df = df.set_index("country")
df = df.reset_index()


In [None]:
# Describe index
df.index


In [None]:
# Describe DataFrame columns
df.columns


In [None]:
# Info on DataFrame
df.info()


In [None]:
# Number of non-NA values
df.count()


In [None]:
df["country"].value_counts()


In [None]:
df["country"].hist(xrot=40)


 ## Summary

In [None]:
# Sum of values
df['population'].sum()

# # Cumulative sum of values
df['population'].cumsum()

# # Minimum/maximum values
df['population'].min()
df['population'].max()

# # Index of minimum/maximum values
df['population'].idxmin()
df['population'].idxmax()


# # Mean of values
df['population'].mean()

# # Median of values
df['population'].median()

# # Summary statistics
df['population'].describe()


In [None]:
# easier way to get the summaries
# df.describe()
df.describe().T


 ## Introduction to data profiling

In [None]:
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title="Profiling Report",explorative=True)
# profile.to_widgets()
# profile.to_notebook_iframe()
profile.to_file("your_report.html")


 ## Pivot Table

 Pivot tables allow you to transform and summarize data, similar to the pivot table feature in Excel.

In [1]:
import pandas as pd

# Create pandas dataframe with sample data
data = {
    "date": [
        "2023-01-01",
        "2023-01-02",
        "2023-01-03",
        "2023-01-01",
        "2023-01-02",
        "2023-01-03",
        "2023-01-01",
        "2023-01-02",
        "2023-01-03",
        "2023-01-01",
        "2023-01-02",
        "2023-01-03",
        "2023-01-01",
        "2023-01-02",
        "2023-01-03",
        "2023-01-01",
        "2023-01-02",
        "2023-01-03",
    ],
    "city": [
        "New York",
        "New York",
        "New York",
        "Los Angeles",
        "Los Angeles",
        "Los Angeles",
        "Chicago",
        "Chicago",
        "Chicago",
        "San Francisco",
        "San Francisco",
        "San Francisco",
        "Houston",
        "Houston",
        "Houston",
        "Seattle",
        "Seattle",
        "Seattle",
    ],
    "sales": [
        100,
        150,
        200,
        50,
        60,
        70,
        110,
        120,
        130,
        80,
        90,
        100,
        125,
        135,
        145,
        95,
        105,
        115,
    ],
}

sales_df = pd.DataFrame(data)

# Ensure no NaNs by filling with 0
sales_df.fillna(0, inplace=True)

sales_df

Unnamed: 0,date,city,sales
0,2023-01-01,New York,100
1,2023-01-02,New York,150
2,2023-01-03,New York,200
3,2023-01-01,Los Angeles,50
4,2023-01-02,Los Angeles,60
5,2023-01-03,Los Angeles,70
6,2023-01-01,Chicago,110
7,2023-01-02,Chicago,120
8,2023-01-03,Chicago,130
9,2023-01-01,San Francisco,80


In [46]:
# Pivot the data to show sales for each city by date
pivot_df = sales_df.pivot(index='date', columns='city', values='sales')
pivot_df


city,Chicago,Houston,Los Angeles,New York,San Francisco,Seattle
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2023-01-01,110,125,50,100,80,95
2023-01-02,120,135,60,150,90,105
2023-01-03,130,145,70,200,100,115


In [47]:
# Pivot the data to show sales for each city by date
pivot_df = sales_df.pivot(index="city", columns="date", values="sales")
pivot_df

date,2023-01-01,2023-01-02,2023-01-03
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Chicago,110,120,130
Houston,125,135,145
Los Angeles,50,60,70
New York,100,150,200
San Francisco,80,90,100
Seattle,95,105,115


 ## GroupBy

 Grouping data in pandas can help you to aggregate and summarize your data in powerful ways.

In [2]:
import pandas as pd

# Create pandas dataframe with sample data
data = {
    "date": [
        "2023-01-01",
        "2023-01-02",
        "2023-01-03",
        "2023-01-01",
        "2023-01-02",
        "2023-01-03",
        "2023-01-01",
        "2023-01-02",
        "2023-01-03",
        "2023-01-01",
        "2023-01-02",
        "2023-01-03",
        "2023-01-01",
        "2023-01-02",
        "2023-01-03",
        "2023-01-01",
        "2023-01-02",
        "2023-01-03",
    ],
    "city": [
        "New York",
        "New York",
        "New York",
        "Los Angeles",
        "Los Angeles",
        "Los Angeles",
        "Chicago",
        "Chicago",
        "Chicago",
        "San Francisco",
        "San Francisco",
        "San Francisco",
        "Houston",
        "Houston",
        "Houston",
        "Seattle",
        "Seattle",
        "Seattle",
    ],
    "sales": [
        100,
        150,
        200,
        50,
        60,
        70,
        110,
        120,
        130,
        80,
        90,
        100,
        125,
        135,
        145,
        95,
        105,
        115,
    ],
}

sales_df = pd.DataFrame(data)

# Ensure no NaNs by filling with 0
sales_df.fillna(0, inplace=True)

sales_df

Unnamed: 0,date,city,sales
0,2023-01-01,New York,100
1,2023-01-02,New York,150
2,2023-01-03,New York,200
3,2023-01-01,Los Angeles,50
4,2023-01-02,Los Angeles,60
5,2023-01-03,Los Angeles,70
6,2023-01-01,Chicago,110
7,2023-01-02,Chicago,120
8,2023-01-03,Chicago,130
9,2023-01-01,San Francisco,80


In [11]:
# Group by city and calculate the total sales for each city
grouped_sales_df = sales_df.groupby("date").sum()
grouped_sales_df

Unnamed: 0_level_0,city,sales
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-01-01,New YorkLos AngelesChicagoSan FranciscoHouston...,560
2023-01-02,New YorkLos AngelesChicagoSan FranciscoHouston...,660
2023-01-03,New YorkLos AngelesChicagoSan FranciscoHouston...,760


In [49]:
# Group by city and get descriptive statistics for sales
grouped_sales_stats = sales_df.groupby('city').describe()
grouped_sales_stats


Unnamed: 0_level_0,sales,sales,sales,sales,sales,sales,sales,sales
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
city,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Chicago,3.0,120.0,10.0,110.0,115.0,120.0,125.0,130.0
Houston,3.0,135.0,10.0,125.0,130.0,135.0,140.0,145.0
Los Angeles,3.0,60.0,10.0,50.0,55.0,60.0,65.0,70.0
New York,3.0,150.0,50.0,100.0,125.0,150.0,175.0,200.0
San Francisco,3.0,90.0,10.0,80.0,85.0,90.0,95.0,100.0
Seattle,3.0,105.0,10.0,95.0,100.0,105.0,110.0,115.0


 ## Aggregate (agg)

 The `agg` method allows you to apply multiple aggregation functions to your grouped data.

In [21]:
aggregated_sales_df = sales_df.groupby("city").agg(
    {
        "sales": ["sum", "mean", "max", "min", "count"],
        "date": ["count", ("aggregate_string", lambda x: " /// ".join(x))],
    }
)
aggregated_sales_df

Unnamed: 0_level_0,sales,sales,sales,sales,sales,date,date
Unnamed: 0_level_1,sum,mean,max,min,count,count,aggregate_string
city,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Chicago,360,120.0,130,110,3,3,2023-01-01 /// 2023-01-02 /// 2023-01-03
Houston,405,135.0,145,125,3,3,2023-01-01 /// 2023-01-02 /// 2023-01-03
Los Angeles,180,60.0,70,50,3,3,2023-01-01 /// 2023-01-02 /// 2023-01-03
New York,450,150.0,200,100,3,3,2023-01-01 /// 2023-01-02 /// 2023-01-03
San Francisco,270,90.0,100,80,3,3,2023-01-01 /// 2023-01-02 /// 2023-01-03
Seattle,315,105.0,115,95,3,3,2023-01-01 /// 2023-01-02 /// 2023-01-03


In [22]:
# Group by city and apply custom aggregation functions
custom_agg_sales_df = sales_df.groupby('city').agg({
    'sales': lambda x: x.max() - x.min()  # Range of sales
})
custom_agg_sales_df

Unnamed: 0_level_0,sales
city,Unnamed: 1_level_1
Chicago,20
Houston,20
Los Angeles,20
New York,100
San Francisco,20
Seattle,20
