![pandas](images\pandas.png)

## 1 Setup

### 1.1 Download Library

Download library using below commands:

#### Inside Jupyter Notebook/Lab

```sh
!pip install pandas
```

#### From terminal

```sh
pip install pandas
```

or

```sh
pip3 install pandas
```

In [1]:
# !pip install pandas

### 1.2 Import Library

Import pandas library under the alias `pd`.

In [2]:
from datetime import datetime

import pandas as pd

print("Pandas version:", pd.__version__)

Pandas version: 2.3.0


### 1.3 Introduction

#### Why Pandas?

1. Pandas can hold **heterogeneous data**.
2. Pandas supports heterogeneous data but each **column is homogeneous** (i.e., all values are of same datatype).
3. It provides facilities for **data manipulation** such as
    1. Data import
    2. Conditional filtering
    3. Aggregation
    4. Querying
    5. Data export
5. It **uses NumPy internally** hence we can directly pass NumPy arrays to Pandas.

## 2 Create `DataFrame`

Lets create a sample Employee DataFrame

### 2.1 Using Python `dict`

In [3]:
employee_data = {
    "Name": ["Alex", "Ajax", "Jane", "John", "Anna"],
    "Age": [31, 31, 28, 35, 40],
    "Role": ["Senior SD", "Associate Architect", "Junior SD", "Architect", "V.P."],
    "DOJ": ["01-06-2021", "01-01-2025", "01-03-2023", "01-12-2022", "01-08-2000"],
}

emp_df = pd.DataFrame(data=employee_data)
emp_df

Unnamed: 0,Name,Age,Role,DOJ
0,Alex,31,Senior SD,01-06-2021
1,Ajax,31,Associate Architect,01-01-2025
2,Jane,28,Junior SD,01-03-2023
3,John,35,Architect,01-12-2022
4,Anna,40,V.P.,01-08-2000


### 2.2 Using Python `list`

In [4]:
employee_rows = [
    ["Alex", 31, "Senior SD", "01-06-2021"],
    ["Ajax", 31, "Associate Architect", "01-01-2025"],
    ["Jane", 28, "Junior SD", "01-03-2023"],
    ["John", 35, "Architect", "01-12-2022"],
    ["Anna", 40, "V.P.", "01-08-2000"],
]
column_names = ["EmpName", "EmpAge", "EmpRole", "D.O.J"]

emp_df = pd.DataFrame(data=employee_rows, columns=column_names)
emp_df

Unnamed: 0,EmpName,EmpAge,EmpRole,D.O.J
0,Alex,31,Senior SD,01-06-2021
1,Ajax,31,Associate Architect,01-01-2025
2,Jane,28,Junior SD,01-03-2023
3,John,35,Architect,01-12-2022
4,Anna,40,V.P.,01-08-2000


> **Note**:
>
> Passing values to `columns` attribute is **NOT** mandatory, if no values are passed then they are indexed from 0 (to 3 in this case).

### 2.3 `DataFrame`

#### What is a `DataFrame`?

1. Tabular representation of data in pandas is called as DataFrame.
2. DataFrame contain heterogeneous data.

In [5]:
type(emp_df)

pandas.core.frame.DataFrame

#### `DataFrame` Metadata

##### `info()` Method

In [6]:
emp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   EmpName  5 non-null      object
 1   EmpAge   5 non-null      int64 
 2   EmpRole  5 non-null      object
 3   D.O.J    5 non-null      object
dtypes: int64(1), object(3)
memory usage: 292.0+ bytes


##### `shape` Attribute

In [7]:
emp_df.shape

(5, 4)

### 2.4 `Series`

#### What is a `Series`?

1. A NumPy array extracted from a DataFrame is called as Series.
2. Series contain homogeneous data.

In [8]:
emp_df["EmpName"]  # emp_df.EmpName

0    Alex
1    Ajax
2    Jane
3    John
4    Anna
Name: EmpName, dtype: object

In [9]:
type(emp_df.EmpName)

pandas.core.series.Series

#### `Series` Metadata

##### `info()` Method

In [10]:
emp_df["EmpName"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 5 entries, 0 to 4
Series name: EmpName
Non-Null Count  Dtype 
--------------  ----- 
5 non-null      object
dtypes: object(1)
memory usage: 172.0+ bytes


##### `shape` Attribute

In [11]:
emp_df["EmpName"].shape

(5,)

When multiple Series are stacked together it forms a DataFrame or in other words DataFrame is made up of multiple Series.

In [12]:
temp_df = emp_df[["EmpName", "EmpRole"]]
type(temp_df)

pandas.core.frame.DataFrame

### 2.5 Import data

Generally DataFrames are not created explicitly, instead data is imported/loaded into DataFrame from file or tables (from database). 

Lets create a DataFrame using this [CSV data file][1] downloaded from Kaggle.

[1]: https://www.kaggle.com/datasets/anshumrankawat/mckinsey-csv

In [13]:
mk_df = pd.read_csv("data/mckinsey.csv")
mk_df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


> **Note**:
> 1. Along with `.csv` pandas supports data import from different file types like `.txt`, `.xlsx`
> 2. Data can also be imported from SQL tables refer [docs][1] for more details.
> 3. Full list of supported formats: https://pandas.pydata.org/docs/reference/io.html

[1]: https://pandas.pydata.org/docs/reference/io.html#sql

## 3 Helper Methods

### 3.1 Preview data

There are three most frequently used DataFrame methods for previewing data:

1. `head()` method
2. `tail()` method
3. `sample()` method

#### `head()` Method

##### Default

In [14]:
mk_df.head()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106


##### Custom

In [15]:
mk_df.head(7)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
5,Afghanistan,1977,14880372,Asia,38.438,786.11336
6,Afghanistan,1982,12881816,Asia,39.854,978.011439


#### `tail()` Method

##### Default

In [16]:
mk_df.tail()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.44996
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1703,Zimbabwe,2007,12311143,Africa,43.487,469.709298


##### Custom

In [17]:
mk_df.tail(3)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1701,Zimbabwe,1997,11404948,Africa,46.809,792.44996
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1703,Zimbabwe,2007,12311143,Africa,43.487,469.709298


#### `sample()` Method

##### Default

In [18]:
mk_df.sample()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1574,Turkey,1962,29788695,Europe,52.098,2322.869908


##### Custom

In [19]:
mk_df.sample(5)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
137,Bolivia,1977,5079716,Americas,50.023,3548.097832
824,Kenya,1992,25020539,Africa,59.285,1341.921721
485,Equatorial Guinea,1977,192675,Africa,42.024,958.566812
1192,Paraguay,1972,2614104,Americas,65.815,2523.337977
785,Jamaica,1977,2156814,Americas,70.11,6650.195573


#### `describe()` Method

In [20]:
mk_df.describe()

Unnamed: 0,year,population,life_exp,gdp_cap
count,1704.0,1704.0,1704.0,1704.0
mean,1979.5,29601210.0,59.474439,7215.327081
std,17.26533,106157900.0,12.917107,9857.454543
min,1952.0,60011.0,23.599,241.165876
25%,1965.75,2793664.0,48.198,1202.060309
50%,1979.5,7023596.0,60.7125,3531.846988
75%,1993.25,19585220.0,70.8455,9325.462346
max,2007.0,1318683000.0,82.603,113523.1329


### 3.2 DataFrame Helper Methods

#### `info()` Method

Applying `info()` Method on DataFrame gives following details:

1. Coulm and row indices.
2. Name and datatype of each column.
3. Count of non values for each column.
4. Space occupied by DataFrame in RAM.

In [21]:
mk_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     1704 non-null   object 
 1   year        1704 non-null   int64  
 2   population  1704 non-null   int64  
 3   continent   1704 non-null   object 
 4   life_exp    1704 non-null   float64
 5   gdp_cap     1704 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB


#### `keys()` Method

In [22]:
mk_df.keys()

Index(['country', 'year', 'population', 'continent', 'life_exp', 'gdp_cap'], dtype='object')

In [23]:
mk_df.columns

Index(['country', 'year', 'population', 'continent', 'life_exp', 'gdp_cap'], dtype='object')

#### `nunique()` Method

In [24]:
mk_df.nunique()

country        142
year            12
population    1704
continent        5
life_exp      1626
gdp_cap       1704
dtype: int64

#### `value_counts()` Method

In [25]:
emp_df.value_counts()

EmpName  EmpAge  EmpRole              D.O.J     
Ajax     31      Associate Architect  01-01-2025    1
Alex     31      Senior SD            01-06-2021    1
Anna     40      V.P.                 01-08-2000    1
Jane     28      Junior SD            01-03-2023    1
John     35      Architect            01-12-2022    1
Name: count, dtype: int64

### 3.3 Series Helper Methods

#### `info()` Method

In [26]:
mk_df.population.info()

<class 'pandas.core.series.Series'>
RangeIndex: 1704 entries, 0 to 1703
Series name: population
Non-Null Count  Dtype
--------------  -----
1704 non-null   int64
dtypes: int64(1)
memory usage: 13.4 KB


#### `unique()` Method

In [27]:
mk_df["continent"].unique()

array(['Asia', 'Europe', 'Africa', 'Americas', 'Oceania'], dtype=object)

#### `value_counts()` Method

In [28]:
mk_df["continent"].value_counts()

continent
Africa      624
Asia        396
Europe      360
Americas    300
Oceania      24
Name: count, dtype: int64

## 4 Index

### 4.1 Column Index

#### Explicit Indexes

##### Display column explicit indices

###### Example 1

In [29]:
tmp_df = pd.DataFrame([["Alex", 31, "Senior SD"]])
tmp_df

Unnamed: 0,0,1,2
0,Alex,31,Senior SD


In [30]:
print("Different ways of accessing explicit indices:")

display(tmp_df.keys())
display(tmp_df.columns)
display(tmp_df.columns.values)

Different ways of accessing explicit indices:


RangeIndex(start=0, stop=3, step=1)

RangeIndex(start=0, stop=3, step=1)

array([0, 1, 2])

###### Example 2

In [31]:
print("Column explicit indices:")
emp_df.columns

Column explicit indices:


Index(['EmpName', 'EmpAge', 'EmpRole', 'D.O.J'], dtype='object')

##### Update column explicit indices

###### Before

In [32]:
emp_df

Unnamed: 0,EmpName,EmpAge,EmpRole,D.O.J
0,Alex,31,Senior SD,01-06-2021
1,Ajax,31,Associate Architect,01-01-2025
2,Jane,28,Junior SD,01-03-2023
3,John,35,Architect,01-12-2022
4,Anna,40,V.P.,01-08-2000


In [33]:
emp_df.columns = ["Name", "Age", "Role", "DOJ"]
emp_df.columns

Index(['Name', 'Age', 'Role', 'DOJ'], dtype='object')

###### After

In [34]:
emp_df

Unnamed: 0,Name,Age,Role,DOJ
0,Alex,31,Senior SD,01-06-2021
1,Ajax,31,Associate Architect,01-01-2025
2,Jane,28,Junior SD,01-03-2023
3,John,35,Architect,01-12-2022
4,Anna,40,V.P.,01-08-2000


#### Implicit Indexes

1. Implicit indices can only be assessed but not modified.
2. Implicit indices are sliced in pythonic way (end index are exclusive). Discussed in `iloc`.

##### Display column implicit indices

###### Example 1

In [35]:
print("Explicit column indices:")
display(tmp_df.columns)

print("\nImplicit column indices:")
[tmp_df.columns.get_loc(c_name) for c_name in tmp_df.columns]

Explicit column indices:


RangeIndex(start=0, stop=3, step=1)


Implicit column indices:


[0, 1, 2]

###### Example 2

In [36]:
print("Explicit column indices:")
display(emp_df.columns)

print("\nImplicit column indices:")
[emp_df.columns.get_loc(c_name) for c_name in emp_df.columns]

Explicit column indices:


Index(['Name', 'Age', 'Role', 'DOJ'], dtype='object')


Implicit column indices:


[0, 1, 2, 3]

### 4.2 Row Index

#### Explicit Indexes

##### Display row explicit indices

###### Example 1

In [37]:
emp_df.index.values

array([0, 1, 2, 3, 4])

###### Example 2

In [38]:
mk_df.index.values

array([   0,    1,    2, ..., 1701, 1702, 1703], shape=(1704,))

##### Update row explicit indices

###### Before

In [39]:
emp_df

Unnamed: 0,Name,Age,Role,DOJ
0,Alex,31,Senior SD,01-06-2021
1,Ajax,31,Associate Architect,01-01-2025
2,Jane,28,Junior SD,01-03-2023
3,John,35,Architect,01-12-2022
4,Anna,40,V.P.,01-08-2000


In [40]:
emp_df.index = ["E01", "E02", "E03", "E04", "E05"]
emp_df.index.values

array(['E01', 'E02', 'E03', 'E04', 'E05'], dtype=object)

###### After

In [41]:
emp_df

Unnamed: 0,Name,Age,Role,DOJ
E01,Alex,31,Senior SD,01-06-2021
E02,Ajax,31,Associate Architect,01-01-2025
E03,Jane,28,Junior SD,01-03-2023
E04,John,35,Architect,01-12-2022
E05,Anna,40,V.P.,01-08-2000


#### Implicit Indexes

1. Implicit indices can only be assessed but not modified.
2. Implicit indices are sliced in pythonic way (end index are exclusive). Discussed in detail  in `iloc`.

##### Display Indices

In [42]:
print("Explicit row indices:")
display(emp_df.index.values)

print("\nImplicit row indices:")
[emp_df.index.get_loc(exp_idx) for exp_idx in emp_df.index]

Explicit row indices:


array(['E01', 'E02', 'E03', 'E04', 'E05'], dtype=object)


Implicit row indices:


[0, 1, 2, 3, 4]

## 5 Rename

Behavior of `rename` function:

1. `rename` can be used to change existing column name or explicit row indices.
2. Calling `rename` by passing an iterable as first parameter with `axis` parameter set to 1 will update column names.
3. Calling `rename` by passing an iterable as first parameter with `axis` parameter set to 0 will update explicit row indices.
4. Calling `rename` by passing a `dict` to `columns` parameter will update column names.
5. Calling `rename` by passing a `dict` to `index` parameter will update explicit row indices.  

### 5.1 Rename Column Index

In [43]:
emp_df.rename(
    {
        "Name": "First name",
        "Age": "Age in Years",
        "Role": "Position",
        "DOJ": "Date of Joining",
    },
    axis=1,
)

Unnamed: 0,First name,Age in Years,Position,Date of Joining
E01,Alex,31,Senior SD,01-06-2021
E02,Ajax,31,Associate Architect,01-01-2025
E03,Jane,28,Junior SD,01-03-2023
E04,John,35,Architect,01-12-2022
E05,Anna,40,V.P.,01-08-2000


#### Importance of `inplace`

1. DataFrame is not updated with new column names since `inplace` parameter of `rename()` is not set to `True`.
2. Setting `inplace` parameter to `True` ensures the changes done are permanent.
3. All DML methods require this parameter to be set to `True` for permanent changes.

##### Original DataFrame

In [44]:
emp_df

Unnamed: 0,Name,Age,Role,DOJ
E01,Alex,31,Senior SD,01-06-2021
E02,Ajax,31,Associate Architect,01-01-2025
E03,Jane,28,Junior SD,01-03-2023
E04,John,35,Architect,01-12-2022
E05,Anna,40,V.P.,01-08-2000


In [45]:
emp_df.rename({"DOJ": "DateOfJoining"}, axis=1, inplace=True)

##### Modified DataFrame

In [46]:
emp_df

Unnamed: 0,Name,Age,Role,DateOfJoining
E01,Alex,31,Senior SD,01-06-2021
E02,Ajax,31,Associate Architect,01-01-2025
E03,Jane,28,Junior SD,01-03-2023
E04,John,35,Architect,01-12-2022
E05,Anna,40,V.P.,01-08-2000


### 5.2 Rename Row Index

In [47]:
emp_df.rename({"E02": "A65", "E05": "VP01"}, axis=0)

Unnamed: 0,Name,Age,Role,DateOfJoining
E01,Alex,31,Senior SD,01-06-2021
A65,Ajax,31,Associate Architect,01-01-2025
E03,Jane,28,Junior SD,01-03-2023
E04,John,35,Architect,01-12-2022
VP01,Anna,40,V.P.,01-08-2000


> **Note**:
>
> Original DataFrame is not updated since `inplace` is not set to `True`.

## 6 Typecasting

### 6.1 Typecasting `DataFrame`

In [48]:
temp = pd.DataFrame(
    data=[
        [1, 3.14, True, "Abc"],
        [2, 0.5, False, "Xyz"],
        [3, 0, 0, None],
    ],
    dtype=object,
)

display(temp)
temp.info()

Unnamed: 0,0,1,2,3
0,1,3.14,True,Abc
1,2,0.5,False,Xyz
2,3,0.0,0,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       3 non-null      object
 1   1       3 non-null      object
 2   2       3 non-null      object
 3   3       2 non-null      object
dtypes: object(4)
memory usage: 228.0+ bytes


### 6.2 Typecasting `Series`

In pandas columns can be typecast using `astype`

In [49]:
# Typecast each column.
temp[0] = temp[0].astype(int)
temp[1] = temp[1].astype(float)
temp[2] = temp[2].astype(bool)
temp[3] = temp[3].astype(str)

display(temp)
temp.info()

Unnamed: 0,0,1,2,3
0,1,3.14,True,Abc
1,2,0.5,False,Xyz
2,3,0.0,False,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       3 non-null      int64  
 1   1       3 non-null      float64
 2   2       3 non-null      bool   
 3   3       3 non-null      object 
dtypes: bool(1), float64(1), int64(1), object(1)
memory usage: 207.0+ bytes


> **Note**:
>
> Typecasting reduced the size of DataFrame from ~228 bytes to ~207 bytes

#### Handling Dates

In [50]:
emp_df["DateOfJoining"] = pd.to_datetime(emp_df["DateOfJoining"], format="%d-%m-%Y")
emp_df["DateOfJoining"].info()

<class 'pandas.core.series.Series'>
Index: 5 entries, E01 to E05
Series name: DateOfJoining
Non-Null Count  Dtype         
--------------  -----         
5 non-null      datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 252.0+ bytes


`DateOfJoining` was converted from `object` to `datetime64[ns]`

## 7 Read

Before accessing rows using indices lets update index of DataFrame to create distinction between row-explicit index and row-implicit index.

For the `mk_df` DataFrame, implicit row indices range from 0 to 1703 as seen below (1704 is exclusive).

##### Old Explicit Indices

In [51]:
mk_df.index.values

array([   0,    1,    2, ..., 1701, 1702, 1703], shape=(1704,))

In [52]:
mk_df.index = list(range(1, len(mk_df) + 1))

##### New Explicit Indices

In [53]:
mk_df.index.values

array([   1,    2,    3, ..., 1702, 1703, 1704], shape=(1704,))

### 7.1 Access Rows

#### Direct Access

##### Example #1

Explicit row index cannot be used with DataFrame. Trying to access rows using explicit row index throws error.

In [54]:
try:
    mk_df[1]
except KeyError as err:
    print("KeyError:", err)

KeyError: 1


In [55]:
try:
    mk_df[1, 2, 3]
except KeyError as err:
    print("KeyError:", err)

KeyError: (1, 2, 3)


In [56]:
mk_df[100:115:5]

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
101,Bangladesh,1972,70759295,Asia,45.252,630.233627
106,Bangladesh,1997,123315288,Asia,59.412,972.770035
111,Belgium,1962,9218400,Europe,70.25,10991.20676


##### Example #2

Trying to access rows using explicit row index throws error.

In [57]:
try:
    emp_df["E05"]
except KeyError as err:
    print("KeyError:", err)

KeyError: 'E05'


##### Example #3:

In [58]:
try:
    emp_df["E03", "E04", "EO5"]
except KeyError as err:
    print("KeyError:", err)

KeyError: ('E03', 'E04', 'EO5')


In [59]:
emp_df["E03":"EO5"]

Unnamed: 0,Name,Age,Role,DateOfJoining
E03,Jane,28,Junior SD,2023-03-01
E04,John,35,Architect,2022-12-01
E05,Anna,40,V.P.,2000-08-01


#### Using `loc`

##### Example #1

In [60]:
mk_df.loc[1]

country       Afghanistan
year                 1952
population        8425333
continent            Asia
life_exp           28.801
gdp_cap        779.445314
Name: 1, dtype: object

##### Example #2

In [61]:
emp_df.loc["E05"]

Name                            Anna
Age                               40
Role                            V.P.
DateOfJoining    2000-08-01 00:00:00
Name: E05, dtype: object

#### Using `iloc`

In [62]:
mk_df.iloc[0]

country       Afghanistan
year                 1952
population        8425333
continent            Asia
life_exp           28.801
gdp_cap        779.445314
Name: 1, dtype: object

In [63]:
emp_df.iloc[4]

Name                            Anna
Age                               40
Role                            V.P.
DateOfJoining    2000-08-01 00:00:00
Name: E05, dtype: object

### 7.2 Access Columns

##### Example #1

In [64]:
emp_df["Name"]

E01    Alex
E02    Ajax
E03    Jane
E04    John
E05    Anna
Name: Name, dtype: object

Explicit row index can be used with Series.

In [65]:
emp_df["Name"]["E05"]

'Anna'

##### Example #2

In [66]:
mk_df["continent"].tail(3)

1702    Africa
1703    Africa
1704    Africa
Name: continent, dtype: object

Explicit row index can be used with Series.

In [67]:
mk_df["continent"][1702]

'Africa'

In [68]:
mk_df.head(3)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1,Afghanistan,1952,8425333,Asia,28.801,779.445314
2,Afghanistan,1957,9240934,Asia,30.332,820.85303
3,Afghanistan,1962,10267083,Asia,31.997,853.10071


In [69]:
mk_df.loc[:2, "year":"continent"]

Unnamed: 0,year,population,continent
1,1952,8425333,Asia
2,1957,9240934,Asia


```python
pd.Series(['a','b','c'], index=[1,2,2])
print(a[2])
```

In [70]:
mk_df["country"][1]

'Afghanistan'

#### Using `loc` and `iloc`

In [71]:
# df.index = list(range(1, len(df)+1))
# df.head(10)

In [72]:
# df.loc[1:3]

In [73]:
# df.iloc[1:3]

iloc end index is exclusive and it works on implicit index.

In [74]:
# temp["column name 1"]

In [75]:
# temp.loc["a"]

In [76]:
# temp.loc[["a"]]

In [77]:
# temp.loc[["a", "c"]]

In [78]:
# temp.reset_index(drop=True)

In [79]:
# df.loc[9::-3]

### 7.3 Access Index

In [80]:
# df.head(10)

In [81]:
# pd.DataFrame(df, columns=['country', 'year'])
# df.iloc[:, 0:2]

In [82]:
# df.nunique()

In [83]:
# df.iloc[:20]

## 8 Insert

### 8.1 Insert Row

#### Using `concat` Method

##### Insert single row

In [84]:
# New employee record.
new_emp = [
    {
        "Name": "Bob",
        "Age": 30,
        "Role": "Junior SD",
        "DateOfJoining": "01-07-2024",
    }
]

# Create new DataFrame for new employee record.
new_emp_df1 = pd.DataFrame(data=new_emp)
new_emp_df1

Unnamed: 0,Name,Age,Role,DateOfJoining
0,Bob,30,Junior SD,01-07-2024


Observe that new DataFrame has explicit index `0`

In [85]:
# Concatenate new DataFrame to main DataFrame.
emp_df = pd.concat([emp_df, new_emp_df1])
emp_df

Unnamed: 0,Name,Age,Role,DateOfJoining
E01,Alex,31,Senior SD,2021-06-01 00:00:00
E02,Ajax,31,Associate Architect,2025-01-01 00:00:00
E03,Jane,28,Junior SD,2023-03-01 00:00:00
E04,John,35,Architect,2022-12-01 00:00:00
E05,Anna,40,V.P.,2000-08-01 00:00:00
0,Bob,30,Junior SD,01-07-2024


> **Note**:
>
> If `ignore_index` parameter of `concat` method is not set to `True` then explicit indices from both the DataFrames are retained as it is.

Check explicit indices

In [86]:
emp_df.index

Index(['E01', 'E02', 'E03', 'E04', 'E05', 0], dtype='object')

Check implicit indices

In [87]:
[emp_df.index.get_loc(exp_idx) for exp_idx in emp_df.index]

[0, 1, 2, 3, 4, 5]

##### Insert multiple rows

In [88]:
new_emps = [
    ["David", 30, "Junior SD", "01-07-2024"],
    ["Karen", 40, "Associate Architect", "01-07-2024"],
]
column_names = ["Name", "Age", "Role", "DateOfJoining"]

new_emp_df2 = pd.DataFrame(data=new_emps, columns=column_names)
new_emp_df2

Unnamed: 0,Name,Age,Role,DateOfJoining
0,David,30,Junior SD,01-07-2024
1,Karen,40,Associate Architect,01-07-2024


In [89]:
# Concatenate new DataFrame to main DataFrame.
emp_df = pd.concat([emp_df, new_emp_df2], axis=0, ignore_index=True)
emp_df

Unnamed: 0,Name,Age,Role,DateOfJoining
0,Alex,31,Senior SD,2021-06-01 00:00:00
1,Ajax,31,Associate Architect,2025-01-01 00:00:00
2,Jane,28,Junior SD,2023-03-01 00:00:00
3,John,35,Architect,2022-12-01 00:00:00
4,Anna,40,V.P.,2000-08-01 00:00:00
5,Bob,30,Junior SD,01-07-2024
6,David,30,Junior SD,01-07-2024
7,Karen,40,Associate Architect,01-07-2024


> **Note**:
>
> `concat()` method does not have `inplace` parameter.

In [90]:
emp_df["DateOfJoining"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 8 entries, 0 to 7
Series name: DateOfJoining
Non-Null Count  Dtype 
--------------  ----- 
8 non-null      object
dtypes: object(1)
memory usage: 196.0+ bytes


In [91]:
emp_df["DateOfJoining"] = pd.to_datetime(emp_df["DateOfJoining"], format="%d-%m-%Y")
emp_df["DateOfJoining"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 8 entries, 0 to 7
Series name: DateOfJoining
Non-Null Count  Dtype         
--------------  -----         
8 non-null      datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 196.0 bytes


#### Using `loc` and `iloc`

### 8.2 Insert Column

Columns can be directly added to DataFrame as shown in below examples:

#### Using direct assignment

##### Example #1

Lets insert a new column called `Experinace` into `emp_df`

In [92]:
emp_df["Experinace"] = datetime(2025, 1, 1) - emp_df["DateOfJoining"]
emp_df

Unnamed: 0,Name,Age,Role,DateOfJoining,Experinace
0,Alex,31,Senior SD,2021-06-01,1310 days
1,Ajax,31,Associate Architect,2025-01-01,0 days
2,Jane,28,Junior SD,2023-03-01,672 days
3,John,35,Architect,2022-12-01,762 days
4,Anna,40,V.P.,2000-08-01,8919 days
5,Bob,30,Junior SD,2024-07-01,184 days
6,David,30,Junior SD,2024-07-01,184 days
7,Karen,40,Associate Architect,2024-07-01,184 days


##### Example #2

In [93]:
mk_df["gdp"] = mk_df["gdp_cap"] * mk_df["population"]
mk_df.head(3)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap,gdp
1,Afghanistan,1952,8425333,Asia,28.801,779.445314,6567086000.0
2,Afghanistan,1957,9240934,Asia,30.332,820.85303,7585449000.0
3,Afghanistan,1962,10267083,Asia,31.997,853.10071,8758856000.0


#### Using `concat`

In [94]:
emp_gender = pd.DataFrame(
    data=["M", "M", "F", "M", "F", "M", "M", "F"],
    columns=["Gender"],
)
emp_gender.head(3)

Unnamed: 0,Gender
0,M
1,M
2,F


In [95]:
pd.concat([emp_df, emp_gender], axis=1)

Unnamed: 0,Name,Age,Role,DateOfJoining,Experinace,Gender
0,Alex,31,Senior SD,2021-06-01,1310 days,M
1,Ajax,31,Associate Architect,2025-01-01,0 days,M
2,Jane,28,Junior SD,2023-03-01,672 days,F
3,John,35,Architect,2022-12-01,762 days,M
4,Anna,40,V.P.,2000-08-01,8919 days,F
5,Bob,30,Junior SD,2024-07-01,184 days,M
6,David,30,Junior SD,2024-07-01,184 days,M
7,Karen,40,Associate Architect,2024-07-01,184 days,F


In [96]:
# Concatenate new DataFrame to main DataFrame.
emp_df = pd.concat([emp_df, new_emp_df2], axis=1)
emp_df

Unnamed: 0,Name,Age,Role,DateOfJoining,Experinace,Name.1,Age.1,Role.1,DateOfJoining.1
0,Alex,31,Senior SD,2021-06-01,1310 days,David,30.0,Junior SD,01-07-2024
1,Ajax,31,Associate Architect,2025-01-01,0 days,Karen,40.0,Associate Architect,01-07-2024
2,Jane,28,Junior SD,2023-03-01,672 days,,,,
3,John,35,Architect,2022-12-01,762 days,,,,
4,Anna,40,V.P.,2000-08-01,8919 days,,,,
5,Bob,30,Junior SD,2024-07-01,184 days,,,,
6,David,30,Junior SD,2024-07-01,184 days,,,,
7,Karen,40,Associate Architect,2024-07-01,184 days,,,,


#### Using `loc` and `iloc`

## 9 Update

### 9.1 Update Index

#### Using `index` Attribute

For the `mk_df` DataFrame, implicit row indices range from 0 to 1703 as seen below (1704 is exclusive).

##### Before update

In [97]:
mk_df.index.values

array([   1,    2,    3, ..., 1702, 1703, 1704], shape=(1704,))

By default Explicit index is same as Implicit index. Default Explicit indices can be update as shown below:

In [98]:
mk_df.index = list(range(1, 1705))

##### After update

In [99]:
mk_df.index.values

array([   1,    2,    3, ..., 1702, 1703, 1704], shape=(1704,))

#### Reset Index

##### Without `drop` parameter

In [100]:
mk_df.reset_index().head(3)

Unnamed: 0,index,country,year,population,continent,life_exp,gdp_cap,gdp
0,1,Afghanistan,1952,8425333,Asia,28.801,779.445314,6567086000.0
1,2,Afghanistan,1957,9240934,Asia,30.332,820.85303,7585449000.0
2,3,Afghanistan,1962,10267083,Asia,31.997,853.10071,8758856000.0


> **Note**:
>
> In above call to `reset_index`, `inplace` parameter was purposefully not set to `True` for testing `drop` parameter next.

##### With `drop` parameter

In [101]:
mk_df.reset_index(drop=True, inplace=True)
mk_df.head()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap,gdp
0,Afghanistan,1952,8425333,Asia,28.801,779.445314,6567086000.0
1,Afghanistan,1957,9240934,Asia,30.332,820.85303,7585449000.0
2,Afghanistan,1962,10267083,Asia,31.997,853.10071,8758856000.0
3,Afghanistan,1967,11537966,Asia,34.02,836.197138,9648014000.0
4,Afghanistan,1972,13079460,Asia,36.088,739.981106,9678553000.0


> **Note**:
>
> In `reset_index` method `drop` parameter is set to `False` by default.

#### Set Index

##### Before set index

In [102]:
emp_df

Unnamed: 0,Name,Age,Role,DateOfJoining,Experinace,Name.1,Age.1,Role.1,DateOfJoining.1
0,Alex,31,Senior SD,2021-06-01,1310 days,David,30.0,Junior SD,01-07-2024
1,Ajax,31,Associate Architect,2025-01-01,0 days,Karen,40.0,Associate Architect,01-07-2024
2,Jane,28,Junior SD,2023-03-01,672 days,,,,
3,John,35,Architect,2022-12-01,762 days,,,,
4,Anna,40,V.P.,2000-08-01,8919 days,,,,
5,Bob,30,Junior SD,2024-07-01,184 days,,,,
6,David,30,Junior SD,2024-07-01,184 days,,,,
7,Karen,40,Associate Architect,2024-07-01,184 days,,,,


In [103]:
emp_df.set_index(keys="Name", inplace=True)

##### After set index

In [104]:
emp_df

Unnamed: 0_level_0,Age,Role,DateOfJoining,Experinace,Age,Role,DateOfJoining
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"(Alex, David)",31,Senior SD,2021-06-01,1310 days,30.0,Junior SD,01-07-2024
"(Ajax, Karen)",31,Associate Architect,2025-01-01,0 days,40.0,Associate Architect,01-07-2024
"(Jane, nan)",28,Junior SD,2023-03-01,672 days,,,
"(John, nan)",35,Architect,2022-12-01,762 days,,,
"(Anna, nan)",40,V.P.,2000-08-01,8919 days,,,
"(Bob, nan)",30,Junior SD,2024-07-01,184 days,,,
"(David, nan)",30,Junior SD,2024-07-01,184 days,,,
"(Karen, nan)",40,Associate Architect,2024-07-01,184 days,,,


> **Note**:
>
> `set_index` method also has `drop` parameter but it is set to `True` by default.

### 9.2 Update Column

In [105]:
# df.index = list(range(1, len(df)+1))
# df

### 9.3 Update Row

In [106]:
# test_df.iloc[3] = ["Row 4 Column 1", "Row 4 Column 2", "Row 4 Column 3"]

## 10 Drop

### 10.1 Deleting Rows (Explicit Indices)

#### Using `axis`

For the `mk_df` DataFrame:
1. Explicit row indices range from 1 to 1704.
2. Implicit row indices range from 0 to 1703.

##### Before delete

In [107]:
mk_df.loc[801:805]

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap,gdp
801,Japan,1997,125956499,Asia,80.69,28816.58499,3629636000000.0
802,Japan,2002,127065841,Asia,82.0,28604.5919,3634667000000.0
803,Japan,2007,127467972,Asia,82.603,31656.06806,4035135000000.0
804,Jordan,1952,607914,Asia,43.158,1546.907807,940386900.0
805,Jordan,1957,746559,Asia,45.669,1886.080591,1408070000.0


##### Single row delete

In [108]:
try:
    mk_df.drop(802, axis=0, inplace=True)
except KeyError as err:
    print(err)

##### Multi row delete

In [109]:
try:
    mk_df.drop([803, 804], axis=0, inplace=True)
except KeyError as err:
    print(err)

##### After delete

In [110]:
mk_df.loc[801:805]

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap,gdp
801,Japan,1997,125956499,Asia,80.69,28816.58499,3629636000000.0
805,Jordan,1957,746559,Asia,45.669,1886.080591,1408070000.0


#### Using `index` parameter

##### Before delete

In [111]:
mk_df.loc[601:605]

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap,gdp
601,Guatemala,1957,3640876,Americas,44.142,2617.155967,9528740000.0
602,Guatemala,1962,4208858,Americas,46.954,2750.364446,11575890000.0
603,Guatemala,1967,4690773,Americas,50.016,3242.531147,15209980000.0
604,Guatemala,1972,5149581,Americas,53.738,4031.408271,20760060000.0
605,Guatemala,1977,5703430,Americas,56.029,4879.992748,27832700000.0


##### Single row delete

In [112]:
try:
    mk_df.drop(index=602, axis=0, inplace=True)
except KeyError as err:
    print(err)

##### Multi row delete

In [113]:
try:
    mk_df.drop(index=range(603, 605), inplace=True)
except KeyError as err:
    print(err)

##### After

In [114]:
mk_df.loc[601:605]

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap,gdp
601,Guatemala,1957,3640876,Americas,44.142,2617.155967,9528740000.0
605,Guatemala,1977,5703430,Americas,56.029,4879.992748,27832700000.0


### 10.2 Deleting Columns

#### Using `axis`

##### Before delete

In [115]:
mk_df.head(2)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap,gdp
0,Afghanistan,1952,8425333,Asia,28.801,779.445314,6567086000.0
1,Afghanistan,1957,9240934,Asia,30.332,820.85303,7585449000.0


##### Single column delete

In [116]:
mk_df.drop("population", axis=1, inplace=True)

##### Multi column delete

In [117]:
mk_df.drop(["life_exp", "country"], axis=1, inplace=True)

##### After delete

In [118]:
mk_df.head(2)

Unnamed: 0,year,continent,gdp_cap,gdp
0,1952,Asia,779.445314,6567086000.0
1,1957,Asia,820.85303,7585449000.0


#### Using `columns` parameter

##### Before delete

In [119]:
mk_df.head(2)

Unnamed: 0,year,continent,gdp_cap,gdp
0,1952,Asia,779.445314,6567086000.0
1,1957,Asia,820.85303,7585449000.0


##### Single column delete

In [120]:
mk_df.drop(columns="continent", inplace=True)

##### Multi column delete

In [121]:
mk_df.drop(columns=["year", "gdp_cap"], inplace=True)

##### After delete

In [122]:
mk_df.head(2)

Unnamed: 0,gdp
0,6567086000.0
1,7585449000.0
