## **Pandas - DataFrames**
Probably the most important data structure of pandas is the DataFrame. It's a tabular structure tightly integrated with Series.

In [298]:
# importing required modules / libraries
import numpy as np
import pandas as pd
print(pd.__version__)

2.0.0


### What is Dataframe ?
##### **A DataFrame in Pandas is a 2D, tabular data structure that is similar to an Excel spreadsheet or an SQL table. It consists of rows and columns, where:**
- Columns represent different attributes/features.
- Rows represent individual records/entries.
##### A DataFrame is the most commonly used data structure in Pandas for data analysis.
##### Note that : In dataframe each column represent a series. Hence dataframe is also a collection of series

Creating DataFrames manually can be tedious. 99% of the time you'll be pulling the data from a Database, a csv file or the web. But still, you can create a DataFrame by specifying the columns and values:
A 2D array in numpy is pretty much a dataframe

In [299]:
# creating dataframes using pd.DataFrame is written in pascal scale since it is a panda class
df = pd.DataFrame({
    'Population': [
        35.467, 
        63.951, 
        80.94 , 
        60.665, 
        127.061, 
        64.511, 
        318.523
    ],
    'GDP': [
        1785387,
        2833687,
        3874437,
        2167744,
        4602367,
        2950039,
        17348075
    ],
    'Surface Area': [
        9984670,
        640679,
        357114,
        301336,
        377930,
        242495,
        9525067
    ],
    'HDI': [
        0.913,
        0.888,
        0.916,
        0.873,
        0.891,
        0.907,
        0.915
    ],
    'Continent': [
        'America',
        'Europe',
        'Europe',
        'Europe',
        'Asia',
        'Europe',
        'America'
    ]
}, columns=['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'])

(The columns attribute is optional)
The .columns attribute in a Pandas DataFrame stores the column names as an Index object. It helps in:
- Retrieving column names
- Renaming columns
- Modifying column labels dynamically

In [300]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
0,35.467,1785387,9984670,0.913,America
1,63.951,2833687,640679,0.888,Europe
2,80.94,3874437,357114,0.916,Europe
3,60.665,2167744,301336,0.873,Europe
4,127.061,4602367,377930,0.891,Asia
5,64.511,2950039,242495,0.907,Europe
6,318.523,17348075,9525067,0.915,America


DataFrames also have indexes. As you can see in the "table" above, pandas has assigned a numeric, autoincremental index automatically to each "row" in our DataFrame. In our case, we know that each row represents a country, so we'll just reassign the index:

In [301]:
# here we will be creating custom index for the dataframe using .index same as series in pandas
df.index = [
        'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States'
]

In [302]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [303]:
# get the columns of the dataframe using .columns
df.columns

Index(['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'], dtype='object')

In [304]:
# get the index of the dataframe using .index
df.index

Index(['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom',
       'United States'],
      dtype='object')

In [305]:
# get the size of the dataframe using .size
# .size tells the no. of elements present in the dataframe
df.size

35

In [306]:
# get the info about the dataframe using .info
# .info provides a summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, Canada to United States
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Population    7 non-null      float64
 1   GDP           7 non-null      int64  
 2   Surface Area  7 non-null      int64  
 3   HDI           7 non-null      float64
 4   Continent     7 non-null      object 
dtypes: float64(2), int64(2), object(1)
memory usage: 336.0+ bytes


In [307]:
df.info

<bound method DataFrame.info of                 Population       GDP  Surface Area    HDI Continent
Canada              35.467   1785387       9984670  0.913   America
France              63.951   2833687        640679  0.888    Europe
Germany             80.940   3874437        357114  0.916    Europe
Italy               60.665   2167744        301336  0.873    Europe
Japan              127.061   4602367        377930  0.891      Asia
United Kingdom      64.511   2950039        242495  0.907    Europe
United States      318.523  17348075       9525067  0.915   America>

note that there is a difference between df.info and df.info()
##### **df.info**
- It returns the method object (<bound method DataFrame.info>)
- It does NOT execute the function.
##### **df.info()**
- Executes the method and displays DataFrame details.
- This is what you should use to check DataFrame info.

In [308]:
# get the shape of dimension of the dataframe using .shape ( row , column)
df.shape

(7, 5)

In [309]:
# The .describe() method in Pandas generates summary statistics for numerical (or categorical) columns in a DataFrame.
# note that .describe() only shows the statistics of the numerical values hence the continent column is not present since its object type
# again the same difference apply for .describe and .describe() like that of .info and .info()
df.describe()

Unnamed: 0,Population,GDP,Surface Area,HDI
count,7.0,7.0,7.0,7.0
mean,107.302571,5080248.0,3061327.0,0.900429
std,97.24997,5494020.0,4576187.0,0.016592
min,35.467,1785387.0,242495.0,0.873
25%,62.308,2500716.0,329225.0,0.8895
50%,64.511,2950039.0,377930.0,0.907
75%,104.0005,4238402.0,5082873.0,0.914
max,318.523,17348080.0,9984670.0,0.916


In [310]:
# get the datatype of each columns using .dtypes
df.dtypes

Population      float64
GDP               int64
Surface Area      int64
HDI             float64
Continent        object
dtype: object

In [311]:
# get the no. of each datatype column present using .dtypes.value_counts()
df.dtypes.value_counts()

float64    2
int64      2
object     1
Name: count, dtype: int64

### Indexing, Selection and Slicing
Individual columns in the DataFrame can be selected with regular indexing. Each column is represented as a Series:

In [312]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [313]:
# using .loc lets you to select the entire row by index name in the dataframe
df.loc["Canada"]

Population       35.467
GDP             1785387
Surface Area    9984670
HDI               0.913
Continent       America
Name: Canada, dtype: object

##### syntax of .loc is df.loc[row_selection, column_selection]

In [314]:
# .loc can also be used to select the entire column by label/custom index name in the dataframe
df.loc[:,"Population"]

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: Population, dtype: float64

In [315]:
# using .iloc lets you to select the entire row by sequential indexing numerical positions ( 0 , 1 , 2 , 3 ... )
df.iloc[-1]

Population       318.523
GDP             17348075
Surface Area     9525067
HDI                0.915
Continent        America
Name: United States, dtype: object

In [316]:
# note that we can select each columns by their name as an index
df["Population"]

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: Population, dtype: float64

so **iloc** given horizontal row data and **df[ column name ]** gives vertical column datas

**Note that** : when we are extracting a particular column or row as a whole using different indexing then the resultant data is in the form of a pandas series where the rows data are transposed as key value pairs as in a series

Note that the index of the returned Series is the same as the DataFrame one. And its name is the name of the column. If you're working on a notebook and want to see a more DataFrame-like format you can use the to_frame method:

In [317]:
# here we can use the .to_frame() method to display the extracted column or row in a dataframe like format
df['Population'].to_frame()

Unnamed: 0,Population
Canada,35.467
France,63.951
Germany,80.94
Italy,60.665
Japan,127.061
United Kingdom,64.511
United States,318.523


In [318]:
# note that rows can also be changed to dataframe like format
df.loc["Canada"].to_frame()

Unnamed: 0,Canada
Population,35.467
GDP,1785387
Surface Area,9984670
HDI,0.913
Continent,America


- In dataframe slicing rules are same as that of the numpy and panda series
- Label-based slicing ("Population":"HDI") only works for rows, not columns.
- Column slicing must be done using .loc[] or selecting a list of column names.

#### **.loc syntax**
- df.loc[:, "start_column":"end_column"]

In [319]:
# slicing the columns using df.loc[:, "start_column":"end_column"]
df.loc[:, "Population":"HDI"]

Unnamed: 0,Population,GDP,Surface Area,HDI
Canada,35.467,1785387,9984670,0.913
France,63.951,2833687,640679,0.888
Germany,80.94,3874437,357114,0.916
Italy,60.665,2167744,301336,0.873
Japan,127.061,4602367,377930,0.891
United Kingdom,64.511,2950039,242495,0.907
United States,318.523,17348075,9525067,0.915


In [320]:
# this is used to select specific columns using their label/custom index name
df[["Population","HDI"]]

Unnamed: 0,Population,HDI
Canada,35.467,0.913
France,63.951,0.888
Germany,80.94,0.916
Italy,60.665,0.873
Japan,127.061,0.891
United Kingdom,64.511,0.907
United States,318.523,0.915


In [321]:
# slicing the rows using the row index names
df["Canada":"Japan"]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia


In [322]:
# slicing the rows using the sequential numerical row index position values
# this method cant be used to slice columns
df[1:3]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe


#### **.iloc syntax**
- df.iloc[row_selection, column_selection]

In [323]:
# slicing using iloc on row index
df.iloc[1:3]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe


In [324]:
# slicing using iloc on column index
df.iloc[:,1:3]

Unnamed: 0,GDP,Surface Area
Canada,1785387,9984670
France,2833687,640679
Germany,3874437,357114
Italy,2167744,301336
Japan,4602367,377930
United Kingdom,2950039,242495
United States,17348075,9525067


In [325]:
# row wise slicing using .loc
df.loc["France":"Italy"]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe


In [326]:
# here slicing the rows and selecting specific columns 
df.loc["France":"Italy","Population"]

France     63.951
Germany    80.940
Italy      60.665
Name: Population, dtype: float64

In [327]:
df.loc["France":"Italy",["Population","GDP","Continent"]]

Unnamed: 0,Population,GDP,Continent
France,63.951,2833687,Europe
Germany,80.94,3874437,Europe
Italy,60.665,2167744,Europe


In [328]:
# slicing both rows and columns using .loc
# note that here Z ia also a dataframe
Z = df.loc["France":"Italy","Population":"HDI"]
Z

Unnamed: 0,Population,GDP,Surface Area,HDI
France,63.951,2833687,640679,0.888
Germany,80.94,3874437,357114,0.916
Italy,60.665,2167744,301336,0.873


In [329]:
# selecting both rows and columns using .iloc
# here slicing both 
df.iloc[1:3, 2:]

Unnamed: 0,Surface Area,HDI,Continent
France,640679,0.888,Europe
Germany,357114,0.916,Europe


In [330]:
# selecting both rows and columns using .iloc
# here slicing rows and selecting columns
df.iloc[1:3,[1,3]]

Unnamed: 0,GDP,HDI
France,2833687,0.888
Germany,3874437,0.916


##### **RECOMMENDED**: Always use loc and iloc to reduce ambiguity, specially with DataFrames with numeric indexes.

### **Conditional Selection** (Boolean array)
We saw conditional selection applied to Series and it'll work in the same way for DataFrames. After all, a DataFrame is a collection of Series:

In [331]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


##### Remember : 
- condition inside the df [condition] gives the value which are true
- condition outside the df [ ] give the series of boolean datatype

In [332]:
df["Population"] > 70

Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
Name: Population, dtype: bool

In [333]:
# here the resultant dataframe gives the value of only those index names which has a boolean value True
df.loc[df["Population"] > 70]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Germany,80.94,3874437,357114,0.916,Europe
Japan,127.061,4602367,377930,0.891,Asia
United States,318.523,17348075,9525067,0.915,America


In [334]:
# now in the above cell the resultant dataframe gives all the columns as index for the row values 
# but we can choose to display specific columns for the rows 
df.loc[df["Population"] > 70 , "Population"]

Germany           80.940
Japan            127.061
United States    318.523
Name: Population, dtype: float64

In [335]:
df.loc[df["Population"] > 70 , ["Population","GDP"]]

Unnamed: 0,Population,GDP
Germany,80.94,3874437
Japan,127.061,4602367
United States,318.523,17348075


### **Dropping stuff**
Opposed to the concept of selection, we have "dropping". Instead of pointing out which values you'd like to select you could point which ones you'd like to drop  
Note that : this operation is immutable meaning it will not change the original dataframe but will create another dataframe with the changes

In [336]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [337]:
# we can use the .drop function to just remove the rows or columns from the dataframe
# here removing the row
df.drop("Canada")

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


**Note that** : Dropping doesnot mean deleting the rows or columns from the main dataframe rather drop function returns dataframe which the specified changes hence not affecting the main one

In [338]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [339]:
# here dropping more rows
df.drop(["Canada","Japan"])

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [340]:
# now dropping the columns
df.drop(columns = ["HDI","GDP"])

Unnamed: 0,Population,Surface Area,Continent
Canada,35.467,9984670,America
France,63.951,640679,Europe
Germany,80.94,357114,Europe
Italy,60.665,301336,Europe
Japan,127.061,377930,Asia
United Kingdom,64.511,242495,Europe
United States,318.523,9525067,America


In [341]:
# now dropping stuffs along with specified axis
# note that here we are not mentioning the columns explicitely because here we are using axis instead
# axis = 1 means columns , axis = 0 means rows
df.drop(["Population","HDI"],axis = 1)

Unnamed: 0,GDP,Surface Area,Continent
Canada,1785387,9984670,America
France,2833687,640679,Europe
Germany,3874437,357114,Europe
Italy,2167744,301336,Europe
Japan,4602367,377930,Asia
United Kingdom,2950039,242495,Europe
United States,17348075,9525067,America


In [342]:
df.drop(["Canada","Japan"],axis = 0)

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [343]:
# we can also do axis = columns and axis = rows for drop along col and row
df.drop(["Canada","Japan"],axis = "rows")

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [344]:
df.drop(["Population","HDI"],axis = "columns")

Unnamed: 0,GDP,Surface Area,Continent
Canada,1785387,9984670,America
France,2833687,640679,Europe
Germany,3874437,357114,Europe
Italy,2167744,301336,Europe
Japan,4602367,377930,Asia
United Kingdom,2950039,242495,Europe
United States,17348075,9525067,America


All these drop methods return a new DataFrame. If you'd like to modify it "in place", you can use the inplace attribute (there's an example below).

### **Operations**
some basic operations on pandas dataframe

In [345]:
df[["Population","HDI"]]

Unnamed: 0,Population,HDI
Canada,35.467,0.913
France,63.951,0.888
Germany,80.94,0.916
Italy,60.665,0.873
Japan,127.061,0.891
United Kingdom,64.511,0.907
United States,318.523,0.915


In [346]:
# here we are dividing the elements of the dataframe using a scalar value
df[["Population","HDI"]] / 100

Unnamed: 0,Population,HDI
Canada,0.35467,0.00913
France,0.63951,0.00888
Germany,0.8094,0.00916
Italy,0.60665,0.00873
Japan,1.27061,0.00891
United Kingdom,0.64511,0.00907
United States,3.18523,0.00915


**Operations with Series** work at a column level, broadcasting down the rows (which can be counter intuitive).  
**Broadcasting** is when a smaller array (Series, scalar, or NumPy array) is automatically expanded to match the shape of a larger DataFrame during operations.

It follows NumPy broadcasting rules to perform element-wise operations efficiently.

In [347]:
# here we are creating a series which will be added to the dataframe to performing broadcasting operation
crisis = pd.Series([-1_000_000,-0.3],index = ["GDP","HDI"])
crisis

GDP   -1000000.0
HDI         -0.3
dtype: float64

In [348]:
# here we will be merging the crisis with the dataframe
df[["GDP","HDI"]] + crisis

Unnamed: 0,GDP,HDI
Canada,785387.0,0.613
France,1833687.0,0.588
Germany,2874437.0,0.616
Italy,1167744.0,0.573
Japan,3602367.0,0.591
United Kingdom,1950039.0,0.607
United States,16348075.0,0.615


So what is happening here is that : 
every value of the dataframe is being subtracted by the value of the series column respective as we are performing addition operation

 ### **Modifying DataFrames**
 It's simple and intuitive, You can add columns, or replace values for columns without issues  
 These are mutable operations

#### Adding a column

In [349]:
langs = pd.Series(
    ["French","German","Italian"],
    index = ["France","Germany","Italy"],
    name = "Language"
)
langs
# note that if we add the lang to the dataframe then it would automatically match the index of the pd series with the index of the dataframe 
# ans then add the values

France      French
Germany     German
Italy      Italian
Name: Language, dtype: object

In [350]:
# here we will be adding a coulumn to the dataframe
df["Language"] = langs

In [351]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,
France,63.951,2833687,640679,0.888,Europe,French
Germany,80.94,3874437,357114,0.916,Europe,German
Italy,60.665,2167744,301336,0.873,Europe,Italian
Japan,127.061,4602367,377930,0.891,Asia,
United Kingdom,64.511,2950039,242495,0.907,Europe,
United States,318.523,17348075,9525067,0.915,America,


#### Replacing the value per column

In [352]:
# suppose we change the column of language with only a particular language and not a series then that value will be added to all the indexes in df
df["Language"] = "English"

In [353]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,English
France,63.951,2833687,640679,0.888,Europe,English
Germany,80.94,3874437,357114,0.916,Europe,English
Italy,60.665,2167744,301336,0.873,Europe,English
Japan,127.061,4602367,377930,0.891,Asia,English
United Kingdom,64.511,2950039,242495,0.907,Europe,English
United States,318.523,17348075,9525067,0.915,America,English


Note that : whenever you see = sign in the dataframe then that means that modification of the dataframe is occuring down the line

#### Renaming columns

In [354]:
# here we will be renaming the columns
df.rename(
    columns = {
        "HDI" : "Human Development Index",
        "Anual Popcorn Consumption" : "APC"
    },index={
        "United States" : "USA",
        "United Kingdom" : "UK",
        "Argentina" : "AR"
    }
)
# note that in the dataframe there is no index called Argentina and no label called Annual Popcorn Consumption 
# but still it in mentioned in the rename() in this cell
# However this doesnot cause a problem while executing
# Note that : This operation is also immutable since it doeanot effect the original df

Unnamed: 0,Population,GDP,Surface Area,Human Development Index,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,English
France,63.951,2833687,640679,0.888,Europe,English
Germany,80.94,3874437,357114,0.916,Europe,English
Italy,60.665,2167744,301336,0.873,Europe,English
Japan,127.061,4602367,377930,0.891,Asia,English
UK,64.511,2950039,242495,0.907,Europe,English
USA,318.523,17348075,9525067,0.915,America,English


In [355]:
# original dataframe is not changed
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,English
France,63.951,2833687,640679,0.888,Europe,English
Germany,80.94,3874437,357114,0.916,Europe,English
Italy,60.665,2167744,301336,0.873,Europe,English
Japan,127.061,4602367,377930,0.891,Asia,English
United Kingdom,64.511,2950039,242495,0.907,Europe,English
United States,318.523,17348075,9525067,0.915,America,English


In [356]:
# we can use some inbuilt function along with rename()
df.rename(index = str.upper)

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
CANADA,35.467,1785387,9984670,0.913,America,English
FRANCE,63.951,2833687,640679,0.888,Europe,English
GERMANY,80.94,3874437,357114,0.916,Europe,English
ITALY,60.665,2167744,301336,0.873,Europe,English
JAPAN,127.061,4602367,377930,0.891,Asia,English
UNITED KINGDOM,64.511,2950039,242495,0.907,Europe,English
UNITED STATES,318.523,17348075,9525067,0.915,America,English


In [357]:
# using lambda function using .rename()
df.rename(index = lambda x : x.lower())

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
canada,35.467,1785387,9984670,0.913,America,English
france,63.951,2833687,640679,0.888,Europe,English
germany,80.94,3874437,357114,0.916,Europe,English
italy,60.665,2167744,301336,0.873,Europe,English
japan,127.061,4602367,377930,0.891,Asia,English
united kingdom,64.511,2950039,242495,0.907,Europe,English
united states,318.523,17348075,9525067,0.915,America,English


#### 'inplace' attribute in drop function
The inplace parameter in df.drop() controls whether changes are applied directly to the DataFrame or if a new modified DataFrame is returned.

In [358]:
# using inplace attribute
df.drop(columns = {
    "Language"
},inplace = True)

In [359]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


#### Adding values

In [360]:
# note that .append() is deprecated since panda 1.3.0 hence use .concat instead
new_row = pd.Series({
    "Population" : 3,
    "GDP" : 5
},name = "China")
pd.concat([df,new_row.to_frame().T])

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670.0,0.913,America
France,63.951,2833687,640679.0,0.888,Europe
Germany,80.94,3874437,357114.0,0.916,Europe
Italy,60.665,2167744,301336.0,0.873,Europe
Japan,127.061,4602367,377930.0,0.891,Asia
United Kingdom,64.511,2950039,242495.0,0.907,Europe
United States,318.523,17348075,9525067.0,0.915,America
China,3.0,5,,,


🔹 Breakdown of new_row.to_frame().T:
- .to_frame() → Converts new_row from a pd.Series to a single-column DataFrame.
- .T (Transpose) → Converts it into a single-row DataFrame.

In [361]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [362]:
# we can just add a row using .loc
df.loc["India"] = pd.Series(
    {
        "Population" : 956.854,
        "GDP" : 5681289,
        "Surface Area" : 545624,
        "Continent" : "Asia"
    }
)
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America
India,956.854,5681289,545624,,Asia


#### More radical index changes

In [363]:
# df.reset_index() resets the index of a DataFrame
#converting the index into a regular column and replacing it with a default integer index (0, 1, 2,...).
df.reset_index()

Unnamed: 0,index,Population,GDP,Surface Area,HDI,Continent
0,Canada,35.467,1785387,9984670,0.913,America
1,France,63.951,2833687,640679,0.888,Europe
2,Germany,80.94,3874437,357114,0.916,Europe
3,Italy,60.665,2167744,301336,0.873,Europe
4,Japan,127.061,4602367,377930,0.891,Asia
5,United Kingdom,64.511,2950039,242495,0.907,Europe
6,United States,318.523,17348075,9525067,0.915,America
7,India,956.854,5681289,545624,,Asia


In [364]:
# df.set_index(keys, drop=True, inplace=False) method sets the key as the index of the DataFrame, replacing the default index.
df.set_index('Population')

Unnamed: 0_level_0,GDP,Surface Area,HDI,Continent
Population,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
35.467,1785387,9984670,0.913,America
63.951,2833687,640679,0.888,Europe
80.94,3874437,357114,0.916,Europe
60.665,2167744,301336,0.873,Europe
127.061,4602367,377930,0.891,Asia
64.511,2950039,242495,0.907,Europe
318.523,17348075,9525067,0.915,America
956.854,5681289,545624,,Asia


#### Creating columns from other columns
Altering a DataFrame often involves combining different columns into another. For example, in our Countries analysis, we could try to calculate the "GDP per capita", which is just, GDP / Population.

In [365]:
df[['Population', 'GDP']]

Unnamed: 0,Population,GDP
Canada,35.467,1785387
France,63.951,2833687
Germany,80.94,3874437
Italy,60.665,2167744
Japan,127.061,4602367
United Kingdom,64.511,2950039
United States,318.523,17348075
India,956.854,5681289


In [366]:
df["Population"] / df["GDP"]

Canada            0.000020
France            0.000023
Germany           0.000021
Italy             0.000028
Japan             0.000028
United Kingdom    0.000022
United States     0.000018
India             0.000168
dtype: float64

In [367]:
# suppose we want to find the per capita income then we will use population and gdp column
df["GDP per capita"] = df["GDP"] / df["Population"]
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,GDP per capita
Canada,35.467,1785387,9984670,0.913,America,50339.385908
France,63.951,2833687,640679,0.888,Europe,44310.284437
Germany,80.94,3874437,357114,0.916,Europe,47868.013343
Italy,60.665,2167744,301336,0.873,Europe,35733.025633
Japan,127.061,4602367,377930,0.891,Asia,36221.712406
United Kingdom,64.511,2950039,242495,0.907,Europe,45729.239975
United States,318.523,17348075,9525067,0.915,America,54464.12033
India,956.854,5681289,545624,,Asia,5937.466949


Note that :  
df["value"] always refers to columns, not rows.  
To work with rows, you must use .loc[]  
However we can access and add new column using .loc[:,"column name"] also  

### **Statistical info**
You've already seen the describe method, which gives you a good "summary" of the DataFrame. Let's explore other methods in more detail:

In [368]:
#✔ Returns the first 5 rows of the DataFrame (default).
#✔ You can specify how many rows you want, e.g., df.head(3) for 3 rows.
df.head()

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,GDP per capita
Canada,35.467,1785387,9984670,0.913,America,50339.385908
France,63.951,2833687,640679,0.888,Europe,44310.284437
Germany,80.94,3874437,357114,0.916,Europe,47868.013343
Italy,60.665,2167744,301336,0.873,Europe,35733.025633
Japan,127.061,4602367,377930,0.891,Asia,36221.712406


In [369]:
#✔ Generates summary statistics for numeric columns.
#✔ Includes count, mean, standard deviation, min, max, and quartiles (25%, 50%, 75%).
df.describe()

Unnamed: 0,Population,GDP,Surface Area,HDI,GDP per capita
count,8.0,8.0,8.0,7.0,8.0
mean,213.4965,5155378.0,2746864.0,0.900429,40075.406123
std,313.566071,5090911.0,4329081.0,0.016592,15222.705086
min,35.467,1785387.0,242495.0,0.873,5937.466949
25%,63.1295,2667201.0,343169.5,0.8895,36099.540713
50%,72.7255,3412238.0,461777.0,0.907,45019.762206
75%,174.9265,4872098.0,2861776.0,0.914,48485.856484
max,956.854,17348080.0,9984670.0,0.916,54464.12033


In [370]:
#✔ Extracts the "Population" column as a Pandas Series.
#✔ Now, population can be used for analysis.
population = df['Population']

In [371]:
population

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
India             956.854
Name: Population, dtype: float64

In [372]:
#✔ Finds the minimum and maximum values in the "Population" column.
population.min(), population.max()

(35.467, 956.854)

In [373]:
# ✔ Adds up all values in the "Population" column.
population.sum()

1707.972

In [374]:
# ✔ Manually computes the mean (average).
population.sum() / len(population)

213.4965

In [376]:
# ✔ Computes the mean (average) of the "Population" column.
population.mean()

213.4965

In [377]:
# ✔ Computes the standard deviation (how much values deviate from the mean).
population.std()

313.5660709378943

In [379]:
# ✔ Finds the median (middle value when sorted).
population.median()

72.7255

In [380]:
# ✔ Similar to df.describe() but for a single column.
# ✔ Shows count, mean, std, min, max, and quartiles.
population.describe()

count      8.000000
mean     213.496500
std      313.566071
min       35.467000
25%       63.129500
50%       72.725500
75%      174.926500
max      956.854000
Name: Population, dtype: float64

In [381]:
# ✔ Finds the 25th percentile (Q1, lower quartile).
# ✔ 25% of values are below this number.
population.quantile(.25)

63.1295

In [382]:
# ✔ Finds multiple quantiles at 20%, 40%, 60%, 80%, and 100% (max value).
population.quantile([.2, .4, .6, .8, 1])

0.2     61.9794
0.4     64.3990
0.6     90.1642
0.8    241.9382
1.0    956.8540
Name: Population, dtype: float64