<a href="https://colab.research.google.com/github/SaiArjunSairamje/Python-Scaler/blob/main/Pandas_1%20(Dataset%3A%20McKinsey_LifeExpentancy_Insights)%20.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing Libraries and Dataset "McKinsey LifeExpentancy Insights"
***

In [2]:
import numpy as np
import pandas as pd

In [3]:
# The !wget command is not a Python command; it is a command used in the command-line interface (CLI) of Unix-like operating systems, such as Linux or macOS,
# to download files from the internet using the HTTP, HTTPS, or FTP protocols.
# The wget command stands for "Web Get" and is a non-interactive command-line utility that retrieves files from web servers. It is often used to
# download files, scripts, or web pages directly to your local machine.

!wget "https://drive.google.com/uc?export=download&id=1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_" -O mckinsey.csv

--2023-07-10 06:05:15--  https://drive.google.com/uc?export=download&id=1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_
Resolving drive.google.com (drive.google.com)... 74.125.137.100, 74.125.137.102, 74.125.137.113, ...
Connecting to drive.google.com (drive.google.com)|74.125.137.100|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0s-68-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/27e7lejmmq9arh2nnl7n0mg9pnm0cni2/1688969100000/14302370361230157278/*/1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_?e=download&uuid=a7342132-db0d-4317-bda8-d14225ddb75e [following]
--2023-07-10 06:05:16--  https://doc-0s-68-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/27e7lejmmq9arh2nnl7n0mg9pnm0cni2/1688969100000/14302370361230157278/*/1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_?e=download&uuid=a7342132-db0d-4317-bda8-d14225ddb75e
Resolving doc-0s-68-docs.googleusercontent.com (doc-0s-68-docs.googleusercontent.com)... 142.250.141.132, 2607

# How to read data from a CSV (Comma-Separated Values) file into a DataFrame?
***
 By default, pd.read_csv() assumes that the CSV file has a header row with column names. If your CSV file doesn't have a header, you can specify it using the header parameter.

 For example: **df = pd.read_csv('filename.csv', header=None)**

In [4]:
df = pd.read_csv("mckinsey.csv")

**Why do we need Pandas when we have NumPy ???**

While NumPy is a powerful library for numerical computations and handling multi-dimensional arrays, it has certain limitations that led to the development of Pandas. Here are some disadvantages of NumPy that Pandas addresses:

> **Lack of Native Support for Tabular Data**:          
**NumPy** does not have a built-in data structure specifically designed for tabular data. It primarily focuses on n-dimensional arrays, which are more suitable for **homogeneous numerical data**. This limitation makes it less intuitive and convenient to work with structured datasets, such as those found in relational databases or spreadsheet-like formats.

> **Limited Labeling and Column Handling**:           
**NumPy** arrays do not provide built-in labeling for columns and have limited capabilities for handling **heterogeneous** or **mixed data types**. This can make it challenging to work with datasets that require column-specific operations, handling missing values, or dealing with different data types within a single array.

Hence, we make use of **Pandas** which is suitable for **heterogeneous data**.

In [5]:
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


In [6]:
type(df)

pandas.core.frame.DataFrame

# Data Structure:
***
Pandas provides two main data structures: **Series** and **DataFrame**.

**(1) Series:**   
A Series is a one-dimensional labeled array that can hold any data type (integers, floats, strings, etc.). It is similar to a column in a spreadsheet or a dictionary in Python. Each element in a Series is assigned a label, called an **index**|.

**(2) DataFrame:**               
A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It is similar to a table in a relational database or a spreadsheet. You can think of it as a collection of Series objects that share the same index.








In [7]:
# How to access perticular column in Pandas ???
# Series

df["country"]

0       Afghanistan
1       Afghanistan
2       Afghanistan
3       Afghanistan
4       Afghanistan
           ...     
1699       Zimbabwe
1700       Zimbabwe
1701       Zimbabwe
1702       Zimbabwe
1703       Zimbabwe
Name: country, Length: 1704, dtype: object

In [8]:
type(df["country"])

pandas.core.series.Series

In [9]:
# How to access multiple column in Pandas ???
# DataFrame

df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


In [10]:
type(df)

pandas.core.frame.DataFrame

# Exploring and Understanding Data with Pandas
***
Exploring Data:

> **df.info()**: Understanding DataFrame structure and summary         
> **df.head()**: Viewing the first few rows of a DataFrame             
> **df.tail()**: Viewing the last few rows of a DataFrame

In [11]:
# df.info() provides a summary of the DataFrame, including the number of rows & columns, data types of each column, non-null counts, and memory usage.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     1704 non-null   object 
 1   year        1704 non-null   int64  
 2   population  1704 non-null   int64  
 3   continent   1704 non-null   object 
 4   life_exp    1704 non-null   float64
 5   gdp_cap     1704 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB


In [12]:
# How to access multiple rows in Pandas ???
# df.head() displays the first few rows of the DataFrame (by default, the first five rows).

df.head()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106


In [13]:
# df.tail() displays the last few rows of the DataFrame (again, by default, the last five rows).

df.tail()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.44996
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1703,Zimbabwe,2007,12311143,Africa,43.487,469.709298


# How can we create a Dataframe from scratch?
***
There are multiple ways to create a DataFrame from scratch in Pandas. Here are 2 commonly used approaches:

> **`Approach (1):` Creating a DataFrame from a List of Lists**  (**`NOTE:`** Widely used approach)         

> **`Approach (2):` Creating a DataFrame from a Dictionary**

**Approach (1): Creating a DataFrame from a List of Lists:**  (**`NOTE:`** Widely used approach)

You can also create a DataFrame by passing a list of lists to the **pd.DataFrame()** constructor. Each inner list represents a row of data, and the outer list contains these rows.

In this approach, you explicitly specify the column names using the columns parameter. This is useful when you want to control the column names or if the data does not contain column names by default.

In [14]:
pd.DataFrame([['Afghanistan',1952, 8425333, 'Asia', 28.801, 779.445314 ],
              ['Afghanistan',1957, 9240934, 'Asia', 30.332, 820.853030 ],
              ['Afghanistan',1962, 102267083, 'Asia', 31.997, 853.100710 ]],
             columns = ['country','year','population','continent','life_exp','gdp_cap'])

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,102267083,Asia,31.997,853.10071


**`NOTE:`** We always need **two brackes** even if we have single row, else will give error message


In [15]:
pd.DataFrame(['Afghanistan',1952, 8425333, 'Asia', 28.801, 779.445314 ],
             columns = ['country','year','population','continent','life_exp','gdp_cap'])

ValueError: ignored

**Approach (2): Creating a DataFrame from a Dictionary:**

You can create a DataFrame by passing a dictionary to the pd.DataFrame() constructor. The keys of the dictionary represent the column names, and the values can be lists, arrays, or Series containing the data.

In [17]:
pd.DataFrame({'country':['Afghanistan', 'Afghanistan'],
              'year':[1952,1957],
              'population':[842533, 9240934],
              'continent':['Asia', 'Asia'],
              'life_exp':[28.801, 30.332],
              'gdp_cap':[779.445314, 820.853030]})

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,842533,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303


# Some operations you can perform on "Columns" of a Pandas DataFrame
***
**How to access only the Column names ???**   

There are 2 ways and they are as shown below:   
> **`Approach (1):` df.columns**     
      
> **`Approach (2):` df.keys()**

**Approach (1): df.columns** returns an Index object containing the column names of the DataFrame. It provides a list of all the column names present in the DataFrame.


In [18]:
df.columns

Index(['country', 'year', 'population', 'continent', 'life_exp', 'gdp_cap'], dtype='object')

**Approach (2): df.keys()** is an alternative way to retrieve the column names of a DataFrame. It also returns an Index object with the column names.


In [19]:
df.keys()

Index(['country', 'year', 'population', 'continent', 'life_exp', 'gdp_cap'], dtype='object')

**How to convert a single column from "Series" data type into "DataFrame" data type ???**
> **df["country"]** -> Series

> **df[["country"]]** -> DataFrame

**df["country"]** retrieves the column named "country" from the DataFrame and returns it as a **Series**. This allows you to access and work with the data specifically in that column.


In [20]:
df["country"]

0       Afghanistan
1       Afghanistan
2       Afghanistan
3       Afghanistan
4       Afghanistan
           ...     
1699       Zimbabwe
1700       Zimbabwe
1701       Zimbabwe
1702       Zimbabwe
1703       Zimbabwe
Name: country, Length: 1704, dtype: object

In [21]:
type(df["country"])

pandas.core.series.Series

**df[["country"]]** retrieves the column named "country" from the DataFrame and returns it as a **DataFrame (with a single column)**. The double brackets are used to create a DataFrame instead of a Series.


In [22]:
df[["country"]]

Unnamed: 0,country
0,Afghanistan
1,Afghanistan
2,Afghanistan
3,Afghanistan
4,Afghanistan
...,...
1699,Zimbabwe
1700,Zimbabwe
1701,Zimbabwe
1702,Zimbabwe


In [23]:
type(df[["country"]])

pandas.core.frame.DataFrame

**How to access multiple columns in a table ???**

The code **df[["country", "life_exp"]]** retrieves a subset of the DataFrame df containing the columns "country" and "life_exp". It returns a new DataFrame with only those two columns. (`NOTE`: Similar to **Multi-Indexing**)


In [24]:
df[["country" , "life_exp"]]

Unnamed: 0,country,life_exp
0,Afghanistan,28.801
1,Afghanistan,30.332
2,Afghanistan,31.997
3,Afghanistan,34.020
4,Afghanistan,36.088
...,...,...
1699,Zimbabwe,62.351
1700,Zimbabwe,60.377
1701,Zimbabwe,46.809
1702,Zimbabwe,39.989


**How to be access dfferent/unique values from a Column (i.e. in a Series) ???**

> **df["country"].unique()**   

> **df["country"].nunique()**   
         
> **df["country"].value_counts()**

**df["country"].unique()** returns an array of unique values present in the "country" column of the DataFrame df. It provides a list of all the distinct country names found in that column.


In [25]:
df["country"].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
       'Australia', 'Austria', 'Bahrain', 'Bangladesh', 'Belgium',
       'Benin', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon',
       'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',
       'Colombia', 'Comoros', 'Congo, Dem. Rep.', 'Congo, Rep.',
       'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Czech Republic',
       'Denmark', 'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Ethiopia',
       'Finland', 'France', 'Gabon', 'Gambia', 'Germany', 'Ghana',
       'Greece', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Haiti',
       'Honduras', 'Hong Kong, China', 'Hungary', 'Iceland', 'India',
       'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy',
       'Jamaica', 'Japan', 'Jordan', 'Kenya', 'Korea, Dem. Rep.',
       'Korea, Rep.', 'Kuwait', 'Leba

**df["country"].nunique()** returns the count of distinct (unique) values in the "country" column of the DataFrame df. It provides the number of different countries present in that column.


In [26]:
df["country"].nunique()

142

In [27]:
df["year"].nunique()

12

**df["country"].value_counts()** returns a **Series** with the unique values in the "country" column as **indices** and their corresponding **counts** as values. It gives you the frequency distribution of each unique country in that column.



In [28]:
# Very IMPORTANT to know...

df["country"].value_counts()

Afghanistan          12
Pakistan             12
New Zealand          12
Nicaragua            12
Niger                12
                     ..
Eritrea              12
Equatorial Guinea    12
El Salvador          12
Egypt                12
Zimbabwe             12
Name: country, Length: 142, dtype: int64

In [29]:
df.head(4)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138


### **How to Change or Rename a particular Column ???**

When using the **df.rename()** method in Pandas, you can rename columns using 2 approaches:

> **`Approach (1):` df.rename({"key1":"value1", "key2":"value2"}, axis = 1/0, inplace = True/False)**    

> **`Approach (2):` df.rename(columns = {"key1":"value1", "key2":"value2"}, inplace = True/False)**

Let's explore this step-by-step:

You can rename columns by providing a dictionary mapping the old column names to the new column names The method returns a new DataFrame with the updated column names.

> **Approach (1): df.rename({"population": "Population", "country": "Country"})** renames the columns "population" and "country" to "Population" and "Country", respectively. This operation modifies the column names and returns a new DataFrame with the updated column names.

In this example, the column names "population" and "country" are renamed to "Population" and "Country", respectively, using the **df.rename()** method with the default **axis=0**.


In [30]:
# Why hasn't the changes reflected in o/p ???

df.rename({"population": "Population", "country": "Country"})  # default axis = 0

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


**Why hasn't the changes reflected in o/p ???**

This is where the **axis** parameter comes into picture. There are 2 ways to specify the axis parameter in the **df.rename()** method: **axis=0** (default) and **axis=1**.

> **axis=0** (default): The operation is applied along the **rows (vertically)**. It means that the method will work on the rows of the DataFrame, such as **renaming index labels**.      

> **axis=1**: The operation is applied along the **columns (horizontally)**. It means that the method will work on the columns of the DataFrame, such as **renaming column names**.      

By specifying **axis=1**, you can target column-related operations, while **axis=0** targets row-related operations.

You can now notice the difference as shown below:

> **df.rename({"population": "Population", "country": "Country"}, axis=1)** renames the columns "population" and "country" to "Population" and "Country", respectively, with **axis=1**. This operation modifies the column names in the DataFrame and returns a new DataFrame with the updated column names.

In [31]:
df.rename({"population": "Population", "country": "Country"}, axis = 1)   # this code is not permanent

Unnamed: 0,Country,year,Population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


In [32]:
# Again, why hasn't the change reflected in o/p of the original DataFrame???

df.head(4)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138


**Again, why hasn't the change reflected in o/p of the original DataFrame ???**

It's becase the changes was temporary as it was just **returning a copy DataFrame** but it was not changing the **original DataFrame**. We can get this fixed as shown below:

> **inplace = True/False**                                    

When using the **inplace** parameter in Pandas, it controls whether the modifications should be made directly to the existing DataFrame or if a new DataFrame should be returned.

When **inplace=True** is set, the modification is made directly to the existing DataFrame, and the method does not return a new DataFrame. It modifies the DataFrame in-place, meaning the changes are made to the **original DataFrame** itself.



In [33]:
df.rename({"population": "Population", "country": "Country"}, axis = 1, inplace = True)

In [34]:
df.head(4)

Unnamed: 0,Country,year,Population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138


When **inplace=False** (or not specified), the modification does not affect the original DataFrame. Instead, a **new DataFrame** is returned with the modifications applied. The original DataFrame remains unchanged.


In [35]:
df.rename({"Population": "population", "Country": "country"}, axis = 1, inplace = False)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


In [36]:
df.head(4)

Unnamed: 0,Country,year,Population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138


**If remembering the axis is difficuly, there is a 2nd approach for renaming the column name in which we don't have to write the "axis" parameter as shown below:**

> **Approach (2): df.rename(columns={"Country": "country"})**  This line of code renames the column "Country" to "country" in the DataFrame **df** and returns a new DataFrame with the modified column name. However, the original DataFrame **df** remains unchanged unless you assign the result back to **df** or another variable.

In [37]:
df.rename(columns = {"Country" : "country"})

Unnamed: 0,country,year,Population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


In [38]:
df.head(4)  # the changes was not permanent

Unnamed: 0,Country,year,Population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138


**As the changes was not permanent, let's get it fixed.**

> **df.rename(columns={"Country": "country"}, inplace=True)**:

This line of code renames the column "Country" to "country" in the DataFrame **df** and modifies the original DataFrame directly. The **inplace=True** parameter ensures that the changes are made in place, and no new DataFrame is returned.

In [39]:
df.rename(columns = {"Country" : "country", "Population" : "population"}, inplace = True)

In [40]:
df.head(4)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138


### **How to Create a new column ???**

To create a new column in a Pandas DataFrame, you can assign a value or a calculated expression to a new column name using the square bracket notation. Here's an example of creating a new column named **"year + 7"** that adds **7** to the values in the **"year"** column:

In this example, the **df["year + 7"] = df["year"] + 7** statement creates a new column named "year + 7" in the DataFrame **df**. It assigns the values obtained by adding 7 to each value in the "year" column. The resulting DataFrame includes the original columns "year" and "population", as well as the newly created column "year + 7" with the calculated values.


In [41]:
# Header = Value

df["year + 7"] = df["year"] + 7

In [42]:
df.head(4)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap,year + 7
0,Afghanistan,1952,8425333,Asia,28.801,779.445314,1959
1,Afghanistan,1957,9240934,Asia,30.332,820.85303,1964
2,Afghanistan,1962,10267083,Asia,31.997,853.10071,1969
3,Afghanistan,1967,11537966,Asia,34.02,836.197138,1974


In [43]:
df["gdp"] = df["gdp_cap"] * df["population"]

In [44]:
df.head(4)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap,year + 7,gdp
0,Afghanistan,1952,8425333,Asia,28.801,779.445314,1959,6567086000.0
1,Afghanistan,1957,9240934,Asia,30.332,820.85303,1964,7585449000.0
2,Afghanistan,1962,10267083,Asia,31.997,853.10071,1969,8758856000.0
3,Afghanistan,1967,11537966,Asia,34.02,836.197138,1974,9648014000.0


In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     1704 non-null   object 
 1   year        1704 non-null   int64  
 2   population  1704 non-null   int64  
 3   continent   1704 non-null   object 
 4   life_exp    1704 non-null   float64
 5   gdp_cap     1704 non-null   float64
 6   year + 7    1704 non-null   int64  
 7   gdp         1704 non-null   float64
dtypes: float64(3), int64(3), object(2)
memory usage: 106.6+ KB


In [46]:
df.columns

Index(['country', 'year', 'population', 'continent', 'life_exp', 'gdp_cap',
       'year + 7', 'gdp'],
      dtype='object')

In [47]:
df.keys()

Index(['country', 'year', 'population', 'continent', 'life_exp', 'gdp_cap',
       'year + 7', 'gdp'],
      dtype='object')

### **How to Remove/Delete one or more Columns ???**

To delete one or more columns from a Pandas DataFrame, you can use the **df.drop()** method and specify the columns to be removed. The **drop()** method returns a new DataFrame with the specified columns dropped unless you set the **inplace** parameter to **True**, which modifies the DataFrame directly.



In [48]:
df.drop(columns = ["year + 7", "gdp"])   # issue not fixed as it's not permanent

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


In [49]:
df.head(4)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap,year + 7,gdp
0,Afghanistan,1952,8425333,Asia,28.801,779.445314,1959,6567086000.0
1,Afghanistan,1957,9240934,Asia,30.332,820.85303,1964,7585449000.0
2,Afghanistan,1962,10267083,Asia,31.997,853.10071,1969,8758856000.0
3,Afghanistan,1967,11537966,Asia,34.02,836.197138,1974,9648014000.0


In [50]:
df.drop(columns = ["year + 7", "gdp"], inplace = True)

In [51]:
df.head(4)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138


# Some operations you can perform on "Rows" of a Pandas DataFrame
***
> **df.index**      

> **df.index.values**    
         
> **df.shape**

In [52]:
df.head()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106


The **df.index** attribute returns the index labels of the DataFrame **df**. It provides access to the index, which represents the labels or values associated with the rows of the DataFrame. The index can be a numerical range, dates, or custom labels.


In [53]:
df.index

RangeIndex(start=0, stop=1704, step=1)

The **df.index.values** attribute returns the values of the index as an array. It gives you access to the actual values of the index labels associated with the rows of the DataFrame. The resulting output is an array containing the index values.


In [54]:
df.index.values

array([   0,    1,    2, ..., 1701, 1702, 1703])

The **df.shape** attribute in Pandas returns a tuple representing the dimensions of a DataFrame. It provides the number of rows and columns in the DataFrame.



In [55]:
df.shape

(1704, 6)

**What if I want to access the number of rows specifically ???**

To access the number of rows specifically, you can use **df.shape[0]**, which retrieves the first element of the shape tuple. Similarly, to access the number of columns, you can use **df.shape[1]**, which retrieves the second element of the shape tuple.



In [56]:
df.shape[0]

1704

In [57]:
df.shape[1] # number of columns

6

**How to change the values of the "Index" ???**

The line of code **df.index = list(range(1, n+1))** sets the index of the DataFrame **df** to a new range of values starting from **1** up to **n**.

In this example, **n** is set to the length of the DataFrame using **len(df)**. The **range(1, n+1)** generates a range of values from **1** to **n**, which corresponds to the desired index labels. By assigning this new range of values to **df.index**, the index of the DataFrame is updated accordingly.




In [58]:
df.head(4)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138


In [59]:
n = df.shape[0]
n

1704

In [60]:
df.index = list(range(1, n+1))

In [61]:
df.head(4)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1,Afghanistan,1952,8425333,Asia,28.801,779.445314
2,Afghanistan,1957,9240934,Asia,30.332,820.85303
3,Afghanistan,1962,10267083,Asia,31.997,853.10071
4,Afghanistan,1967,11537966,Asia,34.02,836.197138


In [62]:
sample = df.head()

In [63]:
sample

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1,Afghanistan,1952,8425333,Asia,28.801,779.445314
2,Afghanistan,1957,9240934,Asia,30.332,820.85303
3,Afghanistan,1962,10267083,Asia,31.997,853.10071
4,Afghanistan,1967,11537966,Asia,34.02,836.197138
5,Afghanistan,1972,13079460,Asia,36.088,739.981106


**How can I check the new Index value ???**

To access the values of the **index** of a Pandas **DataFrame** or **Series**, you can use the **"variable".index.values** attribute. It returns an array containing the index values.


In [64]:
sample.index.values

array([1, 2, 3, 4, 5])

In [65]:
# How to update this again ???

sample.index = ["a", "b", "c", "d", "e"]

In [66]:
sample

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
a,Afghanistan,1952,8425333,Asia,28.801,779.445314
b,Afghanistan,1957,9240934,Asia,30.332,820.85303
c,Afghanistan,1962,10267083,Asia,31.997,853.10071
d,Afghanistan,1967,11537966,Asia,34.02,836.197138
e,Afghanistan,1972,13079460,Asia,36.088,739.981106


In [67]:
# Going back to the original DataFrame

df.head()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1,Afghanistan,1952,8425333,Asia,28.801,779.445314
2,Afghanistan,1957,9240934,Asia,30.332,820.85303
3,Afghanistan,1962,10267083,Asia,31.997,853.10071
4,Afghanistan,1967,11537966,Asia,34.02,836.197138
5,Afghanistan,1972,13079460,Asia,36.088,739.981106


### **How to do Indexng on Series dataset ???**

In [68]:
ser = df["country"]

In [69]:
ser.head(20)

1     Afghanistan
2     Afghanistan
3     Afghanistan
4     Afghanistan
5     Afghanistan
6     Afghanistan
7     Afghanistan
8     Afghanistan
9     Afghanistan
10    Afghanistan
11    Afghanistan
12    Afghanistan
13        Albania
14        Albania
15        Albania
16        Albania
17        Albania
18        Albania
19        Albania
20        Albania
Name: country, dtype: object

**What is the o/p value of "ser[12]" ??? Is it "Afghanistan" or "Albania" ???**

In [70]:
ser[12] #confusing - the index could start with 0-index or with 1-index as shown in previous codes

'Afghanistan'

To over come this confusion, in Pandas, **indexing** in a **Series** can be done using **implicit (integer-based) indexing** or **explicit (label-based) indexing**.

> **Implicit Indexing (.iloc):**
Implicit indexing refers to the default zero-based integer index assigned to each element in a Series. It allows accessing elements based on their positional index.

> **Explicit Indexing (.loc):**
Explicit indexing refers to assigning custom labels to the elements in a Series. It allows accessing elements based on these assigned labels.

In [71]:
# implicit index : starts from 0: iloc
# explicit index: loc

In [72]:
df.head(4)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1,Afghanistan,1952,8425333,Asia,28.801,779.445314
2,Afghanistan,1957,9240934,Asia,30.332,820.85303
3,Afghanistan,1962,10267083,Asia,31.997,853.10071
4,Afghanistan,1967,11537966,Asia,34.02,836.197138


In [73]:
df.loc[1]

country       Afghanistan
year                 1952
population        8425333
continent            Asia
life_exp           28.801
gdp_cap        779.445314
Name: 1, dtype: object

### **How to do Slicing with Indexing on Series dataset ???**

> **Slicing with Explicit Indexing**  
   
> **Slicing with Implecit Indexing**

**Slicing with Explicit Indexing:**     
When using explicit indexing, you can slice a **Series** based on the specified **index** labels. The resulting subset includes the elements corresponding to the specified labels.

When slicing with **explicit index** labels, the **range is inclusive of both the start and end** labels. The resulting subset retains the original index labels, allowing you to maintain the association between the data and its respective labels.


In [74]:
df.head(4)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1,Afghanistan,1952,8425333,Asia,28.801,779.445314
2,Afghanistan,1957,9240934,Asia,30.332,820.85303
3,Afghanistan,1962,10267083,Asia,31.997,853.10071
4,Afghanistan,1967,11537966,Asia,34.02,836.197138


In [75]:
print(df.loc[1:3]) #slicing in explicit index: the end index is inclusive

       country  year  population continent  life_exp     gdp_cap
1  Afghanistan  1952     8425333      Asia    28.801  779.445314
2  Afghanistan  1957     9240934      Asia    30.332  820.853030
3  Afghanistan  1962    10267083      Asia    31.997  853.100710


**Slicing with Implicit Indexing:**          
When using **implicit indexing**, you can slice a **Series** based on the **positional indices** of the elements. The resulting subset includes the elements within the specified index range.

In [76]:
df.head(4)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1,Afghanistan,1952,8425333,Asia,28.801,779.445314
2,Afghanistan,1957,9240934,Asia,30.332,820.85303
3,Afghanistan,1962,10267083,Asia,31.997,853.10071
4,Afghanistan,1967,11537966,Asia,34.02,836.197138


In [77]:
print(df.iloc[1])

country       Afghanistan
year                 1957
population        9240934
continent            Asia
life_exp           30.332
gdp_cap         820.85303
Name: 2, dtype: object


In [78]:
print(df.iloc[0:2])  #slicing in implicit index: the end index is exclusive

       country  year  population continent  life_exp     gdp_cap
1  Afghanistan  1952     8425333      Asia    28.801  779.445314
2  Afghanistan  1957     9240934      Asia    30.332  820.853030


**What is multi-indexing with the iloc indexer ???**

In Pandas, you can achieve **multi-indexing** with the **iloc** indexer, similar to **multi-indexing** in **NumPy**. Multi-indexing allows you to access and manipulate data in a Pandas DataFrame or Series that has multiple levels of row and column indices.

In [79]:
df.head(12)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1,Afghanistan,1952,8425333,Asia,28.801,779.445314
2,Afghanistan,1957,9240934,Asia,30.332,820.85303
3,Afghanistan,1962,10267083,Asia,31.997,853.10071
4,Afghanistan,1967,11537966,Asia,34.02,836.197138
5,Afghanistan,1972,13079460,Asia,36.088,739.981106
6,Afghanistan,1977,14880372,Asia,38.438,786.11336
7,Afghanistan,1982,12881816,Asia,39.854,978.011439
8,Afghanistan,1987,13867957,Asia,40.822,852.395945
9,Afghanistan,1992,16317921,Asia,41.674,649.341395
10,Afghanistan,1997,22227415,Asia,41.763,635.341351


In [80]:
print(df.iloc[[1,10]])

        country  year  population continent  life_exp     gdp_cap
2   Afghanistan  1957     9240934      Asia    30.332  820.853030
11  Afghanistan  2002    25268405      Asia    42.129  726.734055


**Finding the last Row in Pandas DataFrame ???**

> **df.loc[]**     
      
> **df.iloc[]**

In [81]:
df.tail()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1700,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1701,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1702,Zimbabwe,1997,11404948,Africa,46.809,792.44996
1703,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1704,Zimbabwe,2007,12311143,Africa,43.487,469.709298


When using the **df.loc[]** indexer in Pandas, it is generally used to access rows or subsets of data based on the label-based index rather than positional indices.

If you try to access a row with a label that does not exist in the index, such as **-1**, it will raise a **KeyError** because **-1** is not a valid label in the index.

In [82]:
df.loc[-1]

KeyError: ignored

If you want to access a row by its position, you can use **df.iloc[]** instead. For example, **df.iloc[-1]** will access the last row of the DataFrame based on its positional index.



In [83]:
df.iloc[-1]

country         Zimbabwe
year                2007
population      12311143
continent         Africa
life_exp          43.487
gdp_cap       469.709298
Name: 1704, dtype: object

**How to set a particular column as an Index Column ???**

To set a **particular column** as the **index column** in a Pandas DataFrame, you can use the **set_index()** method. This method allows you to specify a column to use as the new index column, and it returns a new DataFrame with the updated index.



In [84]:
df.head()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1,Afghanistan,1952,8425333,Asia,28.801,779.445314
2,Afghanistan,1957,9240934,Asia,30.332,820.85303
3,Afghanistan,1962,10267083,Asia,31.997,853.10071
4,Afghanistan,1967,11537966,Asia,34.02,836.197138
5,Afghanistan,1972,13079460,Asia,36.088,739.981106


In [85]:
temp = df.set_index("country")

In [86]:
temp.head()

Unnamed: 0_level_0,year,population,continent,life_exp,gdp_cap
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Afghanistan,1952,8425333,Asia,28.801,779.445314
Afghanistan,1957,9240934,Asia,30.332,820.85303
Afghanistan,1962,10267083,Asia,31.997,853.10071
Afghanistan,1967,11537966,Asia,34.02,836.197138
Afghanistan,1972,13079460,Asia,36.088,739.981106


**Now give me all the Rows for "Afghanistan".**

> Is "Afghanistan" an **implicit index** or **explicit index** ???

> Should be use **loc[]** or **iloc[]** ???

In [87]:
temp.loc["Afghanistan"]  # explcit index # loc[]

Unnamed: 0_level_0,year,population,continent,life_exp,gdp_cap
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Afghanistan,1952,8425333,Asia,28.801,779.445314
Afghanistan,1957,9240934,Asia,30.332,820.85303
Afghanistan,1962,10267083,Asia,31.997,853.10071
Afghanistan,1967,11537966,Asia,34.02,836.197138
Afghanistan,1972,13079460,Asia,36.088,739.981106
Afghanistan,1977,14880372,Asia,38.438,786.11336
Afghanistan,1982,12881816,Asia,39.854,978.011439
Afghanistan,1987,13867957,Asia,40.822,852.395945
Afghanistan,1992,16317921,Asia,41.674,649.341395
Afghanistan,1997,22227415,Asia,41.763,635.341351


In [88]:
temp.loc["India"]

Unnamed: 0_level_0,year,population,continent,life_exp,gdp_cap
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
India,1952,372000000,Asia,37.373,546.565749
India,1957,409000000,Asia,40.249,590.061996
India,1962,454000000,Asia,43.605,658.347151
India,1967,506000000,Asia,47.193,700.770611
India,1972,567000000,Asia,50.651,724.032527
India,1977,634000000,Asia,54.208,813.337323
India,1982,708000000,Asia,56.596,855.723538
India,1987,788000000,Asia,58.553,976.512676
India,1992,872000000,Asia,60.223,1164.406809
India,1997,959000000,Asia,61.765,1458.817442


In [89]:
temp.iloc["Afghanistan"]  # implicit index # iloc[] will throw error message

TypeError: ignored

**How to get back to the original Index value since we have made few changes now ???**

To revert the changes made to the index and restore the original index values, you can use the **reset_index()** method in Pandas. This method removes the current index and resets it to the default numerical index, while creating a new column for the original index values.

If you don't want to keep the additional column for the original index values, you can add the **drop=True** parameter to the **reset_index()** method. This will remove the newly created column and reset the index without preserving the original index values.

By using **reset_index(drop=True, inplace=True)**, you can effectively revert the changes made to the index and restore the original index values in the DataFrame.

In [90]:
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1,Afghanistan,1952,8425333,Asia,28.801,779.445314
2,Afghanistan,1957,9240934,Asia,30.332,820.853030
3,Afghanistan,1962,10267083,Asia,31.997,853.100710
4,Afghanistan,1967,11537966,Asia,34.020,836.197138
5,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1700,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1701,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1702,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1703,Zimbabwe,2002,11926563,Africa,39.989,672.038623


In [91]:
df.reset_index()  # this creates an uncessary index column as well that needs to be removed

Unnamed: 0,index,country,year,population,continent,life_exp,gdp_cap
0,1,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,2,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,3,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,4,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,5,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...,...
1699,1700,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,1701,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,1702,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,1703,Zimbabwe,2002,11926563,Africa,39.989,672.038623


In [92]:
df.reset_index(drop = True)  # the original dataFrame is not stored

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


In [93]:
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1,Afghanistan,1952,8425333,Asia,28.801,779.445314
2,Afghanistan,1957,9240934,Asia,30.332,820.853030
3,Afghanistan,1962,10267083,Asia,31.997,853.100710
4,Afghanistan,1967,11537966,Asia,34.020,836.197138
5,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1700,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1701,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1702,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1703,Zimbabwe,2002,11926563,Africa,39.989,672.038623


In [94]:
df.reset_index(drop = True, inplace = True)

In [95]:
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


In [96]:
df.head()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106


### **How to add new Row in Pandas DataFrame ???**

To add a new row to a Pandas DataFrame using the **df.append()** method, you can provide the row data as a dictionary or a Series object. The **append()** method appends the new row to the DataFrame and returns a new DataFrame with the added row.

To add a new row, we create a dictionary **new_row** with the values for each column. Then, we use **df.append(new_row, ignore_index=True)** to append the new row to the DataFrame. The **ignore_index=True** parameter is used to reset the index and ensure the new row is appended at the end with a new index value.

In [97]:
new_row = {'country': 'India', 'year': 2023, 'population':13500000, 'continent' : "Asia", 'life_exp':37.08,'gdp_cap':900.23}

In [98]:
df = df.append(new_row) # the append() method didn't work & why's that ???

  df = df.append(new_row) # the append() method didn't work & why's that ???


TypeError: ignored

In [99]:
df = df.append(new_row, ignore_index=True)

  df = df.append(new_row, ignore_index=True)


In [100]:
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1703,Zimbabwe,2007,12311143,Africa,43.487,469.709298


**What to do if we want to modift the last added Row ???**
> **`Approach (1):` df.loc[n-1]** or **df.loc[-1]** (**`NOTE:`** Notice the o/p of **df.loc[-1]**)

> **`Approach (2):` df.iloc[n-1]** or **df.iloc[-1]**

To modify the last added row in a Pandas DataFrame using both **implicit** and **explicit indexing**, you can combine the use of **iloc** and **loc** indexers.

In [101]:
n = len(df.index)

In [102]:
print(n)

1705


In [103]:
df.loc[n-1] = ["India", 2022, 1232, "Asia", 38, 1000]

In [104]:
df.tail()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.44996
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1703,Zimbabwe,2007,12311143,Africa,43.487,469.709298
1704,India,2022,1232,Asia,38.0,1000.0


In [105]:
# We can do the same using implicity index as well...
# Approach (1)

df.iloc[n-1] = ["India", 2021, 1232, "Asia", 38, 1000]

In [106]:
df.tail()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.44996
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1703,Zimbabwe,2007,12311143,Africa,43.487,469.709298
1704,India,2021,1232,Asia,38.0,1000.0


In [107]:
# Approach (2)

df.iloc[-1] = ["India", 2020, 1232, "Asia", 38, 1000]

In [108]:
df.tail()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.44996
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1703,Zimbabwe,2007,12311143,Africa,43.487,469.709298
1704,India,2020,1232,Asia,38.0,1000.0


**Will the -1 index of last row work for explicit index as well ???**

 In Pandas, by default, the **-1** index label is considered a valid index for adding a new row. When using **df.loc[-1]**, if the index **-1** does not already exist in the DataFrame, it will create a new row with the index **-1** and assign the provided values.


In [109]:
df.loc[-1] = ["India", 2019, 1232, "Asia", 38, 1000]

In [110]:
df.tail()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1701,Zimbabwe,1997,11404948,Africa,46.809,792.44996
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1703,Zimbabwe,2007,12311143,Africa,43.487,469.709298
1704,India,2020,1232,Asia,38.0,1000.0
-1,India,2019,1232,Asia,38.0,1000.0


**We have learnt how to delete a Column previously. Likewise, how do we delete a Row ???**

In [111]:
sample

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
a,Afghanistan,1952,8425333,Asia,28.801,779.445314
b,Afghanistan,1957,9240934,Asia,30.332,820.85303
c,Afghanistan,1962,10267083,Asia,31.997,853.10071
d,Afghanistan,1967,11537966,Asia,34.02,836.197138
e,Afghanistan,1972,13079460,Asia,36.088,739.981106


In [112]:
# Let's change back the index to numbers

sample.index = [1,2,3,4,5]

In [113]:
sample

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1,Afghanistan,1952,8425333,Asia,28.801,779.445314
2,Afghanistan,1957,9240934,Asia,30.332,820.85303
3,Afghanistan,1962,10267083,Asia,31.997,853.10071
4,Afghanistan,1967,11537966,Asia,34.02,836.197138
5,Afghanistan,1972,13079460,Asia,36.088,739.981106


The code **sample.drop(3, axis=0)** will delete the row with index label **3** from the DataFrame **sample** along the row axis (**axis=0**).

Note that setting **inplace=True** modifies the DataFrame in-place, so the original DataFrame **sample** is modified. If you omit **inplace=True**, the **drop()** method returns a new DataFrame with the specified row(s) removed, while leaving the original DataFrame unchanged.







In [114]:
sample.drop(3, axis = 0) # temporary dataFrame copy

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1,Afghanistan,1952,8425333,Asia,28.801,779.445314
2,Afghanistan,1957,9240934,Asia,30.332,820.85303
4,Afghanistan,1967,11537966,Asia,34.02,836.197138
5,Afghanistan,1972,13079460,Asia,36.088,739.981106


In [115]:
sample

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1,Afghanistan,1952,8425333,Asia,28.801,779.445314
2,Afghanistan,1957,9240934,Asia,30.332,820.85303
3,Afghanistan,1962,10267083,Asia,31.997,853.10071
4,Afghanistan,1967,11537966,Asia,34.02,836.197138
5,Afghanistan,1972,13079460,Asia,36.088,739.981106


In [116]:
sample.drop(3, axis = 0, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sample.drop(3, axis = 0, inplace = True)


In [117]:
sample

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1,Afghanistan,1952,8425333,Asia,28.801,779.445314
2,Afghanistan,1957,9240934,Asia,30.332,820.85303
4,Afghanistan,1967,11537966,Asia,34.02,836.197138
5,Afghanistan,1972,13079460,Asia,36.088,739.981106


In [118]:
# What will be the o/p of this code ???

sample.drop([1,2], axis = 0)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
4,Afghanistan,1967,11537966,Asia,34.02,836.197138
5,Afghanistan,1972,13079460,Asia,36.088,739.981106


In [119]:
sample.drop("country", axis = 1)

Unnamed: 0,year,population,continent,life_exp,gdp_cap
1,1952,8425333,Asia,28.801,779.445314
2,1957,9240934,Asia,30.332,820.85303
4,1967,11537966,Asia,34.02,836.197138
5,1972,13079460,Asia,36.088,739.981106


In [120]:
# NOTE: continue with duplicate row concept from next lecture