# Introduction

1. Why do we need Pandas?
2. How to Access Pandas?
3. Basic Data Structures in Pandas
4. How to Access a File in Pandas?
5. DataFrame Structure
6. How to Extract one particular Column from a DataFrame?
7. df.info()
8. df.head()
9. df.tail()
10. df.shape()
11. Working on Columns
    1. df.columns
    2. df.keys
    3. How to access top rows or bottom rows of a specific column (using df['column'].head() and df['column'].tail()
    4. df["Column"].unique()
    5. df["Column"].value_counts()
    6. Renaming a column
    7. Dropping a column
    8. Creating a new series using the available series
    9. Creatinga complete new series
12. Working with rows
    1. df.index.values
    2. df.index
    3. Explicit & Implicit Indexing
    4. df.index[a] #a is implicit index label
    5. Indexing and Slicing with loc and iloc
    6. set_index
    7. reset_index

## 1. Why do we need Pandas?

==> In numpy, the arrays can have only one data type (Homogenous)

==> Example: In the industry, it is not possible to maintain a table with same data type

![image.png](attachment:image.png)

==> Different data types, hence not supported in numpy

==> Pandas are developed to overcome the problems of numpy's homogenous feature

==> From the above example, even though each column has different data types; Pandas allows the same

==> However, each column should have same data type (Just like the table in SQL)

==> Numpy and Pandas are like partners and both of them work together

## 2. How to access Pandas?

__To install:__ pip install pandas

__To access:__  import pandas as pd

In [2]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [29]:
import pandas as pd

## 3. Basic Data Structures in Pandas:

Pandas provides two types of classes for handling data:

1. Series
2. DataFrames


<br>

1. __Series:__ 

    1. Series is a one-dimensional (1D) array capable of holding any data type (integers, strings, floating point number, python objects etc.)
    2. However, a series can hold only one data type at a time
    3. Series is essentially a single column of a DataFrame
<br>

2. __DataFrames:__

    1. DataFrame is a 2-dimensional (2D) labeled data structure with columns of different data types
    2. DataFrame is just like a spreadsheet (or) a SQL table, (or) a dictionary of series objects
    3. DataFrame is the primary data structure used for data manipulation and analysis in Pandas

## 4. How to access a csv file in Pandas?

1. __pd.read_csv("file_name.csv")__

    The __pd.read_csv("file_name.csv")__ function is used to read data from a csv file and to create a dataframe
<br>

2. __pd.read_excel("file_name.xlsx")__

    __Note:__ Always save the Jupyter notebook and the csv file at the same location for reading the file

In [30]:
df = pd.read_csv("mckinsey.csv")
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


In [31]:
type(df)

pandas.core.frame.DataFrame

In [32]:
import numpy as np

## 5. DataFrame Structure:

![image-2.png](attachment:image-2.png)

==> Each column is a Series. Series is also referred as a column or 1D matrix

==> DataFrame is also referred as a table or 2D matrix

==> The column indices are inbuid and are not visible

==> Any missing values in the data are shown as "NaN"

==> Rows are referred as Index

#### Important points

__Features of a DataFrame:__

1. __Tabular Structure:__ Data is organised in a table with rows and columns
2. __Labeled Axes:__ Both rows and columns have labels, making it easy for referencing and manipulating the data based on the labels
3. __Heterogeneous Data Types:__ Different columns can have different data types
4. __Index:__ DataFrames have an index, which is a label for each row. The index allows for easy retrival and manipulation of data based on row labels

__Series:__
1. Series is essentially a single column of a DataFrame
2. Series shares many characteristics with a DataFrame such as labeled axes and the ability to handle different / heterogeneous data types. However, Series holds only one data type at a time
3. The primary difference is that, a Series has only one dimension (1D) whereas a DataFrame has two dimensions - rows and columns (2D)

## 6. How to extract one particular columns from a DataFrame? 

Example: Let 'abc' is a dataframe with following columns
1. Name
2. Country
3. Age
4. DoB

To extract information from country column
    
   __abc['Country']__
    
   ==> Country is the labeled name for a specific column
   
   ==> It is mandatory to include column name inside the quotations (either single (or) double)

In [33]:
#Current DataFrame

df['country']

0       Afghanistan
1       Afghanistan
2       Afghanistan
3       Afghanistan
4       Afghanistan
           ...     
1699       Zimbabwe
1700       Zimbabwe
1701       Zimbabwe
1702       Zimbabwe
1703       Zimbabwe
Name: country, Length: 1704, dtype: object

In [34]:
type(df['country'])

pandas.core.series.Series

## 7. DataFrame_name.info() or df_name.info()

==> DataFrame_name.info() or df_name.info() is used to print a concise summary of a dataframe, including information about the data types, number of non-null values, and memory usage

==> The function is used when we want to get a quick review of the structure and content of a dataframe

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     1704 non-null   object 
 1   year        1704 non-null   int64  
 2   population  1704 non-null   int64  
 3   continent   1704 non-null   object 
 4   life_exp    1704 non-null   float64
 5   gdp_cap     1704 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB


## 8. DataFrame_name.head()

==> __DataFrame_name.head()__ is used to display the first few rows of a DataFrame

==> By default, it shows the first 5 rows, but we can specify the number of rows we want to display by providing an argument

==> DataFrame_name.head() ==> Provides first five rows (Default)

==> DataFrame_name.head(10) ==> Provides first 10 rows

In [36]:
df.head()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106


In [37]:
df.head(10)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
5,Afghanistan,1977,14880372,Asia,38.438,786.11336
6,Afghanistan,1982,12881816,Asia,39.854,978.011439
7,Afghanistan,1987,13867957,Asia,40.822,852.395945
8,Afghanistan,1992,16317921,Asia,41.674,649.341395
9,Afghanistan,1997,22227415,Asia,41.763,635.341351


## 9. DataFrame_name.tail()

==> __DataFrame_name.tail()__ is used to display the last few rows of a DataFrame

==> By default, it shows the last 5 rows, but we can specify the number of rows we want to display by providing an argument

==> DataFrame_name.tail() ==> Provides last five rows (Default)

==> DataFrame_name.tail(10) ==> Provides last 10 rows

In [38]:
df.tail()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.44996
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1703,Zimbabwe,2007,12311143,Africa,43.487,469.709298


In [39]:
df.tail(10)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1694,Zimbabwe,1962,4277736,Africa,52.358,527.272182
1695,Zimbabwe,1967,4995432,Africa,53.995,569.795071
1696,Zimbabwe,1972,5861135,Africa,55.635,799.362176
1697,Zimbabwe,1977,6642107,Africa,57.674,685.587682
1698,Zimbabwe,1982,7636524,Africa,60.363,788.855041
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.44996
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1703,Zimbabwe,2007,12311143,Africa,43.487,469.709298


## 10. DataFrame_name.shape

==> Provides a Tuple containing number of rows and columns (rows,columns)

==> Works in the same way as in Numpy. We are not required to provide brakets ()

In [40]:
df.shape

(1704, 6)

#### Important Points

==> Pandas is built on top of Numpy and often leverages Numpy arrays for its underlying data structures and operations

==> Pandas provides high-level data structures like series and dataframe, however, it relies on Numpy for efficient numerical calculations

==> Pandas series are similar to 1D arrays in Numpy

==> Dataframes are similar to 2D arrays in Numpy

## 11. Working on Columns

## A. Accessing Columns Names

### 1. df.columns

==> df.columns is used to access the column labels / names of columns in a dataframe

==> It will return an index object containing the column names

In [41]:
df.columns #dataframes have an attribute columns

Index(['country', 'year', 'population', 'continent', 'life_exp', 'gdp_cap'], dtype='object')

### 2. df.keys()

==> df.keys is an alias for df.columns

==> Both keys and columns are used interchangeably to access the column labels of a dataframe

In [42]:
df.keys() #dataframes have a method ==> keys()

Index(['country', 'year', 'population', 'continent', 'life_exp', 'gdp_cap'], dtype='object')

#### Note: Index is a special pandas object which stores immutable data

### 3. How to access top rows or bottom rows of a specific column (using df['column'].head() and df['column'].tail()

In [43]:
#To extract data in the form of a Series

df["country"].head()

0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
Name: country, dtype: object

In [44]:
#To extract data in the form of a Dataframe

df[["country"]].head()

Unnamed: 0,country
0,Afghanistan
1,Afghanistan
2,Afghanistan
3,Afghanistan
4,Afghanistan


#### Note: It is not possible to extract more than one series as series. To extract more than one series, we need extract as a dataframe (using double list brackets)

__Because one column at a time is called a Series. When two or more Series are combined they become a DataFrame__

In [45]:
df[["country", 'year']].head() #Works

Unnamed: 0,country,year
0,Afghanistan,1952
1,Afghanistan,1957
2,Afghanistan,1962
3,Afghanistan,1967
4,Afghanistan,1972


In [46]:
df["country", 'year'].head() #Results in error

KeyError: ('country', 'year')

In [47]:
#To extract data in the form of a Series

df["country"].tail()

1699    Zimbabwe
1700    Zimbabwe
1701    Zimbabwe
1702    Zimbabwe
1703    Zimbabwe
Name: country, dtype: object

In [48]:
#To extract data in the form of a Dataframe

df[["country"]].tail()

Unnamed: 0,country
1699,Zimbabwe
1700,Zimbabwe
1701,Zimbabwe
1702,Zimbabwe
1703,Zimbabwe


In [49]:
df[["country", 'year']].tail() #Works

Unnamed: 0,country,year
1699,Zimbabwe,1987
1700,Zimbabwe,1992
1701,Zimbabwe,1997
1702,Zimbabwe,2002
1703,Zimbabwe,2007


### 4. df['Column'].unique():

The __df['Column'].unique()__ returns an array of unique elements in a series

==> Output will be provided in a numpy array

In [50]:
df['country'].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
       'Australia', 'Austria', 'Bahrain', 'Bangladesh', 'Belgium',
       'Benin', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon',
       'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',
       'Colombia', 'Comoros', 'Congo, Dem. Rep.', 'Congo, Rep.',
       'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Czech Republic',
       'Denmark', 'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Ethiopia',
       'Finland', 'France', 'Gabon', 'Gambia', 'Germany', 'Ghana',
       'Greece', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Haiti',
       'Honduras', 'Hong Kong, China', 'Hungary', 'Iceland', 'India',
       'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy',
       'Jamaica', 'Japan', 'Jordan', 'Kenya', 'Korea, Dem. Rep.',
       'Korea, Rep.', 'Kuwait', 'Leba

In [51]:
type(df['country'].unique()) #Proves Numpy and Pandas work together

numpy.ndarray

### 5. df['column'].value_counts()

The __df['column'].value_counts()__ provides the count of how many times every unique element in the specified column is repeated

==> Output will be provided in a Pandas Series

In [52]:
df['country'].value_counts()

#Count of every unique element in the country column

country
Afghanistan          12
Pakistan             12
New Zealand          12
Nicaragua            12
Niger                12
                     ..
Eritrea              12
Equatorial Guinea    12
El Salvador          12
Egypt                12
Zimbabwe             12
Name: count, Length: 142, dtype: int64

### 6. Renaming columns

__Method -1__

Syntax ==> df.rename({"old_name" : "new_name"}, axis = 0 / 1 / 2)

The __df.rename({"old_name" : "new_name"}, axis = 0 / 1 / 2)__ allows us to specify a dictionary mapping old columns names to new columns names

==> A complete new dataframe will be created. Original dataframe remains the same

In [53]:
df.rename({'population' :'POPULATION', 'country' : 'NATION'}, axis = 1).head()

Unnamed: 0,NATION,year,POPULATION,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106


__Method - 2 (This is simple with no confusion)__

Syntax ==> df.rename(columns = {"old_name" : "new_name"})

In [54]:
df.rename(columns = {'population' :'POPULATION', 'country' : 'JAGAN'}).head()

Unnamed: 0,JAGAN,year,POPULATION,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106


In [55]:
df.head() #However original dataframe remains the same

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106


#### How to rename columns in original dataframe?

For updating the original dataframe, we need need to use inplace in the code

==> df.rename({"old_name" : "new_name"}, axis = 0 / 1 / 2, inplace = True)

==> df.rename(columns = {"old_name" : "new_name"}, inplace = True)

__inplace = True will ask Pandas to modify the original dataframe__

In [56]:
df.head()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106


In [57]:
df.rename({'country' : 'NATION'}, axis = 1, inplace = True)

df.head()

Unnamed: 0,NATION,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106


In [58]:
df.rename(columns = {'Population' : 'POPULATION'}, inplace = True)

df.head()

Unnamed: 0,NATION,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106


### 7. Dropping a Columns

We can use following two codes to drop a columns

==> df.drop("column_name", axis = 1)

==> df.drop(columns = ["column_name"'])

__Provides a new dataframe by dropping the specified columns__

In [59]:
df.drop("continent", axis = 1)

Unnamed: 0,NATION,year,population,life_exp,gdp_cap
0,Afghanistan,1952,8425333,28.801,779.445314
1,Afghanistan,1957,9240934,30.332,820.853030
2,Afghanistan,1962,10267083,31.997,853.100710
3,Afghanistan,1967,11537966,34.020,836.197138
4,Afghanistan,1972,13079460,36.088,739.981106
...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,62.351,706.157306
1700,Zimbabwe,1992,10704340,60.377,693.420786
1701,Zimbabwe,1997,11404948,46.809,792.449960
1702,Zimbabwe,2002,11926563,39.989,672.038623


In [60]:
df.head()

Unnamed: 0,NATION,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106


In [61]:
df.drop(columns = ["continent"]).head()

Unnamed: 0,NATION,year,population,life_exp,gdp_cap
0,Afghanistan,1952,8425333,28.801,779.445314
1,Afghanistan,1957,9240934,30.332,820.85303
2,Afghanistan,1962,10267083,31.997,853.10071
3,Afghanistan,1967,11537966,34.02,836.197138
4,Afghanistan,1972,13079460,36.088,739.981106


#### How to drop columns in original dataframe?

For updating the original dataframe, we need need to use inplace in the code

==> df.drop("column", axis = 0 / 1 / 2, inplace = "True")

==> df.drop(columns = "column", inplace = "True")

__inplace = True will ask Pandas to modify the original dataframe__

In [62]:
import pandas as pd

df = pd.read_csv("mckinsey.csv")

df.head()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106


In [63]:
df.drop("continent", axis = 1, inplace = True) #Won't provide any output but the column is dropped

In [64]:
df.head()

Unnamed: 0,country,year,population,life_exp,gdp_cap
0,Afghanistan,1952,8425333,28.801,779.445314
1,Afghanistan,1957,9240934,30.332,820.85303
2,Afghanistan,1962,10267083,31.997,853.10071
3,Afghanistan,1967,11537966,34.02,836.197138
4,Afghanistan,1972,13079460,36.088,739.981106


### 8. Creating a new series using the available Series

We can create a new series by assigning the available Series or by performing some arithematic operations on the available Series

==> __Assigning same data to new Series__

   df["new_column_name"] = df["available_column_name"]
   
==> __Creating a new Series through arithmetic operations__

   df["new_column_name"] = df["available_column_name"] + 7 (Example)

In [65]:
df.head()

Unnamed: 0,country,year,population,life_exp,gdp_cap
0,Afghanistan,1952,8425333,28.801,779.445314
1,Afghanistan,1957,9240934,30.332,820.85303
2,Afghanistan,1962,10267083,31.997,853.10071
3,Afghanistan,1967,11537966,34.02,836.197138
4,Afghanistan,1972,13079460,36.088,739.981106


In [66]:
#Assigning same data to new Series

df["nation"] = df["country"]

In [67]:
df.head()

Unnamed: 0,country,year,population,life_exp,gdp_cap,nation
0,Afghanistan,1952,8425333,28.801,779.445314,Afghanistan
1,Afghanistan,1957,9240934,30.332,820.85303,Afghanistan
2,Afghanistan,1962,10267083,31.997,853.10071,Afghanistan
3,Afghanistan,1967,11537966,34.02,836.197138,Afghanistan
4,Afghanistan,1972,13079460,36.088,739.981106,Afghanistan


In [68]:
#Creating a new Series through arithmetic operation

df["year + 7"] = df["year"] + 7

In [69]:
df.head()

Unnamed: 0,country,year,population,life_exp,gdp_cap,nation,year + 7
0,Afghanistan,1952,8425333,28.801,779.445314,Afghanistan,1959
1,Afghanistan,1957,9240934,30.332,820.85303,Afghanistan,1964
2,Afghanistan,1962,10267083,31.997,853.10071,Afghanistan,1969
3,Afghanistan,1967,11537966,34.02,836.197138,Afghanistan,1974
4,Afghanistan,1972,13079460,36.088,739.981106,Afghanistan,1979


In [70]:
df.drop(columns = "nation" , inplace = True)

In [71]:
df.drop(columns = "year + 7" , inplace = True)

In [72]:
df.head()

Unnamed: 0,country,year,population,life_exp,gdp_cap
0,Afghanistan,1952,8425333,28.801,779.445314
1,Afghanistan,1957,9240934,30.332,820.85303
2,Afghanistan,1962,10267083,31.997,853.10071
3,Afghanistan,1967,11537966,34.02,836.197138
4,Afghanistan,1972,13079460,36.088,739.981106


In [73]:
df["gdp"] = df["gdp_cap"] * df["population"]
df.head()

Unnamed: 0,country,year,population,life_exp,gdp_cap,gdp
0,Afghanistan,1952,8425333,28.801,779.445314,6567086000.0
1,Afghanistan,1957,9240934,30.332,820.85303,7585449000.0
2,Afghanistan,1962,10267083,31.997,853.10071,8758856000.0
3,Afghanistan,1967,11537966,34.02,836.197138,9648014000.0
4,Afghanistan,1972,13079460,36.088,739.981106,9678553000.0


### Creating a complete new Series

While creatinga new Series in a dataframe, the size of rows in the new series should match with the size of rows of other series, otherwise it will result in an error.

In [74]:
df.tail()

Unnamed: 0,country,year,population,life_exp,gdp_cap,gdp
1699,Zimbabwe,1987,9216418,62.351,706.157306,6508241000.0
1700,Zimbabwe,1992,10704340,60.377,693.420786,7422612000.0
1701,Zimbabwe,1997,11404948,46.809,792.44996,9037851000.0
1702,Zimbabwe,2002,11926563,39.989,672.038623,8015111000.0
1703,Zimbabwe,2007,12311143,43.487,469.709298,5782658000.0


In [75]:
df["random"] = [i**3 for i in range(1704)]

#In Python, square brackets [] are used to denote list comprehension, a concise way to create lists

In [76]:
df.head()

Unnamed: 0,country,year,population,life_exp,gdp_cap,gdp,random
0,Afghanistan,1952,8425333,28.801,779.445314,6567086000.0,0
1,Afghanistan,1957,9240934,30.332,820.85303,7585449000.0,1
2,Afghanistan,1962,10267083,31.997,853.10071,8758856000.0,8
3,Afghanistan,1967,11537966,34.02,836.197138,9648014000.0,27
4,Afghanistan,1972,13079460,36.088,739.981106,9678553000.0,64


In [77]:
df.drop(columns = ["gdp", "random"], inplace =True)

In [78]:
df.head()

Unnamed: 0,country,year,population,life_exp,gdp_cap
0,Afghanistan,1952,8425333,28.801,779.445314
1,Afghanistan,1957,9240934,30.332,820.85303
2,Afghanistan,1962,10267083,31.997,853.10071
3,Afghanistan,1967,11537966,34.02,836.197138
4,Afghanistan,1972,13079460,36.088,739.981106


In [79]:
df.drop(columns = "continent", inplace =True)

KeyError: "['continent'] not found in axis"

In [None]:
df.head()

In [83]:
df

Unnamed: 0,country,year,population,life_exp,gdp_cap
0,Afghanistan,1952,8425333,28.801,779.445314
1,Afghanistan,1957,9240934,30.332,820.853030
2,Afghanistan,1962,10267083,31.997,853.100710
3,Afghanistan,1967,11537966,34.020,836.197138
4,Afghanistan,1972,13079460,36.088,739.981106
...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,62.351,706.157306
1700,Zimbabwe,1992,10704340,60.377,693.420786
1701,Zimbabwe,1997,11404948,46.809,792.449960
1702,Zimbabwe,2002,11926563,39.989,672.038623


## 12. Working with Rows

1. df.index.values
2. df.index
3. Explicit & Implicit Indexing
4. df.index[a] #a is explicit index label
5. Indexing and Slicing with loc and iloc
6. set_index
7. reset_index

__1. df.index.values:__ The __df.index.values__ returns the values of the index of a DataFrame. This is useful when you want to access or manipulate the index values directly, for example, to perform calculations or comparisons.

In [80]:
df.index.values

array([   0,    1,    2, ..., 1701, 1702, 1703], dtype=int64)

__2. df.index:__ The __df.index__ returns the index labels of a DataFrame df. In pandas, the index is a structure that labels each row in the DataFrame, allowing for efficient data retrieval and manipulation.

When you call df.index, it will return the index labels of the DataFrame df, which could be integers, strings, or any other data type depending on how the DataFrame was created or modified.

In [81]:
df.index

RangeIndex(start=0, stop=1704, step=1)

This indicates that the DataFrame has integer index labels starting from 0 and ending at 1704 (exclusive), with a step size of 1.

__3. Explicit & Implicit Indexing__

__Explicit Indexing:__

==> Explicit indexing refers to when you use specific labels or names to access data in a pandas DataFrame or Series.

==> In explicit indexing, you access data using labels that you assign to the rows and columns of your DataFrame or Series.

==> For example, if you have a DataFrame with rows labeled as 'A', 'B', 'C', etc., and columns labeled as 'X', 'Y', 'Z', etc., you explicitly access data using these labels like df.loc['A', 'X'], where loc is the explicit indexing accessor.

__Implicit Indexing:__

==> Implicit indexing, also known as positional or integer-based indexing, refers to when you use the default numerical positions of rows and columns to access data.

==> In implicit indexing, you access data based on the numerical positions of rows and columns, starting from 0.

==> For example, if you have a DataFrame with default integer index labels (0, 1, 2, etc.), you implicitly access data using these numerical positions like df.iloc[0, 1], where iloc is the implicit indexing accessor.

__In simple terms:__

==> Explicit indexing means you directly refer to data using the labels or names you've assigned to rows and columns.

==> Implicit indexing means you refer to data using the default numerical positions of rows and columns. You don't use specific labels; instead, you use the order in which the data appears.

__4. df.index[a]__ #a is the implicit index or row label

__The df.index[1]__ means you're asking for the label of the row at the position number 1 in your DataFrame's index. For example, if your DataFrame has rows labeled with numbers or names, it would give you the label of the second row.

In [86]:
df.index[1] # implicit index 1 gave explicit index 1

# df.index[implicit_index] -> Explicit Index

1

__df.index__ also enables us to change the explicit indexing

In [87]:
#Lets change the indices of the dataframe
df.index = np.arange(10, df.shape[0] + 10, dtype = "float") #df.shape[0] = 1704

In [88]:
df

Unnamed: 0,country,year,population,life_exp,gdp_cap
10.0,Afghanistan,1952,8425333,28.801,779.445314
11.0,Afghanistan,1957,9240934,30.332,820.853030
12.0,Afghanistan,1962,10267083,31.997,853.100710
13.0,Afghanistan,1967,11537966,34.020,836.197138
14.0,Afghanistan,1972,13079460,36.088,739.981106
...,...,...,...,...,...
1709.0,Zimbabwe,1987,9216418,62.351,706.157306
1710.0,Zimbabwe,1992,10704340,60.377,693.420786
1711.0,Zimbabwe,1997,11404948,46.809,792.449960
1712.0,Zimbabwe,2002,11926563,39.989,672.038623


In [89]:
df.index[1] # implicit index 1 gave explicit index 11

# df.index[implicit_index] -> Explicit Index

11.0

In [91]:
#Now lets see how to keep strings as index
sample = df.head()

In [92]:
sample

Unnamed: 0,country,year,population,life_exp,gdp_cap
10.0,Afghanistan,1952,8425333,28.801,779.445314
11.0,Afghanistan,1957,9240934,30.332,820.85303
12.0,Afghanistan,1962,10267083,31.997,853.10071
13.0,Afghanistan,1967,11537966,34.02,836.197138
14.0,Afghanistan,1972,13079460,36.088,739.981106


In [93]:
#setting strings as index

sample.index = ["a", "b", "c", "d", "e"]
sample

Unnamed: 0,country,year,population,life_exp,gdp_cap
a,Afghanistan,1952,8425333,28.801,779.445314
b,Afghanistan,1957,9240934,30.332,820.85303
c,Afghanistan,1962,10267083,31.997,853.10071
d,Afghanistan,1967,11537966,34.02,836.197138
e,Afghanistan,1972,13079460,36.088,739.981106


In [94]:
#Changing the explicit indexes of the dataframe to integers
df.index = np.arange(10, df.shape[0] + 10, dtype = "int")
df

Unnamed: 0,country,year,population,life_exp,gdp_cap
10,Afghanistan,1952,8425333,28.801,779.445314
11,Afghanistan,1957,9240934,30.332,820.853030
12,Afghanistan,1962,10267083,31.997,853.100710
13,Afghanistan,1967,11537966,34.020,836.197138
14,Afghanistan,1972,13079460,36.088,739.981106
...,...,...,...,...,...
1709,Zimbabwe,1987,9216418,62.351,706.157306
1710,Zimbabwe,1992,10704340,60.377,693.420786
1711,Zimbabwe,1997,11404948,46.809,792.449960
1712,Zimbabwe,2002,11926563,39.989,672.038623


In [95]:
#Lets see how to access a single row in a Series

In [96]:
Series1 = df["country"]

In [97]:
Series1

10      Afghanistan
11      Afghanistan
12      Afghanistan
13      Afghanistan
14      Afghanistan
           ...     
1709       Zimbabwe
1710       Zimbabwe
1711       Zimbabwe
1712       Zimbabwe
1713       Zimbabwe
Name: country, Length: 1704, dtype: object

In [104]:
#Lets access 22th row in the series. Working, no issues (Basically we are doing indexing on Series)
Series1[22]

'Albania'

In [103]:
Series1.head(15)

10    Afghanistan
11    Afghanistan
12    Afghanistan
13    Afghanistan
14    Afghanistan
15    Afghanistan
16    Afghanistan
17    Afghanistan
18    Afghanistan
19    Afghanistan
20    Afghanistan
21    Afghanistan
22        Albania
23        Albania
24        Albania
Name: country, dtype: object

In [105]:
#Lets access row 5 to 15 in the Series. Slicing is some issue

Series1[5:15]

15    Afghanistan
16    Afghanistan
17    Afghanistan
18    Afghanistan
19    Afghanistan
20    Afghanistan
21    Afghanistan
22        Albania
23        Albania
24        Albania
Name: country, dtype: object

Notice something different though?

- **Indexing in Series** used **explicit indices**
- **Slicing** however used **implicit indices**

Let's try the same for the dataframe.

In [106]:
df[0]

KeyError: 0

Notice that this syntax is exactly same as how we tried accessing a column.

- `df[x]` looks for column with name `x`

**How can we access a slice of rows in the dataframe?**

In [108]:
df[5:15]

Unnamed: 0,country,year,population,life_exp,gdp_cap
15,Afghanistan,1977,14880372,38.438,786.11336
16,Afghanistan,1982,12881816,39.854,978.011439
17,Afghanistan,1987,13867957,40.822,852.395945
18,Afghanistan,1992,16317921,41.674,649.341395
19,Afghanistan,1997,22227415,41.763,635.341351
20,Afghanistan,2002,25268405,42.129,726.734055
21,Afghanistan,2007,31889923,43.828,974.580338
22,Albania,1952,1282697,55.23,1601.056136
23,Albania,1957,1476505,59.28,1942.284244
24,Albania,1962,1728137,64.82,2312.888958


Woah, so the slicing works.

This can be a cause for confusion.

To avoid this, Pandas provides special indexers, loc and iloc

__loc:__

==> loc is a label-based indexer in pandas used to select data by label or by a boolean array.

==> It allows you to access data in a DataFrame using explicit index labels for both rows and columns.

==> Syntax: __df.loc[row_label, column_label]__ or __df.loc[row_label_condition, column_label_condition]__

__iloc:__

==> iloc is an integer-based indexer in pandas used to select data by position.

==> It allows you to access data in a DataFrame using implicit (integer) index positions for both rows and columns.

==> Syntax: __df.iloc[row_position, column_position]__ or __df.iloc[row_position_condition, column_position_condition]__


__In simple terms:__

__loc:__ Use when you want to access data by label / explicit index.

__iloc:__ Use when you want to access data by position / implicit index (integer index).

In [109]:
df

Unnamed: 0,country,year,population,life_exp,gdp_cap
10,Afghanistan,1952,8425333,28.801,779.445314
11,Afghanistan,1957,9240934,30.332,820.853030
12,Afghanistan,1962,10267083,31.997,853.100710
13,Afghanistan,1967,11537966,34.020,836.197138
14,Afghanistan,1972,13079460,36.088,739.981106
...,...,...,...,...,...
1709,Zimbabwe,1987,9216418,62.351,706.157306
1710,Zimbabwe,1992,10704340,60.377,693.420786
1711,Zimbabwe,1997,11404948,46.809,792.449960
1712,Zimbabwe,2002,11926563,39.989,672.038623


#### LOC:

In [117]:
#Lets access a single row of a dataframe
df.loc[12]

country       Afghanistan
year                 1962
population       10267083
life_exp           31.997
gdp_cap         853.10071
Name: 12, dtype: object

In [119]:
#Lets access a single row of a dataframe as a dataframe
df.loc[[12]]

Unnamed: 0,country,year,population,life_exp,gdp_cap
12,Afghanistan,1962,10267083,31.997,853.10071


In [123]:
#Lets access a single row of a dataframe with selected columns
df.loc[12, "year":"life_exp"]

year              1962
population    10267083
life_exp        31.997
Name: 12, dtype: object

In [125]:
#Lets access a multiple rows of a dataframe
df.loc[10:12]

Unnamed: 0,country,year,population,life_exp,gdp_cap
10,Afghanistan,1952,8425333,28.801,779.445314
11,Afghanistan,1957,9240934,30.332,820.85303
12,Afghanistan,1962,10267083,31.997,853.10071


#### Note: loc considers ending values as well

#### ILOC

In [126]:
df.iloc[1]

country       Afghanistan
year                 1957
population        9240934
life_exp           30.332
gdp_cap         820.85303
Name: 11, dtype: object

#### Will iloc also consider the range inclusive?

In [127]:
df.iloc[0:2]

Unnamed: 0,country,year,population,life_exp,gdp_cap
10,Afghanistan,1952,8425333,28.801,779.445314
11,Afghanistan,1957,9240934,30.332,820.85303


#### No, because **`iloc` works with implicit Python-style indices**.

**Which one should we use?**
- Generally, explicit indexing is considered to be better than implicit indexing.
- But it is recommended to always use both `loc` and `iloc` to avoid any confusions.

#### What if we want to access multiple non-consecutive rows at same time?

In [128]:
df.iloc[[1, 10, 100]] #not supported with loc

Unnamed: 0,country,year,population,life_exp,gdp_cap
11,Afghanistan,1957,9240934,30.332,820.85303
20,Afghanistan,2002,25268405,42.129,726.734055
110,Bangladesh,1972,70759295,45.252,630.233627


#### We can just pack the indices in [] and pass it in loc or iloc.

##### What about negative index? Which would work between iloc and loc?

In [130]:
df.iloc[-1]

# Works and gives last row in dataframe

country         Zimbabwe
year                2007
population      12311143
life_exp          43.487
gdp_cap       469.709298
Name: 1713, dtype: object

In [131]:
df.loc[-1]

# Does not work

KeyError: -1

**So, why did `iloc[-1]` worked, but `loc[-1]` didn't?**

- Because **`iloc` works with positional indices, while `loc` with assigned labels**.
- `[-1]` here points to the **row at last position** in `iloc`.

__6. df.set_index__


The __df.set_index()__ is used to set one or more columns as the index of the DataFrame. This method allows you to reorganize your DataFrame by assigning one or more columns to be used as the index labels, instead of the default integer-based index.

__syntax:__ _df.set_index(keys, drop=True, inplace=False)_

==> __keys:__ This parameter specifies the column name(s) or column index(es) that you want to set as the new index. It could be a single column name/index or a list of column names/indexes if you want a multi-level index.

==> __drop:__ This parameter is a boolean (default is True) that indicates whether to drop the column(s) used as the new index from the DataFrame or not.

==> __inplace:__ This parameter is a boolean (default is False) that indicates whether to modify the DataFrame in place or return a new DataFrame with the updated index.

In [132]:
temp = df.set_index("country")
temptemp.loc['Afghanistan']

Unnamed: 0_level_0,year,population,life_exp,gdp_cap
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Afghanistan,1952,8425333,28.801,779.445314
Afghanistan,1957,9240934,30.332,820.853030
Afghanistan,1962,10267083,31.997,853.100710
Afghanistan,1967,11537966,34.020,836.197138
Afghanistan,1972,13079460,36.088,739.981106
...,...,...,...,...
Zimbabwe,1987,9216418,62.351,706.157306
Zimbabwe,1992,10704340,60.377,693.420786
Zimbabwe,1997,11404948,46.809,792.449960
Zimbabwe,2002,11926563,39.989,672.038623


In [133]:
temp.loc['Afghanistan']

Unnamed: 0_level_0,year,population,life_exp,gdp_cap
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Afghanistan,1952,8425333,28.801,779.445314
Afghanistan,1957,9240934,30.332,820.85303
Afghanistan,1962,10267083,31.997,853.10071
Afghanistan,1967,11537966,34.02,836.197138
Afghanistan,1972,13079460,36.088,739.981106
Afghanistan,1977,14880372,38.438,786.11336
Afghanistan,1982,12881816,39.854,978.011439
Afghanistan,1987,13867957,40.822,852.395945
Afghanistan,1992,16317921,41.674,649.341395
Afghanistan,1997,22227415,41.763,635.341351


__7. df.reset_index()__

The __df.reset_index()__ is used to reset the index of a DataFrame. When you reset the index, the current index labels (whether they are default integer-based or custom labels) are removed, and the DataFrame is reverted to its default integer-based index.

__syntax:__ _df.reset_index(level=None, drop=False, inplace=False)_

==> __level:__ This parameter specifies the level(s) of the index to be reset. By default, it resets all levels. You can specify the level(s) by providing the level's position or label.

==> __drop:__ This parameter is a boolean (default is False) that indicates whether to drop the index column(s) after resetting the index or not. If True, the current index will be removed and not added as a column in the DataFrame.

==> __inplace:__ This parameter is a boolean (default is False) that indicates whether to modify the DataFrame in place or return a new DataFrame with the reset index.

In [134]:
df.reset_index(drop=True) # by using drop=True we can prevent creation of a new column

Unnamed: 0,country,year,population,life_exp,gdp_cap
0,Afghanistan,1952,8425333,28.801,779.445314
1,Afghanistan,1957,9240934,30.332,820.853030
2,Afghanistan,1962,10267083,31.997,853.100710
3,Afghanistan,1967,11537966,34.020,836.197138
4,Afghanistan,1972,13079460,36.088,739.981106
...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,62.351,706.157306
1700,Zimbabwe,1992,10704340,60.377,693.420786
1701,Zimbabwe,1997,11404948,46.809,792.449960
1702,Zimbabwe,2002,11926563,39.989,672.038623


In [135]:
df

Unnamed: 0,country,year,population,life_exp,gdp_cap
10,Afghanistan,1952,8425333,28.801,779.445314
11,Afghanistan,1957,9240934,30.332,820.853030
12,Afghanistan,1962,10267083,31.997,853.100710
13,Afghanistan,1967,11537966,34.020,836.197138
14,Afghanistan,1972,13079460,36.088,739.981106
...,...,...,...,...,...
1709,Zimbabwe,1987,9216418,62.351,706.157306
1710,Zimbabwe,1992,10704340,60.377,693.420786
1711,Zimbabwe,1997,11404948,46.809,792.449960
1712,Zimbabwe,2002,11926563,39.989,672.038623


In [136]:
df.reset_index(drop=True, inplace=True)

In [137]:
df

Unnamed: 0,country,year,population,life_exp,gdp_cap
0,Afghanistan,1952,8425333,28.801,779.445314
1,Afghanistan,1957,9240934,30.332,820.853030
2,Afghanistan,1962,10267083,31.997,853.100710
3,Afghanistan,1967,11537966,34.020,836.197138
4,Afghanistan,1972,13079460,36.088,739.981106
...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,62.351,706.157306
1700,Zimbabwe,1992,10704340,60.377,693.420786
1701,Zimbabwe,1997,11404948,46.809,792.449960
1702,Zimbabwe,2002,11926563,39.989,672.038623
