<a href="https://colab.research.google.com/github/Saifullah785/python-data-science-handbook-notes/blob/main/03_00_Introduction_to_Pandas_03_02_Data_Indexing_and_Selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Manipulation with Pandas**

This section introduces how to work with data using the pandas library in Python

## **Installing and Using Pandas**

This covers how to get pandas ready for use and how to start using its features.

In [39]:
import pandas
# Print the installed pandas version
pandas.__version__

'2.2.2'

In [40]:
# Import the pandas library and alias it as 'pd'
import pandas as pd

## **Reminder About Built-in Documentation**

This is a reminder that you can use Python's built-in help features to understand how pandas functions work

# **Data Indexing and Selection**

This part explains different ways to pick out specific pieces of data from your dataset.

## **Data Selection in Series**

This focuses on selecting data from a one-dimensional data structure called a Series.

### **Series as Dictionary**

This describes how a pandas Series can be thought of and accessed like a Python dictionary.

In [41]:
import pandas as pd
# Create a pandas Series with custom index labels
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
# Display the Series
data

Unnamed: 0,0
a,0.25
b,0.5
c,0.75
d,1.0


In [42]:
# Access an element of the Series using its explicit index
data['b']

np.float64(0.5)

In [43]:
# Get the index labels of the Series
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [44]:
# Add a new element to the Series with a new index label
data['e'] = 1.25
# Display the updated Series
data

Unnamed: 0,0
a,0.25
b,0.5
c,0.75
d,1.0
e,1.25


### **Series as One-Dimensional Array**

This describes how a pandas Series can also be thought of and accessed like a one-dimensional array.

In [45]:
# slicing by explicit index
data['a':'c']

Unnamed: 0,0
a,0.25
b,0.5
c,0.75


In [46]:
# slicing by implicit integer index
data[0:2]

Unnamed: 0,0
a,0.25
b,0.5


In [47]:
# masking: select elements based on a boolean condition
data[(data > 0.3) & (data < 0.8)]

Unnamed: 0,0
b,0.5
c,0.75


In [48]:
# fancy indexing: select elements using a list of index labels
data[['a', 'e']]

Unnamed: 0,0
a,0.25
e,1.25


### **indexers: loc and iloc**

This explains how to use loc and iloc to select data based on either the index labels or the integer position.

In [49]:
# Create a new Series with integer index labels
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
# Display the Series
data

Unnamed: 0,0
1,a
3,b
5,c


In [50]:
# explicit index when indexing (uses the label)
data[1]

'a'

In [51]:
# implicit index when slicing (uses the integer position)
data[1:3]

Unnamed: 0,0
3,b
5,c


In [52]:
# Use .loc for explicit index access
data.loc[1]

'a'

In [53]:
# Use .loc for explicit index slicing
data.loc[1:3]

Unnamed: 0,0
1,a
3,b


In [54]:
# Use .iloc for implicit integer index access
data.iloc[1]

'b'

In [55]:
# Use .iloc for implicit integer index slicing
data.iloc[1:3]

Unnamed: 0,0
3,b
5,c


**Data Selection in DataFrames**

This covers how to select data from a two-dimensional data structure called a DataFrame.


In [56]:
# Create a Series for area data
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'Florida': 170312, 'New York': 141297,
                  'Pennsylvania': 119280})
# Create a Series for population data
pop = pd.Series({'California': 39538223, 'Texas': 29145505,
                 'Florida': 21538187, 'New York': 20201249,
                 'Pennsylvania': 13002700})
# Create a DataFrame from the area and pop Series
data = pd.DataFrame({'area':area, 'pop':pop})
# Display the DataFrame
data

Unnamed: 0,area,pop
California,423967,39538223
Texas,695662,29145505
Florida,170312,21538187
New York,141297,20201249
Pennsylvania,119280,13002700


In [57]:
# Access a column of the DataFrame using dictionary-style indexing
data['area']

Unnamed: 0,area
California,423967
Texas,695662
Florida,170312
New York,141297
Pennsylvania,119280


In [58]:
# Access a column of the DataFrame using attribute-style access
data.area

Unnamed: 0,area
California,423967
Texas,695662
Florida,170312
New York,141297
Pennsylvania,119280


In [59]:
# Check if attribute-style access and dictionary-style access to 'pop' are the same object
data.pop is data['pop']

False

In [60]:
# Add a new column 'density' to the DataFrame by calculating population density
data['density'] = data['pop'] / data['area']
# Display the updated DataFrame
data

Unnamed: 0,area,pop,density
California,423967,39538223,93.257784
Texas,695662,29145505,41.896072
Florida,170312,21538187,126.463121
New York,141297,20201249,142.97012
Pennsylvania,119280,13002700,109.009893


### **DataFrame as two-Dimensional Array**

This describes how a pandas DataFrame can be thought of and accessed like a two-dimensional array.

In [61]:
# Access the underlying NumPy array of the DataFrame
data.values

array([[4.23967000e+05, 3.95382230e+07, 9.32577842e+01],
       [6.95662000e+05, 2.91455050e+07, 4.18960717e+01],
       [1.70312000e+05, 2.15381870e+07, 1.26463121e+02],
       [1.41297000e+05, 2.02012490e+07, 1.42970120e+02],
       [1.19280000e+05, 1.30027000e+07, 1.09009893e+02]])

In [62]:
# Transpose the DataFrame (swap rows and columns)
data.T

Unnamed: 0,California,Texas,Florida,New York,Pennsylvania
area,423967.0,695662.0,170312.0,141297.0,119280.0
pop,39538220.0,29145500.0,21538190.0,20201250.0,13002700.0
density,93.25778,41.89607,126.4631,142.9701,109.0099


In [63]:
# Access the first row of the underlying NumPy array
data.values[0]

array([4.23967000e+05, 3.95382230e+07, 9.32577842e+01])

In [64]:
# Access the 'area' column of the DataFrame
data['area']

Unnamed: 0,area
California,423967
Texas,695662
Florida,170312
New York,141297
Pennsylvania,119280


In [65]:
# Use .iloc for integer-position based slicing (select first 3 rows and first 2 columns)
data.iloc[:3, :2]

Unnamed: 0,area,pop
California,423967,39538223
Texas,695662,29145505
Florida,170312,21538187


In [66]:
# Use .loc for label-based slicing (select rows up to 'Florida' and columns up to 'pop')
data.loc[:'Florida', :'pop']

Unnamed: 0,area,pop
California,423967,39538223
Texas,695662,29145505
Florida,170312,21538187


In [67]:
# Use .loc with a boolean condition to select rows where density is greater than 120,
# and select only the 'pop' and 'density' columns
data.loc[data.density > 120, ['pop', 'density']]

Unnamed: 0,pop,density
Florida,21538187,126.463121
New York,20201249,142.97012


In [68]:
# Modify a specific element in the DataFrame using integer-position based indexing
data.iloc[0, 2] = 90
# Display the modified DataFrame
data

Unnamed: 0,area,pop,density
California,423967,39538223,90.0
Texas,695662,29145505,41.896072
Florida,170312,21538187,126.463121
New York,141297,20201249,142.97012
Pennsylvania,119280,13002700,109.009893


### **Additinal Idexing Conventions**

This covers other ways to select data in pandas using different indexing methods

In [69]:
# Slice rows using explicit index labels
data['Florida':'New York']

Unnamed: 0,area,pop,density
Florida,170312,21538187,126.463121
New York,141297,20201249,142.97012


In [70]:
# Slice rows using implicit integer positions
data[1:3]

Unnamed: 0,area,pop,density
Texas,695662,29145505,41.896072
Florida,170312,21538187,126.463121


In [71]:
# Select rows based on a boolean condition on the 'density' column
data[data.density > 120]

Unnamed: 0,area,pop,density
Florida,170312,21538187,126.463121
New York,141297,20201249,142.97012
