# What is Pandas?

**Pandas** is a powerful and flexible open-source Python library used for data manipulation, analysis, and cleaning.

It provides two primary data structures:

**Series:** *1-dimensional* labeled array (like a column in a spreadsheet).

**DataFrame:** *2-dimensional* labeled table (like an Excel sheet or SQL table).

## Key Features of Pandas

*   **Easy Data Loading:** Load data from CSV, Excel, SQL, JSON, etc.
*   **Powerful Indexing:** Label-based and position-based indexing (.loc, .iloc).
*   **Data Cleaning Tools:** Handle missing values, duplicates, outliers, etc.
*   **Data Aggregation:** Summarize data with groupby, pivot_table, and agg().
*   **Reshaping:** Use melt, pivot, stack, unstack to reshape data.
*   **Merge and Join:** Combine multiple datasets like SQL joins (merge, concat).
*   **Vectorized Operations:** Fast operations on entire columns (no need for loops).
*   **Built-in Plotting:** Simple plots using .plot() with Matplotlib backend.


# What is the Iris Dataset?

The **Iris dataset** is a small, classic dataset in the field of machine learning and statistics. It contains measurements of flowers from three species of the iris plant:

* Iris setosa
* Iris versicolor
* Iris virginica

# Import Pandas

In [118]:
import pandas as pd

# Load the Dataset
How to load a dataset from a CSV file into a DataFrame؟

In [119]:
data = pd.read_csv("/content/Iris.csv")
df = pd.DataFrame(data)

# Dataset Overview

**What does head() show us?**<br>
It shows the first 5 rows of the DataFrame

In [120]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


How to show the last 5 rows of the DataFrame?

In [121]:
df.tail()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica
149,150,5.9,3.0,5.1,1.8,Iris-virginica


How do we get a random sample of rows?

In [122]:
df.sample(10)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
64,65,5.6,2.9,3.6,1.3,Iris-versicolor
134,135,6.1,2.6,5.6,1.4,Iris-virginica
92,93,5.8,2.6,4.0,1.2,Iris-versicolor
34,35,4.9,3.1,1.5,0.1,Iris-setosa
108,109,6.7,2.5,5.8,1.8,Iris-virginica
102,103,7.1,3.0,5.9,2.1,Iris-virginica
63,64,6.1,2.9,4.7,1.4,Iris-versicolor
104,105,6.5,3.0,5.8,2.2,Iris-virginica
66,67,5.6,3.0,4.5,1.5,Iris-versicolor
70,71,5.9,3.2,4.8,1.8,Iris-versicolor


How many rows and columns are in the dataset?

In [123]:
df.shape

(150, 6)

What are the column names?

In [124]:
df.columns

Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

# Data Types and Info
How to explore column types and general information?

In [125]:
df.dtypes

Unnamed: 0,0
Id,int64
SepalLengthCm,float64
SepalWidthCm,float64
PetalLengthCm,float64
PetalWidthCm,float64
Species,object


In [126]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


# Descriptive Statistics

How to generate summary statistics of the dataset? <br>
What are the mean, min, max and ... of numeric columns?

In [127]:
df.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


How to describe categorical data?

In [128]:
df.describe(include="object")

Unnamed: 0,Species
count,150
unique,3
top,Iris-setosa
freq,50


**What does value_counts() do?**<br>
It returns the count of each unique value in a Series

How can we see how many times each species appears?

In [129]:
df['Species'].value_counts()

Unnamed: 0_level_0,count
Species,Unnamed: 1_level_1
Iris-setosa,50
Iris-versicolor,50
Iris-virginica,50


# Selecting Columns

How can we access a single column?

In [130]:
df['SepalLengthCm']

Unnamed: 0,SepalLengthCm
0,5.1
1,4.9
2,4.7
3,4.6
4,5.0
...,...
145,6.7
146,6.3
147,6.5
148,6.2


How can we select multiple columns?

In [131]:
df[['SepalLengthCm', 'SepalWidthCm']]

Unnamed: 0,SepalLengthCm,SepalWidthCm
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1
4,5.0,3.6
...,...,...
145,6.7,3.0
146,6.3,2.5
147,6.5,3.0
148,6.2,3.4


# Create a New Column
How to create a new calculated column? <br>
How can we create a new column based on existing ones?

In [132]:
df['SepalArea'] = df['SepalLengthCm'] * df['SepalWidthCm']

In [133]:
df['SepalArea']

Unnamed: 0,SepalArea
0,17.85
1,14.70
2,15.04
3,14.26
4,18.00
...,...
145,20.10
146,15.75
147,19.50
148,21.08


# Missing Values
How to check for missing values in a dataset?

In [134]:
df.isnull().sum()

Unnamed: 0,0
Id,0
SepalLengthCm,0
SepalWidthCm,0
PetalLengthCm,0
PetalWidthCm,0
Species,0
SepalArea,0


# Filter Rows

How to filter rows based on conditions?

In [135]:
df[df['PetalWidthCm'] > 2]

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,SepalArea
100,101,6.3,3.3,6.0,2.5,Iris-virginica,20.79
102,103,7.1,3.0,5.9,2.1,Iris-virginica,21.3
104,105,6.5,3.0,5.8,2.2,Iris-virginica,19.5
105,106,7.6,3.0,6.6,2.1,Iris-virginica,22.8
109,110,7.2,3.6,6.1,2.5,Iris-virginica,25.92
112,113,6.8,3.0,5.5,2.1,Iris-virginica,20.4
114,115,5.8,2.8,5.1,2.4,Iris-virginica,16.24
115,116,6.4,3.2,5.3,2.3,Iris-virginica,20.48
117,118,7.7,3.8,6.7,2.2,Iris-virginica,29.26
118,119,7.7,2.6,6.9,2.3,Iris-virginica,20.02


In [136]:
df[df['Species'] == 'Iris-virginica']

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,SepalArea
100,101,6.3,3.3,6.0,2.5,Iris-virginica,20.79
101,102,5.8,2.7,5.1,1.9,Iris-virginica,15.66
102,103,7.1,3.0,5.9,2.1,Iris-virginica,21.3
103,104,6.3,2.9,5.6,1.8,Iris-virginica,18.27
104,105,6.5,3.0,5.8,2.2,Iris-virginica,19.5
105,106,7.6,3.0,6.6,2.1,Iris-virginica,22.8
106,107,4.9,2.5,4.5,1.7,Iris-virginica,12.25
107,108,7.3,2.9,6.3,1.8,Iris-virginica,21.17
108,109,6.7,2.5,5.8,1.8,Iris-virginica,16.75
109,110,7.2,3.6,6.1,2.5,Iris-virginica,25.92


# Grouping and Aggregation

How do we compute the mean for each group?

In [137]:
df.groupby('Species').mean()

Unnamed: 0_level_0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,SepalArea
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Iris-setosa,25.5,5.006,3.418,1.464,0.244,17.2088
Iris-versicolor,75.5,5.936,2.77,4.26,1.326,16.5262
Iris-virginica,125.5,6.588,2.974,5.552,2.026,19.6846


# Sorting Columns
How do we sort rows by column values?

In [138]:
df.sort_values(by='SepalLengthCm', ascending=False)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,SepalArea
131,132,7.9,3.8,6.4,2.0,Iris-virginica,30.02
122,123,7.7,2.8,6.7,2.0,Iris-virginica,21.56
118,119,7.7,2.6,6.9,2.3,Iris-virginica,20.02
117,118,7.7,3.8,6.7,2.2,Iris-virginica,29.26
135,136,7.7,3.0,6.1,2.3,Iris-virginica,23.10
...,...,...,...,...,...,...,...
41,42,4.5,2.3,1.3,0.3,Iris-setosa,10.35
42,43,4.4,3.2,1.3,0.2,Iris-setosa,14.08
8,9,4.4,2.9,1.4,0.2,Iris-setosa,12.76
38,39,4.4,3.0,1.3,0.2,Iris-setosa,13.20


What does this code do?

In [139]:
df.groupby('Species')['SepalWidthCm'].mean().sort_values(ascending=False)

Unnamed: 0_level_0,SepalWidthCm
Species,Unnamed: 1_level_1
Iris-setosa,3.418
Iris-virginica,2.974
Iris-versicolor,2.77


It groups the data by species, calculates the average sepal width, and sorts the result in descending order.

# Drop Columns
How can we drop columns?

In [140]:
df.drop(columns=['SepalArea'], inplace=True)

# Renaming Columns
How can we rename columns?

In [141]:
df.rename(columns={'SepalLengthCm': 'SepalLength'}, inplace=True)

In [142]:
df

Unnamed: 0,Id,SepalLength,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


How can we count unique values in each column?

In [143]:
df.nunique()

Unnamed: 0,0
Id,150
SepalLength,35
SepalWidthCm,23
PetalLengthCm,43
PetalWidthCm,22
Species,3


How do we count unique species?

In [144]:
df['Species'].nunique()

3

 # loc vs iloc

**What is the difference between loc and iloc?**<br>
loc uses labels; iloc uses integer index positions

In [145]:
df.iloc[0]

Unnamed: 0,0
Id,1
SepalLength,5.1
SepalWidthCm,3.5
PetalLengthCm,1.4
PetalWidthCm,0.2
Species,Iris-setosa


In [146]:
df.iloc[12]

Unnamed: 0,12
Id,13
SepalLength,4.8
SepalWidthCm,3.0
PetalLengthCm,1.4
PetalWidthCm,0.1
Species,Iris-setosa


In [147]:
df.iloc[0:2, 2:4]

Unnamed: 0,SepalWidthCm,PetalLengthCm
0,3.5,1.4
1,3.0,1.4


In [148]:
df.loc[0, 'PetalLengthCm']  # Specific value of a cell

np.float64(1.4)

In [149]:
df.loc[56, 'SepalWidthCm']

np.float64(3.3)

In [150]:
df.loc[56, 'Species']

'Iris-versicolor'

# Pivot Table

**What is a pivot table?**<br>
It aggregates values by groupings

How to summarize data using a pivot table?

In [151]:
df.pivot_table(values='SepalWidthCm', index='Species', aggfunc='mean')

Unnamed: 0_level_0,SepalWidthCm
Species,Unnamed: 1_level_1
Iris-setosa,3.418
Iris-versicolor,2.77
Iris-virginica,2.974


# Concatenate DataFrames
How to concatenate rows and columns from multiple DataFrames?

In [152]:
df1 = df.iloc[:20]
df2 = df.iloc[70:80]

What does this code do?

In [153]:
pd.concat([df1, df2], axis=0) # axis=0 means row-wise concatenation (adds more rows)

Unnamed: 0,Id,SepalLength,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
5,6,5.4,3.9,1.7,0.4,Iris-setosa
6,7,4.6,3.4,1.4,0.3,Iris-setosa
7,8,5.0,3.4,1.5,0.2,Iris-setosa
8,9,4.4,2.9,1.4,0.2,Iris-setosa
9,10,4.9,3.1,1.5,0.1,Iris-setosa


**When should we use this?**<br>
When we want to combine data as additional entries (rows)


---



What does this code do?

In [154]:
df1 = df['SepalLength']
df2 = df['PetalWidthCm']

pd.concat([df1, df2], axis=1) # axis=1 means column-wise concatenation (adds more columns)

Unnamed: 0,SepalLength,PetalWidthCm
0,5.1,0.2
1,4.9,0.2
2,4.7,0.2
3,4.6,0.2
4,5.0,0.2
...,...,...
145,6.7,2.3
146,6.3,1.9
147,6.5,2.0
148,6.2,2.3


**When should we use this?**<br>
When we want to combine related columns for the same rows


---



# Merge with Another DataFrame
How do we merge two DataFrames?

In [155]:
farsi_names = pd.DataFrame({
    'Species': ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'],
    'name_fa': ['ستوسا', 'ورسیکالر', 'ویرجینیکا']
})

In [156]:
farsi_names

Unnamed: 0,Species,name_fa
0,Iris-setosa,ستوسا
1,Iris-versicolor,ورسیکالر
2,Iris-virginica,ویرجینیکا


In [157]:
df.merge(farsi_names, on='Species')

Unnamed: 0,Id,SepalLength,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,name_fa
0,1,5.1,3.5,1.4,0.2,Iris-setosa,ستوسا
1,2,4.9,3.0,1.4,0.2,Iris-setosa,ستوسا
2,3,4.7,3.2,1.3,0.2,Iris-setosa,ستوسا
3,4,4.6,3.1,1.5,0.2,Iris-setosa,ستوسا
4,5,5.0,3.6,1.4,0.2,Iris-setosa,ستوسا
...,...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica,ویرجینیکا
146,147,6.3,2.5,5.0,1.9,Iris-virginica,ویرجینیکا
147,148,6.5,3.0,5.2,2.0,Iris-virginica,ویرجینیکا
148,149,6.2,3.4,5.4,2.3,Iris-virginica,ویرجینیکا


# Masking and Conditional Columns
How do we label rows based on a condition?

In [158]:
df['LongSepal'] = df['SepalLength'].mask(df['SepalLength'] > 5, 'long')

In [159]:
df['LongSepal'].value_counts()

Unnamed: 0_level_0,count
LongSepal,Unnamed: 1_level_1
long,118
5.0,10
4.9,6
4.8,5
4.6,4
4.4,3
4.7,2
4.3,1
4.5,1


In [160]:
df['LongSepal']

Unnamed: 0,LongSepal
0,long
1,4.9
2,4.7
3,4.6
4,5.0
...,...
145,long
146,long
147,long
148,long


# Apply Functions to Columns


**What is the difference between apply and map?**<br>
apply works on Series/DataFrames, map only on Series

**How does apply() work on columns?**<br>
It applies a function to each row or column.

How to apply a function to every value in a column?

In [161]:
df[['SepalLength', 'SepalWidthCm']].apply(lambda x: x.mean(), axis=0)

Unnamed: 0,0
SepalLength,5.843333
SepalWidthCm,3.054


In [162]:
df['species_length'] = df['Species'].map(len)

In [163]:
df

Unnamed: 0,Id,SepalLength,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,LongSepal,species_length
0,1,5.1,3.5,1.4,0.2,Iris-setosa,long,11
1,2,4.9,3.0,1.4,0.2,Iris-setosa,4.9,11
2,3,4.7,3.2,1.3,0.2,Iris-setosa,4.7,11
3,4,4.6,3.1,1.5,0.2,Iris-setosa,4.6,11
4,5,5.0,3.6,1.4,0.2,Iris-setosa,5.0,11
...,...,...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica,long,14
146,147,6.3,2.5,5.0,1.9,Iris-virginica,long,14
147,148,6.5,3.0,5.2,2.0,Iris-virginica,long,14
148,149,6.2,3.4,5.4,2.3,Iris-virginica,long,14


In [164]:
df['sepal_label'] = df['SepalLength'].apply(lambda x: 'short' if x < 5 else 'long')

In [165]:
df

Unnamed: 0,Id,SepalLength,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,LongSepal,species_length,sepal_label
0,1,5.1,3.5,1.4,0.2,Iris-setosa,long,11,long
1,2,4.9,3.0,1.4,0.2,Iris-setosa,4.9,11,short
2,3,4.7,3.2,1.3,0.2,Iris-setosa,4.7,11,short
3,4,4.6,3.1,1.5,0.2,Iris-setosa,4.6,11,short
4,5,5.0,3.6,1.4,0.2,Iris-setosa,5.0,11,long
...,...,...,...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica,long,14,long
146,147,6.3,2.5,5.0,1.9,Iris-virginica,long,14,long
147,148,6.5,3.0,5.2,2.0,Iris-virginica,long,14,long
148,149,6.2,3.4,5.4,2.3,Iris-virginica,long,14,long


# Save the Modified Dataset
How to export a DataFrame to a CSV file?

In [166]:
df.to_csv("iris_modified.csv", index=False)