# What is Pandas?
Pandas is an open-source Python library used for:

- Data manipulation

- Data cleaning

- Data analysis

It is built on top of NumPy and provides powerful data structures like:

- Series – 1D

- DataFrame – 2D (rows and columns, like an Excel sheet)

# 🛠️ Installation
If you haven’t installed it yet:

In [1]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


# 🧾 Importing Pandas

In [6]:
import pandas as pd
#We use pd as the alias for pandas (a common convention).

# 🔍 Why Pandas?

| Feature            | Description                                |
| ------------------ | ------------------------------------------ |
| Fast and efficient | Handles large datasets with ease           |
| Flexible           | Works with different formats (CSV, Excel)  |
| Rich functionality | Grouping, filtering, joining, pivoting     |
| Easy integration   | With NumPy, Matplotlib, Scikit-learn, etc. |


# 📦 Two Main Data Structures in Pandas

1. Series: A single column (like a 1D array with labels)

2. DataFrame: A 2D table of rows and columns (like a spreadsheet)

In [7]:
# Creating a Series
s = pd.Series([10, 20, 30])
print(s)

0    10
1    20
2    30
dtype: int64


In [8]:
# Creating a DataFrame

data = {
  'Name': ['Alice', 'Bob', 'Charlie'],
  'Age': [25, 30, 35]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30
2,Charlie,35


In [12]:
# small exercise

s = pd.Series([10, 20, 30, 40])
print(s)
print(type(s)) # pandas.core.series.Series

0    10
1    20
2    30
3    40
dtype: int64
<class 'pandas.core.series.Series'>


In [15]:
df = pd.DataFrame(data={
  'Item': ["Apple", "Banana", "Orange"],
  'Price': [250, 40, 300]
})

print(df)
print(type(df)) # pandas.core.frame.DataFrame

     Item  Price
0   Apple    250
1  Banana     40
2  Orange    300
<class 'pandas.core.frame.DataFrame'>


# 🧩 Topic 2: Pandas Series – One-Dimensional Data

A Series is like a 1D array or list in Python, but it has labels (index).

## Creating a Series

In [16]:
# 1. From a list:

data = [10, 20, 30, 40]
s = pd.Series(data)
print(s)

0    10
1    20
2    30
3    40
dtype: int64


In [19]:
# 2. From a list with custome index

s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
s

a    10
b    20
c    30
dtype: int64

In [20]:
# 3. From a dictionary
d = {'apple': 100, 'banana': 200, 'cherry': 300}
s = pd.Series(d)
s


apple     100
banana    200
cherry    300
dtype: int64

In [22]:
# 4. From a Numpy array

import numpy as np

arr = np.array([1,2,3,4,5])
s = pd.Series(arr)
s

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [27]:
s = pd.Series(arr, index=['a', 'b', 'c', 'd', 'e']) # should have a length
s

a    1
b    2
c    3
d    4
e    5
dtype: int64

# 🔍 Accessing Data in a Series

In [31]:
# By index

s = pd.Series([100, 200, 300], index=['x', 'y', 'z'])
if int(s['y']) == 200:
  print("hello")

hello


In [36]:
# By Position
s.y

np.int64(200)

# 🛠️ Series Attributes and Methods

| Method / Attribute | Description                     |
| ------------------ | ------------------------------- |
| `s.index`          | Get index labels                |
| `s.values`         | Get all values as a NumPy array |
| `s.dtype`          | Data type of the values         |
| `s.size`           | Number of elements              |
| `s.head(n)`        | First `n` elements              |
| `s.tail(n)`        | Last `n` elements               |
| `s.describe()`     | Summary statistics (if numeric) |


In [37]:
s.index

Index(['x', 'y', 'z'], dtype='object')

In [49]:
print(s.values)
for i in list(s.values):
  print(i)

[100 200 300]
100
200
300


In [39]:
s.value_counts

<bound method IndexOpsMixin.value_counts of x    100
y    200
z    300
dtype: int64>

In [40]:
s.size

3

In [42]:
s.head(2)

x    100
y    200
dtype: int64

In [43]:
s.tail(1)

z    300
dtype: int64

In [44]:
s.describe()

count      3.0
mean     200.0
std      100.0
min      100.0
25%      150.0
50%      200.0
75%      250.0
max      300.0
dtype: float64

# 🔢 Vectorized Operations

  Pandas performs element-wise operations automatically!

In [51]:
s = pd.Series([1, 2, 3])
print(s + 10)
print(s * 2)

0    11
1    12
2    13
dtype: int64
0    2
1    4
2    6
dtype: int64


In [52]:
# Excercise

s = pd.Series([70, 90, 89, 88, 99], index=['Raj', 'Baj', 'Moj', 'Boj', 'Tej'])

In [53]:
s

Raj    70
Baj    90
Moj    89
Boj    88
Tej    99
dtype: int64

In [54]:
s.Tej

np.int64(99)

In [55]:
s[2]

  s[2]


np.int64(89)

In [61]:
s.describe()

count     5.000000
mean     87.200000
std      10.568822
min      70.000000
25%      88.000000
50%      89.000000
75%      90.000000
max      99.000000
dtype: float64

In [62]:
s

Raj    70
Baj    90
Moj    89
Boj    88
Tej    99
dtype: int64

In [63]:
s.mean()

np.float64(87.2)

# 🧾 Topic 3: DataFrame – Two-Dimensional Data

A DataFrame is a table-like structure with rows and columns (just like an Excel sheet).

It is one of the most powerful data structures in Pandas.

## ✅ Creating a DataFrame

In [64]:
# 1. From a dictionary of lists:

data = {
  'Name': ['Alice', 'Bob', 'Charlie'],
  'Age': [25, 30, 35]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30
2,Charlie,35


In [68]:
# 2. From a list of dictionaries:
data = [
  {
    'Name': 'Alice',
    'Age': 25
  },{
    'Name': 'Bob',
    'Age': 30
  },{
    'Name': 'Charlie',
    'Age': 35
  },

]
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30
2,Charlie,35


In [69]:
# 3. From a 2D list with column names:

data = [[1, 'Apple'], [2, 'Banana'], [3, 'Cherry']]
df = pd.DataFrame(data, columns=['ID', 'Fruit'])
df

Unnamed: 0,ID,Fruit
0,1,Apple
1,2,Banana
2,3,Cherry


# 📌 Basic Information and Exploration

| Method          | Description                            |
| --------------- | -------------------------------------- |
| `df.head(n)`    | First `n` rows (default 5)             |
| `df.tail(n)`    | Last `n` rows                          |
| `df.shape`      | (rows, columns)                        |
| `df.columns`    | Column names                           |
| `df.index`      | Row index                              |
| `df.info()`     | Summary of DataFrame                   |
| `df.describe()` | Summary statistics (numerical columns) |
| `df.dtypes`     | Data types of each column              |


In [71]:
df.head(1)

Unnamed: 0,ID,Fruit
0,1,Apple


In [72]:
df.tail()

Unnamed: 0,ID,Fruit
0,1,Apple
1,2,Banana
2,3,Cherry


In [73]:
df.columns

Index(['ID', 'Fruit'], dtype='object')

In [77]:
for i in df.index:
  print(i)

0
1
2


In [78]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ID      3 non-null      int64 
 1   Fruit   3 non-null      object
dtypes: int64(1), object(1)
memory usage: 180.0+ bytes


In [79]:
df.describe()

Unnamed: 0,ID
count,3.0
mean,2.0
std,1.0
min,1.0
25%,1.5
50%,2.0
75%,2.5
max,3.0


In [80]:
df

Unnamed: 0,ID,Fruit
0,1,Apple
1,2,Banana
2,3,Cherry


In [81]:
df.dtypes

ID        int64
Fruit    object
dtype: object

In [82]:
df.shape

(3, 2)

In [83]:
df

Unnamed: 0,ID,Fruit
0,1,Apple
1,2,Banana
2,3,Cherry


In [86]:
df["ID"]

0    1
1    2
2    3
Name: ID, dtype: int64

In [89]:
df[["ID", "Fruit"]]

Unnamed: 0,ID,Fruit
0,1,Apple
1,2,Banana
2,3,Cherry


# 🔍 Accessing Rows

In [None]:
# Using loc[] (label-based):
df.loc[0] ##First row by label

ID           1
Fruit    Apple
Name: 0, dtype: object

In [94]:
# Using iloc[] (position-based)
df.iloc[0] # First row by position

ID           1
Fruit    Apple
Name: 0, dtype: object

In [95]:
df

Unnamed: 0,ID,Fruit
0,1,Apple
1,2,Banana
2,3,Cherry


In [100]:
df['ID'] = [20, 30, 40]

In [101]:
df

Unnamed: 0,ID,Fruit
0,20,Apple
1,30,Banana
2,40,Cherry


In [102]:
df.drop(columns=['Fruit'], inplace=True)

In [103]:
df

Unnamed: 0,ID
0,20
1,30
2,40


In [104]:
df.shape

(3, 1)

In [105]:
# Excerice

df = pd.DataFrame(data= {
  'Name': ['Su', 'Mu', "Ds", 'Py'],
  'Department': ['HR', 'Dev', 'Hu', 'Op'],
  'Salary': [3000, 44555, 60000, 7000]
})

In [106]:
df

Unnamed: 0,Name,Department,Salary
0,Su,HR,3000
1,Mu,Dev,44555
2,Ds,Hu,60000
3,Py,Op,7000


In [109]:
df.iloc[0:2,:]

Unnamed: 0,Name,Department,Salary
0,Su,HR,3000
1,Mu,Dev,44555


In [111]:
df.iloc[:,1]

0     HR
1    Dev
2     Hu
3     Op
Name: Department, dtype: object

In [112]:
df.columns

Index(['Name', 'Department', 'Salary'], dtype='object')

In [113]:
df.shape

(4, 3)

In [114]:
df

Unnamed: 0,Name,Department,Salary
0,Su,HR,3000
1,Mu,Dev,44555
2,Ds,Hu,60000
3,Py,Op,7000


In [115]:
df.Salary = df.Salary + 30000

In [116]:
df

Unnamed: 0,Name,Department,Salary
0,Su,HR,33000
1,Mu,Dev,74555
2,Ds,Hu,90000
3,Py,Op,37000


In [117]:
df.drop(columns=['Department'])

Unnamed: 0,Name,Salary
0,Su,33000
1,Mu,74555
2,Ds,90000
3,Py,37000


In [118]:
df

Unnamed: 0,Name,Department,Salary
0,Su,HR,33000
1,Mu,Dev,74555
2,Ds,Hu,90000
3,Py,Op,37000


# 🧭 Topic 4: Data Selection & Indexing

This topic covers how to access, filter, and slice data from a DataFrame.



# 1. Accessing Columns

In [119]:
df

Unnamed: 0,Name,Department,Salary
0,Su,HR,33000
1,Mu,Dev,74555
2,Ds,Hu,90000
3,Py,Op,37000


In [121]:
# ✅ As a Series:
df['Name']

0    Su
1    Mu
2    Ds
3    Py
Name: Name, dtype: object

In [124]:
# ✅ As a DataFrame:

df[['Name']] # Single column as DataFrame

Unnamed: 0,Name
0,Su
1,Mu
2,Ds
3,Py


In [126]:
df[['Name', 'Salary']] # Multiple columns

Unnamed: 0,Name,Salary
0,Su,33000
1,Mu,74555
2,Ds,90000
3,Py,37000


In [127]:
df

Unnamed: 0,Name,Department,Salary
0,Su,HR,33000
1,Mu,Dev,74555
2,Ds,Hu,90000
3,Py,Op,37000


# # 2. Accessing Rows

In [134]:
# Using loc[] – label-based:
df.loc[0] # Row with index label 0

Name             Su
Department       HR
Salary        33000
Name: 0, dtype: object

In [None]:
# Using loc[] – label-based:
df.loc[0:2] # Rows from label 0 to 2 (inclusive)

Unnamed: 0,Name,Department,Salary
0,Su,HR,33000
1,Mu,Dev,74555
2,Ds,Hu,90000


In [None]:
# Using iloc[] – position-based:
df.iloc[0] # First row

Name             Su
Department       HR
Salary        33000
Name: 0, dtype: object

In [140]:
df.iloc[0:3] # First 3 rows (exclusive)

Unnamed: 0,Name,Department,Salary
0,Su,HR,33000
1,Mu,Dev,74555
2,Ds,Hu,90000


# 🎯 3. Accessing a Cell (Row + Column)

In [141]:
df

Unnamed: 0,Name,Department,Salary
0,Su,HR,33000
1,Mu,Dev,74555
2,Ds,Hu,90000
3,Py,Op,37000


In [144]:
df.loc[1, 'Name'] # Row with index label 1 and column 'Name'

'Mu'

In [145]:
df.iloc[1, 0] # Second row, first column

'Mu'

# 🔍 4. Conditional Filtering

In [168]:
# Example: Filter employees with salary > 60000

df[(df.Salary > 4000) & ~(df.Department == 'HR')] # paranthesic must be there if multiple conditions, Operators: &, |, not ~

Unnamed: 0,Name,Department,Salary
1,Mu,Dev,74555
2,Ds,Hu,90000
3,Py,Op,37000


In [158]:
df

Unnamed: 0,Name,Department,Salary
0,Su,HR,33000
1,Mu,Dev,74555
2,Ds,Hu,90000
3,Py,Op,37000


In [169]:
df

Unnamed: 0,Name,Department,Salary
0,Su,HR,33000
1,Mu,Dev,74555
2,Ds,Hu,90000
3,Py,Op,37000


In [170]:
df = df.set_index('Name')

In [171]:
df

Unnamed: 0_level_0,Department,Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Su,HR,33000
Mu,Dev,74555
Ds,Hu,90000
Py,Op,37000


In [172]:
df = df.reset_index()

In [173]:
df

Unnamed: 0,Name,Department,Salary
0,Su,HR,33000
1,Mu,Dev,74555
2,Ds,Hu,90000
3,Py,Op,37000


# 💡 Pro Tip – at[] and iat[] for fast single-cell access

- `at[row_label, column_name]` → label-based

- `iat[row_pos, col_pos]` → position-based

In [181]:
df

Unnamed: 0,Name,Department,Salary
0,Su,HR,33000
1,Mu,Dev,74555
2,Ds,Hu,90000
3,Py,Op,37000


In [183]:
df.at[0, 'Department']

'HR'

In [184]:
df.loc[0,'Department']

'HR'

In [186]:
df.iat[0, 1]

'HR'

In [187]:
df

Unnamed: 0,Name,Department,Salary
0,Su,HR,33000
1,Mu,Dev,74555
2,Ds,Hu,90000
3,Py,Op,37000


In [188]:
df.iat[1,2]

np.int64(74555)

In [189]:
df[df.Department == 'HR']

Unnamed: 0,Name,Department,Salary
0,Su,HR,33000


In [190]:
bonus = [345, 66767, 677, 788]
df_new = pd.DataFrame(bonus, columns=["Bonus"])

In [191]:
df_new

Unnamed: 0,Bonus
0,345
1,66767
2,677
3,788


In [196]:
df = pd.concat([df, df_new], copy=False, axis=1)

In [197]:
df

Unnamed: 0,Name,Department,Salary,Bonus
0,Su,HR,33000,345
1,Mu,Dev,74555,66767
2,Ds,Hu,90000,677
3,Py,Op,37000,788


In [198]:
df[(df.Bonus >= 500) & (df.Salary < 60000)]

Unnamed: 0,Name,Department,Salary,Bonus
3,Py,Op,37000,788


In [199]:
df = df.set_index('Name')

In [200]:
df

Unnamed: 0_level_0,Department,Salary,Bonus
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Su,HR,33000,345
Mu,Dev,74555,66767
Ds,Hu,90000,677
Py,Op,37000,788


In [201]:
df.loc['Su']

Department       HR
Salary        33000
Bonus           345
Name: Su, dtype: object

In [202]:
df = df.reset_index()

In [203]:
df

Unnamed: 0,Name,Department,Salary,Bonus
0,Su,HR,33000,345
1,Mu,Dev,74555,66767
2,Ds,Hu,90000,677
3,Py,Op,37000,788


In [204]:
df.isnull()

Unnamed: 0,Name,Department,Salary,Bonus
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


In [208]:
df_new = pd.DataFrame(data=[{
  'Name': 'Nagesh'
}])

In [209]:
df_new

Unnamed: 0,Name
0,Nagesh


In [210]:
df = pd.concat([df, df_new])

In [216]:
df

Unnamed: 0,Name,Department,Salary,Bonus
0,Su,HR,33000.0,345.0
1,Mu,Dev,74555.0,66767.0
2,Ds,Hu,90000.0,677.0
3,Py,Op,37000.0,788.0
0,Nagesh,,,


# 1. Detecting Missing Data
Use .isnull() and .notnull():

In [217]:
df.isnull() # Shows True for missing values

Unnamed: 0,Name,Department,Salary,Bonus
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
0,False,True,True,True


In [218]:
df.isnull().sum() # Total missing values per column

Name          0
Department    1
Salary        1
Bonus         1
dtype: int64

In [219]:
df

Unnamed: 0,Name,Department,Salary,Bonus
0,Su,HR,33000.0,345.0
1,Mu,Dev,74555.0,66767.0
2,Ds,Hu,90000.0,677.0
3,Py,Op,37000.0,788.0
0,Nagesh,,,


# 2. Dropping Missing Values

In [None]:
df.dropna(inplace=True) # Drop rows with ANY missing value
# df.dropna(axis=1)        # Drop columns with ANY missing value

In [223]:
df

Unnamed: 0,Name,Department,Salary,Bonus
0,Su,HR,33000.0,345.0
1,Mu,Dev,74555.0,66767.0
2,Ds,Hu,90000.0,677.0
3,Py,Op,37000.0,788.0


In [224]:
df = pd.concat([df, df_new])

In [226]:
df

Unnamed: 0,Name,Department,Salary,Bonus
0,Su,HR,33000.0,345.0
1,Mu,Dev,74555.0,66767.0
2,Ds,Hu,90000.0,677.0
3,Py,Op,37000.0,788.0
0,Nagesh,,,


In [227]:
df.dropna(how='all') # Drop only if all values are missing

Unnamed: 0,Name,Department,Salary,Bonus
0,Su,HR,33000.0,345.0
1,Mu,Dev,74555.0,66767.0
2,Ds,Hu,90000.0,677.0
3,Py,Op,37000.0,788.0
0,Nagesh,,,


# 3. Filling Missing Values

In [None]:
df.fillna(0) # Replace all NaNs with 0

Unnamed: 0,Name,Department,Salary,Bonus
0,Su,HR,33000.0,345.0
1,Mu,Dev,74555.0,66767.0
2,Ds,Hu,90000.0,677.0
3,Py,Op,37000.0,788.0
0,Nagesh,0,0.0,0.0


In [235]:
df['Salary'].fillna(df['Salary'].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Salary'].fillna(df['Salary'].mean(), inplace=True)


In [240]:
df.fillna({'Salary': df['Salary'].mean()}, inplace=True) # right one

In [242]:
df

Unnamed: 0,Name,Department,Salary,Bonus
0,Su,HR,33000.0,345.0
1,Mu,Dev,74555.0,66767.0
2,Ds,Hu,90000.0,677.0
3,Py,Op,37000.0,788.0
0,Nagesh,,58638.75,


In [243]:
df.duplicated()

0    False
1    False
2    False
3    False
0    False
dtype: bool

In [245]:
# 4. Replacing Values
df.replace('HR', 'Human Resources', inplace=True)
df.replace([10, 20], [100, 200])

Unnamed: 0,Name,Department,Salary,Bonus
0,Su,Human Resources,33000.0,345.0
1,Mu,Dev,74555.0,66767.0
2,Ds,Hu,90000.0,677.0
3,Py,Op,37000.0,788.0
0,Nagesh,,58638.75,


In [246]:
df

Unnamed: 0,Name,Department,Salary,Bonus
0,Su,Human Resources,33000.0,345.0
1,Mu,Dev,74555.0,66767.0
2,Ds,Hu,90000.0,677.0
3,Py,Op,37000.0,788.0
0,Nagesh,,58638.75,


In [249]:
df.replace([33000.00, 74555.00], [334545, 5445545])

Unnamed: 0,Name,Department,Salary,Bonus
0,Su,Human Resources,334545.0,345.0
1,Mu,Dev,5445545.0,66767.0
2,Ds,Hu,90000.0,677.0
3,Py,Op,37000.0,788.0
0,Nagesh,,58638.75,


In [251]:
# 5. Removing Duplicates

df.duplicated()               # Check for duplicates
df.drop_duplicates(inplace=True)


In [252]:
df

Unnamed: 0,Name,Department,Salary,Bonus
0,Su,Human Resources,33000.0,345.0
1,Mu,Dev,74555.0,66767.0
2,Ds,Hu,90000.0,677.0
3,Py,Op,37000.0,788.0
0,Nagesh,,58638.75,


In [253]:
df.duplicated()

0    False
1    False
2    False
3    False
0    False
dtype: bool

In [254]:
df

Unnamed: 0,Name,Department,Salary,Bonus
0,Su,Human Resources,33000.0,345.0
1,Mu,Dev,74555.0,66767.0
2,Ds,Hu,90000.0,677.0
3,Py,Op,37000.0,788.0
0,Nagesh,,58638.75,


In [258]:
df.iloc[3]

Name               Py
Department         Op
Salary        37000.0
Bonus           788.0
Name: 3, dtype: object

In [259]:
df = pd.concat([df, df.iloc[3]])


In [260]:
df

Unnamed: 0,Name,Department,Salary,Bonus,3
0,Su,Human Resources,33000.0,345.0,
1,Mu,Dev,74555.0,66767.0,
2,Ds,Hu,90000.0,677.0,
3,Py,Op,37000.0,788.0,
0,Nagesh,,58638.75,,
Name,,,,,Py
Department,,,,,Op
Salary,,,,,37000.0
Bonus,,,,,788.0


In [261]:
df.dropna(axis=1)

0
1
2
3
0
Name
Department
Salary
Bonus


In [262]:
df

Unnamed: 0,Name,Department,Salary,Bonus,3
0,Su,Human Resources,33000.0,345.0,
1,Mu,Dev,74555.0,66767.0,
2,Ds,Hu,90000.0,677.0,
3,Py,Op,37000.0,788.0,
0,Nagesh,,58638.75,,
Name,,,,,Py
Department,,,,,Op
Salary,,,,,37000.0
Bonus,,,,,788.0


In [263]:
df.dropna(axis=1, inplace=True)

In [264]:
df

0
1
2
3
0
Name
Department
Salary
Bonus


In [267]:
df

0
1
2
3
0
Name
Department
Salary
Bonus


In [273]:
df = df.drop([], inplace=True)

In [274]:
df

In [275]:
df = pd.DataFrame(data= {
  'Name': ['Su', 'Mu', "Ds", 'Py'],
  'Department': ['HR', 'Dev', 'Hu', 'Op'],
  'Salary': [3000, 44555, 60000, 7000]
})

In [276]:
df

Unnamed: 0,Name,Department,Salary
0,Su,HR,3000
1,Mu,Dev,44555
2,Ds,Hu,60000
3,Py,Op,7000


In [277]:
df['Name'] = df['Name'].str.lower()

In [278]:
df

Unnamed: 0,Name,Department,Salary
0,su,HR,3000
1,mu,Dev,44555
2,ds,Hu,60000
3,py,Op,7000


# ✂️ 8. String Cleaning (Object/Text Columns)

In [281]:
df['Name'] = df['Name'].str.strip()          # Remove leading/trailing whitespace
df['Name'] = df['Name'].str.upper()          # Convert to lowercase
df['Name'] = df['Name'].str.replace(' ', '_')# Replace space with underscore


In [282]:
df

Unnamed: 0,Name,Department,Salary
0,SU,HR,3000
1,MU,Dev,44555
2,DS,Hu,60000
3,PY,Op,7000


# 🧪 Practice Tasks
Use a sample DataFrame like this:

Now try:

Show how many missing values are there.

Fill missing Salary with mean.

Drop rows with missing Name.

Strip and lowercase Name column.

Remove duplicates (if any).

In [306]:
data = {
    'Name': [' Alice ', 'Bob', 'Charlie', None],
    'Department': ['HR', 'IT', None, 'HR'],
    'Salary': [50000, None, 70000, 50000]
}
df = pd.DataFrame(data)


In [307]:
df

Unnamed: 0,Name,Department,Salary
0,Alice,HR,50000.0
1,Bob,IT,
2,Charlie,,70000.0
3,,HR,50000.0


In [308]:
df.isnull()

Unnamed: 0,Name,Department,Salary
0,False,False,False
1,False,False,True
2,False,True,False
3,True,False,False


In [309]:
df.isnull().sum()

Name          1
Department    1
Salary        1
dtype: int64

In [310]:
df.fillna({'Salary': df['Salary'].mean()}, inplace=True)

In [311]:
df

Unnamed: 0,Name,Department,Salary
0,Alice,HR,50000.0
1,Bob,IT,56666.666667
2,Charlie,,70000.0
3,,HR,50000.0


In [312]:
df.isnull()

Unnamed: 0,Name,Department,Salary
0,False,False,False
1,False,False,False
2,False,True,False
3,True,False,False


In [316]:
df['Name'] = df['Name'].str.strip();
df['Name'] = df['Name'].str.lower();

In [317]:
df

Unnamed: 0,Name,Department,Salary
0,alice,HR,50000.0
1,bob,IT,56666.666667
2,charlie,,70000.0
3,,HR,50000.0


In [318]:
df.duplicated()

0    False
1    False
2    False
3    False
dtype: bool

In [320]:
df.drop_duplicates(inplace=True)

In [321]:
df

Unnamed: 0,Name,Department,Salary
0,alice,HR,50000.0
1,bob,IT,56666.666667
2,charlie,,70000.0
3,,HR,50000.0


In [322]:
df.dropna(subset=['Name'], inplace=True)

In [323]:
df

Unnamed: 0,Name,Department,Salary
0,alice,HR,50000.0
1,bob,IT,56666.666667
2,charlie,,70000.0


# New Set of Problems

### Task: Create a DataFrame containing names of students, their ages, and scores in math.

In [1]:
import pandas as pd

In [4]:
data = {
  'Name': ['Alice', 'Bob', 'Charlie', 'David'],
  'Age': [20, 21, 19, 22],
  'Math_Score': [85, 90, 78, 92]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Math_Score
0,Alice,20,85
1,Bob,21,90
2,Charlie,19,78
3,David,22,92


In [5]:
type(df)


pandas.core.frame.DataFrame

In [6]:
df.shape

(4, 3)

In [9]:
list(df.columns)

['Name', 'Age', 'Math_Score']

In [12]:
df.index

RangeIndex(start=0, stop=4, step=1)

In [13]:
df["Name"]

0      Alice
1        Bob
2    Charlie
3      David
Name: Name, dtype: object

In [18]:
# df[0] // Never works
# df[0, "Name"] // never works

In [24]:
# df.loc[0, "name"] // it fails because column name is case sensitive

df.loc[0, "Name"]


'Alice'

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        4 non-null      object
 1   Age         4 non-null      int64 
 2   Math_Score  4 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 228.0+ bytes


In [30]:
df.head(2)

Unnamed: 0,Name,Age,Math_Score
0,Alice,20,85
1,Bob,21,90


In [31]:
df.columns

Index(['Name', 'Age', 'Math_Score'], dtype='object')

In [33]:
print(df.describe())

             Age  Math_Score
count   4.000000    4.000000
mean   20.500000   86.250000
std     1.290994    6.238322
min    19.000000   78.000000
25%    19.750000   83.250000
50%    20.500000   87.500000
75%    21.250000   90.500000
max    22.000000   92.000000


In [34]:
df.describe()

Unnamed: 0,Age,Math_Score
count,4.0,4.0
mean,20.5,86.25
std,1.290994,6.238322
min,19.0,78.0
25%,19.75,83.25
50%,20.5,87.5
75%,21.25,90.5
max,22.0,92.0


In [36]:
df["Name"] # Access single column

0      Alice
1        Bob
2    Charlie
3      David
Name: Name, dtype: object

In [38]:
df[["Name", "Age"]]

Unnamed: 0,Name,Age
0,Alice,20
1,Bob,21
2,Charlie,19
3,David,22


In [39]:
df

Unnamed: 0,Name,Age,Math_Score
0,Alice,20,85
1,Bob,21,90
2,Charlie,19,78
3,David,22,92


In [42]:
# df.loc[0, 0] # never works
df.loc[0, 'Name']

'Alice'

In [45]:
df[["Age"]]

Unnamed: 0,Age
0,20
1,21
2,19
3,22


In [None]:
df.iloc[0:2, 0:2] # using iloc index based

Unnamed: 0,Name,Age
0,Alice,20
1,Bob,21


In [50]:
df.loc[0:1, ["Name", "Age"]] # in loc 0:1 both are inclusive

Unnamed: 0,Name,Age
0,Alice,20
1,Bob,21


In [52]:
df.iloc[0] # only first row

Name          Alice
Age              20
Math_Score       85
Name: 0, dtype: object

In [56]:
df.iloc[0:2]

Unnamed: 0,Name,Age,Math_Score
0,Alice,20,85
1,Bob,21,90


In [55]:
print(df.iloc[0])      # First row
print(df.iloc[1:3])    # Rows 1 and 2


Name          Alice
Age              20
Math_Score       85
Name: 0, dtype: object
      Name  Age  Math_Score
1      Bob   21          90
2  Charlie   19          78
