<a href="https://colab.research.google.com/github/Haseeb-zai30/Ai-notebooks/blob/main/day_2_introduction_to_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pandas



In [2]:
# import micropip
# await micropip.install("pandas")  #run only if using jupyter nootbook

**Pandas**

 Pandas is an open-source Python library designed for fast, flexible data manipulation and analysis. It
 offers two core data structures—Series and DataFrame—that allow for intuitive manipulation of
 data.

Key Features:

Efficient Data Structures:

 Provides Series (one-dimensional) and DataFrame (two-dimensional)
 objects.

Data Input/Output:

 Supports a variety of file formats including CSV, Excel, JSON, and SQL.
 Data Cleaning: Offers tools for handling missing data, duplicate data, and inconsistent formats.

Data Analysis:

 Includes functions for grouping, merging, and reshaping data, making it ideal for exploratory data analysis (EDA).

Time Series:

  Provides specialized features for time series data analysis.

In [4]:
import pandas as pd   # Importing pandas library

**Creating a Series from a List:**

data → A Python list.

pd.Series(data) → Converts list into a Pandas Series.

Index → Labels for each value. Default is 0, 1, 2....

Output → Two columns: index on left, values on right.

Creating a Series from a Python list

In [5]:
data = [10, 20, 30, 40, 50]#list

'pd.Series()' converts the list into a Series object
 A Series is like a 1D array, but with labels (index).
 By default, the index starts from 0, 1, 2, ...

In [6]:
#Creating a Pandas Series from the list
series = pd.Series(data)

In [7]:
print("Default Index Series:")
print(series)

Default Index Series:
0    10
1    20
2    30
3    40
4    50
dtype: int64


**Creating a Series with Custom Index Labels**

index=['a','b','c','d','e'] → assigns custom names to each position.

Instead of 0,1,2..., now elements are labeled a,b,c,d,e.

In [8]:
# Creating a Series with custom labels for the index
series_custom = pd.Series(data, index=['a', 'b', 'c', 'd', 'e'])

In [9]:
print("\nSeries with custom indices:")
print(series_custom)


Series with custom indices:
a    10
b    20
c    30
d    40
e    50
dtype: int64


**Creating a Series from a Dictionary**

Dictionary → has key: value pairs.

In Pandas Series:

Keys → become index labels (x, y, z).

Values → become data (100, 200, 300).

In [12]:
# A Python dictionary: keys act as labels, values as data
data_dict = {'x': 100, 'y': 200, 'z': 300}

In [13]:
# Creating a Series directly from the dictionary
series_from_dict = pd.Series(data_dict)

In [14]:
print("\nSeries from a dictionary:")
print(series_from_dict)


Series from a dictionary:
x    100
y    200
z    300
dtype: int64


**Mathematical Operations on a Series**

Pandas allows vectorized operations (no need for loops).

series * 2 → multiplies every element by 2.

Result is a new Series with transformed values

In [15]:
# Multiplying every element in the Series by 2
series_multiplied = series * 2

In [16]:
print("\nSeries after multiplying by 2:")
print(series_multiplied)


Series after multiplying by 2:
0     20
1     40
2     60
3     80
4    100
dtype: int64


**Pandas DataFrame**:

A DataFrame is like an Excel table or SQL table:

Rows → represent records/observations.

Columns → represent fields/attributes.

It’s the core data structure in Pandas for 2D data.

Dictionary keys ('Name','Age','City') → become columns.

Lists inside dictionary → become column values.

Output: a neat table with rows (0–3) and 3 columns.

In [18]:
# A Python dictionary: each key is a column name, and values are lists of column entries
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

In [19]:
# Convert dictionary to DataFrame
df = pd.DataFrame(data)

In [20]:
print("DataFrame from dictionary:")
print(df)

DataFrame from dictionary:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3    David   40      Houston


**Creating a DataFrame from a List of Dictionaries**

Each dictionary = one row.

Keys inside dict = column names.

Useful when loading JSON-like structured data.

In [21]:
# A list of dictionaries: each dictionary represents one row
data_list = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'},
    {'Name': 'David', 'Age': 40, 'City': 'Houston'}
]

In [22]:
# Convert list of dictionaries to DataFrame
df_list = pd.DataFrame(data_list)


In [23]:
print("\nDataFrame from list of dictionaries:")
print(df_list)


DataFrame from list of dictionaries:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3    David   40      Houston


**Adding a New Column**:

df['Salary'] → creates a new column.

The list values [70000, 80000, 90000, 85000] are assigned row by row.

Now the DataFrame has 4 columns: Name, Age, City, Salary.

In [24]:
# Adding a new column called 'Salary' to the DataFrame
df['Salary'] = [70000, 80000, 90000, 85000]


In [25]:
print("\nDataFrame after adding Salary column:")
print(df)


DataFrame after adding Salary column:
      Name  Age         City  Salary
0    Alice   25     New York   70000
1      Bob   30  Los Angeles   80000
2  Charlie   35      Chicago   90000
3    David   40      Houston   85000


**Pandas Data Input & Output**:

Pandas can read/write data from many sources: CSV, Excel, SQL databases, JSON, etc.
We’ll demonstrate with small dummy datasets.

**Reading from CSV Files**:

to_csv() → saves DataFrame as a CSV file.

index=False → prevents Pandas from writing row numbers into the file.

read_csv() → loads the CSV file back into a DataFrame.

In [26]:
# Step 1: Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

In [29]:
df = pd.DataFrame(data)

In [30]:
# Step 2: Save this DataFrame as a CSV file
df.to_csv("sample_data.csv", index=False)   # index=False avoids writing row numbers

In [31]:
# Step 3: Read the CSV file back into a DataFrame
df_csv = pd.read_csv("sample_data.csv")

In [32]:
print("Data read from CSV file:")
print(df_csv)

Data read from CSV file:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3    David   40      Houston


**Reading from Excel Files**:

to_excel() → writes DataFrame into an Excel file.

sheet_name → specifies which sheet to save/read.

read_excel() → loads Excel data.

In [33]:
# Save the DataFrame to an Excel file
df.to_excel("sample_data.xlsx", sheet_name="Sheet1", index=False)


In [34]:
# Read back the Excel file
df_excel = pd.read_excel("sample_data.xlsx", sheet_name="Sheet1")


In [35]:
print("\nData read from Excel file:")
print(df_excel)


Data read from Excel file:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3    David   40      Houston


**Reading from SQL Databases**:

We’ll use SQLite (a lightweight database) for demonstration.



sqlite3.connect() → connects to SQLite DB (creates file if not exists).

to_sql() → saves DataFrame into a database table.

read_sql_query() → fetches SQL results into a DataFrame.

In [42]:
import sqlite3

In [37]:
# Step 1: Create a connection to SQLite (creates a local DB file)
conn = sqlite3.connect("sample_data.db")


In [38]:
# Step 2: Save DataFrame to a SQL table
df.to_sql("people", conn, if_exists="replace", index=False)

4

In [39]:
# Step 3: Query data back into Pandas
df_sql = pd.read_sql_query("SELECT * FROM people", conn)


In [40]:
# Close connection
conn.close()

In [41]:
print("\nData read from SQL database:")
print(df_sql)


Data read from SQL database:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3    David   40      Houston


**Selection & Indexing in Pandas**:

**1. Selecting Columns**:

df['Age'] → returns a Series of ages.

df[['Name','City']] → returns a DataFrame with only selected columns.

In [43]:
# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston'],
    'Salary': [70000, 80000, 90000, 85000]
}

In [44]:
df = pd.DataFrame(data)

In [45]:
# Selecting a single column
ages = df['Age']

Single column 'Age':
0    25
1    30
2    35
3    40
Name: Age, dtype: int64


In [46]:
print("Single column 'Age':")
print(ages)

Single column 'Age':
0    25
1    30
2    35
3    40
Name: Age, dtype: int64


In [47]:
# Selecting multiple columns
subset = df[['Name', 'City']]

In [48]:
print("\nSubset with 'Name' and 'City':")
print(subset)


Subset with 'Name' and 'City':
      Name         City
0    Alice     New York
1      Bob  Los Angeles
2  Charlie      Chicago
3    David      Houston


**Selecting Rows with .loc[] (Label-based Indexing)**:

.loc[] → selects rows/columns by labels (index names).

You can pass a single label ('Alice') or a range (1:3).

You can also specify column names in the second argument.

In [63]:
# Set 'Name' as the index for demonstration
df_indexed = df.set_index('Name')

In [64]:
# Select row for 'Alice' by label
row_alice = df_indexed.loc['Alice']


In [65]:
print("\nRow for 'Alice' using .loc:")
print(row_alice)


Row for 'Alice' using .loc:
Age             25
City      New York
Salary       70000
Name: Alice, dtype: object


In [66]:
# Select multiple rows + specific columns by labels
sub_df = df.loc[1:3, ['Name', 'Age']]


In [67]:
print("\nRows 1 to 3 with 'Name' and 'Age':")
print(sub_df)


Rows 1 to 3 with 'Name' and 'Age':
      Name  Age
1      Bob   30
2  Charlie   35
3    David   40


**Selecting Rows with .iloc[] (Position-based Indexing):**

.iloc[] → selects rows/columns by integer positions.

iloc[0] → first row.

iloc[0:3, 0:2] → first 3 rows, first 2 columns.

In [68]:
# Select first row by position
first_row = df.iloc[0]

In [58]:
print("\nFirst row using .iloc:")
print(first_row)


First row using .iloc:
Name         Alice
Age             25
City      New York
Salary       70000
Name: 0, dtype: object


In [69]:
# Select first 3 rows and first 2 columns
sub_df_iloc = df.iloc[0:3, 0:2]

In [70]:
print("\nFirst 3 rows & 2 columns using .iloc:")
print(sub_df_iloc)


First 3 rows & 2 columns using .iloc:
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35


**Conditional Selection (Boolean Indexing)**:

df['Age'] >= 30 → returns True/False for each row.

Combine conditions with & (AND), | (OR).

.isin() → checks if column values match a list.

In [71]:
# Filter rows where Age >= 30
df_age = df[df['Age'] >= 30]

In [72]:
print("\nRows where Age >= 30:")
print(df_age)


Rows where Age >= 30:
      Name  Age         City  Salary
1      Bob   30  Los Angeles   80000
2  Charlie   35      Chicago   90000
3    David   40      Houston   85000


In [76]:
# Filter with multiple conditions: Age > 25 and City = 'Chicago'
df_multiple = df[(df['Age'] > 25) & (df['City'] == 'Chicago')]

In [77]:
print("\nRows where Age > 25 and City = 'Chicago':")
print(df_multiple)


Rows where Age > 25 and City = 'Chicago':
      Name  Age     City  Salary
2  Charlie   35  Chicago   90000


In [78]:
# Select rows where City is either New York or Chicago
cities = ['New York', 'Chicago']
df_cities = df[df['City'].isin(cities)]

In [79]:
print("\nRows where City is New York or Chicago:")
print(df_cities)


Rows where City is New York or Chicago:
      Name  Age      City  Salary
0    Alice   25  New York   70000
2  Charlie   35   Chicago   90000


**Setting & Resetting Index**:

.set_index() → sets a column as the index.

.reset_index() → restores default integer index.

In [80]:
# Set 'Name' as index
df_indexed = df.set_index('Name')

In [62]:
print("\nDataFrame with 'Name' as index:")
print(df_indexed)


DataFrame with 'Name' as index:
         Age         City  Salary
Name                             
Alice     25     New York   70000
Bob       30  Los Angeles   80000
Charlie   35      Chicago   90000
David     40      Houston   85000


In [None]:
# Reset index back to default numbers
df_reset = df_indexed.reset_index()

In [None]:
print("\nDataFrame after resetting index:")
print(df_reset)

**Operations on DataFrames**:

**1. Viewing Data — .head() and .tail()**
  
  .head() → shows top rows (default = 5).

.tail() → shows bottom rows.

You can pass a number inside (e.g., head(2)).

In [81]:
print("\nFirst five rows (head):")
print(df.head())


First five rows (head):
      Name  Age         City  Salary
0    Alice   25     New York   70000
1      Bob   30  Los Angeles   80000
2  Charlie   35      Chicago   90000
3    David   40      Houston   85000


In [82]:
# Show last 5 rows
print("\nLast five rows (tail):")
print(df.tail())


Last five rows (tail):
      Name  Age         City  Salary
0    Alice   25     New York   70000
1      Bob   30  Los Angeles   80000
2  Charlie   35      Chicago   90000
3    David   40      Houston   85000


In [83]:
# Show first 2 rows
print("\nFirst two rows only:")
print(df.head(2))


First two rows only:
    Name  Age         City  Salary
0  Alice   25     New York   70000
1    Bob   30  Los Angeles   80000


**Analyzing Data — .unique() and .value_counts():**

.unique() → shows unique entries in a column.

.value_counts() → counts how many times each value occurs.

normalize=True → gives percentages instead of counts.

In [84]:
# Unique values in a column
unique_cities = df['City'].unique()

In [85]:
print("\nUnique cities in DataFrame:")
print(unique_cities)


Unique cities in DataFrame:
['New York' 'Los Angeles' 'Chicago' 'Houston']


In [86]:
# Value counts for each city
city_counts = df['City'].value_counts()

In [87]:
print("\nCity frequency counts:")
print(city_counts)



City frequency counts:
City
New York       1
Los Angeles    1
Chicago        1
Houston        1
Name: count, dtype: int64


In [88]:
# Value counts normalized (percentage)
city_normalized = df['City'].value_counts(normalize=True)

In [89]:
print("\nCity counts normalized (percentages):")
print(city_normalized)



City counts normalized (percentages):
City
New York       0.25
Los Angeles    0.25
Chicago        0.25
Houston        0.25
Name: proportion, dtype: float64


**3. Applying Functions with .apply()**

.apply() → applies a function to each element of a column.

Can use custom functions or lambda (inline) functions.

In [90]:
# Custom function to categorize ages
def age_group(age):
    if age < 30:
        return 'Young'
    elif age < 40:
        return 'Mid-age'
    else:
        return 'Old'

In [91]:
# Apply the function to Age column
df['Age Group'] = df['Age'].apply(age_group)

In [92]:
print("\nDataFrame after adding 'Age Group':")
print(df)


DataFrame after adding 'Age Group':
      Name  Age         City  Salary Age Group
0    Alice   25     New York   70000     Young
1      Bob   30  Los Angeles   80000   Mid-age
2  Charlie   35      Chicago   90000   Mid-age
3    David   40      Houston   85000       Old


In [93]:
# Using a lambda function: square of Age
df['Age Squared'] = df['Age'].apply(lambda x: x**2)

In [94]:
print("\nDataFrame after adding 'Age Squared':")
print(df)


DataFrame after adding 'Age Squared':
      Name  Age         City  Salary Age Group  Age Squared
0    Alice   25     New York   70000     Young          625
1      Bob   30  Los Angeles   80000   Mid-age          900
2  Charlie   35      Chicago   90000   Mid-age         1225
3    David   40      Houston   85000       Old         1600


**4. Getting Column Names & Index Values:**

df.columns.tolist() → returns all column names.

df.index.tolist() → returns row index values.

In [95]:
# List of column names
print("\nColumn names:")
print(df.columns.tolist())


Column names:
['Name', 'Age', 'City', 'Salary', 'Age Group', 'Age Squared']


In [96]:
# List of index values
print("\nIndex values:")
print(df.index.tolist())


Index values:
[0, 1, 2, 3]


**5. Sorting & Ordering**

.sort_values() → sorts rows by a column (or multiple).

ascending=[True, False] → sort direction per column.

na_position='first' → puts missing values at the top.


In [97]:
df_sorted_age = df.sort_values(by='Age')

In [98]:
print("\nSorted by Age (ascending):")
print(df_sorted_age)


Sorted by Age (ascending):
      Name  Age         City  Salary Age Group  Age Squared
0    Alice   25     New York   70000     Young          625
1      Bob   30  Los Angeles   80000   Mid-age          900
2  Charlie   35      Chicago   90000   Mid-age         1225
3    David   40      Houston   85000       Old         1600


In [103]:
# Sort by City (asc) then Age (desc)
df_sorted_multi = df.sort_values(by=['City','Age'], ascending=[True, False])

In [102]:
print("\nSorted by City (A-Z) and Age (high to low):")
print(df_sorted_multi)


Sorted by City (A-Z) and Age (high to low):
      Name  Age         City  Salary Age Group  Age Squared
2  Charlie   35      Chicago   90000   Mid-age         1225
3    David   40      Houston   85000       Old         1600
1      Bob   30  Los Angeles   80000   Mid-age          900
0    Alice   25     New York   70000     Young          625


In [104]:
# Sort by Salary with NaN values
df.loc[1, 'Salary'] = None   # Insert a NaN for demo
df_sorted_nan = df.sort_values(by='Salary', na_position='first')

In [105]:
print("\nSorted by Salary with NaN at the top:")
print(df_sorted_nan)


Sorted by Salary with NaN at the top:
      Name  Age         City   Salary Age Group  Age Squared
1      Bob   30  Los Angeles      NaN   Mid-age          900
0    Alice   25     New York  70000.0     Young          625
3    David   40      Houston  85000.0       Old         1600
2  Charlie   35      Chicago  90000.0   Mid-age         1225


**6. Handling Missing Values:**

  .isnull() → checks missing values.

.fillna() → fills missing values with specified method/constant/statistic.

.replace() → replaces specific values.



In [106]:
# Detect missing values (True/False)
print("\nCheck for missing values:")
print(df.isnull())


Check for missing values:
    Name    Age   City  Salary  Age Group  Age Squared
0  False  False  False   False      False        False
1  False  False  False    True      False        False
2  False  False  False   False      False        False
3  False  False  False   False      False        False


In [107]:
# Count missing values per column
print("\nCount of missing values per column:")
print(df.isnull().sum())


Count of missing values per column:
Name           0
Age            0
City           0
Salary         1
Age Group      0
Age Squared    0
dtype: int64


In [108]:
# Fill missing values with mean (for numeric column)
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())
print("\nAfter filling missing Salary values with mean:")
print(df)



After filling missing Salary values with mean:
      Name  Age         City        Salary Age Group  Age Squared
0    Alice   25     New York  70000.000000     Young          625
1      Bob   30  Los Angeles  81666.666667   Mid-age          900
2  Charlie   35      Chicago  90000.000000   Mid-age         1225
3    David   40      Houston  85000.000000       Old         1600


In [109]:
# Replace a value in a column
df['City'] = df['City'].replace('Chicago', 'Chi-Town')
print("\nAfter replacing 'Chicago' with 'Chi-Town':")
print(df)


After replacing 'Chicago' with 'Chi-Town':
      Name  Age         City        Salary Age Group  Age Squared
0    Alice   25     New York  70000.000000     Young          625
1      Bob   30  Los Angeles  81666.666667   Mid-age          900
2  Charlie   35     Chi-Town  90000.000000   Mid-age         1225
3    David   40      Houston  85000.000000       Old         1600


**7. Dropping Rows & Columns:**

.drop(columns=[]) → removes a column.

.dropna() → removes rows with missing values.

.drop(index=0) → removes a specific row.

.drop_duplicates() → removes duplicate rows.

In [110]:
# Drop a column
df_drop_col = df.drop(columns=['Age Squared'])

In [111]:
print("\nAfter dropping 'Age Squared' column:")
print(df_drop_col)


After dropping 'Age Squared' column:
      Name  Age         City        Salary Age Group
0    Alice   25     New York  70000.000000     Young
1      Bob   30  Los Angeles  81666.666667   Mid-age
2  Charlie   35     Chi-Town  90000.000000   Mid-age
3    David   40      Houston  85000.000000       Old


In [112]:
# Drop rows with any missing values
df_dropna = df.dropna()

In [113]:
print("\nAfter dropping rows with missing values:")
print(df_dropna)


After dropping rows with missing values:
      Name  Age         City        Salary Age Group  Age Squared
0    Alice   25     New York  70000.000000     Young          625
1      Bob   30  Los Angeles  81666.666667   Mid-age          900
2  Charlie   35     Chi-Town  90000.000000   Mid-age         1225
3    David   40      Houston  85000.000000       Old         1600


In [114]:
# Drop a row by index
df_drop_row = df.drop(index=0)

In [115]:
print("\nAfter dropping row with index 0:")
print(df_drop_row)


After dropping row with index 0:
      Name  Age         City        Salary Age Group  Age Squared
1      Bob   30  Los Angeles  81666.666667   Mid-age          900
2  Charlie   35     Chi-Town  90000.000000   Mid-age         1225
3    David   40      Houston  85000.000000       Old         1600


**Missing Data Handling in Pandas:**:

Missing data is common in real-world datasets. Pandas provides several ways to detect, remove, or fill missing values.


.isnull() → Boolean mask of missing values.

.notnull() → Opposite of .isnull().

.sum() → counts how many missing values each column has.

In [118]:
# Sample DataFrame with missing values
import numpy as np
data_missing = {
    'Name': ['Alice', 'Bob', None, 'David', 'Eve'],
    'Age': [25, np.nan, 35, None, 38],
    'City': ['New York', 'Los Angeles', None, 'Houston', 'Phoenix'],
    'Salary': [70000, 85000, None, 65000, 90000]
}

In [119]:
df = pd.DataFrame(data_missing)

In [120]:
print("Original DataFrame with missing values:")
print(df)

Original DataFrame with missing values:
    Name   Age         City   Salary
0  Alice  25.0     New York  70000.0
1    Bob   NaN  Los Angeles  85000.0
2   None  35.0         None      NaN
3  David   NaN      Houston  65000.0
4    Eve  38.0      Phoenix  90000.0


In [121]:
# Check for missing values (True = missing)
print("\nCheck for missing data (True = missing):")
print(df.isnull())


Check for missing data (True = missing):
    Name    Age   City  Salary
0  False  False  False   False
1  False   True  False   False
2   True  False   True    True
3  False   True  False   False
4  False  False  False   False


In [122]:

# Opposite check: non-missing values
print("\nCheck for non-missing data (True = not missing):")
print(df.notnull())



Check for non-missing data (True = not missing):
    Name    Age   City  Salary
0   True   True   True    True
1   True  False   True    True
2  False   True  False   False
3   True  False   True    True
4   True   True   True    True


In [123]:
# Count missing values per column
print("\nMissing values count per column:")
print(df.isnull().sum())


Missing values count per column:
Name      1
Age       2
City      1
Salary    1
dtype: int64


**2. Dropping Missing Data:**

dropna() → drops rows with missing values (default).

dropna(axis=1) → drops columns with missing values.

In [125]:
# Drop rows with any missing values
df_dropped_rows = df.dropna()

In [126]:
print("\nAfter dropping rows with any missing value:")
print(df_dropped_rows)


After dropping rows with any missing value:
    Name   Age      City   Salary
0  Alice  25.0  New York  70000.0
4    Eve  38.0   Phoenix  90000.0


In [127]:
# Drop columns with any missing values
df_dropped_cols = df.dropna(axis=1)

In [128]:
print("\nAfter dropping columns with any missing value:")
print(df_dropped_cols)


After dropping columns with any missing value:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]


**3. Filling Missing Data:**

.fillna(constant) → fills missing values with a fixed constant.

.fillna(df['col'].mean()) → fills numeric missing values with the mean.



In [129]:
# Fill missing values with constant values
df_filled_const = df.fillna(value={'Age': 0, 'Salary': 50000, 'City': 'Unknown'})

In [130]:
print("\nAfter filling missing values with constants:")
print(df_filled_const)


After filling missing values with constants:
    Name   Age         City   Salary
0  Alice  25.0     New York  70000.0
1    Bob   0.0  Los Angeles  85000.0
2   None  35.0      Unknown  50000.0
3  David   0.0      Houston  65000.0
4    Eve  38.0      Phoenix  90000.0


In [131]:
# Fill missing numeric values with column mean
df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())

In [132]:
print("\nAfter filling numeric columns with mean:")
print(df)


After filling numeric columns with mean:
    Name        Age         City   Salary
0  Alice  25.000000     New York  70000.0
1    Bob  32.666667  Los Angeles  85000.0
2   None  35.000000         None  77500.0
3  David  32.666667      Houston  65000.0
4    Eve  38.000000      Phoenix  90000.0


**Forward Fill & Backward Fill:**

Forward Fill (ffill) → replaces missing value with value above it.

Backward Fill (bfill) → replaces missing value with value below it.

In [133]:
# Forward fill: fill missing value with the last known value
df_ffill = df.fillna(method='ffill')

  df_ffill = df.fillna(method='ffill')


In [134]:
print("\nAfter forward-fill (use previous value):")
print(df_ffill)


After forward-fill (use previous value):
    Name        Age         City   Salary
0  Alice  25.000000     New York  70000.0
1    Bob  32.666667  Los Angeles  85000.0
2    Bob  35.000000  Los Angeles  77500.0
3  David  32.666667      Houston  65000.0
4    Eve  38.000000      Phoenix  90000.0


In [135]:
# Backward fill: fill missing value with the next known value
df_bfill = df.fillna(method='bfill')

  df_bfill = df.fillna(method='bfill')


In [136]:
print("\nAfter backward-fill (use next value):")
print(df_bfill)


After backward-fill (use next value):
    Name        Age         City   Salary
0  Alice  25.000000     New York  70000.0
1    Bob  32.666667  Los Angeles  85000.0
2  David  35.000000      Houston  77500.0
3  David  32.666667      Houston  65000.0
4    Eve  38.000000      Phoenix  90000.0
