# Learning & Revision of Python Pandas Day 3

- **Objective**: Clean and manipulate data to prepare it for analysis.
- **Topics**:
    - Handling missing values: `.fillna()`, `.dropna()`, `.interpolate()`.
    - Removing duplicates: `.drop_duplicates()`.
    - Renaming columns and rows: `.rename()`.
    - Data transformations: `.apply()`, `.map()`, `.applymap()`.
- **Exercises**:
    - Clean a messy dataset (e.g., remove missing values, duplicates).
    - Rename columns for clarity.
    - Apply a function to transform data.

## Topic 01: Handling missing values

In [7]:
import pandas as pd
import numpy as np

## Missing Data

In [9]:
# Sample DataFrame with missing values
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, np.nan, 35, 40],
        'City': ['New York', np.nan, 'Chicago', 'Houston']}
df = pd.DataFrame(data)


In [11]:
df

Unnamed: 0,Name,Age,City
0,Alice,25.0,New York
1,Bob,,
2,Charlie,35.0,Chicago
3,David,40.0,Houston


**Check the missing data**

In [14]:
df.isnull()

Unnamed: 0,Name,Age,City
0,False,False,False
1,False,True,True
2,False,False,False
3,False,False,False


#### **Total Missing data in this dataframe**

In [18]:
df.isnull().value_counts().sum()

4

#### **Missing data in column wise category**

In [24]:
df.isnull().sum()

Name    0
Age     1
City    1
dtype: int64

In [36]:
# Fill missing values with a specific value (e.g., fill NaN in Age with 30)
df_filled = df.fillna({'Age': 30, 'City': 'Unknown'})
print(df_filled)

print("\n")

# Forward fill missing values in 'City' column
df_filled_forward = df.fillna(method='ffill')
print(df_filled_forward)


      Name   Age      City
0    Alice  25.0  New York
1      Bob  30.0   Unknown
2  Charlie  35.0   Chicago
3    David  40.0   Houston


      Name   Age      City
0    Alice  25.0  New York
1      Bob  25.0  New York
2  Charlie  35.0   Chicago
3    David  40.0   Houston


  df_filled_forward = df.fillna(method='ffill')


In [40]:
df.isna()

Unnamed: 0,Name,Age,City
0,False,False,False
1,False,True,True
2,False,False,False
3,False,False,False


## Duplicate Data 

#### Drop the missing values

In [45]:
df_dropeed = df.dropna()

In [47]:
df_dropeed

Unnamed: 0,Name,Age,City
0,Alice,25.0,New York
2,Charlie,35.0,Chicago
3,David,40.0,Houston


#### Drop the duplicates 

In [55]:
# Create a DataFrame with duplicate rows
data = {'Name': ['Alice', 'Bob', 'Bob', 'Charlie', 'David', 'David'],
        'Age': [25, 30, 30, 35, 40, 40],
        'City': ['New York', 'Los Angeles', 'Los Angeles', 'Chicago', 'Houston', 'Houston']}
df_duplicates = pd.DataFrame(data)


In [57]:
df_duplicates

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Bob,30,Los Angeles
3,Charlie,35,Chicago
4,David,40,Houston
5,David,40,Houston


In [69]:
df_no_duplicates = df_duplicates.drop_duplicates()


In [71]:
df_no_duplicates

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
3,Charlie,35,Chicago
4,David,40,Houston


In [73]:
# Remove duplicates based on a specific column (e.g., 'Name')
df_no_duplicates_name = df_duplicates.drop_duplicates(subset='Name')
print(df_no_duplicates_name)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
3  Charlie   35      Chicago
4    David   40      Houston


## Rename Columns and Rows

In [86]:
df

Unnamed: 0,Name,Age,City
0,Alice,25.0,New York
1,Bob,,
2,Charlie,35.0,Chicago
3,David,40.0,Houston


In [90]:
df.rename(columns={"Name":"Full Name", "City":"Town"})

Unnamed: 0,Full Name,Age,Town
0,Alice,25.0,New York
1,Bob,,
2,Charlie,35.0,Chicago
3,David,40.0,Houston


In [92]:
df_renamed_index = df.rename(index={0: 'Row 1', 1: 'Row 2', 2: 'Row 3', 3: 'Row 4'})

In [94]:
df_renamed_index

Unnamed: 0,Name,Age,City
Row 1,Alice,25.0,New York
Row 2,Bob,,
Row 3,Charlie,35.0,Chicago
Row 4,David,40.0,Houston


## Applying Functions to Columns

In [141]:
df['Age in Months'] = df['Age']*12

In [143]:
df

Unnamed: 0,Name,Age,City,Age in Months,City Lenght
0,Alice,25.0,New York,300.0,8
1,Bob,,,,3
2,Charlie,35.0,Chicago,420.0,7
3,David,40.0,Houston,480.0,7


In [145]:
def city_length(p):
    return len(p)

In [147]:
df['City'] = df['City'].astype(str)

In [149]:
df['City Lenght'] = df['City'].apply(city_length)

In [151]:
df

Unnamed: 0,Name,Age,City,Age in Months,City Lenght
0,Alice,25.0,New York,300.0,8
1,Bob,,,,3
2,Charlie,35.0,Chicago,420.0,7
3,David,40.0,Houston,480.0,7


## Extra

##### Replace the value of "nan" in City column by "Unknown"

In [159]:
df['City'] = df['City'].replace({"nan":"Unknown"})

In [161]:
df

Unnamed: 0,Name,Age,City,Age in Months,City Lenght
0,Alice,25.0,New York,300.0,8
1,Bob,,Unknown,,3
2,Charlie,35.0,Chicago,420.0,7
3,David,40.0,Houston,480.0,7


##### Fill the missing value in Age column by using the filling method of *"Forward filing"*

In [163]:
df['Age'] = df['Age'].fillna(method="ffill")

  df['Age'] = df['Age'].fillna(method="ffill")


In [165]:
df

Unnamed: 0,Name,Age,City,Age in Months,City Lenght
0,Alice,25.0,New York,300.0,8
1,Bob,25.0,Unknown,,3
2,Charlie,35.0,Chicago,420.0,7
3,David,40.0,Houston,480.0,7
