Exercises:
Identify and handle missing data:
Use isna(), notna(), dropna(), and fillna() to handle missing values.
Rename columns and convert data types (e.g., change a column type from string to integer using astype()).
Filtering data based on conditions (e.g., select rows where Age > 30).
Sorting data using sort_values().

0. Dataset Overview

In [2]:
import pandas as pd

df = pd.read_csv('employee_data.csv')
print(df)

      Name   Age Department   Salary  Start_Date   Bonus
0    Alice  25.0         HR  50000.0  2022-01-15  5000.0
1      Bob  35.0         IT  70000.0  2021-05-20     NaN
2  Charlie   NaN    Finance      NaN  2020-08-01  4500.0
3    David  45.0         IT  80000.0         NaN  7000.0
4      Eve  29.0  Marketing  55000.0  2023-03-10     NaN
5    Frank  50.0      Sales  65000.0  2019-11-25  8000.0
6    Grace   NaN         HR  52000.0  2023-06-15     NaN
7    Helen  40.0    Finance  72000.0  2018-04-01  6000.0
8      Ian  28.0  Marketing      NaN  2022-12-11  3000.0
9    Julia  32.0      Sales  75000.0  2017-09-05     NaN


1. Identify Missing Data

In [20]:
# Check for missing data in each column
missing_data = df.isna()

# Count of missing values in each column
missing_data_summary = df.isna().sum()

# Display missing data summary and missing data matrix
missing_data, missing_data_summary

(    Name    Age  Department  Salary  Start_Date  Bonus
 0  False  False       False   False       False  False
 1  False  False       False   False       False   True
 2  False   True       False    True       False  False
 3  False  False       False   False        True  False
 4  False  False       False   False       False   True
 5  False  False       False   False       False  False
 6  False   True       False   False       False   True
 7  False  False       False   False       False  False
 8  False  False       False    True       False  False
 9  False  False       False   False       False   True,
 Name          0
 Age           2
 Department    0
 Salary        2
 Start_Date    1
 Bonus         4
 dtype: int64)

In [10]:
df

Unnamed: 0,Name,Age,Department,Salary,Start_Date,Bonus
0,Alice,25.0,HR,50000.0,2022-01-15,5000.0
1,Bob,35.0,IT,70000.0,2021-05-20,
2,Charlie,,Finance,,2020-08-01,4500.0
3,David,45.0,IT,80000.0,,7000.0
4,Eve,29.0,Marketing,55000.0,2023-03-10,
5,Frank,50.0,Sales,65000.0,2019-11-25,8000.0
6,Grace,,HR,52000.0,2023-06-15,
7,Helen,40.0,Finance,72000.0,2018-04-01,6000.0
8,Ian,28.0,Marketing,,2022-12-11,3000.0
9,Julia,32.0,Sales,75000.0,2017-09-05,


In [16]:
df.dtypes

Unnamed: 0,0
Name,object
Age,float64
Department,object
Salary,float64
Start_Date,object
Bonus,float64


2. Handling Missing Data

In [30]:
# Example of dropping rows with any missing data
data_dropped = df.dropna()

# Example of filling missing values with:
# - Mean for numerical columns
# - "Unknown" for categorical columns

data_filled = df.copy()
# Filling missing values without triggering warnings
data_filled['Age'] = data_filled['Age'].fillna(df['Age'].mean())
data_filled['Salary'] = data_filled['Salary'].fillna(df['Salary'].mean())
data_filled['Bonus'] = data_filled['Bonus'].fillna(df['Bonus'].mean())
data_filled['Start_Date'] = data_filled['Start_Date'].fillna("Unknown")

# Display the cleaned datasets
print(data_dropped.head())
print('\n')
print(data_filled.head())

    Name   Age Department   Salary  Start_Date   Bonus
0  Alice  25.0         HR  50000.0  2022-01-15  5000.0
5  Frank  50.0      Sales  65000.0  2019-11-25  8000.0
7  Helen  40.0    Finance  72000.0  2018-04-01  6000.0


      Name   Age Department   Salary  Start_Date        Bonus
0    Alice  25.0         HR  50000.0  2022-01-15  5000.000000
1      Bob  35.0         IT  70000.0  2021-05-20  5583.333333
2  Charlie  35.5    Finance  64875.0  2020-08-01  4500.000000
3    David  45.0         IT  80000.0     Unknown  7000.000000
4      Eve  29.0  Marketing  55000.0  2023-03-10  5583.333333


3. Renaming Columns

In [34]:
# Renaming columns for simplicity
data_renamed = data_filled.rename(columns={
    'Name': 'Employee_Name',
    'Age': 'Employee_Age',
    'Department': 'Dept',
    'Salary': 'Annual_Salary',
    'Start_Date': 'Employment_Start',
    'Bonus': 'Yearly_Bonus'
})

# Display the renamed dataset
data_renamed.head()

Unnamed: 0,Employee_Name,Employee_Age,Dept,Annual_Salary,Employment_Start,Yearly_Bonus
0,Alice,25.0,HR,50000.0,2022-01-15,5000.0
1,Bob,35.0,IT,70000.0,2021-05-20,5583.333333
2,Charlie,35.5,Finance,64875.0,2020-08-01,4500.0
3,David,45.0,IT,80000.0,Unknown,7000.0
4,Eve,29.0,Marketing,55000.0,2023-03-10,5583.333333
