## **1. Handling Missing Data**
Missing data often appears as **`NaN`** (Not a Number) in Pandas.

In [1]:
import pandas as pd
import numpy as np # Often used for np.nan

# Load the messy data
df = pd.read_csv("messy_hr_data.csv")

print("--- Initial DataFrame Info ---")
df.info() # Notice the non-null counts are different
print("\n--- Initial DataFrame ---")
df

--- Initial DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Employee ID  10 non-null     object 
 1   Department   10 non-null     object 
 2   Age          9 non-null      float64
 3   Gender       9 non-null      object 
 4   Salary       9 non-null      float64
 5   Start Date   10 non-null     object 
 6   Last Login   8 non-null      object 
dtypes: float64(2), object(5)
memory usage: 692.0+ bytes

--- Initial DataFrame ---


Unnamed: 0,Employee ID,Department,Age,Gender,Salary,Start Date,Last Login
0,E001,Marketing,35.0,Male,70000.0,2020-01-15,2023-11-01 10:00
1,E002,Sales,28.0,,55000.0,2021-03-20,2023-11-02 11:30
2,E003,IT,,Male,90000.0,2019-07-11,2023-10-30 09:00
3,E004,HR,45.0,Female,65000.0,2018-05-25,
4,E005,Sales,32.0,Male,58000.0,2022-09-01,2023-11-02 14:00
5,E006,IT,41.0,Male,,2017-11-14,2023-11-01 15:10
6,E007,Marketing,29.0,Female,72000.0,2022-02-10,2023-11-02 08:45
7,E008,IT,38.0,Male,95000.0,2019-08-01,2023-10-29 17:00
8,E009,Sales,33.0,Female,60000.0,2021-10-05,
9,E010,Marketing,35.0,Male,70000.0,2020-01-15,2023-11-01 10:00


**Identifying Missing Values:**
- **.isnull() or .isna():** Returns a boolean DataFrame of the same shape, with True where values are missing.
- **.notnull():** The opposite of .isnull().
- You can chain .sum() to these methods to get a count of missing values per column.

In [2]:
print("\n--- Missing Value Counts ---")
print(df.isnull().sum())


--- Missing Value Counts ---
Employee ID    0
Department     0
Age            1
Gender         1
Salary         1
Start Date     0
Last Login     2
dtype: int64


**Dropping Missing Values:**
- **.dropna():** Removes rows (or columns) with missing values.
- **axis=0** to drop rows (default), **axis=1** to drop columns.
- **how='any'** to drop if any NaN is present (default), **how='all'** to drop if all values are NaN.
- **thresh=N** to keep rows with at least N non-null values.
- **subset=['col1', 'col2']** to only consider specific columns for dropping NaNs.

In [3]:
# Drop any row with at least one missing value

df_dropped_rows = df.dropna()
print("\n--- DataFrame after dropping rows with any NaN ---")
df_dropped_rows


--- DataFrame after dropping rows with any NaN ---


Unnamed: 0,Employee ID,Department,Age,Gender,Salary,Start Date,Last Login
0,E001,Marketing,35.0,Male,70000.0,2020-01-15,2023-11-01 10:00
4,E005,Sales,32.0,Male,58000.0,2022-09-01,2023-11-02 14:00
6,E007,Marketing,29.0,Female,72000.0,2022-02-10,2023-11-02 08:45
7,E008,IT,38.0,Male,95000.0,2019-08-01,2023-10-29 17:00
9,E010,Marketing,35.0,Male,70000.0,2020-01-15,2023-11-01 10:00


In [4]:
# Drop columns that have any missing values

df_dropped_cols = df.dropna(axis=1)
print("\n--- DataFrame after dropping columns with any NaN ---")
df_dropped_cols


--- DataFrame after dropping columns with any NaN ---


Unnamed: 0,Employee ID,Department,Start Date
0,E001,Marketing,2020-01-15
1,E002,Sales,2021-03-20
2,E003,IT,2019-07-11
3,E004,HR,2018-05-25
4,E005,Sales,2022-09-01
5,E006,IT,2017-11-14
6,E007,Marketing,2022-02-10
7,E008,IT,2019-08-01
8,E009,Sales,2021-10-05
9,E010,Marketing,2020-01-15


**Filling Missing Values (Imputation):**
- **.fillna(value):** Fills NaN values with a specified value. This is often preferred over dropping data.
- The value can be a scalar (e.g., 0), a dictionary (to fill different columns with different values), or the result of a calculation (like the mean or median).
- **method='ffill'** (forward fill) or **method='bfill'** (backward fill) can propagate the last or next valid observation.

In [5]:
# Create a copy to work with
df_filled = df.copy()

# Fill missing 'Age' with the mean age
mean_age = df_filled['Age'].mean()
df_filled['Age'].fillna(mean_age, inplace=True) # inplace=True modifies the DataFrame directly

# Fill missing 'Salary' with the median salary
median_salary = df_filled['Salary'].median()
df_filled['Salary'].fillna(median_salary, inplace=True)

# Fill missing 'Gender' with the mode (most common value)
mode_gender = df_filled['Gender'].mode()[0] # .mode() returns a Series, so we take the first item
df_filled['Gender'].fillna(mode_gender, inplace=True)

print("\n--- DataFrame after filling missing values ---")
df_filled


--- DataFrame after filling missing values ---


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_filled['Age'].fillna(mean_age, inplace=True) # inplace=True modifies the DataFrame directly
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_filled['Salary'].fillna(median_salary, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work becau

Unnamed: 0,Employee ID,Department,Age,Gender,Salary,Start Date,Last Login
0,E001,Marketing,35.0,Male,70000.0,2020-01-15,2023-11-01 10:00
1,E002,Sales,28.0,Male,55000.0,2021-03-20,2023-11-02 11:30
2,E003,IT,35.111111,Male,90000.0,2019-07-11,2023-10-30 09:00
3,E004,HR,45.0,Female,65000.0,2018-05-25,
4,E005,Sales,32.0,Male,58000.0,2022-09-01,2023-11-02 14:00
5,E006,IT,41.0,Male,70000.0,2017-11-14,2023-11-01 15:10
6,E007,Marketing,29.0,Female,72000.0,2022-02-10,2023-11-02 08:45
7,E008,IT,38.0,Male,95000.0,2019-08-01,2023-10-29 17:00
8,E009,Sales,33.0,Female,60000.0,2021-10-05,
9,E010,Marketing,35.0,Male,70000.0,2020-01-15,2023-11-01 10:00


In [6]:
df_filled.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Employee ID  10 non-null     object 
 1   Department   10 non-null     object 
 2   Age          10 non-null     float64
 3   Gender       10 non-null     object 
 4   Salary       10 non-null     float64
 5   Start Date   10 non-null     object 
 6   Last Login   8 non-null      object 
dtypes: float64(2), object(5)
memory usage: 692.0+ bytes


## **2. Correcting Data Types**
The .info() method showed us that Start Date and Last Login are object (string) types, not datetime types. This prevents us from doing date-based calculations.
- **pd.to_datetime(series):** Converts a series to a datetime object.
- **.astype(new_type):** A general method to cast a series to a new type (e.g., int, float, str).

In [7]:
# Convert date columns to datetime objects

df_filled['Start Date'] = pd.to_datetime(df_filled['Start Date'])
df_filled['Last Login'] = pd.to_datetime(df_filled['Last Login'], errors='coerce')
# errors='coerce' will turn any unparseable dates into NaT (Not a Time)

print("\n--- Data types after converting date columns ---")
df_filled.info()


--- Data types after converting date columns ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Employee ID  10 non-null     object        
 1   Department   10 non-null     object        
 2   Age          10 non-null     float64       
 3   Gender       10 non-null     object        
 4   Salary       10 non-null     float64       
 5   Start Date   10 non-null     datetime64[ns]
 6   Last Login   8 non-null      datetime64[ns]
dtypes: datetime64[ns](2), float64(2), object(3)
memory usage: 692.0+ bytes


In [8]:
# Example: Now we can do date calculations

df_filled['Days Since Last Login'] = (pd.to_datetime('now') - df_filled['Last Login']).dt.days
print("\n--- DataFrame with new calculated column ---")
df_filled.head()


--- DataFrame with new calculated column ---


Unnamed: 0,Employee ID,Department,Age,Gender,Salary,Start Date,Last Login,Days Since Last Login
0,E001,Marketing,35.0,Male,70000.0,2020-01-15,2023-11-01 10:00:00,608.0
1,E002,Sales,28.0,Male,55000.0,2021-03-20,2023-11-02 11:30:00,607.0
2,E003,IT,35.111111,Male,90000.0,2019-07-11,2023-10-30 09:00:00,611.0
3,E004,HR,45.0,Female,65000.0,2018-05-25,NaT,
4,E005,Sales,32.0,Male,58000.0,2022-09-01,2023-11-02 14:00:00,607.0


## **3. Handling Duplicates**
- **.duplicated():** Returns a boolean Series indicating which rows are duplicates.
- **.drop_duplicates():** Returns a DataFrame with duplicate rows removed.
    - keep='first' (default), keep='last', or keep=False (drops all duplicates).
    - subset=['col1', 'col2'] to define uniqueness based on specific columns.

In [9]:
# Check for duplicate rows
print(f"\nNumber of duplicate rows: {df_filled.duplicated().sum()}")

# Drop duplicate rows
df_cleaned = df_filled.drop_duplicates()
print("\n--- DataFrame after dropping duplicates ---")
df_cleaned


Number of duplicate rows: 0

--- DataFrame after dropping duplicates ---


Unnamed: 0,Employee ID,Department,Age,Gender,Salary,Start Date,Last Login,Days Since Last Login
0,E001,Marketing,35.0,Male,70000.0,2020-01-15,2023-11-01 10:00:00,608.0
1,E002,Sales,28.0,Male,55000.0,2021-03-20,2023-11-02 11:30:00,607.0
2,E003,IT,35.111111,Male,90000.0,2019-07-11,2023-10-30 09:00:00,611.0
3,E004,HR,45.0,Female,65000.0,2018-05-25,NaT,
4,E005,Sales,32.0,Male,58000.0,2022-09-01,2023-11-02 14:00:00,607.0
5,E006,IT,41.0,Male,70000.0,2017-11-14,2023-11-01 15:10:00,608.0
6,E007,Marketing,29.0,Female,72000.0,2022-02-10,2023-11-02 08:45:00,608.0
7,E008,IT,38.0,Male,95000.0,2019-08-01,2023-10-29 17:00:00,611.0
8,E009,Sales,33.0,Female,60000.0,2021-10-05,NaT,
9,E010,Marketing,35.0,Male,70000.0,2020-01-15,2023-11-01 10:00:00,608.0


## **4. Applying Functions to Data**
- **.apply(function):** Applies a function along an axis of the DataFrame (rows or columns). Often used with lambda functions for custom transformations.

In [10]:
# Let's create a new column 'Salary Grade' based on salary
def assign_salary_grade(salary):
    if salary > 80000:
        return 'High'
    elif salary > 60000:
        return 'Medium'
    else:
        return 'Low'

# Apply this function to the 'Salary' column
df_cleaned['Salary Grade'] = df_cleaned['Salary'].apply(assign_salary_grade)

# Using a lambda function for a simpler task
df_cleaned['Department Lowercase'] = df_cleaned['Department'].apply(lambda dept: dept.lower())

print("\n--- DataFrame with new columns from .apply() ---")
df_cleaned.head()


--- DataFrame with new columns from .apply() ---


Unnamed: 0,Employee ID,Department,Age,Gender,Salary,Start Date,Last Login,Days Since Last Login,Salary Grade,Department Lowercase
0,E001,Marketing,35.0,Male,70000.0,2020-01-15,2023-11-01 10:00:00,608.0,Medium,marketing
1,E002,Sales,28.0,Male,55000.0,2021-03-20,2023-11-02 11:30:00,607.0,Low,sales
2,E003,IT,35.111111,Male,90000.0,2019-07-11,2023-10-30 09:00:00,611.0,High,it
3,E004,HR,45.0,Female,65000.0,2018-05-25,NaT,,Medium,hr
4,E005,Sales,32.0,Male,58000.0,2022-09-01,2023-11-02 14:00:00,607.0,Low,sales


## **Exercises**

**1. Missing Value Imputation:**
- Load the messy_hr_data.csv again into a new DataFrame called df_ex1.
- For the Salary column, fill the missing values with the mean salary of their respective Department. (This is more advanced! Hint: Use df.groupby('Department')['Salary'].transform('mean')).
- For the Age column, fill the missing values with the median age of the entire dataset.
- Print the first 7 rows of the modified DataFrame and the count of missing values to confirm your changes.

In [32]:
df_ex1 = pd.read_csv("messy_hr_data.csv")

mean_salary = df_ex1.groupby('Department')['Salary'].transform('mean')
df_ex1['Salary'] = df_ex1['Salary'].fillna(mean_salary)  #df_ex1['Salary'].fillna(mean_salary, inplace=True)

median_age = df_ex1['Age'].median()
df_ex1['Age'] = df_ex1['Age'].fillna(median_age)   #df_ex1['Age'].fillna(median_age, inplace = True)

print("Missing Values -----\n")
print(df_ex1.isnull().sum())
df_ex1.head(7)

Missing Values -----

Employee ID    0
Department     0
Age            0
Gender         1
Salary         0
Start Date     0
Last Login     2
dtype: int64


Unnamed: 0,Employee ID,Department,Age,Gender,Salary,Start Date,Last Login
0,E001,Marketing,35.0,Male,70000.0,2020-01-15,2023-11-01 10:00
1,E002,Sales,28.0,,55000.0,2021-03-20,2023-11-02 11:30
2,E003,IT,35.0,Male,90000.0,2019-07-11,2023-10-30 09:00
3,E004,HR,45.0,Female,65000.0,2018-05-25,
4,E005,Sales,32.0,Male,58000.0,2022-09-01,2023-11-02 14:00
5,E006,IT,41.0,Male,92500.0,2017-11-14,2023-11-01 15:10
6,E007,Marketing,29.0,Female,72000.0,2022-02-10,2023-11-02 08:45


**2. Data Type Conversion and Feature Creation:**
- Work with the df_cleaned DataFrame from the examples (the one with no duplicates and filled NaNs).
- The Employee ID column has a string "E" at the beginning. Create a new column called Numeric ID that contains only the integer part of the Employee ID. (Hint: Use the .str accessor and slicing, like .str[1:], then use .astype(int)).
- Calculate the number of years an employee has been with the company and store it in a new column called Years of Service. (Hint: (pd.to_datetime('now') - df['Start Date']).dt.days / 365.25).
- Print the head of the DataFrame with these two new columns.

In [33]:
df_cleaned['Numeric ID'] = df_cleaned['Employee ID'].str[1:].astype(int)
df_cleaned['Years of Service'] = (pd.to_datetime('now') - df_cleaned['Start Date']).dt.days/ 365.25

df_cleaned.head()

Unnamed: 0,Employee ID,Department,Age,Gender,Salary,Start Date,Last Login,Days Since Last Login,Salary Grade,Department Lowercase,Numeric ID,Years of Service
0,E001,Marketing,35.0,Male,70000.0,2020-01-15,2023-11-01 10:00:00,608.0,Medium,marketing,1,5.462012
1,E002,Sales,28.0,Male,55000.0,2021-03-20,2023-11-02 11:30:00,607.0,Low,sales,2,4.284736
2,E003,IT,35.111111,Male,90000.0,2019-07-11,2023-10-30 09:00:00,611.0,High,it,3,5.976728
3,E004,HR,45.0,Female,65000.0,2018-05-25,NaT,,Medium,hr,4,7.104723
4,E005,Sales,32.0,Male,58000.0,2022-09-01,2023-11-02 14:00:00,607.0,Low,sales,5,2.833676


**3. Find and Remove Outliers (Simple Method):**
- An "outlier" can be defined as a value that is significantly different from other observations. A common way to find them is using the Interquartile Range (IQR).
- Let's focus on the Salary column of df_cleaned.
- Step 1: Calculate the first quartile (Q1, 25th percentile) and the third quartile (Q3, 75th percentile) of the Salary. (Hint: df['Salary'].quantile(0.25)).
- Step 2: Calculate the IQR: IQR = Q3 - Q1.
- Step 3: Define the outlier boundaries: lower_bound = Q1 - 1.5 * IQR and upper_bound = Q3 + 1.5 * IQR.
- Step 4: Use boolean masking to select and print the rows of the DataFrame that are considered "outliers" (i.e., salary is less than lower_bound or greater than upper_bound).
- Step 5: Create a new DataFrame that excludes these outliers. Print its shape.

In [34]:
q1 = df_cleaned['Salary'].quantile(0.25)
q3 = df_cleaned['Salary'].quantile(0.75)
iqr = q3 - q1

lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)

outliers = df_cleaned.loc[(df_cleaned['Salary'] < lower_bound) | (df_cleaned['Salary'] > upper_bound)]
outliers

Unnamed: 0,Employee ID,Department,Age,Gender,Salary,Start Date,Last Login,Days Since Last Login,Salary Grade,Department Lowercase,Numeric ID,Years of Service
2,E003,IT,35.111111,Male,90000.0,2019-07-11,2023-10-30 09:00:00,611.0,High,it,3,5.976728
7,E008,IT,38.0,Male,95000.0,2019-08-01,2023-10-29 17:00:00,611.0,High,it,8,5.919233


In [35]:
df_no_outliers = df_cleaned.loc[(df_cleaned['Salary'] >= lower_bound) & (df_cleaned['Salary'] <= upper_bound)]
df_no_outliers

Unnamed: 0,Employee ID,Department,Age,Gender,Salary,Start Date,Last Login,Days Since Last Login,Salary Grade,Department Lowercase,Numeric ID,Years of Service
0,E001,Marketing,35.0,Male,70000.0,2020-01-15,2023-11-01 10:00:00,608.0,Medium,marketing,1,5.462012
1,E002,Sales,28.0,Male,55000.0,2021-03-20,2023-11-02 11:30:00,607.0,Low,sales,2,4.284736
3,E004,HR,45.0,Female,65000.0,2018-05-25,NaT,,Medium,hr,4,7.104723
4,E005,Sales,32.0,Male,58000.0,2022-09-01,2023-11-02 14:00:00,607.0,Low,sales,5,2.833676
5,E006,IT,41.0,Male,70000.0,2017-11-14,2023-11-01 15:10:00,608.0,Medium,it,6,7.63039
6,E007,Marketing,29.0,Female,72000.0,2022-02-10,2023-11-02 08:45:00,608.0,Medium,marketing,7,3.389459
8,E009,Sales,33.0,Female,60000.0,2021-10-05,NaT,,Low,sales,9,3.739904
9,E010,Marketing,35.0,Male,70000.0,2020-01-15,2023-11-01 10:00:00,608.0,Medium,marketing,10,5.462012


In [36]:
print("\nShape of the DataFrame before removing outliers:", df_cleaned.shape)
print("Shape of the DataFrame after removing outliers:", df_no_outliers.shape)


Shape of the DataFrame before removing outliers: (10, 12)
Shape of the DataFrame after removing outliers: (8, 12)
