# Week 6: Advanced Pandas - Apply and Regex

This week, we'll explore some more advanced pandas techniques that will make you much more effective at data cleaning and analysis.

We'll cover three main topics:
1. **Removing duplicates** - How to find and handle duplicate data intelligently
2. **The `.apply()` method** - How to create custom functions and apply them to your data
3. **Regular expressions (regex)** - How to find and extract patterns from text

These skills are essential for real-world data work!

## Load and Explore Our Dataset

Let's start by loading our employee dataset and taking a look at what we're working with.

In [None]:
import pandas as pd
import numpy as np
import re

# Load the dataset
df = pd.read_csv('employee_reviews_clean.csv')

print(f'Shape of the DataFrame: {df.shape}')
df.head()

Let's also look at the data types and get a sense of what each column contains.

In [None]:
print('Data types:')
print(df.dtypes)
print('\nSample of each column:')
for col in df.columns:
    print(f'{col}: {df[col].iloc[0]}')

## 1. Finding and Removing Duplicate Data

In real-world datasets, you often encounter duplicate records. This can happen when data is collected from multiple sources, or when the same person is entered into a system multiple times. Let's learn how to handle this intelligently.

### Step 1: Detecting Duplicates

First, let's see if we have any duplicate rows in our dataset.

In [None]:
# Check for completely identical rows
print(f'Total rows: {len(df)}')
print(f'Duplicate rows (completely identical): {df.duplicated().sum()}')
print(f'Unique rows: {len(df.drop_duplicates())}')

That shows us duplicates across ALL columns. But often, we care about duplicates in specific columns. Let's check for duplicate employee IDs and names.

In [None]:
# Check for duplicates based on specific columns
print('Duplicates based on employee_id:')
print(f'Duplicate employee_ids: {df.duplicated(subset=["employee_id"]).sum()}')

print('\nDuplicates based on full_name:')
print(f'Duplicate names: {df.duplicated(subset=["full_name"]).sum()}')

### Step 2: Examining the Duplicate Records

Let's look at the actual duplicate records to understand what we're dealing with.

In [None]:
# Show all rows where employee_id is duplicated
duplicate_ids = df[df.duplicated(subset=['employee_id'], keep=False)]
print('Rows with duplicate employee IDs:')
duplicate_ids.sort_values('employee_id')

Notice that Alice Smith and Bob Jones appear twice, but with different information (salary, start date, email). This is a common scenario - we need to decide which record to keep.

### Step 3: Simple Duplicate Removal

The simplest approach is to just keep the first occurrence of each duplicate.

In [None]:
# Keep the first occurrence of each employee_id
df_simple = df.drop_duplicates(subset=['employee_id'], keep='first')
print(f'After removing duplicates (keep first): {len(df_simple)} rows')
print('\nWhich Alice Smith record did we keep?')
df_simple[df_simple['full_name'] == 'Alice Smith'][['full_name', 'salary', 'start_date']]

### Step 4: Controlled Duplicate Removal with Sorting

Often, we want to keep a specific record - like the most recent one, or the one with the highest salary. We can do this by sorting first, then removing duplicates.

**Example 1: Keep the Most Recent Record**

In [None]:
# Sort by start_date (most recent first) then remove duplicates
df_recent = df.sort_values('start_date', ascending=False).drop_duplicates(subset=['employee_id'], keep='first')
print('Keeping the most recent record for each employee:')
recent_duplicates = df_recent[df_recent['full_name'].isin(['Alice Smith', 'Bob Jones'])]
recent_duplicates[['full_name', 'salary', 'start_date']]

**Example 2: Keep the Highest Paid Record**

In [None]:
# Sort by salary (highest first) then remove duplicates
df_highest_paid = df.sort_values('salary', ascending=False).drop_duplicates(subset=['employee_id'], keep='first')
print('Keeping the highest paid record for each employee:')
highest_paid_duplicates = df_highest_paid[df_highest_paid['full_name'].isin(['Alice Smith', 'Bob Jones'])]
highest_paid_duplicates[['full_name', 'salary', 'start_date']]

### Step 5: Different Strategies for Handling Duplicates

The `keep` parameter gives us different options for handling duplicates.

In [None]:
print('Different strategies for handling duplicates:\n')

# Keep first occurrence
df_keep_first = df.drop_duplicates(subset=['employee_id'], keep='first')
print(f'Keep first: {len(df_keep_first)} rows')

# Keep last occurrence  
df_keep_last = df.drop_duplicates(subset=['employee_id'], keep='last')
print(f'Keep last: {len(df_keep_last)} rows')

# Remove all duplicates (keep none)
df_remove_all = df.drop_duplicates(subset=['employee_id'], keep=False)
print(f'Remove all duplicates: {len(df_remove_all)} rows')

For the rest of our exercises, let's work with clean data (no duplicates). We'll keep the most recent record for each employee.

In [None]:
# Create our clean dataset for the rest of the notebook
df_clean = df.sort_values('start_date', ascending=False).drop_duplicates(subset=['employee_id'], keep='first').copy()
print(f'Clean dataset: {len(df_clean)} unique employees')
df_clean.head()

## 2. The `.apply()` Method - Custom Data Transformations

The `.apply()` method is one of the most powerful tools in pandas. It allows you to apply any function to your data - whether it's a built-in function or one you create yourself.

### Using `.apply()` with Built-in Functions

Let's start simple. We can use `.apply()` with functions that Python already provides.

**Example: Getting the Length of Names**

In [None]:
# Get the length of each employee's name
df_clean['name_length'] = df_clean['full_name'].apply(len)
print('Name lengths:')
df_clean[['full_name', 'name_length']].head()

**Example: Converting Text to Uppercase**

In [None]:
# Convert department names to uppercase
df_clean['department_upper'] = df_clean['department'].apply(str.upper)
print('Department names in uppercase:')
df_clean[['department', 'department_upper']].head()

### Creating Your Own Functions

The real power of `.apply()` comes when you create your own functions to solve specific problems.

**Example: Categorizing Salaries**

Let's create a function that puts salaries into categories: Low, Medium, or High.

In [None]:
def categorize_salary(salary):
    """Put salaries into Low, Medium, or High categories"""
    if salary >= 100000:
        return 'High'
    elif salary >= 75000:
        return 'Medium'
    else:
        return 'Low'

# Apply our function to the salary column
df_clean['salary_category'] = df_clean['salary'].apply(categorize_salary)
print('Salary categories:')
df_clean[['full_name', 'salary', 'salary_category']].head()

**Example: Extracting Email Domains**

Let's create a function to extract the domain part of email addresses (the part after @).

In [None]:
def get_email_domain(email):
    """Extract the domain from an email address"""
    return email.split('@')[1]

# Apply the function to extract domains
df_clean['email_domain'] = df_clean['email'].apply(get_email_domain)
print('Email domains:')
df_clean[['email', 'email_domain']].head()

### Using Multiple Columns in Your Functions

Sometimes you need to use information from multiple columns. To do this, you apply the function to the entire row using `axis=1`.

**Example: Calculating Years of Experience**

In [None]:
def calculate_years_experience(row):
    """Calculate years of experience based on start date"""
    start_date = pd.to_datetime(row['start_date'])
    current_date = pd.to_datetime('2024-01-01')  # Assuming current date
    years = (current_date - start_date).days / 365.25
    return round(years, 1)

# Apply the function to each row
df_clean['years_experience'] = df_clean.apply(calculate_years_experience, axis=1)
print('Years of experience:')
df_clean[['full_name', 'start_date', 'years_experience']].head()

**Example: Creating Display Names**

In [None]:
def create_display_name(row):
    """Create a display name that includes the department"""
    name = row['full_name']
    dept = row['department']
    return f"{name} ({dept})"

# Apply the function to create display names
df_clean['display_name'] = df_clean.apply(create_display_name, axis=1)
print('Display names:')
df_clean[['display_name']].head()

### Using Functions with Extra Parameters

Sometimes your function needs additional information beyond what's in the DataFrame. You can pass extra parameters using keyword arguments.

**Example: Calculating Bonuses with Different Rates**

In [None]:
def calculate_bonus(row, base_rate, high_performer_bonus):
    """Calculate bonus based on salary and performance"""
    base_bonus = row['salary'] * base_rate
    
    # Give extra bonus to high performers
    if 'Exceeds' in row['performance_review']:
        base_bonus = base_bonus * high_performer_bonus
    
    return round(base_bonus, 2)

# Apply the function with custom parameters
df_clean['bonus'] = df_clean.apply(
    calculate_bonus, 
    axis=1, 
    base_rate=0.05,  # 5% base bonus rate
    high_performer_bonus=1.5  # 50% extra for high performers
)

print('Bonuses calculated:')
df_clean[['full_name', 'salary', 'performance_review', 'bonus']].head()

### Practice Exercises - Apply Method

#### Exercise 1: Extract First Names

Create a function that extracts just the first name from the `full_name` column. Apply it to create a new column called `first_name`.

In [None]:
# Your solution here


#### Exercise 2: Create Employee Codes

Create a function that takes a row and generates an employee code using the first 2 letters of the department and the last 2 digits of the start year. Apply this to create an `emp_code` column.

In [None]:
# Your solution here


#### Exercise 3: Salary Adjustment Calculator

Create a function that adjusts salaries based on years of experience. Use these parameters: `base_adjustment=0.02` (2% base increase) and `experience_bonus=0.01` (1% per year of experience). Apply this function.

In [None]:
# Your solution here


## 3. Regular Expressions (Regex) - Finding Patterns in Text

Regular expressions (regex) are a powerful way to find patterns in text. They might look scary at first, but once you understand the basics, they become incredibly useful for data cleaning and extraction.

### What is Regex?

Regex is like a search pattern that can find specific combinations of characters in text. For example, you could use regex to find all phone numbers, email addresses, or ID codes in a dataset.

### Basic Regex Building Blocks

Let's start with the fundamental pieces that make up regex patterns:

- `\d` = any digit (0-9)
- `\w` = any word character (letters, numbers, underscore)
- `[A-Z]` = any uppercase letter
- `[a-z]` = any lowercase letter
- `[0-9]` = any digit (same as `\d`)

In [None]:
# Let's practice with some simple text examples
sample_text = "Employee ID: ABC123, Phone: 555-1234, Date: 2023-12-25"
print(f'Sample text: {sample_text}')

### Finding Single Characters

Let's start by finding individual characters in our sample text.

In [None]:
# Find all digits in the text
digits = re.findall(r'\d', sample_text)
print(f'All digits found: {digits}')

# Find all uppercase letters
uppercase = re.findall(r'[A-Z]', sample_text)
print(f'All uppercase letters: {uppercase}')

### Using Quantifiers - How Many Characters?

Quantifiers tell regex how many characters to match:

- `+` = one or more
- `*` = zero or more
- `{3}` = exactly 3
- `{2,4}` = between 2 and 4

In [None]:
# Find groups of digits (not just individual ones)
digit_groups = re.findall(r'\d+', sample_text)
print(f'Groups of digits: {digit_groups}')

# Find exactly 3 uppercase letters in a row
three_letters = re.findall(r'[A-Z]{3}', sample_text)
print(f'Exactly 3 uppercase letters: {three_letters}')

### Building More Complex Patterns

Now let's combine these pieces to find specific patterns.

**Example: Finding Phone Numbers**

Pattern: 3 digits, dash, 4 digits

In [None]:
phone_text = "Call me at 555-1234 or 555-5678"
print(f'Text: {phone_text}')

# Pattern: 3 digits, dash, 4 digits
phone_pattern = r'\d{3}-\d{4}'
phones = re.findall(phone_pattern, phone_text)
print(f'Phone numbers found: {phones}')

**Example: Finding Date Patterns**

Pattern: 4 digits, dash, 2 digits, dash, 2 digits

In [None]:
date_text = "Important dates: 2023-12-25 and 2024-01-15"
print(f'Text: {date_text}')

# Pattern: 4 digits, dash, 2 digits, dash, 2 digits
date_pattern = r'\d{4}-\d{2}-\d{2}'
dates = re.findall(date_pattern, date_text)
print(f'Dates found: {dates}')

**Example: Finding Mixed Letter-Number Codes**

Pattern: 3 uppercase letters followed by 3 digits

In [None]:
code_text = "Product codes: ABC123, XYZ789, DEF456"
print(f'Text: {code_text}')

# Pattern: 3 uppercase letters followed by 3 digits
code_pattern = r'[A-Z]{3}\d{3}'
codes = re.findall(code_pattern, code_text)
print(f'Product codes found: {codes}')

### Using Regex with Our Employee Data

Now let's apply regex to extract information from our employee dataset. Remember, our employee IDs follow the pattern: DEPARTMENT + YEAR + LASTNAME

**Example 1: Extracting Department Codes**

We want to extract the letters at the beginning of each employee ID.

In [None]:
# Look at a few employee IDs to understand the pattern
print('Sample employee IDs:')
print(df_clean['employee_id'].head().tolist())

# Extract the department code (letters at the beginning)
df_clean['dept_code'] = df_clean['employee_id'].str.extract(r'([A-Z]+)')
print('\nExtracted department codes:')
df_clean[['employee_id', 'dept_code']].head()

**Example 2: Extracting Years**

Now let's extract the 4-digit year from each employee ID.

In [None]:
# Extract the 4-digit year from employee IDs
df_clean['id_year'] = df_clean['employee_id'].str.extract(r'(\d{4})')
print('Extracted years from employee IDs:')
df_clean[['employee_id', 'id_year']].head()

**Example 3: Extracting Last Names**

Finally, let's extract the last name (letters at the end of the ID).

In [None]:
# Extract the last name (letters at the end)
df_clean['id_lastname'] = df_clean['employee_id'].str.extract(r'([A-Z]+)$')
print('Extracted last names from employee IDs:')
df_clean[['employee_id', 'id_lastname']].head()

### Working with Different Phone Number Formats

Our dataset has phone numbers in different formats. Let's use regex to work with them.

In [None]:
# Look at the different phone formats in our data
print('Different phone number formats:')
print(df_clean['phone'].unique()[:10])

**Extracting Area Codes**

Let's extract the first 3 digits (area code) from any phone format.

In [None]:
# Extract the first 3 digits (area code) from any phone format
df_clean['area_code'] = df_clean['phone'].str.extract(r'(\d{3})')
print('Extracted area codes:')
df_clean[['phone', 'area_code']].head()

**Identifying Phone Format Types**

Let's create a function that identifies which format each phone number uses.

In [None]:
def identify_phone_format(phone):
    """Identify the format of a phone number"""
    if re.match(r'\d{3}-\d{3}-\d{4}', phone):
        return 'dash'
    elif re.match(r'\(\d{3}\) \d{3}-\d{4}', phone):
        return 'parentheses'
    elif re.match(r'\d{3}\.\d{3}\.\d{4}', phone):
        return 'dot'
    else:
        return 'other'

df_clean['phone_format'] = df_clean['phone'].apply(identify_phone_format)
print('Phone format types:')
df_clean[['phone', 'phone_format']].head(10)

### Practice Exercises - Regular Expressions

#### Exercise 1: Email Validation

Create a regex pattern to check if email addresses are valid. A valid email should have: letters/numbers/dots before @, followed by a domain name, followed by .com

In [None]:
# Your solution here


#### Exercise 2: Performance Review Analysis

Extract both the quarter (Q1, Q2, etc.) and year from the performance_review column using regex.

In [None]:
# Your solution here
# Extract quarter and year separately


#### Exercise 3: Creating Standardised Phone Numbers

Use regex to extract all the digits from phone numbers and create a standardised format (XXX-XXX-XXXX).

In [None]:
# Your solution here


## Summary

You've learned some powerful advanced pandas techniques:

### Drop Duplicates
- How to find duplicate records in your data
- Different strategies for handling duplicates (keep first, last, or remove all)
- How to use sorting to control which duplicate records to keep

### Apply Method
- Using built-in functions with `.apply()`
- Creating your own custom functions
- Working with multiple columns using `axis=1`
- Passing additional parameters to your functions

### Regular Expressions
- Basic regex building blocks (`\d`, `\w`, `[A-Z]`)
- Quantifiers for specifying how many characters (`+`, `*`, `{3}`)
- Building complex patterns step by step
- Using regex with pandas to extract information from text columns

These skills will make you much more effective at cleaning and analysing real-world data!