# CME538 - Introduction to Data Science
## Lecture 3.3 - Cleaning Data

---

### Analyzing Canada's Government Travel Expenses

In today’s connected world, where Zoom calls and online collaboration platforms dominate, government officials still seem to have a love for travel. Why spend thousands on trips when technology allows us to have a meeting without ever leaving our desks? It turns out, travel is sometimes unavoidable—whether for meeting key stakeholders, building international relations, or attending summits. However, with taxpayer money at stake, it's important to examine the costs.

This notebook dives deep into the data of Canadian government travel expenses, based on the Proactive Disclosure - Travel Expenses dataset (see [here](https://open.canada.ca/data/en/dataset/009f9a49-c2d9-4d29-a6d4-1a228da335ce/resource/8282db2a-878f-475c-af10-ad56aa8fa72c)).

**Fun Fact:** Did you know that the CEO of a government agency once spent over $14,000 attending a conference in Australia? We’ll be exploring whether this is just the tip of the iceberg.

---

### Lecture 3.3 Overview

As part of CME538's focus on data science, Lecture 3.3 covered a range of important data cleaning techniques. Here's a summary of what was discussed in the lecture and what you'll see applied in this notebook:

- **Type Conversion**: Converting data types to ensure they match the values they represent.
- **Duplicates**: Removing any repeated entries.
- **Missing Data**: Handling gaps in the dataset through removal or imputation.
- **Implausible Data**: Flagging values that seem incorrect or out of bounds.
- **Irrelevant Data**: Removing unnecessary columns to streamline our dataset.
- **Character Encoding**: Fixing encoding issues.
- **Datetime Parsing**: Ensuring dates are formatted consistently.
- **Outliers**: Identifying and potentially removing extreme values.
- **Inconsistent Data Entry**: Standardizing any inconsistent input formats (e.g., addresses).
- **Unit Conversion**: Standardizing any discrepancies between units (like miles vs kilometers).

This notebook will walk through these concepts step by step as we clean and analyze Canada's government travel expense data. Get ready for a mix of interesting insights, and maybe a surprise or two about how tax dollars are being spent!

## Step 0: Setup the Notebook

Before diving into the data, it's important to set up our working environment. We'll be using Python’s powerful data analysis libraries, primarily `pandas` for data manipulation and `numpy` for numerical operations. Additionally, we’ll configure `pandas` to display all columns and rows for better visibility, especially when analyzing data later.

Let's get started by importing the necessary libraries and setting up display options for ease of analysis.


In [None]:
# %pip install pandas numpy matplotlib seaborn

# Step 0: Setup the Notebook
import pandas as pd
import numpy as np

# Display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Other necessary imports
import matplotlib.pyplot as plt
import seaborn as sns

## Step 1: Load the Data

In this step, we will load the dataset into our notebook. The dataset consists of travel expenses disclosed by the Canadian government. This data will help us understand government spending on travel and explore potential inefficiencies or unusual expenses.

We will load the CSV file using `pandas` and display the first few rows to familiarize ourselves with its structure and content.


In [None]:
# Step 1: Load the Data
data_file_path = 'travelq.csv'
data = pd.read_csv(data_file_path)

# Display the first few rows of the dataset to get an overview
data.head()


### What's the Data Telling Us?
The dataset contains several columns, including information about the purpose of travel, destination, expenses for airfare, lodging, meals, and other associated costs. This gives us a comprehensive view of travel-related expenses. 

Notice that some columns contain both English and French translations, which is common for Canadian datasets. We’ll need to handle these appropriately when cleaning the data.


## Step 2: Data Exploration and Initial Checks

In this step, we are going to explore the data and get a better understanding of it. We’ll check for the basics like data types, missing values, and any possible anomalies. This is where we lay the groundwork for deeper analysis and cleaning.

#### What Are We Looking For?

- **Data Types:** Are the columns properly formatted (e.g., dates as `datetime`, numbers as `float` or `int`)?
- **Missing Values:** Are there any gaps in the data that we need to handle?
- **Duplicates:** Are there any duplicate rows that are unnecessarily inflating the data?

Time to dive in and see what we’re dealing with!

In [None]:
# Check data types and missing values
data.info()

In [None]:
# More explicit check for missing values
missing_values = data.isnull().sum()
print('Missing values per column:')
print(missing_values)

# Check for duplicates
duplicate_rows = data.duplicated().sum()
print(f'Total duplicate rows: {duplicate_rows}')

### What's Happening?

- **Data Types:** `data.info()` gives a quick overview of the data types, but we might need to adjust some (e.g., converting `start_date` and `end_date` to proper `datetime` types).
- **Missing Values:** The `isnull().sum()` function provides an explicit count of missing values for each column.
- **Duplicates:** We can see the number of duplicate rows, which we’ll deal with shortly if needed.

Getting cleaner data is like prepping for a road trip — you don’t want any bumps in the road!

#### What's Missing?

Here, we're checking how many missing values each column has. It's essential to get a sense of how severe the missing data issue is before deciding what to do. Let's take a look at the result above to see which columns are affected and how many entries are missing. 

Once we know that, we'll decide on the best way to handle it: do we drop the rows, fill in the blanks, or take another approach? Keep your cleaning tools ready!

#### Discussion: “Did Someone Forget to Fill This In?”

If the number of missing values is small, we might just drop the rows. If it’s a significant portion, we’ll consider filling in the blanks using an appropriate method, such as filling with a median for numerical data or the most frequent value for categorical data. Let’s move ahead and clean up the missing values!

## Step 3: Dropping Rows with Missing Values (Bye-Bye, Gaps!)

Now that we've identified the missing values, we're going to drop the rows with missing data. Since our dataset has synthetic issues added, it's a good opportunity to practice this important data cleaning step. Removing rows with missing values ensures we don't run into errors or misleading results during analysis or modeling.

In [None]:
# Drop rows with missing values
data_cleaned = data.dropna()

# Check the shape of the cleaned dataset to see how much data was dropped
data_cleaned.shape

#### Data Cleaned and Ready to Roll!

We've successfully dropped the rows containing missing values. You can see above how much data we have left. This is a common cleaning step when missing values are scattered across the dataset.

In the next step, we'll tackle another common data cleaning task: dealing with duplicates.

## Step 4: Detecting and Removing Duplicates (Seeing Double?)

Sometimes, datasets contain duplicate rows, which can occur due to errors during data entry or when datasets are merged. Duplicate entries can skew results and lead to incorrect conclusions. In this step, we’ll identify and remove any duplicate rows to ensure our dataset is unique.

In [None]:
# Detect duplicates
duplicates_count = data_cleaned.duplicated().sum()

print(f"Number of duplicate rows: {duplicates_count}")

# Remove duplicate rows
data_no_duplicates = data_cleaned.drop_duplicates()

# Check the shape of the dataset after removing duplicates
data_no_duplicates.shape

#### Duplicate Data, Gone!

We've identified and removed any duplicate rows in the dataset. By doing this, we ensure that our analysis won't be affected by repeated entries that could skew our results.

In the next step, we'll deal with implausible data — things like negative expenses or unrealistic numbers. Time to do a reality check!

## Step 5: Dealing with Implausible Data (Reality Check Time!)

In some cases, datasets can contain implausible or impossible values, such as negative amounts for travel expenses. These errors can seriously mess up your analysis if not corrected. Here, we'll identify any such cases and fix them by replacing implausible values with `NaN`, so we can decide later how to handle them.

Let's start by checking for negative values in key numeric columns like airfare, lodging, and meals. We'll then replace them with `NaN` for further action.



In [None]:
# Detect implausible data (negative values)
implausible_columns = ['airfare', 'lodging', 'meals', 'other_expenses']
implausible_data = data_no_duplicates[implausible_columns] < 0

# Sum the number of implausible entries in each column
implausible_data_sum = implausible_data.sum()
print(f"Number of implausible entries:\n{implausible_data_sum}")

# Replace implausible values with NaN using .loc and .map for each column
for column in implausible_columns:
    data_no_duplicates.loc[data_no_duplicates[column] < 0, column] = np.nan

# Verify that the implausible values are handled
data_no_duplicates[implausible_columns].describe()

## Step 6: Cleaning Inconsistent Data Entries (Name Game Fix!)

Sometimes, inconsistent data entries, such as variations in names or addresses, can creep into our dataset. This could be something like "Toronto" vs "Toronto, Ontario" or "Vancouver" vs "Vancouver, BC." We can fix such inconsistencies by using the `fuzzywuzzy` library, which helps match similar strings based on token similarity.

Let's apply this to the `destination_en` column of our dataset and clean up inconsistent destination names.

In [None]:
import fuzzywuzzy
from fuzzywuzzy import process

# Example of common inconsistent names for demonstration. You could add more here.
common_names = ['Toronto', 'Toronto, Ontario', 'Vancouver', 'Vancouver, BC', 'Montreal', 'Montréal']

# Create a function to replace inconsistent names
def replace_inconsistent_names(column, common_names):
    corrected_names = []
    
    for name in column:
        # Find the closest match for the current name
        match = process.extractOne(name, common_names, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
        # If the match score is high enough, replace with the most common name
        if match and match[1] > 80:  # Threshold set to 80 for better accuracy
            corrected_names.append(match[0])
        else:
            corrected_names.append(name)  # Keep the name as it is if no good match is found
    
    return corrected_names

# Apply the function to the 'destination_en' column using .loc to avoid the warning
data_no_duplicates.loc[:, 'destination_en'] = replace_inconsistent_names(data_no_duplicates['destination_en'], common_names)

# Check the cleaned column
data_no_duplicates['destination_en'].unique()

#### Consistency is Key!

We've cleaned up the inconsistent city names in the 'destination_en' column. Now, instead of multiple variations of the same location, we have consistent and standardized names. This makes analysis much easier and ensures that our insights are accurate.

Coming up next, we'll handle outliers — those data points that might be way outside the norm. Let's see how to identify and deal with them.

## Step 7: Outlier Detection 🧐

Now, we're going to identify outliers using the **Interquartile Range (IQR)** method. Outliers are values that are significantly higher or lower than most of the data and can skew the analysis.

We'll calculate the IQR for columns such as `airfare`, `lodging`, `meals`, and `other_expenses`. Outliers will be defined as any value outside 1.5 times the IQR.

### Why?
Detecting outliers can help us clean up extreme values that might distort our analysis or misrepresent spending patterns.

In [None]:
# Define the columns where we want to detect outliers
numeric_columns = ['airfare', 'lodging', 'meals', 'other_expenses']

# Function to detect and remove outliers using IQR
def detect_and_remove_outliers(df, columns):
    for col in columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        # Replace outliers with NaN
        df.loc[(df[col] < lower_bound) | (df[col] > upper_bound), col] = np.nan
    return df

# Apply the function to detect and remove outliers
data_cleaned = detect_and_remove_outliers(data_no_duplicates, numeric_columns)

# Check a summary of the cleaned data
data_cleaned[numeric_columns].describe()

### What just happened? 🤔
We detected and replaced the outliers with `NaN` for columns like airfare, lodging, meals, and other expenses. This helps ensure that extreme values don’t skew our results in later analysis.

We’ll next decide whether to **drop** or **impute** these outliers in Step 8.

## Step 8: Handling Outliers 🚀

Now that we’ve identified the outliers and replaced them with `NaN`, we need to decide how to handle them. The two main strategies are:

- **Imputation**: Replace the missing values (outliers) with something like the median or mean of the column.
- **Dropping**: Remove rows that contain these outliers.

Since we want to maintain as much data as possible, we'll go with **imputation** using the median for each column. The median is a robust measure because it’s not affected by extreme values.

### Why?
Imputing with the median will help preserve the data while preventing outliers from distorting our results.

In [None]:
# Impute missing (outlier) values with the median of each column
def impute_outliers(df, columns):
    for col in columns:
        median_value = df[col].median()
        df[col] = df[col].fillna(median_value)
    return df

# Apply imputation to outlier-affected columns
data_imputed = impute_outliers(data_cleaned, numeric_columns)

# Check if outliers are handled
data_imputed[numeric_columns].describe()

### What happened here? 🤓
We replaced all `NaN` values (previously identified as outliers) with the median of their respective columns. This allows us to retain the rows without the influence of extreme values.

Next, we'll move on to parsing and standardizing **datetime** values to ensure consistency. Let's move on! 🎯

## Step 9: Parsing Dates**.

### Parsing Dates and Standardizing Formats

We often encounter inconsistent date formats in real-world datasets. In this step, we'll ensure that all dates are consistently formatted by converting them to `datetime` objects. This allows for easier filtering, manipulation, and analysis later.

---

### **🔍 Time Travel with Date Parsing!**
We’re about to fix the dates. Imagine sorting travel plans for an organization that has meetings all across the world. Now, let’s make sure all the date columns are in the same format so we can analyze the data efficiently!

---




In [None]:
# Step 8: Parse the 'start_date' and 'end_date' columns
data_imputed['start_date'] = pd.to_datetime(data_imputed['start_date'], errors='coerce')
data_imputed['end_date'] = pd.to_datetime(data_imputed['end_date'], errors='coerce')

# Let's check if there are any issues with the conversion
print(data_imputed[['start_date', 'end_date']].dtypes)
print(data_imputed[['start_date', 'end_date']].head())

### **🕰️ What Happened?**
In this block:
- We used `pd.to_datetime()` to convert the `start_date` and `end_date` columns into `datetime` objects.
- The `errors='coerce'` argument ensures that any invalid dates are converted into `NaT` (Not a Time), so we can handle those later.
- The result shows whether our conversion was successful and gives us a preview of the dates.

---

After parsing dates, we can proceed with exploring our travel durations.

## Step 11: Character Encoding

Sometimes datasets may contain special characters or non-UTF-8 encoding that can cause issues when reading or processing the data. In this step, we will:
- Identify any encoding issues using Python’s `chardet` package.
- Convert the data to UTF-8 format if necessary and handle errors during conversion.

---

### **🧙‍♂️ Encoding Magic!**
Our data may contain hidden encoding issues. Let's detect and fix them before they cause any headaches!

---

In [None]:
# Install the chardet package if you haven't already
# %pip install chardet

import chardet

# Function to detect encoding issues
def detect_encoding(file_path):
    with open(file_path, 'rb') as file:
        result = chardet.detect(file.read())
        return result

# Detect encoding of the CSV file
encoding_info = detect_encoding('travelq.csv')
print(f"Detected Encoding: {encoding_info}")

# Read the file using the detected encoding
data_clean = pd.read_csv('travelq.csv', encoding=encoding_info['encoding'])

# Convert to UTF-8 encoding and overwrite the file
data_clean.to_csv('travelq_cleaned.csv', encoding='utf-8', index=False)

# Verify the data is now in UTF-8
print("Data successfully converted to UTF-8 encoding!")

### **🔧 What Happened?**
- We used the `chardet` package to detect the encoding of the file.
- If any encoding issues were detected, we read the file using the appropriate encoding and converted it to UTF-8.
- This ensures all characters, especially special characters, are correctly interpreted.

## Step 12: Conclusion

Now that we've successfully cleaned the data through multiple steps, it’s time to visualize the result and highlight why clean data is essential for any analysis. We’ll create some cool charts to visualize the cleaned data and emphasize how clean data allows us to create meaningful visualizations and insights.

---

### **🎉 Clean Data, Cool Insights!**
Data cleaning isn’t just about fixing errors—it unlocks the power to gain valuable insights from your data. Here are some cool charts that show how the Canadian government is spending on travel.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Top 5 most expensive travel purposes
top_expenses = data_imputed.groupby('purpose_en')['total'].sum().sort_values(ascending=False).head(5)

# Plotting the top 5 most expensive purposes
plt.figure(figsize=(10, 6))
sns.barplot(x=top_expenses.values, y=top_expenses.index, hue=top_expenses.index, dodge=False, palette="viridis")
plt.title('Top 5 Most Expensive Travel Purposes')
plt.xlabel('Total Cost ($)')
plt.ylabel('Purpose')
plt.legend([],[], frameon=False)  # Hide the legend 
plt.show()


# Distribution of total expenses
filtered_expenses = data_imputed[data_imputed['total'] < 50000] # filtered to remove extreme values

plt.figure(figsize=(10, 6))
sns.histplot(data_imputed['total'], bins=30, kde=True, color="blue")
plt.title('Distribution of Total Travel Expenses')
plt.xlabel('Total Expense ($)')
plt.ylabel('Frequency')
plt.show()

# Travel expenses by destination
top_destinations = data_imputed.groupby('destination_en')['total'].sum().nlargest(10).index
filtered_data = data_imputed[data_imputed['destination_en'].isin(top_destinations)]

plt.figure(figsize=(10, 6))
sns.boxplot(x='destination_en', y='total', data=filtered_data)
plt.xticks(rotation=45)
plt.title('Distribution of Travel Expenses by Top 10 Destinations')
plt.xlabel('Destination')
plt.ylabel('Total Expense ($)')
plt.show()

### **📊 What We See:**
- **Top 5 Travel Purposes:** A bar plot showing which purposes cost the most.
- **Expense Distribution:** A histogram of total travel expenses, showing the spread of travel costs.
- **Expenses by Destination:** A boxplot showing the variation of travel expenses by destination.

---

### **📝 Final Thoughts:**
In this notebook, we’ve gone through several crucial steps to clean and preprocess the data, including:
- Handling missing values, duplicates, and implausible data.
- Parsing dates and resolving inconsistencies.
- Addressing encoding issues.

Clean data allows us to generate powerful insights through visualizations and analysis. Without proper cleaning, the analysis could lead to incorrect conclusions.