# Data Cleaning

What is Data Cleaning?

1. Definition of data cleaning
2. Importance of clean data for analysis
3. Common data issues (missing values, duplicates, inconsistent formats)
4. Impact of unclean data on analysis -> Examples of data errors in real-life scenarios

In [None]:
# import packages

import numpy as np
import pandas as pd

In [None]:
# Artificial dataset

data = {
    'student_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010],
    'name': ['Alice', 'Bob', 'C@rla', 'David', 'Eve', 'Frank', np.nan, 'Hannah', 'Ian', 'Jessica'],
    'age': [20, 'twenty-one', 22, 21, np.nan, 23, 22, 24, 'unknown', 22],
    'gpa': [3.8, 2.9, 3.5, 4.1, np.nan, 2.5, 'n/a', 3.0, 2.7, 3.9],
    'major': ['Physics', 'Mathematics', 'Computer Science', 'History', 'Physics', 'History', 'Psychology', 'Computer Science', np.nan, 'Mathematics'],
    'enrollment_year': [2020, 2019, 2021, 2018, 2019, 2019, 2018, 2020, 2017, '2020ish'],
    'credits_earned': [90, 120, np.nan, 100, 115, -5, 140, 100, 85, 'seventy'],
    'clubs': ['Drama, Tennis', 'Chess', 'Coding Club, AI Club', np.nan, 'Drama, Football', 'Football', 'Chess', 'Drama', np.nan, 'Chess, AI Club'],
    'grad_date': ['2023-05-10', np.nan, '2024-06-15', '2022-12-01', '2023-07-07', 'invalid', np.nan, '2024-06-15', 'NaT', '2023-05-10']
}

df = pd.DataFrame(data)

*Task 1*: What problems/ issues do you detect in this data set?

*Answer*: 


In the first step, check how much and what kind of data is missing.

Keep in mind: missing data can be labeled differently. NA, 0, negative evalues (e.g. -99, -77) etc. --> always check the data documentation first. A value might be an NA in one column and a regular value in another one. be cautious, otherwise it ill mess up your analysis.

In [None]:
# show a truth table for the whole data set - FALSE: regular value - TRUE: NA
print(df.isnull())

In [None]:
# show summarizes number of NAs per column
print(df.isnull().sum())

How are you going to deal with the missing data?

Drop all rows that have a missing value? Drop all colums? 

In [None]:
# drop all rows that include at least one NA

df.dropna()

In [None]:
# drop only variables in the c1 column

df.dropna(subset=["clubs"])

In [None]:
# drop all columns that include at least one NA

df.dropna(axis=1)

In [None]:
# What if you want to change the value of the NA to something else? e.g. -99

df.fillna(-99)

## Cleaning of variable 'Age'

In [None]:
# Instead of dropping observations, you can also fill the NAs with the mean of the column -> of course this is only possible for numeric data like the age variable

def convert_age(age):
    if isinstance(age, str):
        age = age.lower()
        if age == 'twenty-one':
            age = 21
        elif age == 'unknown':
            age = np.nan
    return pd.to_numeric(age, errors='coerce')

# Non-numeric strings are converted to numeric values or NaN
df['age'] = df['age'].apply(convert_age)

# Missing values are then filled with the median age
df['age'].fillna(df['age'].median(), inplace=True)

print(df.isnull().sum())

## Cleaning of variable 'GPA'

In [None]:
# GPA values are coerced to numeric, and invalid values (outside 0.0-4.0) are set to NaN
df['gpa'] = pd.to_numeric(df['gpa'], errors='coerce')
df['gpa'] = df['gpa'].apply(lambda x: np.nan if x > 4.0 or x < 0.0 else x)

# Missing GPAs are filled with the mean GPA
df['gpa'].fillna(df['gpa'].mean(), inplace=True)

print(df.isnull().sum())
df['gpa']

## Cleaning of variable 'Enrollment Year'

Task 2: Clean the variable 'Enrollment Year'

In [None]:
# Non-numeric values are converted to numeric

# Missing values are filled with the most common year


## Cleaning of variable 'Credits Earned'

In [None]:
# A function is written that is supposed to automate the cleaning process
def convert_credits(credits):
    if isinstance(credits, str):
        if credits.lower() == 'seventy':
            credits = 70
    credits = pd.to_numeric(credits, errors='coerce')

    if credits >= 0:
        return credits
    else:
        return np.nan
    
# Strings are converted to numeric, and negative or invalid values are replaced with NaN.
df['credits_earned'] = df['credits_earned'].apply(convert_credits)

# Missing values are filled with the median
df['credits_earned'].fillna(df['credits_earned'].median(), inplace=True)

print(df.isnull().sum())

df['credits_earned']

## Cleaning of variable 'Graduation Date'

In [None]:
# Invalid graduation dates are converted to NaT (Not a Time), and proper date formatting is ensured
df['grad_date'] = pd.to_datetime(df['grad_date'], errors='coerce')

print(df.isnull().sum())
# we can see that we have actually increased the number of NAs in the grad_date column
df['grad_date']

## Cleaning of variable 'Name'

Task 3: Clean the 'name' column yourself

## Cleaning of variable 'Club Membership'

Task 4: Replace missing values in the clubs column with 'No Clubs'