# Python Pandas Data Cleaning Demonstration

In this notebook, I wanted to demonstrate a short process of data cleaning using Python's Pandas library. Data cleaning is an essential step in the data analysis process, as the quality of data and the usefulness of the resulting model often depend on it.

Dataset Overview:
The dataset contains information such as names, surnames, salaries, email addresses, countries, and IDs.

Key Steps:

1. Data Import: The data is loaded into a Pandas dataframe for further manipulation.
2. Data Verification: To ensure data integrity, checks are performed for unique IDs and missing values. Initial checks indicate that each ID is unique and there are no missing values in the dataset.
3. Data Cleaning: The subsequent steps would involve identifying and rectifying inconsistencies, errors, and inaccuracies in the dataset to ensure its quality and reliability.


In [1]:
import pandas as pd
import random

In [2]:
# Creating list of names, surnames and countries
names = ["Piotr", "Katarzyna", "Jan", "Anna", "Tomasz", "Agnieszka", "Andrzej", "Magdalena", "Marek", "Joanna"]
surnames = ["Kowalski", "Nowak", "Woźniak", "Mazur", "Kaczmarek", "Dziedzic", "Zawisza", "Sienkiewicz", "Malinowski", "Jankowski"]
countries = ["Poland", "Germany", "Poland", "France", "Poland", "UK", "USA", "Poland", "Canada", "Poland"]

In [3]:
# Generating random e-mail addresses with some possible errors
def generate_email(name, surname):
    domains = ["gmail.com", "yahoo.com", "outlook.com", "onet.pl", "wp.pl"]
    domain = random.choice(domains)
    errors = [" ", "..", ".pl.pl", "@@"] 
    if random.choice([True, False]):     # Randomly decide if there's an error in the email
        return name.lower() + "." + surname.lower() + "@" + domain
    else:
        error = random.choice(errors)
        if error == " ":
            return name.lower() + "." + surname.lower() + " " + "@" + domain
        else:
            return name.lower() + error + surname.lower() + "@" + domain

In [17]:
# Creating the dataframe
data = {
    "name": [random.choice(names) for _ in range(16)],
    "surname": [random.choice(surnames) for _ in range(16)],
    "salary": [random.randint(2000, 10000) for _ in range(16)],
    "e-mail": [],
    "country": [random.choice(countries) for _ in range(16)],
    "id": [i for i in range(1, 17)]
}

for i in range(16):
    data["e-mail"].append(generate_email(data["name"][i], data["surname"][i]))

In [18]:
df = pd.DataFrame(data)
df

Unnamed: 0,name,surname,salary,e-mail,country,id
0,Tomasz,Kowalski,4746,tomasz.kowalski@yahoo.com,UK,1
1,Andrzej,Nowak,2291,andrzej.nowak@yahoo.com,France,2
2,Marek,Woźniak,6530,marek.woźniak @outlook.com,Poland,3
3,Anna,Sienkiewicz,4580,anna.pl.plsienkiewicz@gmail.com,Canada,4
4,Joanna,Kowalski,5071,joanna.kowalski@yahoo.com,Poland,5
5,Marek,Dziedzic,3291,marek.dziedzic @onet.pl,France,6
6,Jan,Malinowski,7346,jan..malinowski@onet.pl,Germany,7
7,Katarzyna,Mazur,4636,katarzyna.pl.plmazur@yahoo.com,France,8
8,Katarzyna,Kowalski,8946,katarzyna.kowalski@outlook.com,USA,9
9,Anna,Malinowski,7517,anna@@malinowski@yahoo.com,Canada,10


In [6]:
# Removing any whitespace from the beginning or end of strings
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

In [7]:
# Fixing common email errors
df['e-mail'] = df['e-mail'].str.replace(r'\.\.', '.', regex=True).str.replace('@@', '@', regex=True).str.replace('\.pl\.pl', '.pl', regex=True).str.replace('\s+', '', regex=True)

In [20]:
# Ensuring the id column has unique values
unique_ids = df['id'].nunique() == df.shape[0]
unique_ids

True

In [21]:
# Checking for any missing data
missing_data = df.isnull().sum()
missing_data

name       0
surname    0
salary     0
e-mail     0
country    0
id         0
dtype: int64

In [16]:
df.index = range(1, len(df) + 1)
df

Unnamed: 0,name,surname,salary,e-mail,country,id
1,Agnieszka,Sienkiewicz,2705,agnieszka.sienkiewicz@outlook.com,Poland,1
2,Agnieszka,Kowalski,4762,agnieszka.kowalski@yahoo.com,France,2
3,Marek,Kaczmarek,9367,marek.kaczmarek@outlook.com,Poland,3
4,Andrzej,Mazur,5770,andrzej.mazur@onet.pl,USA,4
5,Marek,Woźniak,2975,marek.woźniak@onet.pl,Poland,5
6,Jan,Woźniak,5405,jan.plwoźniak@gmail.com,USA,6
7,Joanna,Zawisza,3370,joanna@zawisza@onet.pl,Poland,7
8,Tomasz,Woźniak,3563,tomasz.woźniak@gmail.com,Poland,8
9,Agnieszka,Kowalski,8407,agnieszka@kowalski@onet.pl,UK,9
10,Tomasz,Woźniak,7252,tomasz.woźniak@yahoo.com,France,10
