# Data Cleaning in Python

# Import Required Libraries

I Make sure that import statement is right at the top means all the powerful DataFrame methods are ready whenever I need them throughout the notebook.

In [2]:
import pandas as pd

# Load Data

Loading the data is my first real look at what I'm working with. Pandas makes reading Excel files so simple, but I know the raw data will have all sorts of inconsistencies that need systematic cleaning. Getting a clear view of the initial structure helps me map out the transformations I need to make.

In [3]:
df = pd.read_excel(r'E:\PortfolioProjects\Python Project\learning pandas\Customer Call List.xlsx')

In [4]:
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact,Not_Useful_Column
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Yes,No,True
1,1002,Abed,Nadir,123/643/9775,93 West Main Street,No,Yes,False
2,1003,Walter,/White,7066950392,298 Drugs Driveway,N,,True
3,1004,Dwight,Schrute,123-543-2345,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y,True
4,1005,Jon,Snow,876|678|3469,123 Dragons Road,Y,No,True
5,1006,Ron,Swanson,304-762-2467,768 City Parkway,Yes,Yes,True
6,1007,Jeff,Winger,,1209 South Street,No,No,False
7,1008,Sherlock,Holmes,876|678|3469,98 Clue Drive,N,No,False
8,1009,Gandalf,,N/a,123 Middle Earth,Yes,,False
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Yes,No,True


# Remove Duplicate Records

Duplicates are always one of the first issues I address because they can seriously skew my analysis results. The drop_duplicates() method is incredibly efficient at spotting and removing exact duplicate rows, keeping my data clean and ensuring I'm working with solid, accurate information.

In [5]:
df = df.drop_duplicates()

In [6]:
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact,Not_Useful_Column
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Yes,No,True
1,1002,Abed,Nadir,123/643/9775,93 West Main Street,No,Yes,False
2,1003,Walter,/White,7066950392,298 Drugs Driveway,N,,True
3,1004,Dwight,Schrute,123-543-2345,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y,True
4,1005,Jon,Snow,876|678|3469,123 Dragons Road,Y,No,True
5,1006,Ron,Swanson,304-762-2467,768 City Parkway,Yes,Yes,True
6,1007,Jeff,Winger,,1209 South Street,No,No,False
7,1008,Sherlock,Holmes,876|678|3469,98 Clue Drive,N,No,False
8,1009,Gandalf,,N/a,123 Middle Earth,Yes,,False
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Yes,No,True


# Remove Unnecessary Columns

Not every column in my dataset contributes to the analysis I'm doing. I prefer to drop the irrelevant ones right away - it reduces memory usage and keeps the focus on the key information that actually matters for my work.

In [7]:
df = df.drop(columns='Not_Useful_Column')

In [8]:
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Yes,No
1,1002,Abed,Nadir,123/643/9775,93 West Main Street,No,Yes
2,1003,Walter,/White,7066950392,298 Drugs Driveway,N,
3,1004,Dwight,Schrute,123-543-2345,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y
4,1005,Jon,Snow,876|678|3469,123 Dragons Road,Y,No
5,1006,Ron,Swanson,304-762-2467,768 City Parkway,Yes,Yes
6,1007,Jeff,Winger,,1209 South Street,No,No
7,1008,Sherlock,Holmes,876|678|3469,98 Clue Drive,N,No
8,1009,Gandalf,,N/a,123 Middle Earth,Yes,
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Yes,No


# Clean Last Name Data

String cleaning is one of my favorite parts of data prep. Methods like rstrip() and lstrip() are incredibly powerful for removing unwanted characters. I really enjoy combining multiple operations into single method calls - it keeps my code efficient and readable.

In [9]:
df['Last_Name'] = df['Last_Name'].str.rstrip('123/._') 
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Yes,No
1,1002,Abed,Nadir,123/643/9775,93 West Main Street,No,Yes
2,1003,Walter,/White,7066950392,298 Drugs Driveway,N,
3,1004,Dwight,Schrute,123-543-2345,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y
4,1005,Jon,Snow,876|678|3469,123 Dragons Road,Y,No
5,1006,Ron,Swanson,304-762-2467,768 City Parkway,Yes,Yes
6,1007,Jeff,Winger,,1209 South Street,No,No
7,1008,Sherlock,Holmes,876|678|3469,98 Clue Drive,N,No
8,1009,Gandalf,,N/a,123 Middle Earth,Yes,
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Yes,No


# Clean Phone Number Data

Phone numbers always seem to come with a bunch of formatting characters that need to go. I really enjoy method chaining with str.replace() - it gives me a clean, systematic approach to stripping out all the unwanted bits and getting down to the actual numbers.

In [10]:
df['Phone_Number'] = df['Phone_Number'].str.replace('(', '')
df['Phone_Number'] = df['Phone_Number'].str.replace(')', '')
df['Phone_Number'] = df['Phone_Number'].str.replace('-', '')
df['Phone_Number'] = df['Phone_Number'].str.replace(' ', '')
df['Phone_Number'] = df['Phone_Number'].str.replace('/', '')
df['Phone_Number'] = df['Phone_Number'].str.replace('.', '')
df['Phone_Number'] = df['Phone_Number'].str.replace('|', '')
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact
0,1001,Frodo,Baggins,1235455421,"123 Shire Lane, Shire",Yes,No
1,1002,Abed,Nadir,1236439775,93 West Main Street,No,Yes
2,1003,Walter,/White,,298 Drugs Driveway,N,
3,1004,Dwight,Schrute,1235432345,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y
4,1005,Jon,Snow,8766783469,123 Dragons Road,Y,No
5,1006,Ron,Swanson,3047622467,768 City Parkway,Yes,Yes
6,1007,Jeff,Winger,,1209 South Street,No,No
7,1008,Sherlock,Holmes,8766783469,98 Clue Drive,N,No
8,1009,Gandalf,,Na,123 Middle Earth,Yes,
9,1010,Peter,Parker,1235455421,"25th Main Street, New York",Yes,No


In [11]:
df['Last_Name'] = df['Last_Name'].str.lstrip('/') 
df['Last_Name'] = df['Last_Name'].str.lstrip('...')

df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact
0,1001,Frodo,Baggins,1235455421,"123 Shire Lane, Shire",Yes,No
1,1002,Abed,Nadir,1236439775,93 West Main Street,No,Yes
2,1003,Walter,White,,298 Drugs Driveway,N,
3,1004,Dwight,Schrute,1235432345,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y
4,1005,Jon,Snow,8766783469,123 Dragons Road,Y,No
5,1006,Ron,Swanson,3047622467,768 City Parkway,Yes,Yes
6,1007,Jeff,Winger,,1209 South Street,No,No
7,1008,Sherlock,Holmes,8766783469,98 Clue Drive,N,No
8,1009,Gandalf,,Na,123 Middle Earth,Yes,
9,1010,Peter,Parker,1235455421,"25th Main Street, New York",Yes,No


In [12]:
df['Phone_Number'] = df['Phone_Number'].apply(lambda x: str(x))

df['Phone_Number'] = df['Phone_Number'].apply(lambda x: x[0:3] + '-'+ x[3:6]+ '-' + x[6:10])

In [13]:
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Yes,No
1,1002,Abed,Nadir,123-643-9775,93 West Main Street,No,Yes
2,1003,Walter,White,nan--,298 Drugs Driveway,N,
3,1004,Dwight,Schrute,123-543-2345,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y
4,1005,Jon,Snow,876-678-3469,123 Dragons Road,Y,No
5,1006,Ron,Swanson,304-762-2467,768 City Parkway,Yes,Yes
6,1007,Jeff,Winger,nan--,1209 South Street,No,No
7,1008,Sherlock,Holmes,876-678-3469,98 Clue Drive,N,No
8,1009,Gandalf,,Na--,123 Middle Earth,Yes,
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Yes,No


In [14]:
df['Phone_Number'] = df['Phone_Number'].str.replace('nan--','')
df['Phone_Number'] = df['Phone_Number'].str.replace('Na--','')

In [15]:
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Yes,No
1,1002,Abed,Nadir,123-643-9775,93 West Main Street,No,Yes
2,1003,Walter,White,,298 Drugs Driveway,N,
3,1004,Dwight,Schrute,123-543-2345,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y
4,1005,Jon,Snow,876-678-3469,123 Dragons Road,Y,No
5,1006,Ron,Swanson,304-762-2467,768 City Parkway,Yes,Yes
6,1007,Jeff,Winger,,1209 South Street,No,No
7,1008,Sherlock,Holmes,876-678-3469,98 Clue Drive,N,No
8,1009,Gandalf,,,123 Middle Earth,Yes,
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Yes,No


# Split Address into Components

Complex address strings are perfect candidates for str.split() with expand=True. I love how it transforms messy unstructured text into clean, structured columns that make the data so much more usable.

In [16]:
df[['Street_Address', 'State', 'Zip_Code']] = df['Address'].str.split(',', n=2, expand=True)

In [17]:
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact,Street_Address,State,Zip_Code
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Yes,No,123 Shire Lane,Shire,
1,1002,Abed,Nadir,123-643-9775,93 West Main Street,No,Yes,93 West Main Street,,
2,1003,Walter,White,,298 Drugs Driveway,N,,298 Drugs Driveway,,
3,1004,Dwight,Schrute,123-543-2345,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y,980 Paper Avenue,Pennsylvania,18503.0
4,1005,Jon,Snow,876-678-3469,123 Dragons Road,Y,No,123 Dragons Road,,
5,1006,Ron,Swanson,304-762-2467,768 City Parkway,Yes,Yes,768 City Parkway,,
6,1007,Jeff,Winger,,1209 South Street,No,No,1209 South Street,,
7,1008,Sherlock,Holmes,876-678-3469,98 Clue Drive,N,No,98 Clue Drive,,
8,1009,Gandalf,,,123 Middle Earth,Yes,,123 Middle Earth,,
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Yes,No,25th Main Street,New York,


In [18]:
df = df.drop(columns='Address')
df = df.drop(columns='Zip_Code')

In [19]:
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Paying Customer,Do_Not_Contact,Street_Address,State
0,1001,Frodo,Baggins,123-545-5421,Yes,No,123 Shire Lane,Shire
1,1002,Abed,Nadir,123-643-9775,No,Yes,93 West Main Street,
2,1003,Walter,White,,N,,298 Drugs Driveway,
3,1004,Dwight,Schrute,123-543-2345,Yes,Y,980 Paper Avenue,Pennsylvania
4,1005,Jon,Snow,876-678-3469,Y,No,123 Dragons Road,
5,1006,Ron,Swanson,304-762-2467,Yes,Yes,768 City Parkway,
6,1007,Jeff,Winger,,No,No,1209 South Street,
7,1008,Sherlock,Holmes,876-678-3469,N,No,98 Clue Drive,
8,1009,Gandalf,,,Yes,,123 Middle Earth,
9,1010,Peter,Parker,123-545-5421,Yes,No,25th Main Street,New York


In [20]:
df['Street_Address'] = df['Street_Address'].str.replace('N/a','')

In [21]:
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Paying Customer,Do_Not_Contact,Street_Address,State
0,1001,Frodo,Baggins,123-545-5421,Yes,No,123 Shire Lane,Shire
1,1002,Abed,Nadir,123-643-9775,No,Yes,93 West Main Street,
2,1003,Walter,White,,N,,298 Drugs Driveway,
3,1004,Dwight,Schrute,123-543-2345,Yes,Y,980 Paper Avenue,Pennsylvania
4,1005,Jon,Snow,876-678-3469,Y,No,123 Dragons Road,
5,1006,Ron,Swanson,304-762-2467,Yes,Yes,768 City Parkway,
6,1007,Jeff,Winger,,No,No,1209 South Street,
7,1008,Sherlock,Holmes,876-678-3469,N,No,98 Clue Drive,
8,1009,Gandalf,,,Yes,,123 Middle Earth,
9,1010,Peter,Parker,123-545-5421,Yes,No,25th Main Street,New York


# Standardize Categorical Values

I can't stress enough how important consistent formatting is for categorical data - it prevents so many analysis errors down the line. Converting things like 'Yes/No' to 'Y/N' standardizes responses across the entire dataset and keeps everything clean.

In [22]:
#df['Paying Customer'] = df['Paying Customer'].apply(lambda x: str(x))

df['Paying Customer'] = df['Paying Customer'].str.replace('No','N')
df['Paying Customer'] = df['Paying Customer'].str.replace('Yes','Y')

df['Do_Not_Contact'] = df['Do_Not_Contact'].str.replace('No','N')
df['Do_Not_Contact'] = df['Do_Not_Contact'].str.replace('Yes','Y')

df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Paying Customer,Do_Not_Contact,Street_Address,State
0,1001,Frodo,Baggins,123-545-5421,Y,N,123 Shire Lane,Shire
1,1002,Abed,Nadir,123-643-9775,N,Y,93 West Main Street,
2,1003,Walter,White,,N,,298 Drugs Driveway,
3,1004,Dwight,Schrute,123-543-2345,Y,Y,980 Paper Avenue,Pennsylvania
4,1005,Jon,Snow,876-678-3469,Y,N,123 Dragons Road,
5,1006,Ron,Swanson,304-762-2467,Y,Y,768 City Parkway,
6,1007,Jeff,Winger,,N,N,1209 South Street,
7,1008,Sherlock,Holmes,876-678-3469,N,N,98 Clue Drive,
8,1009,Gandalf,,,Y,,123 Middle Earth,
9,1010,Peter,Parker,123-545-5421,Y,N,25th Main Street,New York


# Handle Missing Values

Missing values are one of those things I tackle early to keep things consistent. I treat empty strings and NaN values the same way across the board. Using fillna() and replace() ensures uniform handling of missing data in all columns.

In [23]:
df = df.replace('N/a','')

In [24]:
df = df.fillna('')

In [25]:
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Paying Customer,Do_Not_Contact,Street_Address,State
0,1001,Frodo,Baggins,123-545-5421,Y,N,123 Shire Lane,Shire
1,1002,Abed,Nadir,123-643-9775,N,Y,93 West Main Street,
2,1003,Walter,White,,N,,298 Drugs Driveway,
3,1004,Dwight,Schrute,123-543-2345,Y,Y,980 Paper Avenue,Pennsylvania
4,1005,Jon,Snow,876-678-3469,Y,N,123 Dragons Road,
5,1006,Ron,Swanson,304-762-2467,Y,Y,768 City Parkway,
6,1007,Jeff,Winger,,N,N,1209 South Street,
7,1008,Sherlock,Holmes,876-678-3469,N,N,98 Clue Drive,
8,1009,Gandalf,,,Y,,123 Middle Earth,
9,1010,Peter,Parker,123-545-5421,Y,N,25th Main Street,New York


# Filter Unwanted Records

Boolean indexing is one of my favorite ways to filter DataFrames - it's so elegant. I use it to remove contacts who prefer not to be contacted, which respects their privacy and keeps my analysis focused on the data that matters.

In [26]:
for x in df.index:
    if df.loc[x, 'Do_Not_Contact'] == 'Y':
        df.drop(x, inplace=True)

df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Paying Customer,Do_Not_Contact,Street_Address,State
0,1001,Frodo,Baggins,123-545-5421,Y,N,123 Shire Lane,Shire
2,1003,Walter,White,,N,,298 Drugs Driveway,
4,1005,Jon,Snow,876-678-3469,Y,N,123 Dragons Road,
6,1007,Jeff,Winger,,N,N,1209 South Street,
7,1008,Sherlock,Holmes,876-678-3469,N,N,98 Clue Drive,
8,1009,Gandalf,,,Y,,123 Middle Earth,
9,1010,Peter,Parker,123-545-5421,Y,N,25th Main Street,New York
10,1011,Samwise,Gamgee,,Y,N,612 Shire Lane,Shire
11,1012,Harry,Potter,,Y,,2394 Hogwarts Avenue,
12,1013,Don,Draper,123-543-2345,Y,N,2039 Main Street,


In [27]:
for x in df.index:
    if df.loc[x, 'Phone_Number'] == '':
        df.drop(x, inplace=True)

df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Paying Customer,Do_Not_Contact,Street_Address,State
0,1001,Frodo,Baggins,123-545-5421,Y,N,123 Shire Lane,Shire
4,1005,Jon,Snow,876-678-3469,Y,N,123 Dragons Road,
7,1008,Sherlock,Holmes,876-678-3469,N,N,98 Clue Drive,
9,1010,Peter,Parker,123-545-5421,Y,N,25th Main Street,New York
12,1013,Don,Draper,123-543-2345,Y,N,2039 Main Street,
13,1014,Leslie,Knope,876-678-3469,Y,N,343 City Parkway,
14,1015,Toby,Flenderson,304-762-2467,N,N,214 HR Avenue,
15,1016,Ron,Weasley,123-545-5421,N,N,2395 Hogwarts Avenue,
16,1017,Michael,Scott,123-643-9775,Y,N,121 Paper Avenue,Pennsylvania
19,1020,Anakin,Skywalker,876-678-3469,Y,N,910 Tatooine Road,Tatooine


# Reset Index

After filtering operations, the indices often end up with gaps that can be problematic. I make it a habit to reset the index to ensure continuous numbering and keep the DataFrame integrity intact for any subsequent operations.

In [28]:
df = df.reset_index(drop=True)
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Paying Customer,Do_Not_Contact,Street_Address,State
0,1001,Frodo,Baggins,123-545-5421,Y,N,123 Shire Lane,Shire
1,1005,Jon,Snow,876-678-3469,Y,N,123 Dragons Road,
2,1008,Sherlock,Holmes,876-678-3469,N,N,98 Clue Drive,
3,1010,Peter,Parker,123-545-5421,Y,N,25th Main Street,New York
4,1013,Don,Draper,123-543-2345,Y,N,2039 Main Street,
5,1014,Leslie,Knope,876-678-3469,Y,N,343 City Parkway,
6,1015,Toby,Flenderson,304-762-2467,N,N,214 HR Avenue,
7,1016,Ron,Weasley,123-545-5421,N,N,2395 Hogwarts Avenue,
8,1017,Michael,Scott,123-643-9775,Y,N,121 Paper Avenue,Pennsylvania
9,1020,Anakin,Skywalker,876-678-3469,Y,N,910 Tatooine Road,Tatooine


# Exporting cleaned data to a new Excel file

after cleaning the data, we save it to a new excel file.

In [29]:
df.to_excel(r'E:\PortfolioProjects\Python Project\learning pandas\Customer Call List_Cleaned.xlsx', index=False)

# Key Takeaways

Data cleaning is what transforms my raw, messy data into something truly analysis-ready. Each step builds on the previous one, creating a systematic approach to data preparation that I can count on. String methods, filtering techniques, and pandas operations give me the powerful tools I need for handling real-world data challenges. Sticking to consistent cleaning practices ensures I get reliable insights and prevents downstream analysis errors.

## My Data Cleaning Tips

- I always tackle duplicates and missing values first - they can skew everything else I do
- The right cleaning approach depends on what data quality issues I'm seeing - no one-size-fits-all solution
- I avoid over-cleaning data that might be useful for analysis, even if it's messy
- Clear documentation of each cleaning step saves me time when I revisit the data later
- When cleaning methods get complex, I break them into small, testable steps to avoid errors
- I always check my work with df.head() or df.describe() after major cleaning operations

## My Learnings

- Pandas string methods are incredibly powerful for text cleaning - they handle so much complexity with simple syntax
- Different cleaning operations serve different data quality needs: duplicates for integrity, missing values for completeness, standardization for consistency
- Always check data types and formats early - they affect which cleaning methods I can use
- Style with consistent naming conventions and clear variable names throughout the cleaning process
- For complex datasets, I preprocess in stages rather than trying to do everything at once
- Remember: the right cleaning approach depends on my analysis goals and data characteristics
- Practice makes perfect - I experiment with different methods to understand their effects and edge cases
- Documentation is as important as the cleaning itself - future me will thank present me