In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np


In [2]:
# Load your messy data into a pandas dataframe
df = pd.read_csv('messy_data.csv')

In [3]:
# Identify missing values
missing_values = df.isnull().sum()


In [4]:
# Identify duplicate entries
duplicate_entries = df.duplicated().sum()

In [5]:
# Identify inconsistent formatting
inconsistent_formatting = df.applymap(type).nunique()


In [6]:
# Identify duplicate entries
duplicate_entries = df.duplicated().sum()


In [7]:
# Print out the results
print("Missing values:\n", missing_values)
print("\nInconsistent formatting:\n", inconsistent_formatting)
print("\nDuplicate entries:\n", duplicate_entries)

Missing values:
 Index          0
Age            7
Salary         0
Rating         1
Location       0
Established    0
Easy Apply     0
dtype: int64

Inconsistent formatting:
 Index          1
Age            1
Salary         1
Rating         1
Location       1
Established    1
Easy Apply     1
dtype: int64

Duplicate entries:
 0


Chapter 2 Code

In [8]:
# Import necessary libraries
import pandas as pd

In [9]:
# Load your messy data into a pandas dataframe
df = pd.read_csv('messy_data.csv')

In [10]:
# Identify which columns "spark joy"
joyful_columns = ['Age', 'Salary', 'Location']

In [11]:
# Create a new dataframe that only includes the joyful columns
tidy_df = df[joyful_columns]

In [12]:
# Save the tidy data to a new file
tidy_df.to_csv('tidy_data.csv', index=False)

In [13]:
# Print out a message to confirm success
print("Your data has been tidied up! Enjoy your sparkling clean dataset.")

Your data has been tidied up! Enjoy your sparkling clean dataset.


This code demonstrates the idea of keeping only what "sparks joy" in your data. In this example, we have identified which columns we care about (name, age, and favorite_color), and created a new dataframe that only includes those columns. We then save the tidy data to a new file, and print out a message to confirm success.

Of course, you'll need to adjust this code to fit your specific dataset and cleaning needs, but hopefully this gives you a good idea of how to "tidy up" your messy data.

Chapter 3 Code

In [14]:
# Import necessary libraries
import pandas as pd
import numpy as np

In [15]:
# Load your messy data into a pandas dataframe
df = pd.read_csv('messy_data.csv')

In [17]:
# Identify missing values
missing_values = df.isnull().sum()
print(missing_values)

Index          0
Age            7
Salary         0
Rating         1
Location       0
Established    0
Easy Apply     0
dtype: int64


In [19]:
# Replace missing values with mean or mode
df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Rating'] = df['Rating'].fillna(df['Rating'].mode()[0])

In [20]:
# Identify there are no mlonger missing values
missing_values = df.isnull().sum()
print(missing_values)

Index          0
Age            0
Salary         0
Rating         0
Location       0
Established    0
Easy Apply     0
dtype: int64


In [21]:
# Save the tidy data to a new file
df.to_csv('tidy_data.csv', index=False)

In [22]:
# Print out a message to confirm success
print("Your data has been tidied up! Missing values have been handled.")

Your data has been tidied up! Missing values have been handled.


This code demonstrates two common strategies for handling missing values: replacing them with the mean or mode of the non-missing values, or dropping the entire row if it contains any missing values. In this example, we replace the missing age values with the mean age, and replace the missing favorite color values with the mode (most common) favorite color. We then drop any rows that still contain missing values, save the tidy data to a new file, and print out a message to confirm success.

Again, you'll need to adjust this code to fit your specific dataset and cleaning needs, but this should give you a good starting point for handling missing values in your data.

Chapter 4

Here's some Python code that demonstrates how to deal with duplicates in your data, as discussed in Chapter 4:

In [23]:
# Import necessary libraries
import pandas as pd

In [24]:
# Load your messy data into a pandas dataframe
df = pd.read_csv('messy_data.csv')


In [25]:
# Identify duplicate entries
duplicates = df.duplicated()

In [26]:
# Drop duplicate entries
df = df.drop_duplicates()

In [27]:
# Save the tidy data to a new file
df.to_csv('tidy_data.csv', index=False)

In [28]:
# Print out a message to confirm success
print("Your data has been tidied up! Duplicate entries have been removed.")

Your data has been tidied up! Duplicate entries have been removed.


This code identifies any duplicate entries in your data using the duplicated() function, and then drops those duplicate entries using the drop_duplicates() function. We then save the tidy data to a new file, and print out a message to confirm success.

Of course, you'll need to adjust this code to fit your specific dataset and cleaning needs, but this should give you a good starting point for dealing with duplicates in your data.

 Bonus code:
 That demonstrates how to handle outliers in your data, as discussed in (hidden) Chapter 5:

In [29]:
# Import necessary libraries
import pandas as pd
import numpy as np
from scipy import stats

In [30]:
# Load your messy data into a pandas dataframe
df = pd.read_csv('messy_data.csv')

In [32]:
# Identify outliers in the age column using z-scores
z_scores = np.abs(stats.zscore(df['Age']))
outliers = (z_scores > 3)

In [33]:
# Replace outliers with the median age
df.loc[outliers, 'Age'] = df['Age'].median()

In [34]:
# Save the tidy data to a new file
df.to_csv('tidy_data.csv', index=False)

In [35]:
# Print out a message to confirm success
print("Your data has been tidied up! Outliers in the age column have been handled.")

Your data has been tidied up! Outliers in the age column have been handled.


This code identifies outliers in the age column of your data using z-scores, and replaces those outliers with the median age using the loc function. We then save the tidy data to a new file, and print out a message to confirm success.

Again, you'll need to adjust this code to fit your specific dataset and cleaning needs, but this should give you a good starting point for handling outliers in your data.
