# 🗑️ Data Cleaning

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](<https://colab.research.google.com/>)

Remove null values and all the information other than the important ones.

# 1. ⚙️ Imports

Import the necessary libraries and packages.

In [1]:
import re
import pandas as pd

# Local Libraries
from data import data

# 2. 📁 Data

Get the 🇺🇸 US Presidential Elections by States data.

In [2]:
states_dataframe_path = data.get_dataset_path("rbs", "raw", "us_presidential_elections_by_states", 1)

In [3]:
states_dataframe = pd.read_csv(states_dataframe_path)

print("The first five records of the dataframe: \n")
states_dataframe.head()

The first five records of the dataframe: 



Unnamed: 0,Year,1972,1976,1980,1984,1988,1992,1996,2000,2004,2008,2012,2016,2020,2024
0,National popular vote,Nixon,Carter,Reagan,Reagan,Bush,Clinton,Clinton,Gore,Bush,Obama,Obama,Clinton,Biden,Trump
1,Alabama,Nixon,Carter,Reagan,Reagan,Bush,Bush,Dole,Bush,Bush,McCain,Romney,Trump,Trump,Trump
2,Alaska,Nixon,Ford,Reagan,Reagan,Bush,Bush,Dole,Bush,Bush,McCain,Romney,Trump,Trump,Trump
3,Arizona,Nixon,Ford,Reagan,Reagan,Bush,Bush,Clinton,Bush,Bush,McCain,Romney,Trump,Biden,Trump
4,Arkansas,Nixon,Carter,Reagan,Reagan,Bush,Clinton,Clinton,Bush,Bush,McCain,Romney,Trump,Trump,Trump


# 3. 🧹 Cleaning Messy Data

⚓️ Let's check the count of null values in every column. ⤵️

## 3.1. 🪹 Drop Null Values

Drop all the rows with null values, as those won't add any value according to our end goal.

In [4]:
state_null_count = states_dataframe.isnull().sum().reset_index()
state_null_count.columns = ["Column", "Null Count"]

print("Count of null values in each column: \n")
state_null_count

Count of null values in each column: 



Unnamed: 0,Column,Null Count
0,Year,0
1,1972,1
2,1976,1
3,1980,1
4,1984,2
5,1988,3
6,1992,3
7,1996,3
8,2000,3
9,2004,3


In [5]:
print("Shape of the state dataframe: ", states_dataframe.shape)

Shape of the state dataframe:  (55, 15)


⚓️ Now, drop all the rows that have null data, as those don't contain the primary candidates of the winners. ⤵️

In [6]:
states_dataframe.dropna(inplace = True)

In [7]:
print("Shape of the state dataframe after dropping all the rows containing null values: ", states_dataframe.shape)

Shape of the state dataframe after dropping all the rows containing null values:  (52, 15)


## 3.2. 🥡 Remove Unnecessary Data

Remove the unnecessary part of each cell of the dataframe, as those don't contain information helpful for achieving our goal.

In [8]:
print("Let's check the value of the 21st row and 7th column: ", states_dataframe.iloc[20, 6])

Let's check the value of the 21st row and 7th column:  Clinton (at-large and ME-01)


⚓️ The `(at-large and ME-01)` part of `Clinton (at-large and ME-01)` is unnecessary for our end goal.
As we have some data like this, we have to remove those from the strings. ⤵️

⚓️ Function for removing the unnecessary parts of the strings.

In [9]:
def remove_unnecessary_strings(string: str) -> str:
    """
    Removes any substrings enclosed in parentheses (including the parentheses themselves).
    Parameters:⤵️
        - string (str): The input string potentially containing text in parentheses.
    Returns:
        - str: A cleaned string with all substrings in parentheses removed.
    """
    return re.sub(r"\s*\([^)]*\)", "", string)

⚓️ Remove all the unnecessary parts of the strings and create a new dataframe. ⤵️

In [10]:
clean_states_dataframe = states_dataframe.map(remove_unnecessary_strings)

In [11]:
print("The value of the 21st row and 7th column (after removing unnecessary parts): ", clean_states_dataframe.iloc[20, 6])

The value of the 21st row and 7th column (after removing unnecessary parts):  Clinton


# 4. 💣 Export the DataFrame

In [12]:
data_path = data.get_dataset_path("rbs", "processed", "clean_us_presidential_elections_by_states", 1)

try:
    clean_states_dataframe.to_csv(data_path, index = False)
    print("🎉 Saved data to CSV at: ", data_path)
except Exception as e:
    print(f"❌ Error: {e}")

🎉 Saved data to CSV at:  /../../../../../../Volumes/Workstation/Datasets/Red.Blue.States/raw/clean_us_presidential_elections_by_states.csv


🎉 Congratulations! The `Data Cleaning` is complete!