So far, we have:

Created a DataFrame
Viewed the first and last rows using the head and tail methods
Printed the column names
Checked for null values
Next, we'll dive deeper into our data to gain more insights. We'll explore the data types of each column and perform some descriptive analysis on the dataset.

We're fortunate that our data is clean, meaning it doesn't contain any errors or inconsistencies. However, this isn't always the case. The steps we're taking now help us understand our dataset and identify potential issues early on.

Understanding the data types of columns in a DataFrame is crucial because it influences how you analyze and manipulate the data. Different data types (e.g., integers, floats, strings, dates) allow for different operations and analyses. For example:

Numerical columns support mathematical operations and statistical analysis.
Categorical columns are useful for grouping and aggregation.
Date/Time columns enable time-based filtering and analysis.

How to Check Data Types in a DataFrame
In pandas, you can easily check the data types of each column using the .dtypes attribute:

In [1]:
import pandas as pd

# Example DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 22],
    'JoinDate': ['2023-01-01', '2022-06-15', '2023-03-20'],
    'IsMember': [True, False, True]
}

df = pd.DataFrame(data)

# Check data types
print(df.dtypes)

Name        object
Age          int64
JoinDate    object
IsMember      bool
dtype: object


What This Tells Us:

Name is an object, typically indicating text data.
Age is an int64, meaning it's a whole number and suitable for numerical operations.
JoinDate is also an object, but since it's a date, we might need to convert it to a datetime type for date-based analysis.
IsMember is a bool, useful for filtering and logical operations.

Converting Data Types if Needed:

If needed, you can convert data types using methods like pd.to_datetime() for dates or .astype() for other types:

In [2]:
# Convert 'JoinDate' to datetime
df['JoinDate'] = pd.to_datetime(df['JoinDate'])
print(df.dtypes)

Name                object
Age                  int64
JoinDate    datetime64[ns]
IsMember              bool
dtype: object


By ensuring the correct data types, you'll avoid errors and gain more powerful tools for data analysis.

In the example above, we converted the data type of a single column. With pandas, you can target one column at a time by specifying its name within square brackets. This approach allows you to access or modify specific columns directly.

For example:

In [4]:
# date_frame['column_name_here']
data['Name']

['Alice', 'Bob', 'Charlie']

Bringing it all together week 1

When we start a new data project, one of the first things we do is read in the data and take a good look at it. This isn't just busywork—it's our chance to get the lay of the land before we dive into analysis.

We kick things off by checking out the first and last few rows with head() and tail(). This gives us a quick snapshot of the data's structure and helps us catch anything weird right away, like unexpected values or formatting issues.

Next, we hunt for nulls. Why? Because nothing derails a good analysis like missing data. If half a column is blank, we need to know so we can decide whether to fill those gaps, drop them, or handle them in some other way.

We also check the shape of the DataFrame (.shape) to see how many rows and columns we’re dealing with. This is especially important when merging or transforming data—if our row count suddenly doubles, something probably went wrong.

Finally, we look at descriptive stats (.describe()). This gives us a quick read on our numerical data—things like means, medians, min/max values, and percentiles. It helps us spot outliers and understand the data's spread, which can guide our next steps.

This kind of exploratory data analysis (EDA) might feel like a routine, but it’s one of the best habits to build. It sets the stage for solid, reliable analysis and helps us avoid surprises down the road. Plus, as we get into more complex projects, these steps become even more crucial, think of them as our pre-flight checklist before taking off into deeper analytics or machine learning.

Want to make your future self happy? Make EDA your new best friend.

Understanding Null and Missing Values in pandas
When working with data in pandas, you'll often encounter null or missing values. These represent the absence of data in a DataFrame or Series. Missing values can occur for various reasons, such as incomplete data collection, errors during data entry, or merging datasets with inconsistent records. In pandas, missing values are typically represented as NaN (Not a Number) for numeric data or None for object data types.

Example of Missing Values
Consider the following gradebook data with missing values:

In [3]:
import pandas as pd

data = {
    "Student": ["Alice", "Bob", "Charlie", "David"],
    "Math": [85, None, 78, 92],
    "Science": [90, 88, None, 95],
    "English": [None, 80, 87, 85]
}

df = pd.DataFrame(data)
print(df)

   Student  Math  Science  English
0    Alice  85.0     90.0      NaN
1      Bob   NaN     88.0     80.0
2  Charlie  78.0      NaN     87.0
3    David  92.0     95.0     85.0


In this example, some values are missing, represented by NaN.

Handling Missing Values

1. Removing Missing Values
a. dropna() Method:

How: Removes rows (or columns) containing missing values.
When to Use: When missing values represent a small, insignificant portion of the dataset or when those records are not essential for analysis.

In [4]:
import pandas as pd

data = {
    "Student": ["Alice", "Bob", "Charlie", "David"],
    "Math": [85, None, 78, 92],
    "Science": [90, 88, None, 95],
    "English": [None, 80, 87, 85]
}

df = pd.DataFrame(data)
df_cleaned = df.dropna()
print(df_cleaned)

  Student  Math  Science  English
3   David  92.0     95.0     85.0


Pros: Ensures only complete data is used, which can improve the reliability of analysis.
Cons: Potentially reduces the dataset size significantly, leading to biased results if many rows are removed.

2. Filling Missing Values
a. fillna() Method:

How: Replaces NaN values with a specified value (e.g., a fixed number, the mean of the column, or forward/backward filling).
When to Use: When removing rows is not ideal, and estimates or placeholders can be used instead.

In [5]:
import pandas as pd

data = {
    "Student": ["Alice", "Bob", "Charlie", "David"],
    "Math": [85, None, 78, 92],
    "Science": [90, 88, None, 95],
    "English": [None, 80, 87, 85]
}

df = pd.DataFrame(data)
df_filled = df.fillna(0)  # Fill with 0
print(df_filled)

   Student  Math  Science  English
0    Alice  85.0     90.0      0.0
1      Bob   0.0     88.0     80.0
2  Charlie  78.0      0.0     87.0
3    David  92.0     95.0     85.0


Alternative Filling Strategies:

Pros: Preserves the size of the dataset. Useful when the missing data is not entirely random or when approximate values can be used.
Cons: Can introduce bias if the filled values do not accurately reflect the missing data.

When to Remove vs. When to Fill

✅ Remove Missing Values: When the dataset is large, and the missing data is minimal or irrelevant to your analysis. This is ideal for ensuring only high-quality data is used.
✅ Fill Missing Values: When preserving the dataset size is crucial, and the missing values can be reasonably estimated. This approach is helpful for maintaining trends or sequences, such as in time series data.

By carefully considering the context of your data and analysis goals, you can choose the appropriate method to handle missing values effectively.

In [17]:
import pandas as pd

data = {
    "Student": ["Alice", "Bob", "Charlie", "David"],
    "Math": [85, None, 78, 92],
    "Science": [90, 88, None, 95],
    "English": [None, 80, 87, 85]
}

df = pd.DataFrame(data)
# df_filled = df.fillna(0)  # Fill with 0

# The below line does not run because my DataFrame has non-numeric columns like strings (e.g., names, categories, etc.)
# df_mean_filled = df.fillna(df.mean())  # Fill with column mean

# Fill NaN in numeric columns with their mean
df_mean_filled = df.fillna(df.select_dtypes(include='number').mean())

#️⃣ This line:
# ❇️ Selects only numeric columns using select_dtypes(include='number').
# ❇️ Calculates the mean only for numeric columns.
# ❇️ Fills NaN values in those columns with their respective means.

# print(df_filled)
print(df_mean_filled)

   Student  Math  Science  English
0    Alice  85.0     90.0     84.0
1      Bob  85.0     88.0     80.0
2  Charlie  78.0     91.0     87.0
3    David  92.0     95.0     85.0


In [9]:
import pandas as pd

data = {
    "Student": ["Alice", "Bob", "Charlie", "David"],
    "Math": [85, None, 78, 92],
    "Science": [90, 88, None, 95],
    "English": [None, 80, 87, 85]
}

df = pd.DataFrame(data)
# df_filled = df.fillna(0)  # Fill with 0
df_forward_filled = df.fillna(method='ffill')  # Forward fill
# print(df_filled)
print(df_forward_filled)

   Student  Math  Science  English
0    Alice  85.0     90.0      NaN
1      Bob  85.0     88.0     80.0
2  Charlie  78.0     88.0     87.0
3    David  92.0     95.0     85.0


  df_forward_filled = df.fillna(method='ffill')  # Forward fill


In [10]:
import pandas as pd

data = {
    "Student": ["Alice", "Bob", "Charlie", "David"],
    "Math": [85, None, 78, 92],
    "Science": [90, 88, None, 95],
    "English": [None, 80, 87, 85]
}

df = pd.DataFrame(data)
# df_filled = df.fillna(0)  # Fill with 0
df_backward_filled = df.fillna(method='bfill')  # Backward fill
# print(df_filled)
print(df_forward_filled)

   Student  Math  Science  English
0    Alice  85.0     90.0      NaN
1      Bob  85.0     88.0     80.0
2  Charlie  78.0     88.0     87.0
3    David  92.0     95.0     85.0


  df_backward_filled = df.fillna(method='bfill')  # Backward fill


Our client, the Kentucky School Board, recently digitized records from the 1950s. Unfortunately, a coffee spill damaged some of the original documents, resulting in lost data. To restore the missing information, we would like to fill the gaps with the class averages for each subject. The relevant class data has been extracted from the database. Please write a script to repair the data accordingly.

Sales Team Dashboard Issue

Instructor Solution

In [18]:
import pandas as pd

data = {
    "Employee": ["Evelyn", "Frank", "Grace", "Henry"],
    "Sales": [250, None, 300, 275],
    "Marketing": [None, 180, 200, 190],
    "IT": [150, 160, None, 175]
}

df = pd.DataFrame(data)

# your code here 

df = df.fillna(df.mean(numeric_only=True))

print(df)

  Employee  Sales  Marketing          IT
0   Evelyn  250.0      190.0  150.000000
1    Frank  275.0      180.0  160.000000
2    Grace  300.0      200.0  161.666667
3    Henry  275.0      190.0  175.000000


Transpose in Python: In Python, transposing a DataFrame means swapping its rows and columns. This is achieved using the .T attribute of a pandas DataFrame. When transposed, the original rows become columns and vice versa. Transposing is particularly useful when you want to change the orientation of your data, making it easier to perform column-wise operations that might otherwise require row-wise calculations. For example, in the context of a gradebook or sales data, transposing can help align the data by subjects or products, enabling more straightforward aggregation and analysis.

Calculating Column Means: Once the DataFrame is transposed, calculating the mean of each column is simple with the .mean() method. The .mean() function computes the arithmetic mean of numerical values within each column, ignoring any NaN values by default. By chaining .round(), the mean values are rounded to the nearest whole number, which can enhance readability or suit specific use cases. These mean values can be stored as a Series, allowing them to be reused for filling missing data.

Filling Missing Values: Filling NaN values with column averages is a practical method in data cleaning, especially when missing data is sparse and you want to maintain the overall distribution of the dataset. This approach is particularly useful when the dataset is large enough that individual missing values are unlikely to skew results. Using the .fillna() method, missing values are replaced by their respective column averages, which helps to avoid introducing bias or losing valuable data by simply dropping rows or columns with missing values. This technique is often applied in scenarios such as gradebooks, sales data, and any situation where you need to preserve as much information as possible while preparing the data for analysis or modeling.

In [19]:
import pandas as pd

# Sales data for different products in various stores with some missing values
sales_data = {
    "Store_1": {"Apples": 150, "Bananas": None, "Cherries": 200, "Dates": 120, "Eggplants": 90},
    "Store_2": {"Apples": None, "Bananas": 180, "Cherries": 210, "Dates": None, "Eggplants": 100},
    "Store_3": {"Apples": 170, "Bananas": 160, "Cherries": None, "Dates": 130, "Eggplants": None},
    "Store_4": {"Apples": 160, "Bananas": 170, "Cherries": 220, "Dates": 140, "Eggplants": 110},
    "Store_5": {"Apples": 155, "Bananas": None, "Cherries": 215, "Dates": 125, "Eggplants": 95},
    "Store_6": {"Apples": None, "Bananas": 165, "Cherries": None, "Dates": None, "Eggplants": 105},
}

# Convert to a DataFrame and transpose
sales_df = pd.DataFrame(sales_data).T

# Calculate column means (rounded)
column_means = sales_df.mean().round()
print("Column Means (Rounded):\n", column_means)

# Fill NaN values with column averages
sales_df = sales_df.fillna(column_means)

print("\nUpdated DataFrame:\n", sales_df)

Column Means (Rounded):
 Apples       159.0
Bananas      169.0
Cherries     211.0
Dates        129.0
Eggplants    100.0
dtype: float64

Updated DataFrame:
          Apples  Bananas  Cherries  Dates  Eggplants
Store_1   150.0    169.0     200.0  120.0       90.0
Store_2   159.0    180.0     210.0  129.0      100.0
Store_3   170.0    160.0     211.0  130.0      100.0
Store_4   160.0    170.0     220.0  140.0      110.0
Store_5   155.0    169.0     215.0  125.0       95.0
Store_6   159.0    165.0     211.0  129.0      105.0


Kentucky School Board Grades

In [22]:
import pandas as pd

gradebook = {
    "Student_1": {"Math": 85, "Science": 90, "History": 78, "English": None, "Art": 92},
    "Student_2": {"Math": 74, "Science": None, "History": 88, "English": 80, "Art": 76},
    "Student_3": {"Math": None, "Science": 85, "History": 91, "English": 87, "Art": 70},
    "Student_4": {"Math": 92, "Science": 89, "History": None, "English": 95, "Art": 88},
    "Student_5": {"Math": 67, "Science": 73, "History": 80, "English": 85, "Art": None},
    "Student_6": {"Math": 88, "Science": 92, "History": 84, "English": 90, "Art": 86},
    "Student_7": {"Math": 76, "Science": None, "History": 79, "English": 83, "Art": 91},
    "Student_8": {"Math": 95, "Science": 97, "History": None, "English": 93, "Art": 89},
    "Student_9": {"Math": None, "Science": 82, "History": 85, "English": 88, "Art": 90},
    "Student_10": {"Math": 81, "Science": 87, "History": 90, "English": None, "Art": 78},
    "Student_11": {"Math": 69, "Science": 74, "History": 80, "English": 77, "Art": None},
    "Student_12": {"Math": None, "Science": 91, "History": 85, "English": 90, "Art": 93},
    "Student_13": {"Math": 86, "Science": None, "History": 88, "English": 82, "Art": 79},
    "Student_14": {"Math": 90, "Science": 85, "History": 92, "English": 88, "Art": None},
    "Student_15": {"Math": None, "Science": 80, "History": 75, "English": 85, "Art": 87},
}

gradebook_df = pd.DataFrame(gradebook)

#transpose the data 
grades_df_t = pd.DataFrame(gradebook).T

# Calculate the column means (rounded)
column_means = gradebook_df.mean().round()
print("Column Means (Rounded):\n", column_means)

# Fill NaN values with column averages
grades_df_t = gradebook_df.fillna(column_means)

print("\nUpdated DataFrame:\n", grades_df_t)

Column Means (Rounded):
 Student_1     86.0
Student_2     80.0
Student_3     83.0
Student_4     91.0
Student_5     76.0
Student_6     88.0
Student_7     82.0
Student_8     94.0
Student_9     86.0
Student_10    84.0
Student_11    75.0
Student_12    90.0
Student_13    84.0
Student_14    89.0
Student_15    82.0
dtype: float64

Updated DataFrame:
          Student_1  Student_2  Student_3  Student_4  Student_5  Student_6  \
Math          85.0       74.0       83.0       92.0       67.0         88   
Science       90.0       80.0       85.0       89.0       73.0         92   
History       78.0       88.0       91.0       91.0       80.0         84   
English       86.0       80.0       87.0       95.0       85.0         90   
Art           92.0       76.0       70.0       88.0       76.0         86   

         Student_7  Student_8  Student_9  Student_10  Student_11  Student_12  \
Math          76.0       95.0       86.0        81.0        69.0        90.0   
Science       82.0       97.0  

In [23]:
import pandas as pd

gradebook = {
    "Student_1": {"Math": 85, "Science": 90, "History": 78, "English": None, "Art": 92},
    "Student_2": {"Math": 74, "Science": None, "History": 88, "English": 80, "Art": 76},
    "Student_3": {"Math": None, "Science": 85, "History": 91, "English": 87, "Art": 70},
    "Student_4": {"Math": 92, "Science": 89, "History": None, "English": 95, "Art": 88},
    "Student_5": {"Math": 67, "Science": 73, "History": 80, "English": 85, "Art": None},
    "Student_6": {"Math": 88, "Science": 92, "History": 84, "English": 90, "Art": 86},
    "Student_7": {"Math": 76, "Science": None, "History": 79, "English": 83, "Art": 91},
    "Student_8": {"Math": 95, "Science": 97, "History": None, "English": 93, "Art": 89},
    "Student_9": {"Math": None, "Science": 82, "History": 85, "English": 88, "Art": 90},
    "Student_10": {"Math": 81, "Science": 87, "History": 90, "English": None, "Art": 78},
    "Student_11": {"Math": 69, "Science": 74, "History": 80, "English": 77, "Art": None},
    "Student_12": {"Math": None, "Science": 91, "History": 85, "English": 90, "Art": 93},
    "Student_13": {"Math": 86, "Science": None, "History": 88, "English": 82, "Art": 79},
    "Student_14": {"Math": 90, "Science": 85, "History": 92, "English": 88, "Art": None},
    "Student_15": {"Math": None, "Science": 80, "History": 75, "English": 85, "Art": 87},
}

# Convert to DataFrame
gradebook_df = pd.DataFrame(gradebook)

# Transpose the DataFrame (students as rows)
grades_df_t = gradebook_df.T

# Calculate column means and round them
column_means = grades_df_t.mean().round()

print("Column Means (Rounded):\n", column_means)

# Fill NaN values with column means
grades_df_t = grades_df_t.fillna(column_means)

print("\nUpdated DataFrame:\n", grades_df_t)

Column Means (Rounded):
 Math       82.0
Science    85.0
History    84.0
English    86.0
Art        85.0
dtype: float64

Updated DataFrame:
             Math  Science  History  English   Art
Student_1   85.0     90.0     78.0     86.0  92.0
Student_2   74.0     85.0     88.0     80.0  76.0
Student_3   82.0     85.0     91.0     87.0  70.0
Student_4   92.0     89.0     84.0     95.0  88.0
Student_5   67.0     73.0     80.0     85.0  85.0
Student_6   88.0     92.0     84.0     90.0  86.0
Student_7   76.0     85.0     79.0     83.0  91.0
Student_8   95.0     97.0     84.0     93.0  89.0
Student_9   82.0     82.0     85.0     88.0  90.0
Student_10  81.0     87.0     90.0     86.0  78.0
Student_11  69.0     74.0     80.0     77.0  85.0
Student_12  82.0     91.0     85.0     90.0  93.0
Student_13  86.0     85.0     88.0     82.0  79.0
Student_14  90.0     85.0     92.0     88.0  85.0
Student_15  82.0     80.0     75.0     85.0  87.0


Normalizing Data: Why It Matters

When working with datasets, especially in programming and data analytics, normalizing data is a crucial step to ensure consistency, readability, and ease of use. Normalization involves standardizing the data format, which can include renaming columns, formatting values consistently, and preparing the data for analysis. We'll use the mock_data_2 example to demonstrate key normalization techniques, including adding underscores to column names, handling file names with spaces, and converting data to uppercase.

1. Replacing Spaces with Underscores in Column Names

   Column names with spaces can lead to coding challenges. For instance, accessing columns with spaces often requires bracket notation (df['Product Name']), whereas standardized column names (df.Product_Name) provide cleaner, more readable code. Replacing spaces with underscores also improves compatibility with various tools and programming languages that may not handle spaces well.

In [24]:
import pandas as pd

# Original DataFrame
mock_data_2 = {
    "Product Name": ["LaPtOp", "PhOnE", "TaBlEt", "MoNiToR", "KeYbOaRd"],
    "Brand Name": ["DeLl", "ApPlE", "SaMsUnG", "Hp", "LoGiTeCh"],
    "Serial Number": ["SN12345", "SN67890", "SN54321", "SN09876", "SN11223"],
    "Purchase Date": ["2023-01-15", "2022-12-10", "2023-05-21", "2023-07-30", "2023-03-18"],
    "Warranty Status": ["AcTiVe", "ExPiReD", "AcTiVe", "ExPiReD", "AcTiVe"]
}

df = pd.DataFrame(mock_data_2)

# Normalizing column names by replacing spaces with underscores
df.columns = df.columns.str.replace(' ', '_')
print(df.columns)

Index(['Product_Name', 'Brand_Name', 'Serial_Number', 'Purchase_Date',
       'Warranty_Status'],
      dtype='object')


2. The Importance of Avoiding Spaces in File Names

   When saving or loading files, spaces in file names can cause issues, especially in terminal commands or when using certain software. For example, to save a DataFrame to a CSV file, using underscores in the file name (mock_data_2.csv) is more practical than dealing with escape characters (mock\ data\ 2.csv).

In [25]:
import pandas as pd

# Original DataFrame
mock_data_2 = {
    "Product Name": ["LaPtOp", "PhOnE", "TaBlEt", "MoNiToR", "KeYbOaRd"],
    "Brand Name": ["DeLl", "ApPlE", "SaMsUnG", "Hp", "LoGiTeCh"],
    "Serial Number": ["SN12345", "SN67890", "SN54321", "SN09876", "SN11223"],
    "Purchase Date": ["2023-01-15", "2022-12-10", "2023-05-21", "2023-07-30", "2023-03-18"],
    "Warranty Status": ["AcTiVe", "ExPiReD", "AcTiVe", "ExPiReD", "AcTiVe"]
}

df = pd.DataFrame(mock_data_2)

# Normalizing column names by replacing spaces with underscores
# df.columns = df.columns.str.replace(' ', '_')

df.to_csv('mock_data_2.csv', index=False)  # Clean and simple file name
print(df.to_csv)

<bound method NDFrame.to_csv of   Product Name Brand Name Serial Number Purchase Date Warranty Status
0       LaPtOp       DeLl       SN12345    2023-01-15          AcTiVe
1        PhOnE      ApPlE       SN67890    2022-12-10         ExPiReD
2       TaBlEt    SaMsUnG       SN54321    2023-05-21          AcTiVe
3      MoNiToR         Hp       SN09876    2023-07-30         ExPiReD
4     KeYbOaRd   LoGiTeCh       SN11223    2023-03-18          AcTiVe>


3. Converting Data to Uppercase

   A simple but effective normalization technique is converting all text data to uppercase using .map(str.upper). This approach ensures uniformity, making data comparisons easier and improving overall readability.

In [26]:
import pandas as pd

# Original DataFrame
mock_data_2 = {
    "Product Name": ["LaPtOp", "PhOnE", "TaBlEt", "MoNiToR", "KeYbOaRd"],
    "Brand Name": ["DeLl", "ApPlE", "SaMsUnG", "Hp", "LoGiTeCh"],
    "Serial Number": ["SN12345", "SN67890", "SN54321", "SN09876", "SN11223"],
    "Purchase Date": ["2023-01-15", "2022-12-10", "2023-05-21", "2023-07-30", "2023-03-18"],
    "Warranty Status": ["AcTiVe", "ExPiReD", "AcTiVe", "ExPiReD", "AcTiVe"]
}

df = pd.DataFrame(mock_data_2)

# Normalizing column names by replacing spaces with underscores
# df.columns = df.columns.str.replace(' ', '_')
# df.to_csv('mock_data_2.csv', index=False)  # Clean and simple file name

# Convert all string data to uppercase for consistency
df = df.map(str.upper)
print(df)
#print(df.to_csv)

  Product Name Brand Name Serial Number Purchase Date Warranty Status
0       LAPTOP       DELL       SN12345    2023-01-15          ACTIVE
1        PHONE      APPLE       SN67890    2022-12-10         EXPIRED
2       TABLET    SAMSUNG       SN54321    2023-05-21          ACTIVE
3      MONITOR         HP       SN09876    2023-07-30         EXPIRED
4     KEYBOARD   LOGITECH       SN11223    2023-03-18          ACTIVE


Our client, Super-R-US Store, provided us with sales data that requires normalization. It appears that whoever entered the data had some issues with their keyboard, resulting in inconsistent formatting. To standardize the dataset, we will update column names by replacing spaces with underscores to align with our naming conventions. Additionally, we'll convert all raw data to uppercase to match how records are stored in our system, ensuring uniformity and ease of use.

In [27]:
import pandas as pd

Customers = {
    "First Name": ["aLiCe", "bOb", "ChArLiE", "DaViD", "eVa"],
    "Last Name": ["SmItH", "JoNeS", "BrOwN", "WiLsOn", "TaYlOr"],
    "Email Address": ["aLiCe@example.com", "bOb@example.com", "cHaRlIe@example.com", "DaViD@example.com", "eVa@example.com"],
    "Phone Number": ["123-456-7890", "234-567-8901", "345-678-9012", "456-789-0123", "567-890-1234"],
    "Street Address": ["123 MaIn St", "456 SeCoNd Ave", "789 ThIrD Blvd", "101 FoUrTh Rd", "202 FiFtH Ln"]
}

Customers = pd.DataFrame(Customers)

#your code here 
Customers.columns = Customers.columns.str.replace(' ','_')
print(Customers.columns)

#your code here
Customers = Customers.map(str.upper)
print(Customers)

Index(['First_Name', 'Last_Name', 'Email_Address', 'Phone_Number',
       'Street_Address'],
      dtype='object')
  First_Name Last_Name        Email_Address  Phone_Number  Street_Address
0      ALICE     SMITH    ALICE@EXAMPLE.COM  123-456-7890     123 MAIN ST
1        BOB     JONES      BOB@EXAMPLE.COM  234-567-8901  456 SECOND AVE
2    CHARLIE     BROWN  CHARLIE@EXAMPLE.COM  345-678-9012  789 THIRD BLVD
3      DAVID    WILSON    DAVID@EXAMPLE.COM  456-789-0123   101 FOURTH RD
4        EVA    TAYLOR      EVA@EXAMPLE.COM  567-890-1234    202 FIFTH LN


In [28]:
import pandas as pd

url = 'https://raw.githubusercontent.com/CodeYouOrg/DataOpenClass/refs/heads/main/SalaryData.csv'
df = pd.read_csv(url)

print(df.head())

   CalYear Employee_Name          Department               jobTitle  \
0     2020           NaN  Parks & Recreation     Park Worker II-CDL   
1     2020           NaN  Parks & Recreation  Recreation Instructor   
2     2020           NaN  Parks & Recreation        Recreation Aide   
3     2020           NaN      Human Services  Staff Helper/Internal   
4     2020           NaN  Parks & Recreation        Recreation Aide   

   Annual_Rate  Regular_Rate  Overtime_Rate  Incentive_Allowance  Other  \
0      33321.6       1249.56            0.0                  0.0    NaN   
1      22880.0        569.25            0.0                  0.0    NaN   
2      21840.0        152.25            0.0                  0.0    NaN   
3      21008.0        333.30            0.0                  0.0    NaN   
4      21840.0        152.25            0.0                  0.0    NaN   

   YTD_Total  ObjectId  
0    1249.56         1  
1     569.25         2  
2     152.25         3  
3     333.30         4