## Introduction

This notebook demonstrates basic data cleaning and integrity checks on the dataset `us_census_income_data_clean.csv`.  
The purpose is not to perform detailed data manipulation, but to confirm that the dataset is properly structured and ready for analysis.  

The following steps are included:
- Displaying sample rows to verify column alignment and values.
- Checking for missing values across all columns.
- Identifying and removing duplicate rows.
- Confirming the total number of records after cleaning.
- Reviewing column data types to ensure consistency.

These checks provide confidence that the dataset is clean and reliable for further use in the project.

In [18]:
import pandas as pd

# Load the dataset
df = pd.read_csv("us_census_income_data_clean.csv")

# Set display options for full-width view
pd.set_option("display.max_columns", None)
pd.set_option("display.width", 1000)

# Display first 5 rows
display(df.head())



Unnamed: 0,age,workclass,education_num,marital_status,occupation,relationship,race,gender,hours_per_week,native_country,capital,income_status
0,39,state_gov,13,never_married,adm_clerical,not_in_family,white,male,40,united_states,2174,<=50k
1,50,self_emp_not_inc,13,married_civ_spouse,exec_managerial,husband,white,male,13,united_states,0,<=50k
2,38,private,9,divorced,handlers_cleaners,not_in_family,white,male,40,united_states,0,<=50k
3,53,private,7,married_civ_spouse,handlers_cleaners,husband,other,male,40,united_states,0,<=50k
4,28,private,13,married_civ_spouse,prof_specialty,wife,other,female,40,other,0,<=50k


In [19]:
# Display total number of rows
display(len(df))

45222

In [20]:
# Display column data types
display(df.dtypes)

age                int64
workclass         object
education_num      int64
marital_status    object
occupation        object
relationship      object
race              object
gender            object
hours_per_week     int64
native_country    object
capital            int64
income_status     object
dtype: object

In [21]:
# Display missing values per column
display(df.isnull().sum())

age               0
workclass         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
gender            0
hours_per_week    0
native_country    0
capital           0
income_status     0
dtype: int64

In [22]:
# Display number of duplicate rows
display(df.duplicated().sum())

np.int64(6099)

In [23]:
import pandas as pd

# Load the dataset
df = pd.read_csv("us_census_income_data_clean.csv")

# Check total rows and duplicates before cleaning
print("Total rows before cleaning:", len(df))
print("Duplicate rows:", df.duplicated().sum())

# Drop duplicates
df = df.drop_duplicates()

# Confirm after cleaning
print("Total rows after cleaning:", len(df))

Total rows before cleaning: 45222
Duplicate rows: 6099
Total rows after cleaning: 39123


## Conclusion

The cleaning process confirmed that:
- The dataset contained 45,222 rows initially.
- 6,099 duplicate rows were identified and removed.
- The final dataset consists of 39,123 unique records.
- No missing values were found across columns.
- Column data types are consistent and appropriate.

This ensures that the dataset is free of duplicates and missing values, making it suitable for analysis or reporting.  
Although the project does not focus on detailed data manipulation, these steps demonstrate good practice in preparing data for academic and professional work