# **Finding Missing Values**


Estimated time needed: **30** minutes


Data wrangling is the process of cleaning, transforming, and organizing data to make it suitable for analysis. Finding and handling missing values is a crucial step in this process to ensure data accuracy and completeness. In this lab, you will focus exclusively on identifying and handling missing values in the dataset.


## Objectives


After completing this lab, you will be able to:


-   Identify missing values in the dataset.

- Quantify missing values for specific columns.

- Impute missing values using various strategies.


## Hands on Lab


##### Setup: Install Required Libraries


In [None]:
!pip install pandas
!pip install matplotlib
!pip install seaborn

##### Import Necessary Modules:


In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Tasks


<h2>1. Load the Dataset</h2>
<p>
We use the <code>pandas.read_csv()</code> function for reading CSV files. However, in this version of the lab, which operates on JupyterLite, the dataset needs to be downloaded to the interface using the provided code below.
</p>


The functions below will download the dataset into your browser:



In [None]:
# Define the URL of the dataset
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"

# Load the dataset into a DataFrame
df = pd.read_csv(file_path)

# Display the first few rows to ensure it loaded correctly
print(df.head())


In [4]:
df

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
0,1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,...,,,,,,,,,,
1,2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
2,3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,,,,,,,Appropriate in length,Easy,,
3,4,I am learning to code,18-24 years old,"Student, full-time",,Apples,,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,...,,,,,,,Too long,Easy,,
4,5,I am a developer by profession,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Too short,Easy,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65432,65433,I am a developer by profession,18-24 years old,"Employed, full-time",Remote,Apples,Hobby;School or academic work,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","On the job training;School (i.e., University, ...",,...,,,,,,,,,,
65433,65434,I am a developer by profession,25-34 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects,,,,...,,,,,,,,,,
65434,65435,I am a developer by profession,25-34 years old,"Employed, full-time",In-person,Apples,Hobby,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Other online resources (e.g., videos, blogs, f...",Technical documentation;Stack Overflow;Social ...,...,,,,,,,,,,
65435,65436,I am a developer by profession,18-24 years old,"Employed, full-time","Hybrid (some remote, some in-person)",Apples,Hobby;Contribute to open-source projects;Profe...,"Secondary school (e.g. American high school, G...",On the job training;Other online resources (e....,Technical documentation;Blogs;Written Tutorial...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,


### 2. Explore the Dataset
##### Task 1: Display basic information and summary statistics of the dataset.


a. Get DataFrame Shape

In [5]:
# Find out the number of rows and columns:
print(f'The dataset contains {df.shape[0]} rows and {df.shape[1]} columns.')


The dataset contains 65437 rows and 114 columns.


b. Overview of Columns and Data Types

In [6]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65437 entries, 0 to 65436
Columns: 114 entries, ResponseId to JobSat
dtypes: float64(13), int64(1), object(100)
memory usage: 56.9+ MB


c. List of Column Names

In [7]:
print(df.columns.tolist())


['ResponseId', 'MainBranch', 'Age', 'Employment', 'RemoteWork', 'Check', 'CodingActivities', 'EdLevel', 'LearnCode', 'LearnCodeOnline', 'TechDoc', 'YearsCode', 'YearsCodePro', 'DevType', 'OrgSize', 'PurchaseInfluence', 'BuyNewTool', 'BuildvsBuy', 'TechEndorse', 'Country', 'Currency', 'CompTotal', 'LanguageHaveWorkedWith', 'LanguageWantToWorkWith', 'LanguageAdmired', 'DatabaseHaveWorkedWith', 'DatabaseWantToWorkWith', 'DatabaseAdmired', 'PlatformHaveWorkedWith', 'PlatformWantToWorkWith', 'PlatformAdmired', 'WebframeHaveWorkedWith', 'WebframeWantToWorkWith', 'WebframeAdmired', 'EmbeddedHaveWorkedWith', 'EmbeddedWantToWorkWith', 'EmbeddedAdmired', 'MiscTechHaveWorkedWith', 'MiscTechWantToWorkWith', 'MiscTechAdmired', 'ToolsTechHaveWorkedWith', 'ToolsTechWantToWorkWith', 'ToolsTechAdmired', 'NEWCollabToolsHaveWorkedWith', 'NEWCollabToolsWantToWorkWith', 'NEWCollabToolsAdmired', 'OpSysPersonal use', 'OpSysProfessional use', 'OfficeStackAsyncHaveWorkedWith', 'OfficeStackAsyncWantToWorkWith',

##### Check for Missing Values

In [11]:
# Total missing values per column
missing_values = df.isnull().sum().sort_values(ascending=False)
print(missing_values)


AINextMuch less integrated    64289
AINextLess integrated         63082
AINextNo change               52939
AINextMuch more integrated    51999
EmbeddedAdmired               48704
                              ...  
MainBranch                        0
Check                             0
Employment                        0
Age                               0
ResponseId                        0
Length: 114, dtype: int64


##### Visualizing Missing Data

In [None]:
# Using a heatmap to visualize:
plt.figure(figsize=(12,8))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Data Heatmap')
plt.show()


##### Summary Statistics of Numerical Columns

a. Default Summary

In [14]:
df.describe()


Unnamed: 0,ResponseId,CompTotal,WorkExp,JobSatPoints_1,JobSatPoints_4,JobSatPoints_5,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,ConvertedCompYearly,JobSat
count,65437.0,33740.0,29658.0,29324.0,29393.0,29411.0,29450.0,29448.0,29456.0,29456.0,29450.0,29445.0,23435.0,29126.0
mean,32719.0,2.963841e+145,11.466957,18.581094,7.52214,10.060857,24.343232,22.96522,20.278165,16.169432,10.955713,9.953948,86155.29,6.935041
std,18890.179119,5.444117e+147,9.168709,25.966221,18.422661,21.833836,27.08936,27.01774,26.10811,24.845032,22.906263,21.775652,186757.0,2.088259
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,16360.0,60000.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,32712.0,6.0
50%,32719.0,110000.0,9.0,10.0,0.0,0.0,20.0,15.0,10.0,5.0,0.0,0.0,65000.0,7.0
75%,49078.0,250000.0,16.0,22.0,5.0,10.0,30.0,30.0,25.0,20.0,10.0,10.0,107971.5,8.0
max,65437.0,1e+150,50.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,16256600.0,10.0


b. Including All Columns

To get statistics on all columns, including categorical ones:

In [15]:
df.describe(include='all')


Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
count,65437.0,65437,65437,65437,54806,65437,54466,60784,60488,49237,...,29450.0,29448.0,29456.0,29456.0,29450.0,29445.0,56182,56238,23435.0,29126.0
unique,,5,8,110,3,1,118,8,418,10853,...,,,,,,,3,3,,
top,,I am a developer by profession,25-34 years old,"Employed, full-time","Hybrid (some remote, some in-person)",Apples,Hobby,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Appropriate in length,Easy,,
freq,,50207,23911,39041,23015,65437,9993,24942,3674,603,...,,,,,,,38767,30071,,
mean,32719.0,,,,,,,,,,...,24.343232,22.96522,20.278165,16.169432,10.955713,9.953948,,,86155.29,6.935041
std,18890.179119,,,,,,,,,,...,27.08936,27.01774,26.10811,24.845032,22.906263,21.775652,,,186757.0,2.088259
min,1.0,,,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,,,1.0,0.0
25%,16360.0,,,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,,,32712.0,6.0
50%,32719.0,,,,,,,,,,...,20.0,15.0,10.0,5.0,0.0,0.0,,,65000.0,7.0
75%,49078.0,,,,,,,,,,...,30.0,30.0,25.0,20.0,10.0,10.0,,,107971.5,8.0


##### Summary Statistics of Categorical Columns

a. Common Values in Categorical Columns

For a quick overview:

In [16]:
categorical_cols = df.select_dtypes(include=['object']).columns

for col in categorical_cols:
    print(f"\nColumn: {col}")
    print(df[col].value_counts().head())



Column: MainBranch
MainBranch
I am a developer by profession                                                           50207
I am not primarily a developer, but I write code sometimes as part of my work/studies     6511
I am learning to code                                                                     3875
I code primarily as a hobby                                                               3334
I used to be a developer by profession, but no longer am                                  1510
Name: count, dtype: int64

Column: Age
Age
25-34 years old    23911
35-44 years old    14942
18-24 years old    14098
45-54 years old     6249
55-64 years old     2575
Name: count, dtype: int64

Column: Employment
Employment
Employed, full-time                                                         39041
Independent contractor, freelancer, or self-employed                         4846
Student, full-time                                                           4709
Employed, full-time;Ind

b. Unique Values per Column

In [17]:
unique_values = df.nunique()
print(unique_values)


ResponseId             65437
MainBranch                 5
Age                        8
Employment               110
RemoteWork                 3
                       ...  
JobSatPoints_11           79
SurveyLength               3
SurveyEase                 3
ConvertedCompYearly     6113
JobSat                    11
Length: 114, dtype: int64


##### Frequency Distribution

a. Employment Status Distribution

In [19]:
employment_counts = df['Employment'].value_counts()
print(employment_counts)


Employment
Employed, full-time                                                                                                                                  39041
Independent contractor, freelancer, or self-employed                                                                                                  4846
Student, full-time                                                                                                                                    4709
Employed, full-time;Independent contractor, freelancer, or self-employed                                                                              3557
Not employed, but looking for work                                                                                                                    2341
                                                                                                                                                     ...  
Employed, full-time;Student, full-time;Independent contract

Visualization:

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(y='Employment', data=df, order=employment_counts.index, palette='coolwarm')
plt.title('Employment Status Distribution')
plt.xlabel('Count')
plt.ylabel('Employment Status')
plt.show()


b. MainBranch Distribution

In [21]:
mainbranch_counts = df['MainBranch'].value_counts()
print(mainbranch_counts)


MainBranch
I am a developer by profession                                                           50207
I am not primarily a developer, but I write code sometimes as part of my work/studies     6511
I am learning to code                                                                     3875
I code primarily as a hobby                                                               3334
I used to be a developer by profession, but no longer am                                  1510
Name: count, dtype: int64


##### Correlation Analysis

In [None]:
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
corr_matrix = df[numerical_cols].corr()

plt.figure(figsize=(12,8))
sns.heatmap(corr_matrix, annot=True, cmap='Blues')
plt.title('Correlation Matrix')
plt.show()


##### Data Types and Memory Usage
Analyzing data types helps optimize memory usage:

In [25]:
# Data types
print(df.dtypes)

ResponseId               int64
MainBranch              object
Age                     object
Employment              object
RemoteWork              object
                        ...   
JobSatPoints_11        float64
SurveyLength            object
SurveyEase              object
ConvertedCompYearly    float64
JobSat                 float64
Length: 114, dtype: object


In [27]:
# Memory usage
memory_usage = df.memory_usage(deep=True).sort_values(ascending=False)
print(memory_usage)

AIChallenges       12386773
LearnCode          10425960
SOHow               9776632
LearnCodeOnline     8948411
EdLevel             8869497
                     ...   
JobSatPoints_1       523496
WorkExp              523496
ResponseId           523496
CompTotal            523496
Index                   132
Length: 115, dtype: int64


### 3. Finding Missing Values
##### Task 2: Identify missing values for all columns.


##### Task 3: Visualize missing values using a heatmap (Using seaborn library).



##### Task 4: Count the number of missing rows for a specific column (e.g., `Employment`).


1. Using isnull().sum()
The simplest way to count missing values in a specific column is by using the isnull() method combined with sum().

In [28]:
# Count missing values in the 'Employment' column
missing_employment = df['Employment'].isnull().sum()
print(f"Number of missing values in 'Employment' column: {missing_employment}")


Number of missing values in 'Employment' column: 0


Explanation:

- df['Employment']: Accesses the 'Employment' column in your DataFrame.

- .isnull(): Returns a Boolean Series where True indicates missing (NaN) values.

- .sum(): Sums up the True values, effectively counting the number of missing entries.

2. Calculating the Percentage of Missing Values

Understanding the proportion of missing data relative to the total dataset can provide context.

In [29]:
# Total number of rows in the DataFrame
total_rows = df.shape[0]

# Calculate the percentage of missing values
percent_missing = (missing_employment / total_rows) * 100
print(f"Percentage of missing values in 'Employment' column: {percent_missing:.2f}%")


Percentage of missing values in 'Employment' column: 0.00%


3. Investigating Rows with Missing Employment Data
Understanding where the missing data occurs can inform your data cleaning strategy.

In [None]:
# Display the first few rows where 'Employment' is missing
missing_employment_rows = df[df['Employment'].isnull()]
print("Sample rows with missing 'Employment' data:")
missing_employment_rows.head()


Considerations:

- Check if these rows have missing values in other important columns.

- Determine if the missingness is random or has a pattern (e.g., specific age groups, countries).

### 4. Imputing Missing Values
##### Task 5: Identify the most frequent (majority) value in a specific column (e.g., `Employment`).


##### Handling Missing Employment Data
Depending on your analysis goals, you might choose to handle the missing data in different ways.

a. Dropping Missing Values
If the missing data is minimal and random:

In [None]:
# # Drop rows where 'Employment' is missing
# df_cleaned = df.dropna(subset=['Employment'])

# # Verify the number of rows after dropping
# print(f"Number of rows after dropping missing 'Employment' data: {df_cleaned.shape[0]}")


b. Imputing Missing Values

If you prefer to retain all data, you can fill missing values with a placeholder:

In [None]:
# # Fill missing 'Employment' entries with 'Unknown'
# df['Employment'] = df['Employment'].fillna('Unknown')


##### Function to Count Missing Values in Any Column
To make your analysis more flexible, you can create a function that counts missing values for any specified column.

In [31]:
def count_missing_values(dataframe, column_name):
    total_rows = dataframe.shape[0]
    missing_count = dataframe[column_name].isnull().sum()
    missing_percent = (missing_count / total_rows) * 100
    print(f"Column: {column_name}")
    print(f" - Total Rows: {total_rows}")
    print(f" - Missing Values: {missing_count}")
    print(f" - Percentage Missing: {missing_percent:.2f}%\n")

# Example usage for 'Employment' column
count_missing_values(df, 'Employment')

# You can apply this function to other columns as needed
count_missing_values(df, 'Age')
count_missing_values(df, 'RemoteWork')


Column: Employment
 - Total Rows: 65437
 - Missing Values: 0
 - Percentage Missing: 0.00%

Column: Age
 - Total Rows: 65437
 - Missing Values: 0
 - Percentage Missing: 0.00%

Column: RemoteWork
 - Total Rows: 65437
 - Missing Values: 10631
 - Percentage Missing: 16.25%



##### Additional Insights

a. Comparing Missing Data Across Columns

Check if rows missing 'Employment' also miss other critical data.

In [32]:
# Find rows where 'Employment' and 'Age' are both missing
missing_both = df[df['Employment'].isnull() & df['Age'].isnull()]
print(f"Number of rows missing both 'Employment' and 'Age': {missing_both.shape[0]}")


Number of rows missing both 'Employment' and 'Age': 0


b. Analyzing the Impact of Missing Data

- Understanding how missing data might affect your analysis is crucial.

- Bias: If missing data is not random, analyses may be biased.

- Imputation Strategies: Advanced methods like multiple imputation or modeling can help if data is missing not at random (MNAR).

##### Summary of Missing Values in All Columns

To get a broader picture:

In [33]:
# Calculate missing values for all columns
missing_summary = df.isnull().sum().sort_values(ascending=False)

# Create a DataFrame to display counts and percentages
missing_percent = (df.isnull().sum() / df.shape[0]) * 100
missing_data = pd.DataFrame({'Missing Count': missing_summary, 'Percentage': missing_percent})

# Display columns with missing data
missing_data = missing_data[missing_data['Missing Count'] > 0]
print("Summary of Missing Values in the Dataset:")
missing_data


Summary of Missing Values in the Dataset:


Unnamed: 0,Missing Count,Percentage
AIAcc,28135,42.995553
AIBen,28543,43.619053
AIChallenges,27906,42.645598
AIComplex,28416,43.424974
AIEthics,23889,36.506869
...,...,...
WebframeHaveWorkedWith,20276,30.985528
WebframeWantToWorkWith,26902,41.111298
WorkExp,35779,54.677018
YearsCode,5568,8.508948


Steps to Identify the Most Frequent Value in the Employment Column

- Check for Missing Values: Ensure that the column doesn't have a significant number of missing values that might affect the analysis.

- Calculate Value Counts: Use the value_counts() method to get the frequency of each unique value in the column.

- Extract the Most Frequent Value: Identify the value with the highest count.

- Calculate the Percentage: Determine what percentage of the total responses this value represents.

- Visualize the Distribution: Create a bar chart to visualize the frequencies of all categories.

##### Handling Missing Values

If there are missing values, you might choose to:

- Exclude them from the analysis.

- Fill them with a placeholder like 'Unknown' or 'Other'.

For accurate frequency counts, we'll exclude missing values:

In [35]:
# Exclude missing values
employment_data = df['Employment'].dropna()


##### Calculate Value Counts for Employment

Use the value_counts() method:

In [36]:
# Get the frequency of each unique value
employment_counts = employment_data.value_counts()

print("Frequency of each employment status:")
print(employment_counts)


Frequency of each employment status:
Employment
Employed, full-time                                                                                                                                  39041
Independent contractor, freelancer, or self-employed                                                                                                  4846
Student, full-time                                                                                                                                    4709
Employed, full-time;Independent contractor, freelancer, or self-employed                                                                              3557
Not employed, but looking for work                                                                                                                    2341
                                                                                                                                                     ...  
Employed, full-time;St

##### Identify the Most Frequent Value

Extract the top entry:

In [37]:
# Get the most frequent value
most_frequent_value = employment_counts.idxmax()
most_frequent_count = employment_counts.max()

print(f"\nThe most frequent employment status is: '{most_frequent_value}'")
print(f"Number of respondents with this status: {most_frequent_count}")



The most frequent employment status is: 'Employed, full-time'
Number of respondents with this status: 39041


##### Calculate the Percentage of the Most Frequent Value

To understand how dominant this category is:

In [38]:
# Calculate percentage
percentage = (most_frequent_count / employment_data.shape[0]) * 100
print(f"Percentage of respondents: {percentage:.2f}%")


Percentage of respondents: 59.66%


##### Insights from the Analysis

Based on the sample analysis:

- Dominant Employment Status: The majority of respondents are 'Employed full-time', making up 60% of the dataset.

- Other Significant Categories: 'Employed part-time' and 'Independent contractor, freelancer, or self-employed' also have notable representation.

- Minor Categories: Categories like 'Retired' or 'Not employed, and not looking for work' have relatively few respondents.

##### Function to Find Most Frequent Value in Any Column
To make your analysis flexible, you can create a function:

In [39]:
def find_most_frequent_value(dataframe, column_name):
    # Drop missing values
    data = dataframe[column_name].dropna()
    
    # Get value counts
    counts = data.value_counts()
    
    # Get the most frequent value
    most_common_value = counts.idxmax()
    most_common_count = counts.max()
    total = data.shape[0]
    percentage = (most_common_count / total) * 100
    
    # Print the results
    print(f"Most frequent value in '{column_name}': '{most_common_value}'")
    print(f"Count: {most_common_count} ({percentage:.2f}% of {total} responses)")
    
    return counts

# Example usage for 'Employment' column
employment_counts = find_most_frequent_value(df, 'Employment')


Most frequent value in 'Employment': 'Employed, full-time'
Count: 39041 (59.66% of 65437 responses)


##### Task 6: Impute missing values in the `Employment` column with the most frequent value.



In [40]:
# Impute missing values in 'Employment' with the most frequent value
df['Employment'].fillna(most_frequent_value, inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Employment'].fillna(most_frequent_value, inplace=True)


Explanation:

- .fillna(value, inplace=True) replaces all NaNs in the column with the specified value.

- inplace=True ensures that the changes are made directly in the DataFrame without creating a copy.

##### Verify the Imputation
Let's make sure all missing values have been filled.

In [41]:
# Count missing values in 'Employment' after imputation
missing_after = df['Employment'].isnull().sum()
print(f"Missing values after imputation: {missing_after}")

# Confirm the imputation was successful
if missing_after == 0:
    print("All missing values have been successfully imputed.")
else:
    print("There are still missing values remaining.")


Missing values after imputation: 0
All missing values have been successfully imputed.


##### Analyze the Updated Employment Column
It's insightful to see how the distribution has changed after imputation.

In [42]:
# Get the value counts of 'Employment' after imputation
employment_counts = df['Employment'].value_counts()
print("\nUpdated Employment Status Counts:")
print(employment_counts)



Updated Employment Status Counts:
Employment
Employed, full-time                                                                                                                                  39041
Independent contractor, freelancer, or self-employed                                                                                                  4846
Student, full-time                                                                                                                                    4709
Employed, full-time;Independent contractor, freelancer, or self-employed                                                                              3557
Not employed, but looking for work                                                                                                                    2341
                                                                                                                                                     ...  
Employed, full-time;Stud

### 5. Visualizing Imputed Data
##### Task 7: Visualize the distribution of a column after imputation (e.g., `Employment`).


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for visualization
sns.set(style='whitegrid')

plt.figure(figsize=(10,6))
ax = sns.countplot(y='Employment', data=df, order=employment_counts.index, palette='coolwarm')

# Add count labels next to the bars
for p in ax.patches:
    width = p.get_width()
    plt.text(width + 1, p.get_y() + p.get_height()/2, int(width), va='center')

# Set titles and labels
plt.title('Employment Status Distribution After Imputation', fontsize=16)
plt.xlabel('Number of Respondents', fontsize=14)
plt.ylabel('Employment Status', fontsize=14)
plt.show()
