# **Lab: Exploratory Data Analysis**


Estimated time needed: **30** minutes


In this lab, you will work with a cleaned dataset to perform Exploratory Data Analysis or EDA. 


## Objectives


In this lab, you will perform the following:


- Examine the structure of a dataset.

- Handle missing values effectively.

- Conduct summary statistics on key columns.

- Analyze employment status, job satisfaction, programming language usage, and trends in remote work.


## Hands on Lab


#### Step 1: Install and Import Libraries


Install the necessary libraries for data manipulation and visualization.


In [41]:
def import_libraries():
    """
    Imports necessary libraries for data manipulation and visualization.
    """
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns

    # Set visualization styles
    sns.set(style='whitegrid')
    plt.rcParams['figure.figsize'] = (12, 6)

    print("Libraries imported successfully.")
    return pd, np, plt, sns


In [42]:
import_libraries()

Libraries imported successfully.


(<module 'pandas' from 'c:\\Users\\tdhoa\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\pandas\\__init__.py'>,
 <module 'numpy' from 'C:\\Users\\tdhoa\\AppData\\Roaming\\Python\\Python312\\site-packages\\numpy\\__init__.py'>,
 <module 'matplotlib.pyplot' from 'C:\\Users\\tdhoa\\AppData\\Roaming\\Python\\Python312\\site-packages\\matplotlib\\pyplot.py'>,
 <module 'seaborn' from 'c:\\Users\\tdhoa\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\seaborn\\__init__.py'>)

Explanation:

This function imports the essential libraries:

- pandas for data manipulation.

- numpy for numerical operations.

- matplotlib.pyplot and seaborn for data visualization.

We also set some default styles for better-looking plots.

#### Step 2: Load and Preview the Dataset
Load the dataset from the provided URL. Use df.head() to display the first few rows to get an overview of the structure.


In [43]:
def load_and_preview_dataset(url):
    """
    Loads the dataset from the provided URL and displays the first few rows.
    Parameters:
    - url (str): The URL to load the dataset from.
    Returns:
    - df (DataFrame): The loaded DataFrame.
    """
    pd, _, _, _ = import_libraries()
    try:
        df = pd.read_csv(url)
        print("Dataset loaded successfully.")
        print("First five rows of the dataset:")
        print(df.head())
        return df
    except Exception as e:
        print("An error occurred while loading the dataset:", e)
        return None


Explanation:

- The function loads data from a given URL.

- It displays the first few rows to give you an overview.

- Error handling is included to catch any issues during loading.

In [44]:
dataset_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv'
df = load_and_preview_dataset(dataset_url)

Libraries imported successfully.


Dataset loaded successfully.
First five rows of the dataset:
   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Hobby;Contribute to open-source projects;Other...   
3                            

#### Step 3: Handling Missing Data


Identify and manage missing values in critical columns such as `Employment`, `JobSat`, and `RemoteWork`. Implement a strategy to fill or drop these values, depending on the significance of the missing data.


In [45]:
def handle_missing_data(df, critical_columns):
    """
    Identifies and manages missing values in critical columns.
    Parameters:
    - df (DataFrame): The DataFrame to process.
    - critical_columns (list): List of critical columns to check.
    Returns:
    - df (DataFrame): The DataFrame after handling missing data.
    """
    pd, _, _, _ = import_libraries()
    print("Handling missing data...")

    # Check initial shape
    initial_shape = df.shape
    print(f"Initial dataset shape: {initial_shape}")

    # Drop rows with missing values in critical columns
    df = df.dropna(subset=critical_columns)

    # Check final shape
    final_shape = df.shape
    print(f"Final dataset shape after dropping missing values: {final_shape}")
    print(f"Total rows dropped: {initial_shape[0] - final_shape[0]}")

    return df


Explanation:

- Drops rows where any of the critical columns have missing values.

- Prints out the number of rows dropped for transparency.

- This approach ensures that analyses on these columns are based on complete data.

In [46]:
critical_cols = ['Employment', 'JobSat', 'RemoteWork']
df = handle_missing_data(df, critical_cols)

Libraries imported successfully.
Handling missing data...
Initial dataset shape: (65437, 114)
Final dataset shape after dropping missing values: (29117, 114)
Total rows dropped: 36320


#### Step 4: Analysis of Experience and Job Satisfaction


Analyze the relationship between years of professional coding experience (`YearsCodePro`) and job satisfaction (`JobSat`). Summarize `YearsCodePro` and calculate median satisfaction scores based on experience ranges.

- Create experience ranges for `YearsCodePro` (e.g., `0-5`, `5-10`, `10-20`, `>20` years).

- Calculate the median `JobSat` for each range.

- Visualize the relationship using a bar plot or similar visualization.


In [47]:
def analyze_experience_job_satisfaction(df):
    """
    Analyzes the relationship between YearsCodePro and JobSat.
    Returns:
    - experience_median_job_sat (DataFrame): Median JobSat for each experience range.
    """
    pd, np, plt, sns = import_libraries()
    print("Analyzing experience and job satisfaction...")

    # Ensure 'YearsCodePro' is numeric
    df = df.copy()
    df['YearsCodePro'] = pd.to_numeric(df['YearsCodePro'], errors='coerce')

    # Define experience ranges
    max_experience = df['YearsCodePro'].max()
    bins = [0, 5, 10, 20, max_experience]
    labels = ['0-5', '5-10', '10-20', '>20']
    df['ExperienceRange'] = pd.cut(df['YearsCodePro'], bins=bins, labels=labels, right=False, include_lowest=True)

    # Calculate median JobSat for each range
    experience_median_job_sat = df.groupby('ExperienceRange', observed=False)['JobSat'].median().reset_index()

    print("Median Job Satisfaction by Experience Range:")
    print(experience_median_job_sat)

    # Visualize the relationship
    #sns.barplot(x='ExperienceRange', y='JobSat', data=df, errorbar=None, estimator=np.median, palette='viridis')
    sns.barplot(hue='ExperienceRange', y='JobSat', data=df, errorbar=None, estimator=np.median, palette='viridis', legend=False)
    plt.title('Median Job Satisfaction by Experience Range')
    plt.xlabel('Years of Professional Coding Experience')
    plt.ylabel('Median Job Satisfaction')
    plt.show()

    return df


Explanation:

- Converts YearsCodePro to numeric and handles non-numeric entries.

- Categorizes experience into defined ranges.

- Calculates and prints the median job satisfaction for each range.

- Visualizes the results for better understanding.

In [None]:
df = analyze_experience_job_satisfaction(df)

#### Step 5: Visualize Job Satisfaction


Use a count plot to show the distribution of `JobSat` values. This provides insights into the overall satisfaction levels of respondents.


In [49]:
def visualize_job_satisfaction(df):
    """
    Creates a count plot to show the distribution of JobSat values.
    """
    pd, np, plt, sns = import_libraries()
    print("Visualizing job satisfaction...")

    # Plot the distribution
    sns.countplot(x='JobSat', data=df, palette='coolwarm', order=sorted(df['JobSat'].unique()))
    plt.title('Distribution of Job Satisfaction')
    plt.xlabel('Job Satisfaction')
    plt.ylabel('Count')
    plt.xticks(rotation=45)
    plt.show()


Explanation:

- Provides a visual representation of how job satisfaction is distributed among respondents.

- Uses sorting to present the satisfaction levels in order.

In [None]:
# Creates a count plot to show the distribution of JobSat values.
visualize_job_satisfaction(df)


#### Step 6: Analyzing Remote Work Preferences by Job Role


Analyze trends in remote work based on job roles. Use the `RemoteWork` and `Employment` columns to explore preferences and examine if specific job roles prefer remote work more than others.

- Use a count plot to show remote work distribution.

- Cross-tabulate remote work preferences by employment type (e.g., full-time, part-time) and job roles.


In [51]:
def analyze_remote_work_preferences(df):
    """
    Analyzes trends in remote work based on job roles.
    """
    pd, np, plt, sns = import_libraries()
    print("Analyzing remote work preferences by job role...")

    # Plot remote work distribution
    sns.countplot(x='RemoteWork', data=df, palette='Set2', order=sorted(df['RemoteWork'].unique()))
    plt.title('Distribution of Remote Work Preferences')
    plt.xlabel('Remote Work')
    plt.ylabel('Count')
    plt.xticks(rotation=45)
    plt.show()

    # Cross-tabulate remote work by employment type
    remote_employment_ct = pd.crosstab(df['Employment'], df['RemoteWork'])
    print("Remote Work Preferences by Employment Type:")
    print(remote_employment_ct)

    # Heatmap of the cross-tabulation
    sns.heatmap(remote_employment_ct, annot=True, fmt='d', cmap='YlGnBu')
    plt.title('Remote Work Preferences by Employment Type')
    plt.ylabel('Employment Type')
    plt.xlabel('Remote Work Preference')
    plt.show()


Explanation:

- Visualizes how remote work preferences vary across the dataset.

- Provides a heatmap to show the relationship between employment type and remote work preference.

In [None]:
analyze_remote_work_preferences(df)

#### Step 7: Analyzing Programming Language Trends by Region


Analyze the popularity of programming languages by region. Use the `LanguageHaveWorkedWith` column to investigate which languages are most used in different regions.

- Filter data by country or region.

- Visualize the top programming languages by region with a bar plot or heatmap.


In [53]:
def analyze_programming_language_trends(df, region_column='Country'):
    """
    Analyzes the popularity of programming languages by region.
    Parameters:
    - df (DataFrame): The dataset to analyze.
    - region_column (str): The column representing regions (e.g., 'Country').
    """
    pd, np, plt, sns = import_libraries()
    print("Analyzing programming language trends by region...")

    # Split the languages into lists
    df['LanguagesWorkedWith'] = df['LanguageHaveWorkedWith'].str.split(';')

    # Explode the languages
    df_exploded = df.explode('LanguagesWorkedWith')

    # Group and count
    language_region_counts = df_exploded.groupby([region_column, 'LanguagesWorkedWith']).size().reset_index(name='Counts')

    # For illustration, let's focus on the top 5 regions
    top_regions = df[region_column].value_counts().head(5).index
    filtered_data = language_region_counts[language_region_counts[region_column].isin(top_regions)]

    # Pivot for heatmap
    pivot_table = filtered_data.pivot(index='LanguagesWorkedWith', columns=region_column, values='Counts').fillna(0)

    # Visualize with a heatmap
    sns.heatmap(pivot_table, cmap='Blues')
    plt.title('Programming Language Popularity by Region')
    plt.ylabel('Programming Language')
    plt.xlabel('Region')
    plt.show()


Explanation:

- Handles multiple programming languages per respondent by splitting and exploding the data.

- Focuses analysis on the top regions for clarity.

- Uses a heatmap to show which languages are most popular in different regions.

In [None]:
analyze_programming_language_trends(df)

#### Step 8: Correlation Between Experience and Satisfaction


Examine how years of experience (`YearsCodePro`) correlate with job satisfaction (`JobSatPoints_1`). Use a scatter plot to visualize this relationship.


In [59]:
def correlate_experience_satisfaction(df):
    """
    Examines how YearsCodePro correlates with JobSatPoints_1.
    """
    pd, np, plt, sns = import_libraries()
    print("Analyzing correlation between experience and satisfaction...")

    # Ensure numeric types
    df['YearsCodePro'] = pd.to_numeric(df['YearsCodePro'], errors='coerce')
    df['JobSatPoints_1'] = pd.to_numeric(df['JobSatPoints_1'], errors='coerce')

    # Drop missing values
    df_clean = df.dropna(subset=['YearsCodePro', 'JobSatPoints_1'])

    # Scatter plot
    sns.scatterplot(x='YearsCodePro', y='JobSatPoints_1', data=df_clean)
    plt.title('Correlation Between Experience and Job Satisfaction')
    plt.xlabel('Years of Professional Coding Experience')
    plt.ylabel('Job Satisfaction Points')
    plt.show()

    # Calculate correlation
    correlation = df_clean['YearsCodePro'].corr(df_clean['JobSatPoints_1'])
    print(f"Pearson correlation coefficient: {correlation:.2f}")


Explanation:

- Converts relevant columns to numeric types.

- Visualizes the relationship with a scatter plot.

- Calculates and prints the Pearson correlation coefficient.

In [None]:
correlate_experience_satisfaction(df)

#### Step 9: Educational Background and Employment Type


Explore how educational background (`EdLevel`) relates to employment type (`Employment`). Use cross-tabulation and visualizations to understand if higher education correlates with specific employment types.


In [63]:
def analyze_education_employment(df):
    """
    Explores how educational background relates to employment type.
    """
    pd, np, plt, sns = import_libraries()
    print("Analyzing educational background and employment type...")

    # Cross-tabulation
    education_employment_ct = pd.crosstab(df['EdLevel'], df['Employment'])

    print("Educational Background vs. Employment Type:")
    print(education_employment_ct)

    # Visualize with a heatmap
    sns.heatmap(education_employment_ct, annot=True, fmt='d', cmap='coolwarm')
    plt.title('Education Level by Employment Type')
    plt.ylabel('Education Level')
    plt.xlabel('Employment Type')
    plt.xticks(rotation=45)
    plt.show()


Explanation:

- Uses cross-tabulation to examine the relationship.

- Provides a heatmap for visual insight.

- Helps understand if higher education levels correlate with certain employment types.

In [None]:
analyze_education_employment(df)

#### Step 10: Save the Cleaned and Analyzed Dataset


After your analysis, save the modified dataset for further use or sharing.


In [66]:
def save_cleaned_dataset(df, filename='cleaned_dataset.csv'):
    """
    Saves the modified dataset to a CSV file.
    Parameters:
    - df (DataFrame): The DataFrame to save.
    - filename (str): The name of the file to save to.
    """
    pd, _, _, _ = import_libraries()
    df.to_csv(filename, index=False)
    print(f"Dataset saved successfully as {filename}.")


Explanation:

- Saves the DataFrame to a CSV file without the index.

- Default filename is 'cleaned_dataset.csv', but you can specify any name.

In [67]:
save_cleaned_dataset(df, 'cleaned_dataset.csv')

Libraries imported successfully.
Dataset saved successfully as cleaned_dataset.csv.


**Putting It All Together**

Here's how you might use these functions in your script:

In [None]:
# Import libraries
pd, np, plt, sns = import_libraries()

# Step 2: Load and preview the dataset
dataset_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv'  # Replace with your actual dataset URL or file path
df = load_and_preview_dataset(dataset_url)

# Step 3: Handle missing data
critical_columns = ['Employment', 'JobSat', 'RemoteWork']
df = handle_missing_data(df, critical_columns)

# Step 4: Analyze experience and job satisfaction
analyze_experience_job_satisfaction(df)

# Step 5: Visualize job satisfaction
visualize_job_satisfaction(df)

# Step 6: Analyze remote work preferences
analyze_remote_work_preferences(df)

# Step 7: Analyze programming language trends
analyze_programming_language_trends(df)

# Step 8: Correlation between experience and satisfaction
correlate_experience_satisfaction(df)

# Step 9: Analyze educational background and employment type
analyze_education_employment(df)

# Step 10: Save the cleaned and analyzed dataset
save_cleaned_dataset(df, 'analyzed_dataset.csv')
