<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Finding Correlation**


Estimated time needed: **30** minutes


In this lab, you will work with a cleaned dataset to perform exploratory data analysis (EDA). You will examine the distribution of the data, identify outliers, and determine the correlation between different columns in the dataset.


## Objectives


In this lab, you will perform the following:


- Identify the distribution of compensation data in the dataset.

- Remove outliers to refine the dataset.

- Identify correlations between various features in the dataset.


## Hands on Lab


##### Step 1: Install and Import Required Libraries


In [None]:
# Install the necessary libraries
!pip install pandas
!pip install matplotlib
!pip install seaborn

# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


### Step 2: Load the Dataset


In [None]:
# Load the dataset from the given URL
file_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"
df = pd.read_csv(file_url)

# Display the first few rows to understand the structure of the dataset
df.head()

<h3>Step 3: Analyze and Visualize Compensation Distribution</h3>


**Task**: Plot the distribution and histogram for `ConvertedCompYearly` to examine the spread of yearly compensation among respondents.


In [None]:
## Write your code here

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Step 2: Load the Dataset
file_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"
df = pd.read_csv(file_url)

# Step 3: Analyze and Visualize Compensation Distribution

# Check if 'ConvertedCompYearly' column exists
if 'ConvertedCompYearly' not in df.columns:
    print("ConvertedCompYearly column not found in the dataset.")
else:
    # Distribution Plot (KDE Plot)
    plt.figure(figsize=(12, 6))
    sns.kdeplot(df['ConvertedCompYearly'], fill=True)
    plt.title('Distribution of Yearly Compensation (ConvertedCompYearly)')
    plt.xlabel('Yearly Compensation (USD)')
    plt.ylabel('Density')
    plt.show()

    # Histogram
    plt.figure(figsize=(12, 6))
    plt.hist(df['ConvertedCompYearly'], bins=50, edgecolor='black')  # Adjust bins as needed
    plt.title('Histogram of Yearly Compensation (ConvertedCompYearly)')
    plt.xlabel('Yearly Compensation (USD)')
    plt.ylabel('Frequency')
    plt.show()

<h3>Step 4: Calculate Median Compensation for Full-Time Employees</h3>


**Task**: Filter the data to calculate the median compensation for respondents whose employment status is "Employed, full-time."


In [None]:
## Write your code here


import pandas as pd

# Step 2: Load the Dataset
file_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"
df = pd.read_csv(file_url)

# Step 4: Calculate Median Compensation for Full-Time Employees

# Check if 'Employment' and 'ConvertedCompYearly' columns exist
if 'Employment' not in df.columns or 'ConvertedCompYearly' not in df.columns:
    print("One or both of 'Employment' and 'ConvertedCompYearly' columns not found in the dataset.")
else:
    # Filter for full-time employees
    full_time_employees = df[df['Employment'] == 'Employed, full-time']

    # Calculate median compensation
    median_compensation = full_time_employees['ConvertedCompYearly'].median()

    # Print the result
    print(f"Median Compensation for Full-Time Employees: ${median_compensation:.2f}")


<h3>Step 5: Analyzing Compensation Range and Distribution by Country</h3>


Explore the range of compensation in the ConvertedCompYearly column by analyzing differences across countries. Use box plots to compare the compensation distributions for each country to identify variations and anomalies within each region, providing insights into global compensation trends.



In [None]:
## Write your code here


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Step 2: Load the Dataset
file_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"
df = pd.read_csv(file_url)

# Step 5: Analyzing Compensation Range and Distribution by Country

# Check if 'Country' and 'ConvertedCompYearly' columns exist
if 'Country' not in df.columns or 'ConvertedCompYearly' not in df.columns:
    print("One or both of 'Country' and 'ConvertedCompYearly' columns not found in the dataset.")
else:
    # Filter out countries with very few respondents to improve readability
    country_counts = df['Country'].value_counts()
    valid_countries = country_counts[country_counts > 50].index  # Adjust the threshold as needed
    df_filtered = df[df['Country'].isin(valid_countries)]

    # Create box plots for compensation by country
    plt.figure(figsize=(18, 10))  # Adjust figure size for better visualization
    sns.boxplot(x='Country', y='ConvertedCompYearly', data=df_filtered)
    plt.title('Compensation Distribution by Country')
    plt.xlabel('Country')
    plt.ylabel('Yearly Compensation (USD)')
    plt.xticks(rotation=45, ha='right')  # Rotate country labels for readability
    plt.tight_layout()
    plt.show()

<h3>Step 6: Removing Outliers from the Dataset</h3>


**Task**: Create a new DataFrame by removing outliers from the `ConvertedCompYearly` column to get a refined dataset for correlation analysis.


In [None]:
## Write your code here

import pandas as pd

# Step 2: Load the Dataset
file_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"
df = pd.read_csv(file_url)

# Step 6: Removing Outliers from the Dataset

# Check if 'ConvertedCompYearly' column exists
if 'ConvertedCompYearly' not in df.columns:
    print("ConvertedCompYearly column not found in the dataset.")
else:
    # Calculate Q1, Q3, and IQR
    Q1 = df['ConvertedCompYearly'].quantile(0.25)
    Q3 = df['ConvertedCompYearly'].quantile(0.75)
    IQR = Q3 - Q1

    # Determine upper and lower bounds for outliers
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Create a new DataFrame without outliers
    df_no_outliers = df[(df['ConvertedCompYearly'] >= lower_bound) & (df['ConvertedCompYearly'] <= upper_bound)]

    # Print the size of the original and new DataFrames
    print(f"Original DataFrame size: {len(df)}")
    print(f"DataFrame size after removing outliers: {len(df_no_outliers)}")

    # Optional: Display the first few rows of the new DataFrame
    print("\nDataFrame without outliers (first 5 rows):")
    print(df_no_outliers.head())

In [None]:
import pandas as pd

# Step 2: Load the Dataset
file_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"
df = pd.read_csv(file_url)

# Step 6: Removing Outliers from the Dataset

# Check if 'ConvertedCompYearly' column exists
if 'ConvertedCompYearly' not in df.columns:
    print("ConvertedCompYearly column not found in the dataset.")
else:
    # Calculate Q1, Q3, and IQR
    Q1 = df['ConvertedCompYearly'].quantile(0.25)
    Q3 = df['ConvertedCompYearly'].quantile(0.75)
    IQR = Q3 - Q1

    # Determine upper and lower bounds for outliers
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Create a new DataFrame without outliers
    df_no_outliers = df[
        (df['ConvertedCompYearly'] >= lower_bound)
        & (df['ConvertedCompYearly'] <= upper_bound)
    ]

    # Print the size of the original and new DataFrames
    print(f"Original DataFrame size: {len(df)}")
    print(f"DataFrame size after removing outliers: {len(df_no_outliers)}")

    # Optional: Display the first few rows of the new DataFrame
    print("\nDataFrame without outliers (first 5 rows):")
    print(df_no_outliers.head())

<h3>Step 7: Finding Correlations Between Key Variables</h3>


**Task**: Calculate correlations between `ConvertedCompYearly`, `WorkExp`, and `JobSatPoints_1`. Visualize these correlations with a heatmap.


In [None]:
## Write your code here

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Step 2: Load the Dataset
file_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"
df = pd.read_csv(file_url)

# Step 7: Finding Correlations Between Key Variables

# Check if required columns exist
required_columns = ['ConvertedCompYearly', 'WorkExp', 'JobSatPoints_1']
if not all(col in df.columns for col in required_columns):
    missing_cols = [col for col in required_columns if col not in df.columns]
    print(f"Missing columns: {missing_cols}")
else:
    # Calculate correlations
    correlation_matrix = df[['ConvertedCompYearly', 'WorkExp', 'JobSatPoints_1']].corr()

    # Visualize correlations with a heatmap
    plt.figure(figsize=(8, 6))
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
    plt.title('Correlation Heatmap')
    plt.show()

<h3>Step 8: Scatter Plot for Correlations</h3>


**Task**: Create scatter plots to examine specific correlations between `ConvertedCompYearly` and `WorkExp`, as well as between `ConvertedCompYearly` and `JobSatPoints_1`.


In [None]:
## Write your code here

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Step 2: Load the Dataset
file_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"
df = pd.read_csv(file_url)

# Step 8: Scatter Plot for Correlations

# Check if required columns exist
required_columns = ['ConvertedCompYearly', 'WorkExp', 'JobSatPoints_1']
if not all(col in df.columns for col in required_columns):
    missing_cols = [col for col in required_columns if col not in df.columns]
    print(f"Missing columns: {missing_cols}")
else:
    # Scatter plot: ConvertedCompYearly vs. WorkExp
    plt.figure(figsize=(8, 6))
    sns.scatterplot(x='WorkExp', y='ConvertedCompYearly', data=df)
    plt.title('Scatter Plot: Yearly Compensation vs. Work Experience')
    plt.xlabel('Work Experience (Years)')
    plt.ylabel('Yearly Compensation (USD)')
    plt.show()

    # Scatter plot: ConvertedCompYearly vs. JobSatPoints_1
    plt.figure(figsize=(8, 6))
    sns.scatterplot(x='JobSatPoints_1', y='ConvertedCompYearly', data=df)
    plt.title('Scatter Plot: Yearly Compensation vs. Job Satisfaction')
    plt.xlabel('Job Satisfaction (Points)')
    plt.ylabel('Yearly Compensation (USD)')
    plt.show()

<h3>Summary</h3>


In this lab, you practiced essential skills in correlation analysis by:

- Examining the distribution of yearly compensation with histograms and box plots.
- Detecting and removing outliers from compensation data.
- Calculating correlations between key variables such as compensation, work experience, and job satisfaction.
- Visualizing relationships with scatter plots and heatmaps to gain insights into the associations between these features.

By following these steps, you have developed a solid foundation for analyzing relationships within the dataset.


## Authors:
Ayushi Jain


### Other Contributors:
- Rav Ahuja
- Lakshmi Holla
- Malika


Copyright © IBM Corporation. All rights reserved.
