<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Finding Outliers**


Estimated time needed: **30** minutes


In this lab, you will work with a cleaned dataset to perform exploratory data analysis or EDA. 
You will explore the distribution of key variables and focus on identifying outliers in this lab.


## Objectives


In this lab, you will perform the following:


-  Analyze the distribution of key variables in the dataset.

-  Identify and remove outliers using statistical methods.

-  Perform relevant statistical and correlation analysis.


#### Install and import the required libraries


In [None]:
!pip install pandas
!pip install matplotlib
!pip install seaborn

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

<h3>Step 1: Load and Explore the Dataset</h3>


Load the dataset into a DataFrame and examine the structure of the data.


In [None]:
file_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"

#Create the dataframe
df = pd.read_csv(file_url)

#Display the top 10 records
df.head()


<h3>Step 2: Plot the Distribution of Industry</h3>


Explore how respondents are distributed across different industries.

- Plot a bar chart to visualize the distribution of respondents by industry.

- Highlight any notable trends.


In [None]:
##Write your code here
# Plot the distribution of 'Industry' (fixing the warning)
plt.figure(figsize=(16, 8))  # Adjust figure size for better readability
industry_counts = df['Industry'].value_counts()
sns.barplot(x=industry_counts.index, y=industry_counts.values, hue=industry_counts.index, palette="viridis", legend=False)  # Fix for deprecation warning
plt.title('Distribution of Respondents by Industry')
plt.xlabel('Industry')
plt.ylabel('Number of Respondents')
plt.xticks(rotation=45, ha='right', fontsize=10) # Reduced font size of x-axis labels
plt.tight_layout()
plt.show()

<h3>Step 3: Identify High Compensation Outliers</h3>


Identify respondents with extremely high yearly compensation.

- Calculate basic statistics (mean, median, and standard deviation) for `ConvertedCompYearly`.

- Identify compensation values exceeding a defined threshold (e.g., 3 standard deviations above the mean).


In [None]:
##Write your code here
# Calculate basic statistics for ConvertedCompYearly
mean_comp = df['ConvertedCompYearly'].mean()
median_comp = df['ConvertedCompYearly'].median()
std_comp = df['ConvertedCompYearly'].std()

print(f"Mean Compensation: {mean_comp}")
print(f"Median Compensation: {median_comp}")
print(f"Standard Deviation of Compensation: {std_comp}")

# Identify outliers using 3 standard deviations above the mean
outlier_threshold = mean_comp + (3 * std_comp)
high_comp_outliers = df[df['ConvertedCompYearly'] > outlier_threshold]

print(f"\nNumber of high compensation outliers (>{outlier_threshold}): {len(high_comp_outliers)}")
print("\nHigh Compensation Outliers:")
print(high_comp_outliers[['ResponseId', 'ConvertedCompYearly', 'Country']])

#Alternative way using quantiles
q75 = df['ConvertedCompYearly'].quantile(0.75)
q25 = df['ConvertedCompYearly'].quantile(0.25)
iqr = q75 - q25
upper_bound = q75 + (1.5 * iqr)
high_comp_outliers_iqr = df[df['ConvertedCompYearly'] > upper_bound]

print(f"\nNumber of high compensation outliers using IQR (>{upper_bound}): {len(high_comp_outliers_iqr)}")
print("\nHigh Compensation Outliers using IQR:")
print(high_comp_outliers_iqr[['ResponseId', 'ConvertedCompYearly', 'Country']])

<h3>Step 4: Detect Outliers in Compensation</h3>


Identify outliers in the `ConvertedCompYearly` column using the IQR method.

- Calculate the Interquartile Range (IQR).

- Determine the upper and lower bounds for outliers.

- Count and visualize outliers using a box plot.


In [None]:
##Write your code here
# Calculate quartiles and IQR
Q1 = df['ConvertedCompYearly'].quantile(0.25)
Q3 = df['ConvertedCompYearly'].quantile(0.75)
IQR = Q3 - Q1

# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers = df[(df['ConvertedCompYearly'] < lower_bound) | (df['ConvertedCompYearly'] > upper_bound)]

# Count outliers
num_outliers = len(outliers)
print(f"Number of outliers: {num_outliers}")
print(f"Lower Bound: {lower_bound}")
print(f"Upper Bound: {upper_bound}")

# Visualize outliers with a box plot
plt.figure(figsize=(8, 6))
sns.boxplot(y='ConvertedCompYearly', data=df)
plt.title('Box Plot of Yearly Compensation with Outliers')
plt.ylabel('Yearly Compensation')
plt.yscale('log') # Use logarithmic scale for better visualization of spread
plt.show()

# Visualize outliers with a histogram
plt.figure(figsize=(10, 6))
plt.hist(df['ConvertedCompYearly'], bins=50, range=(0, upper_bound*2)) #Limit the range to avoid extreme outliers
plt.title('Histogram of Yearly Compensation with Outliers')
plt.xlabel('Yearly Compensation')
plt.ylabel('Frequency')
plt.show()


In [None]:
# Calculate the Interquartile Range (IQR) for ConvertedCompYearly
Q1 = df['ConvertedCompYearly'].quantile(0.25)
Q3 = df['ConvertedCompYearly'].quantile(0.75)
IQR = Q3 - Q1

print(f'Q1 (25th percentile): {Q1}')
print(f'Q3 (75th percentile): {Q3}')
print(f'IQR (Interquartile Range): {IQR}')

# Determine the upper and lower bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f'Lower bound for outliers: {lower_bound}')
print(f'Upper bound for outliers: {upper_bound}')

# Identify outliers in the ConvertedCompYearly column
outliers = df[(df['ConvertedCompYearly'] < lower_bound) | (df['ConvertedCompYearly'] > upper_bound)]

print(f'Number of outliers: {len(outliers)}')
print(outliers[['ResponseId', 'ConvertedCompYearly']])

# Visualize outliers using a box plot
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='ConvertedCompYearly')
plt.xlabel('Yearly Compensation')
plt.title('Box Plot of Yearly Compensation with Outliers')
plt.show()

In [None]:
# Visualize outliers using a box plot
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='ConvertedCompYearly')
plt.xlabel('Yearly Compensation')
plt.title('Box Plot of Yearly Compensation with Outliers')
plt.show()

<h3>Step 5: Remove Outliers and Create a New DataFrame</h3>


Remove outliers from the dataset.

- Create a new DataFrame excluding rows with outliers in `ConvertedCompYearly`.
- Validate the size of the new DataFrame.


In [None]:
##Write your code here
# Remove negative compensation values (as in the previous step)
df = df[df['ConvertedCompYearly'] >= 0]

# Calculate quartiles and IQR
Q1 = df['ConvertedCompYearly'].quantile(0.25)
Q3 = df['ConvertedCompYearly'].quantile(0.75)
IQR = Q3 - Q1

# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Create a new DataFrame *excluding* outliers
df_no_outliers = df[(df['ConvertedCompYearly'] >= lower_bound) & (df['ConvertedCompYearly'] <= upper_bound)]

# Validate the size of the new DataFrame
print(f"Original DataFrame size: {len(df)}")
print(f"DataFrame size after removing outliers: {len(df_no_outliers)}")

# Visualize the data without outliers with a box plot
plt.figure(figsize=(8, 6))
sns.boxplot(y='ConvertedCompYearly', data=df_no_outliers)
plt.title('Box Plot of Yearly Compensation without Outliers')
plt.ylabel('Yearly Compensation')
plt.yscale('log') # Use logarithmic scale for better visualization of spread
plt.show()

# Visualize the data without outliers with a histogram
plt.figure(figsize=(10, 6))
plt.hist(df_no_outliers['ConvertedCompYearly'], bins=50)
plt.title('Histogram of Yearly Compensation without Outliers')
plt.xlabel('Yearly Compensation')
plt.ylabel('Frequency')
plt.show()

<h3>Step 6: Correlation Analysis</h3>


Analyze the correlation between `Age` (transformed) and other numerical columns.

- Map the `Age` column to approximate numeric values.

- Compute correlations between `Age` and other numeric variables.

- Visualize the correlation matrix.


In [None]:
##Write your code here

<h3> Summary </h3>


In this lab, you developed essential skills in **Exploratory Data Analysis (EDA)** with a focus on outlier detection and removal. Specifically, you:


- Loaded and explored the dataset to understand its structure.

- Analyzed the distribution of respondents across industries.

- Identified and removed high compensation outliers using statistical thresholds and the Interquartile Range (IQR) method.

- Performed correlation analysis, including transforming the `Age` column into numeric values for better analysis.


<!--
## Change Log
|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|               
|2024-10-1|1.1|Madhusudan Moole|Reviewed and updated lab|                                                                                    
|2024-09-29|1.0|Raghul Ramesh|Created lab|
--!>


Copyright © IBM Corporation. All rights reserved.
