**Your Task**

Download and import the Data Science Job Salary dataset.

Normalize the ‘salary’ column using Min-Max normalization which scales all salary values between 0 and 1.

Implement dimensionality reduction like Principal Component Analysis (PCA) or t-SNE to reduce the number of features (columns) in the dataset.

Group the dataset by the ‘experience_level’ column and calculate the average and median salary for each experience level (e.g., Junior, Mid-level, Senior).

**Hint :**

As a reminder, normalization is crucial when dealing with data that has different ranges. For example, salary data might have a wide range (e.g., from $20,000 to $200,000). By scaling the data using Min-Max normalization, you make sure that all salary values fall within a consistent range (0 to 1). This is particularly helpful when the data is going to be used in machine learning models, as some algorithms (like k-nearest neighbors or neural networks) perform better when features are normalized. It ensures that no single salary dominates the learning process, making the analysis more balanced.

Dimensionality reduction helps simplify complex datasets by reducing the number of variables under consideration. This can make the data more manageable and help avoid the curse of dimensionality—a phenomenon where machine learning models struggle when dealing with high-dimensional data.
PCA, for instance, helps in retaining the most important information (variance) from the dataset while reducing noise and redundancy.
It can also speed up the training process for models and help in visualizing data in fewer dimensions.

Aggregating data helps in understanding trends within subgroups of the dataset.
Calculating average and median salaries for each experience level gives insights into the compensation distribution and disparities across different job levels. This kind of aggregation can help in answering business questions like “How does salary evolve with experience?” or “What is the salary distribution for senior-level roles?”

Importing Libraries

In [10]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA

Loading the Dataset

In [1]:
from google.colab import files
files.upload()

Saving datascience_salaries.csv to datascience_salaries.csv


{'datascience_salaries.csv': b",job_title,job_type,experience_level,location,salary_currency,salary\r\n0,Data scientist,Full Time,Senior,New York City,USD,149000\r\n2,Data scientist,Full Time,Senior,Boston,USD,120000\r\n3,Data scientist,Full Time,Senior,London,USD,68000\r\n4,Data scientist,Full Time,Senior,Boston,USD,120000\r\n5,Data scientist,Full Time,Senior,New York City,USD,149000\r\n6,Data scientist,Full Time,Senior,London,USD,68000\r\n7,Data scientist,Full Time,Senior,Research Triangle Park,USD,69000\r\n8,Data scientist,Full Time,Senior,Sydney,USD,68000\r\n9,Data scientist,Full Time,Senior,San Francisco,USD,140000\r\n10,Data scientist,Full Time,Senior,Sofia,USD,68000\r\n12,Data scientist,Full Time,Entry,BangPa-in,USD,35000\r\n15,Data scientist,Full Time,Senior,Berlin,USD,68000\r\n17,Data scientist,Full Time,Senior,NAMER,USD,68000\r\n18,Data scientist,Full Time,Senior,Remote,USD,68000\r\n20,Data scientist,Full Time,Senior,San Jose,USD,68000\r\n22,Data scientist,Full Time,Senior,Pe

In [5]:
df = pd.read_csv('datascience_salaries.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,job_title,job_type,experience_level,location,salary_currency,salary
0,0,Data scientist,Full Time,Senior,New York City,USD,149000
1,2,Data scientist,Full Time,Senior,Boston,USD,120000
2,3,Data scientist,Full Time,Senior,London,USD,68000
3,4,Data scientist,Full Time,Senior,Boston,USD,120000
4,5,Data scientist,Full Time,Senior,New York City,USD,149000


Min-Max Normalize the Salary Column

In [7]:
scaler = MinMaxScaler()
df['salary_normalized'] = scaler.fit_transform(df[['salary']])
df.head()

Unnamed: 0.1,Unnamed: 0,job_title,job_type,experience_level,location,salary_currency,salary,salary_normalized
0,0,Data scientist,Full Time,Senior,New York City,USD,149000,0.60101
1,2,Data scientist,Full Time,Senior,Boston,USD,120000,0.454545
2,3,Data scientist,Full Time,Senior,London,USD,68000,0.191919
3,4,Data scientist,Full Time,Senior,Boston,USD,120000,0.454545
4,5,Data scientist,Full Time,Senior,New York City,USD,149000,0.60101


Dimensionality Reduction with PCA

In [14]:
# Droping non-numeric and identifier columns for PCA
df_numeric = df.select_dtypes(include=[np.number]).drop(columns=['salary'])

# Applying PCA to reduce to 2 components for visualization
pca = PCA(n_components=2)
pca_result = pca.fit_transform(df_numeric)

# Adding PCA results to the dataframe
df['PCA1'] = pca_result[:, 0]
df['PCA2'] = pca_result[:, 1]

df[['PCA1', 'PCA2', 'experience_level']].head()


Unnamed: 0,PCA1,PCA2,experience_level
0,-1863.241672,0.855106,Senior
1,-1859.241672,0.562166,Senior
2,-1857.241674,0.036908,Senior
3,-1855.241672,0.562155,Senior
4,-1853.241672,0.855079,Senior


Group by Experience Level and Compute Salary Statistics

In [15]:
salary_stats = df.groupby('experience_level')['salary'].agg(['mean', 'median']).reset_index()
salary_stats.columns = ['Experience Level', 'Average Salary', 'Median Salary']
salary_stats

Unnamed: 0,Experience Level,Average Salary,Median Salary
0,Entry,36111.111111,30000.0
1,Executive,76076.923077,46000.0
2,Mid,51786.885246,51000.0
3,Senior,75088.033012,68000.0
