Daily Challenge: Data Handling and Analysis in Python

What You Will Learn
Advanced techniques for data normalization, reduction, and aggregation.
Skills in gathering, exploring, integrating, and cleaning data using Python.
Proficiency in using Pandas for complex data manipulation.

Your Task
Download and import the Data Science Job Salary dataset.
Normalize the ‘salary’ column using Min-Max normalization which scales all salary values between 0 and 1.
Implement dimensionality reduction like Principal Component Analysis (PCA) or t-SNE to reduce the number of features (columns) in the dataset.
Group the dataset by the ‘experience_level’ column and calculate the average and median salary for each experience level (e.g., Junior, Mid-level, Senior).

Hint :
As a reminder, normalization is crucial when dealing with data that has different ranges. For example, salary data might have a wide range (e.g., from $20,000 to $200,000). By scaling the data using Min-Max normalization, you make sure that all salary values fall within a consistent range (0 to 1). This is particularly helpful when the data is going to be used in machine learning models, as some algorithms (like k-nearest neighbors or neural networks) perform better when features are normalized. It ensures that no single salary dominates the learning process, making the analysis more balanced.

Dimensionality reduction helps simplify complex datasets by reducing the number of variables under consideration. This can make the data more manageable and help avoid the curse of dimensionality—a phenomenon where machine learning models struggle when dealing with high-dimensional data.
PCA, for instance, helps in retaining the most important information (variance) from the dataset while reducing noise and redundancy.
It can also speed up the training process for models and help in visualizing data in fewer dimensions.

Aggregating data helps in understanding trends within subgroups of the dataset.
Calculating average and median salaries for each experience level gives insights into the compensation distribution and disparities across different job levels. This kind of aggregation can help in answering business questions like “How does salary evolve with experience?” or “What is the salary distribution for senior-level roles?”

In [None]:
#Your Task
#Download and import the Data Science Job Salary dataset. Done.
#Normalize the ‘salary’ column using Min-Max normalization which scales all salary values between 0 and 1.
#Implement dimensionality reduction like Principal Component Analysis (PCA) or t-SNE to reduce the number of features (columns) in the dataset.
#Group the dataset by the ‘experience_level’ column and calculate the average and median salary for each experience level (e.g., Junior, Mid-level, Senior).

In [1]:
#Normalize the ‘salary’ column using Min-Max normalization which scales all salary values between 0 and 1.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

In [3]:
df = pd.read_csv("C:/Users/cibei/Desktop/PSTB_GenAI/WEEK3/DAY3/DailyChallenge/datascience_salaries.csv")

In [5]:
# Scaler Initialization :
scaler = MinMaxScaler()

In [7]:
# Column names check up :
print(df.columns)

Index(['Unnamed: 0', 'job_title', 'job_type', 'experience_level', 'location',
       'salary_currency', 'salary'],
      dtype='object')


In [10]:
df['salary_normalized'] = scaler.fit_transform(df[['salary']])

In [12]:

# Vérification
print(df[['salary', 'salary_normalized']].head())

   salary  salary_normalized
0  149000           0.601010
1  120000           0.454545
2   68000           0.191919
3  120000           0.454545
4  149000           0.601010


In [14]:
#Implement dimensionality reduction like Principal Component Analysis (PCA) or t-SNE to reduce the number of features (columns) in the dataset.

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

In [16]:
# Keep only the relevant numerical columns :
numeric_df = df.select_dtypes(include='number')

In [None]:
#Normalize all numerical columns before reduction:
numeric_scaled = scaler.fit_transform(numeric_df)

In [20]:
#Group the dataset by the ‘experience_level’ column and calculate the average and median salary for each experience level (e.g., Junior, Mid-level, Senior).datascience_salaries
salary_stats = df.groupby('experience_level')['salary'].agg(['mean', 'median']).reset_index()

In [21]:
# Rename columns for more clarity
salary_stats.columns = ['experience_level', 'average_salary_usd', 'median_salary_usd']

In [22]:
print(salary_stats)

  experience_level  average_salary_usd  median_salary_usd
0            Entry        36111.111111            30000.0
1        Executive        76076.923077            46000.0
2              Mid        51786.885246            51000.0
3           Senior        75088.033012            68000.0
