# Table of Contents

* [Chapter 1 : Import Packages](#chapter1)
* [Chapter 2 : Data Exploration](#chapter2)
    * [Section 2.1 : Reading and Understanding the dataset](#subchapter2.1)
    * [Section 2.2 : Data Transformation](#subchapter2.2)
    * [Section 2.3a : Data Visualization (Categorical Data)](#subchapter2.3a)
    * [Section 2.3b : Data Visualization (Numerical Data)](#subchapter2.3b)
    * [Section 2.4 : Bivariate Analysis](#subchapter2.4)
    * [Section 2.5 : Correlation](#subchapter2.5)
* [Chapter 3 : Prediction](#chapter4)
    * [Section 3.1 : Defining Evaluation Metrics](#subchapter3.1)
    * [Section 3.2 : Linear Regression](#subchapter3.2)
    * [Section 3.3 : Decision Tree Regressor](#subchapter3.3)
    * [Section 3.4 : Lasso Regression](#subchapter3.4)
    * [Section 3.5 : Ridge Regression](#subchapter3.5)
    * [Section 3.6 : Elastic Net Regression](#subchapter3.6)
* [Chapter 4 : Conclustion](#chapter4)

# Introduction:

This notebook will covers various aspects of data analysis and prediction using Python. I will be reading the datasets, performing data exploration, visualization, and prediction using different regression techniques.

**Table of Contents:**

**Chapter 1: Import Packages**
- I start by importing the necessary packages and libraries required for data analysis, visualization, and prediction.

**Chapter 2: Data Exploration**
- I will explore the dataset by reading and understanding its contents.
- I'll perform data transformation to preprocess and clean the dataset for analysis.
- I'll visualize both categorical and numerical data using appropriate visualization techniques.
- Bivariate analysis will be conducted to uncover relationships between variables.
- The correlation between variables will be explored to understand their relationships.

**Chapter 3: Prediction**
- I'll focus on predicting target variables using various regression techniques.
- Define evaluation metrics to assess the performance of the prediction models.
- Linear Regression will be explored as a fundamental prediction method.
- Decision Tree Regressor for making predictions based on decision rules.
- Lasso Regression and Ridge Regression will be employed to handle regularization and prevent overfitting.
- Elastic Net Regression will be introduced as a hybrid of Lasso and Ridge regressions.

**Chapter 4: Conclusion**
- Summarizing the key takeaways from the data analysis and prediction techniques covered.

# 1. Import packages <a class="anchor"  id="chapter1"></a>

In [1]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
!pip install pycountry
import pycountry

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

import warnings
# Ignore all warnings
warnings.filterwarnings("ignore")

Collecting pycountry
  Downloading pycountry-22.3.5.tar.gz (10.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m51.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l- \ | / done
[?25h  Getting requirements to build wheel ... [?25l- done
[?25h  Preparing metadata (pyproject.toml) ... [?25l- done
Building wheels for collected packages: pycountry
  Building wheel for pycountry (pyproject.toml) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
[?25h  Created wheel for pycountry: filename=pycountry-22.3.5-py2.py3-none-any.whl size=10681833 sha256=11eafb23b1329b9229763f72a310527538317e0125173e11200e8e7c7237119f
  Stored in directory: /root/.cache/pip/wheels/03/57/cc/290c5252ec97a6d78d36479a3c5e5ecc76318afcb241ad9dbe
Successfully built pycountry
Installing collected packages: pycountry
Successfully installed pycountry-22.3.5



# 2. Data Exploratoration  <a class="anchor"  id="chapter2"></a>

## 2.1 Reading and Understanding the data set  <a class="anchor"  id="subchapter2.1"></a>

In [2]:
# Reading the dataset
df = pd.read_csv('/kaggle/input/data-science-salaries-2023/ds_salaries.csv')
display(df.head())
print(df.shape)

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M


(3755, 11)


In [3]:
display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3755 entries, 0 to 3754
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           3755 non-null   int64 
 1   experience_level    3755 non-null   object
 2   employment_type     3755 non-null   object
 3   job_title           3755 non-null   object
 4   salary              3755 non-null   int64 
 5   salary_currency     3755 non-null   object
 6   salary_in_usd       3755 non-null   int64 
 7   employee_residence  3755 non-null   object
 8   remote_ratio        3755 non-null   int64 
 9   company_location    3755 non-null   object
 10  company_size        3755 non-null   object
dtypes: int64(4), object(7)
memory usage: 322.8+ KB


None

Data Science Job Salaries Dataset contains 11 columns, each are:

* work_year: The year the salary was paid.
* experience_level: The experience level in the job during the year
* employment_type: The type of employment for the role
* job_title: The role worked in during the year.
* salary: The total gross salary amount paid.
* salary_currency: The currency of the salary paid as an ISO 4217 currency code.
* salaryinusd: The salary in USD
* employee_residence: Employee's primary country of residence in during the work year as an ISO 3166 country code.
* remote_ratio: The overall amount of work done remotely
* company_location: The country of the employer's main office or contracting branch
* company_size: The median number of people that worked for the company during the year

## 2.2 Data Transformation  <a class="anchor"  id="subchapter2.2"></a>

In [4]:
# Replace experience level codes with descriptive labels
df['experience_level'] = df['experience_level'].replace({
    'SE': 'Senior',
    'EN': 'Entry level',
    'EX': 'Executive level',
    'MI': 'Mid/Intermediate level',
})

# Replace employment type codes with descriptive labels
df['employment_type'] = df['employment_type'].replace({
    'FL': 'Freelancer',
    'CT': 'Contractor',
    'FT': 'Full-time',
    'PT': 'Part-time'
})

# Replace company size codes with descriptive labels
df['company_size'] = df['company_size'].replace({
    'S': 'SMALL',
    'M': 'MEDIUM',
    'L': 'LARGE',
})

# Convert remote ratio to string and replace codes with descriptive labels
df['remote_ratio'] = df['remote_ratio'].astype(str)
df['remote_ratio'] = df['remote_ratio'].replace({
    '0': 'On-Site',
    '50': 'Half-Remote',
    '100': 'Full-Remote',
})


Converting the various abbreviations into descriptive labels for better understanind:
1. Experience Level :
    *     'SE': 'Senior'
    *     'EN': 'Entry level'
    *     'EX': 'Executive level'
    *     'MI': 'Mid/Intermediate level'
2. Employment Type : 
    *    'FL': 'Freelancer'
    *    'CT': 'Contractor'
    *    'FT': 'Full-time'
    *    'PT': 'Part-time'
3. Company Size : 
    *    'S': 'SMALL'
    *    'M': 'MEDIUM'
    *    'L': 'LARGE'
4. Remote Ratio : 
    *    '0': 'On-Site'
    *    '50': 'Half-Remote'
    *    '100': 'Full-Remote'

In [5]:
# Define a function to convert country code to country name using pycountry library
def country_code_to_name(code):
    try:
        # Get the country object using the alpha-2 code
        country = pycountry.countries.get(alpha_2=code)
        if country:
            return country.name
        else:
            return 'Unknown'
    except AttributeError:
        return 'Invalid Code'
    
# Apply the country_code_to_name function to the 'employee_residence' column and create 'employee_residence_country'
df['employee_residence_country'] = df['employee_residence'].apply(country_code_to_name)

# Apply the country_code_to_name function to the 'company_location' column and create 'company_location_country'
df['company_location_country'] = df['company_location'].apply(country_code_to_name)

# Uncomment the following line to display the DataFrame with added country name columns
# display(df)


In [6]:
# Define a function to assign broader job categories based on specific job titles
def assign_broader_category(job_title):
    # Define lists of job titles for different broader categories
    data_engineering = ["Data Engineer", "Data Analyst", "Analytics Engineer", "BI Data Analyst", "Business Data Analyst", "BI Developer", "BI Analyst", "Business Intelligence Engineer", "BI Data Engineer", "Power BI Developer"]
    data_scientist = ["Data Scientist", "Applied Scientist", "Research Scientist", "3D Computer Vision Researcher", "Deep Learning Researcher", "AI/Computer Vision Engineer"]
    machine_learning = ["Machine Learning Engineer", "ML Engineer", "Lead Machine Learning Engineer", "Principal Machine Learning Engineer"]
    data_architecture = ["Data Architect", "Big Data Architect", "Cloud Data Architect", "Principal Data Architect"]
    management = ["Data Science Manager", "Director of Data Science", "Head of Data Science", "Data Scientist Lead", "Head of Machine Learning", "Manager Data Management", "Data Analytics Manager"]
    
    # Check if the job title belongs to any of the defined lists and assign the corresponding broader category
    if job_title in data_engineering:
        return "Data Engineering"
    elif job_title in data_scientist:
        return "Data Science"
    elif job_title in machine_learning:
        return "Machine Learning"
    elif job_title in data_architecture:
        return "Data Architecture"
    elif job_title in management:
        return "Management"
    else:
        return "Other"

# Apply the function to the 'job_title' column and create a new column 'job_category'
df['job_category'] = df['job_title'].apply(assign_broader_category)

As there is plenty of different job titles, I will be defining a broader job categories based on specific job titles.

In [7]:
for col in df.columns:
    print(f'{col} has {len(set(df[col]))} unique values')

work_year has 4 unique values
experience_level has 4 unique values
employment_type has 4 unique values
job_title has 93 unique values
salary has 815 unique values
salary_currency has 20 unique values
salary_in_usd has 1035 unique values
employee_residence has 78 unique values
remote_ratio has 3 unique values
company_location has 72 unique values
company_size has 3 unique values
employee_residence_country has 78 unique values
company_location_country has 72 unique values
job_category has 6 unique values


## 2.3a Data Visualization (Categorical Data) <a class="anchor"  id="subchapter2.3a"></a>

In [8]:
CATEGORICAL_COLS = ['experience_level', 'employment_type', 'remote_ratio','company_size','job_category']

In [9]:
# Create the histogram using Plotly Graph Objects
cat_fig = make_subplots(rows=4,cols=2, subplot_titles=['Experience Level', 'Employee Residence', 'Employment Type', 'Company Location', 'Company Size', 'Job Title', 'Job Category'])

# Add a histogram for the distribution of experience levels
cat_fig.add_trace(go.Histogram(x=df['experience_level'], text=df['experience_level'].value_counts(), histfunc='count'), row=1, col=1)

# Get the top 15 employee residences by count and add a bar plot
top_employee_residence = df['employee_residence'].value_counts().nlargest(15)
cat_fig.add_trace(go.Bar(x=top_employee_residence.index, y=top_employee_residence.values, text=top_employee_residence.values), row=1, col=2)

# Add a histogram for the distribution of employment types
cat_fig.add_trace(go.Histogram(x=df['employment_type'], text=df['employment_type'].value_counts(), histfunc='count'), row=2, col=1)

# Get the top 15 company locations by count and add a bar plot
top_company_location = df['company_location'].value_counts().nlargest(15)
cat_fig.add_trace(go.Bar(x=top_company_location.index, y=top_company_location.values, text=top_company_location.values), row=2, col=2)

# Add a histogram for the distribution of company sizes
top_company_size = df['company_size'].value_counts()
cat_fig.add_trace(go.Bar(x=top_company_size.index, y=top_company_size.values, text=top_company_size.values), row=3, col=1)

# Get the top 15 job titles by count and add a bar plot
top_job_title = df['job_title'].value_counts().nlargest(15)
cat_fig.add_trace(go.Bar(x=top_job_title.index, y=top_job_title.values, text=top_job_title.values), row=3, col=2)

# Add a histogram for the distribution of job categories
top_job_category = df['job_category'].value_counts()
cat_fig.add_trace(go.Bar(x=top_job_category.index, y=top_job_category.values, text=top_job_category.values), row=4, col=1)

# Display the subplots
cat_fig.show()

1. **Regarding Experience Level**: Among the dataset, 2516 individuals hold senior-level roles, followed by 805 with mid/intermediate experience, 320 in entry-level positions, and 114 in executive roles.
2. **Employee Residence Insight**: The predominant employee residence is in the United States, totaling 3004 instances.
3. **Employment Type Breakdown**: Of the dataset, 3718 individuals are engaged in full-time employment, with 17 identified as contractors, 10 as freelancers, and 10 in part-time positions.
4. **Company Location Distribution**: The highest concentration of company locations is in the United States, with a total count of 3040.
5. **Company Size Variation**: A significant portion, 3153 individuals, belong to medium-sized companies, followed by 454 in large-sized firms, and 148 in small-sized companies.
6. **Job Title Diversity**: The most frequently occurring job title is "Data Engineer," closely followed by "Data Scientist" and "Data Analyst."
7. **Job Category Overview**: The distribution of job categories reveals that 1813 individuals work as Data Engineers, 985 as Data Scientists, 421 in other roles, 327 in Machine Learning, 105 in Data Architecture, and 104 in Management positions.

## 2.3b Data Visualization (Numerical Data) <a class="anchor"  id="subchapter2.3b"></a>

In [10]:
NUMERICAL_COLS = ['work_year', 'salary_in_usd']

In [11]:
# Get the counts of each work year
total_work_year = df['work_year'].value_counts()

# Create a new figure for the pie chart
work_year = go.Figure()

# Add a Pie chart trace using the total work year counts
work_year.add_trace(go.Pie(labels=total_work_year.index, values=total_work_year.values, hole=0.3))

# Update the layout of the figure to set the title
work_year.update_layout(title='Work Years Distribution')

A significant portion of the reported salaries originates from the years 2023 and 2022. This trend may be attributed to the fact that the field of data science was less prevalent during the years 2021 and 2020.

In [12]:
# Create a new figure for the Swarm plot
swarm_plot = go.Figure()

# Add a Box plot trace to the figure
swarm_plot.add_trace(go.Box(
    y=df['salary_in_usd'],
    boxpoints='all',  # 'all' shows individual data points on the plot
    jitter=0.3,  # Adjusts the spread of data points along the axis
    pointpos=0,  # Shifts the position of data points along the axis
    marker=dict(size=6, color='black'),
    name='Box Plot'
))

# Add labels and title to the Swarm plot
swarm_plot.update_layout(
    title='Salary Distribution (Swarm Plot)',
    yaxis=dict(title='Salary')
)

# Show the Swarm plot
swarm_plot.show()

The provided box plot illustrates the distribution of salaries, highlighting key statistics. The maximum salary is 450k, the 75th interquartile value is 175k, the median stands at 135k, the 25th interquartile value is 95k, and the minimum salary is 5132.

## 2.4 Bivariate analysis <a class="anchor"  id="subchapter2.4"></a>

Bivariate analysis is a statistical method that involves analyzing the relationship between two variables. It focuses on understanding how changes in one variable are related to changes in another variable. Bivariate analysis is commonly used to uncover patterns, correlations, or associations between pairs of variables in a dataset.

In [13]:
# Create subplots with specified titles for each subplot
salary_dist = make_subplots(rows=5, cols=1, subplot_titles=[
    'Salary distribution by Employment Type', 
    'Salary distribution by Experience Level', 
    'Salary distribution by Company Size', 
    'Salary distribution by Job Category', 
    'Salary distribution by Job Title'
])

# Create a new figure for the salary distribution
salary = go.Figure()

# Add a Box plot trace for each category to the subplots
salary_dist.add_trace(go.Box(y=df['salary_in_usd'], x=df['employment_type']), row=1, col=1) # Box plot for Emplopyment Type
salary_dist.add_trace(go.Box(y=df['salary_in_usd'], x=df['experience_level']), row=2, col=1) # Box plot for Experience Level
salary_dist.add_trace(go.Box(y=df['salary_in_usd'], x=df['company_size']), row=3, col=1) # Box plot for Company Size
salary_dist.add_trace(go.Box(y=df.sort_values('salary_in_usd', ascending=False)['salary_in_usd'], 
                             x=df.sort_values('salary_in_usd', ascending=False)['job_category']), row=4, col=1) # Box plot for Job Category
salary_dist.add_trace(go.Box(y=df[df['job_title'].isin(top_job_title.index)].sort_values('salary_in_usd', ascending=False)['salary_in_usd'], 
                             x=df[df['job_title'].isin(top_job_title.index)].sort_values('salary_in_usd', ascending=False)['job_title']), row=5, col=1) # Box plot for job title

# Update the layout of the subplots
salary_dist.update_layout(
    title='Bivariant Analysis with Salary and other columns',
    xaxis=dict(title='Employee Type'),
    yaxis=dict(title='Salary'),
    height=1500  # Set the overall height of the subplots (you can adjust this value as needed)
)

# Show the subplots
salary_dist.show()

1. **Analyzing Salary by Employment Type**: Among different employment types, Full-Timers exhibit the most substantial salary distribution, featuring a maximum of 450k, a 75th quartile value of 175.1k, a median of 135k, and a 25th quartile value of 95.55k.
2. **Exploring Salary by Experience Level**: In terms of experience level, the Executive tier showcases the highest 75th quartile at 239k, a median of 196k, and a 25th quartile value of 145k.
3. **Investigating Salary by Company Size**: Among varying company sizes, medium-sized companies display the highest 75th quartile at 180k, a median of 140k, and a 25th quartile value of 102.1k.
4. **Scrutinizing Salary by Job Category**: When considering job categories, the Management job category emerges with the most prominent figures, including a 75th quartile of 221.32k, a median of 159.5k, and a 25th quartile value of 134.118k.
5. **Examining Salary by Job Title**: Among distinct job titles, the role of Data Science Manager attains the highest 75th quartile at 245.1k, a median of 183.78k, and a 25th quartile value of 144k.

In [14]:
# Calculate the median salary for each job title and year
mean_salary = df.groupby(['job_category', 'work_year'])['salary_in_usd'].median().reset_index()

# Create a new figure for the heatmap
heatmap = go.Figure()

# Add a Heatmap trace to the figure
heatmap.add_trace(go.Heatmap(
    x=mean_salary['work_year'],
    y=mean_salary['job_category'],
    z=mean_salary['salary_in_usd'],
    colorscale='Greens',  # You can choose a different colorscale if desired
    colorbar=dict(title='Mean Salary')
))

# Add labels and title to the Heatmap
heatmap.update_layout(
    title='Mean Salary by Job Title and Year',
    xaxis=dict(title='Year'),
    yaxis=dict(title='Job Category')
)

# Show the Heatmap
heatmap.show()

The provided heatmap illustrates the average salary based on job title and year. \
During the year 2020, the "Management" job category stands out with the highest average salary, followed closely by the "Machine Learning" job category. \
As we transition to 2021, the "Data Architecture" job category emerges as the leader in terms of mean salary. \
Looking ahead to the years 2022 and 2023, the job categories exhibit a relatively similar range of average salaries.

In [15]:
# Create a new figure for the line chart
line_chart = go.Figure()

# Iterate over unique job categories and add a Scatter trace for each
for job_category in mean_salary['job_category'].unique():
    # Filter the mean_salary DataFrame for the current job_category
    df_filtered = mean_salary[mean_salary['job_category'] == job_category]
    
    # Add a Scatter trace for the current job_category
    line_chart.add_trace(go.Scatter(
        x=df_filtered['work_year'],
        y=df_filtered['salary_in_usd'],
        mode='lines+markers',  # Display lines and markers
        name=job_category
    ))

# Add labels and title to the line chart
line_chart.update_layout(
    title='Salary Trend by Job Title and Year',
    xaxis=dict(title='Year'),
    yaxis=dict(title='Salary')
)

# Show the line chart
line_chart.show()

The provided line graph depicts the salary trend based on job title and year. \
This trend aligns with the heatmap observation: in 2020, the "Management" job category commands the highest average salary, followed closely by the "Machine Learning" category. As we move into 2021, the "Data Architecture" category takes the lead in mean salary. Looking forward to 2022 and 2023, the job categories exhibit a relatively similar range of average salaries.

In [16]:
# Calculate the median salary for each country
median_salary = df.groupby(['company_location_country'])['salary_in_usd'].median().reset_index()

# Display the median salary data sorted in descending order
# display(median_salary.sort_values('salary_in_usd', ascending=False))

# Create a Choropleth map for median salary distribution
world_map = go.Figure(data=go.Choropleth(
    locations=median_salary['company_location_country'],
    z=median_salary['salary_in_usd'],
    locationmode='country names',
    text=median_salary['company_location_country'],
    colorscale='Greens',  # You can choose a different colorscale if desired
    colorbar=dict(title='Median Salary (USD)')
))

# Update the layout of the map
world_map.update_layout(
    title='Median Salary Distribution by Region',
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='natural earth'
    )
)

# Show the Choropleth world map
world_map.show()

# Create a bar plot for median salary distribution by country
median_salary.sort_values('salary_in_usd', ascending=False, inplace=True)
world_bar = go.Figure(go.Bar(x=median_salary['company_location_country'], y=median_salary['salary_in_usd']))

world_bar.update_layout(
    title = 'Median Salary Distribution by Country')
# Show the bar plot
world_bar.show()

The world map plot above illustrates that the highest median salary is projected in the United States. \
The subsequent graph presents the median salary by country, indicating that the United States ranks third, with Israel ranking first and Puerto Rico ranking second in terms of median salary.

## 2.5 Correlation <a class="anchor"  id="subchapter2.5"></a>

## Correlation is a statistical measure that quantifies the strength and direction of the linear relationship between two continuous variables. It indicates whether changes in one variable are associated with changes in another variable. Correlation values range between -1 and 1, with certain values indicating different degrees of correlation:

* A correlation coefficient of 1 indicates a perfect positive correlation, where as one variable increases, the other variable also increases proportionally.
* A correlation coefficient of -1 indicates a perfect negative correlation, where as one variable increases, the other variable decreases proportionally.
* A correlation coefficient close to 0 indicates a weak or no linear correlation between the variables.

Scaling is important in correlation analysis for a few key reasons:

1. Magnitude Comparison: Correlation is affected by the magnitudes of the variables. If two variables are on different scales, their correlation values might not accurately represent the strength of their relationship. Variables with larger scales can dominate the correlation calculations, even if the actual relationship isn't strong.
2. Impact of Outliers: Outliers can significantly affect correlation calculations. If one variable has outliers that are far from the mean, they might disproportionately influence the correlation value, especially if the other variable has a smaller scale.
3. Misleading Interpretations: When variables are on different scales, it can lead to misinterpretation of the correlation strength. Even though there might be a strong linear relationship, the correlation value might appear weaker due to the scale differences.

Scaling the variables before calculating correlation helps mitigate these issues. By scaling the variables to a common range or distribution, you ensure that they contribute equally to the correlation calculations. This allows for a more accurate representation of the relationship between the variables.

In summary, scaling is important for correlation analysis because it ensures that the correlation values accurately reflect the strength and direction of the linear relationship between variables, rather than being influenced by their individual scales or outliers.

In [17]:
dummy_variables = pd.get_dummies(df, columns=CATEGORICAL_COLS, drop_first=False)

In [18]:
scaler = StandardScaler()

dummy_variables['scaled_salary'] = scaler.fit_transform(df['salary_in_usd'].values.reshape(-1,1))

In [19]:
corr_df = dummy_variables.drop(columns=['work_year','job_title', 'salary', 'salary_currency', 'salary_in_usd',
                                       'employee_residence', 'company_location', 'employee_residence_country',
                                       'company_location_country'], axis=1)
display(corr_df)

Unnamed: 0,experience_level_Entry level,experience_level_Executive level,experience_level_Mid/Intermediate level,experience_level_Senior,employment_type_Contractor,employment_type_Freelancer,employment_type_Full-time,employment_type_Part-time,remote_ratio_Full-Remote,remote_ratio_Half-Remote,...,company_size_LARGE,company_size_MEDIUM,company_size_SMALL,job_category_Data Architecture,job_category_Data Engineering,job_category_Data Science,job_category_Machine Learning,job_category_Management,job_category_Other,scaled_salary
0,0,0,0,1,0,0,1,0,1,0,...,1,0,0,0,0,0,0,0,1,-0.820391
1,0,0,1,0,1,0,0,0,1,0,...,0,0,1,0,0,0,1,0,0,-1.706187
2,0,0,1,0,1,0,0,0,1,0,...,0,0,1,0,0,0,1,0,0,-1.777563
3,0,0,0,1,0,0,1,0,1,0,...,0,1,0,0,0,1,0,0,0,0.593676
4,0,0,0,1,0,0,1,0,1,0,...,0,1,0,0,0,1,0,0,0,-0.278686
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3750,0,0,0,1,0,0,1,0,1,0,...,1,0,0,0,0,1,0,0,0,4.352762
3751,0,0,1,0,0,0,1,0,1,0,...,1,0,0,0,0,0,0,0,1,0.213009
3752,1,0,0,0,0,0,1,0,1,0,...,0,0,1,0,0,1,0,0,0,-0.516603
3753,1,0,0,0,1,0,0,0,1,0,...,1,0,0,0,1,0,0,0,0,-0.595909


In [20]:
# Calculate the correlation matrix for the DataFrame
corr = corr_df.corr()

# Display the correlation matrix
# display(corr)

# Create a heatmap using the Plotly library
heatmap = go.Figure()

# Add a heatmap trace to the figure using the correlation matrix values
heatmap.add_trace(go.Heatmap(
    z=corr.values,  # Correlation values as the heatmap data
    y=corr.columns,  # Y-axis labels (columns)
    x=corr.index,  # X-axis labels (index)
    colorscale='Greens'  # Choose a colorscale for the heatmap (e.g., Greens)
    ))

# Customize the layout of the heatmap
heatmap.update_layout(
    title='Correlation Heatmap Matrix',  # Set the title of the heatmap
    xaxis=dict(title='Feature'),  # Set the x-axis title
    yaxis=dict(title='Feature')  # Set the y-axis title
)

# Show the generated heatmap
heatmap.show()


In [21]:
# Calculate the correlation of features with the 'scaled_salary' column
corr_scaled_salary = corr['scaled_salary'].sort_values(ascending=False)

# Exclude the 'scaled_salary' itself and create a DataFrame with target correlations
target_corr_df = corr_scaled_salary.drop(['scaled_salary']).to_frame()

# Create a new Plotly figure for the scaled salary heatmap
scaled_salary_heatmap = go.Figure()

# Add a heatmap trace to the figure using target correlations as data
scaled_salary_heatmap.add_trace(go.Heatmap(
    z=target_corr_df.values,  # Target correlations as the heatmap data
    y=target_corr_df.index,   # Y-axis labels (features)
    colorscale='bluered_r'    # Choose a colorscale for the heatmap (e.g., bluered_r)
))

# Customize the layout of the heatmap
scaled_salary_heatmap.update_layout(
    yaxis=dict(
        autorange='reversed'  # Reverse the y-axis for better visualization
    ),
    title='Heatmap to Display Correlation with Salary'  # Set the title of the heatmap
)

# Show the generated heatmap
scaled_salary_heatmap.show()


The presented heatmap depicting correlations with salary indicates that attributes such as senior experience level, medium company size, executive experience level, and full-time employment type exhibit positive correlations with salary. \
Conversely, entry-level experience, mid/intermediate-level experience, half remote remote ratio, and small and large company sizes display negative correlations with salary.

# 3. Predictions <a class="anchor"  id="chapter3"></a>

In [22]:
X = corr_df.drop(columns=['scaled_salary'], axis=1)
y = corr_df['scaled_salary']

CROSS_VALIDATION = 10 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state= 42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(2628, 20) (1127, 20) (2628,) (1127,)


## 3.1 Defining Evaluation Metrics <a class="anchor"  id="subchapter3.1"></a>

In [23]:
# Create an empty DataFrame with specified columns
comparison_col = ['Model', 'train_MSE', 'train_MAE', 'train_R2', 'test_MSE', 'test_MAE', 'test_R2']
comparison_df = pd.DataFrame(columns=comparison_col)
display(comparison_df)

# Function to add model performance metrics to the comparison DataFrame
def add_to_comparison(comparison_df, model_name, train_mse, train_mae, train_r2, test_mse, test_mae, test_r2):
    # Create a dictionary with model metrics
    df = {'Model': [model_name], 'train_MSE': [train_mse], 'train_MAE': [train_mae], 'train_R2': [train_r2],
         'test_MSE': [test_mse], 'test_MAE': [test_mae], 'test_R2': [test_r2]}
    
    # Concatenate the dictionary as a new row to the comparison DataFrame
    comparison_df = pd.concat([comparison_df, pd.DataFrame(data=df)])
    
    # Return the updated comparison DataFrame
    return comparison_df

Unnamed: 0,Model,train_MSE,train_MAE,train_R2,test_MSE,test_MAE,test_R2


Analyzing the Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared (R²) involves understanding what each metric represents and interpreting their values in the context of your data and problem. Here's how you can analyze these metrics:

1. **Mean Squared Error (MSE)**:
   - Interpretation: MSE measures the average squared difference between the predicted values and the actual values. A lower MSE indicates better predictive accuracy.
   - Analysis:
     - Compare MSE across models: Compare the MSE values of different models. A model with a lower MSE is generally preferred.
     - Relative scale: Keep in mind that MSE is dependent on the scale of the target variable. A small MSE might be good for a small-scale target variable, but the interpretation may differ for a larger-scale variable.

2. **Mean Absolute Error (MAE)**:
   - Interpretation: MAE measures the average absolute difference between the predicted values and the actual values. It is less sensitive to outliers compared to MSE.
   - Analysis:
     - Interpretability: MAE is easier to interpret as it represents the average error in the original units of the target variable.
     - Outliers: If your data has outliers that disproportionately affect the MSE, MAE might provide a more balanced view of the model's performance.

3. **R-squared (R²)**:
   - Interpretation: R² measures the proportion of the variance in the dependent variable that is explained by the independent variables. It ranges from 0 to 1, where a higher value indicates a better fit.
   - Analysis:
     - Model fit: A higher R² indicates that the model is capturing more of the variability in the data. However, a high R² doesn't necessarily mean that the model is valid; it could be overfitting.
     - Context: R² should be considered in context with the problem and the data. Some data might inherently have lower R² due to noise or external factors.

Additional Analysis Tips:
- Compare with baseline: Compare your model's MSE, MAE, and R² with a baseline model or naive approach to evaluate its improvement.
- Model complexity: Consider the trade-off between model complexity and performance metrics. More complex models might achieve lower training MSE but could overfit.
- Overfitting and underfitting: Analyze the metrics on both training and testing data to identify potential overfitting (low training MSE but high testing MSE) or underfitting (high training and testing MSE).
- Visualize: Plotting the predicted values against the actual values or creating residual plots can provide insights into the model's performance and identify patterns.

In summary, analyzing MSE, MAE, and R² involves interpreting their values, comparing models, considering the context of the problem and data, and using them in conjunction with visual analysis and other evaluation metrics. These metrics provide different perspectives on the model's accuracy, fit, and performance.

## 3.2 Linear Regression <a class="anchor"  id="subchapter3.2"></a>

Linear regression is a statistical technique that models the relationship between a dependent variable and one or more independent variables using a straight line. It aims to find the best-fitting line that minimizes the difference between observed and predicted values, allowing for predictions and understanding of the relationship between variables.

In [24]:
# Create a Linear Regression model
LR = LinearRegression()

# Perform 5-fold cross-validation and get MSE, MAE, and R2 scores
train_mse = -cross_val_score(LR, X_train, y_train, cv=CROSS_VALIDATION, scoring='neg_mean_squared_error')
train_mae = -cross_val_score(LR, X_train, y_train, cv=CROSS_VALIDATION, scoring='neg_mean_absolute_error')
train_r2 = cross_val_score(LR, X_train, y_train, cv=CROSS_VALIDATION, scoring='r2')

# Train the Linear Regression model
LR.fit(X_train, y_train)

# Predict target variable using the trained model
y_pred = LR.predict(X_test)

# Calculate Mean Square Error (MSE), Mean Absolute Error (MAE), and R-squared
test_mse = mean_squared_error(y_test, y_pred)
test_mae = mean_absolute_error(y_test, y_pred)
test_r2 = r2_score(y_test, y_pred)

# Add model metrics to the comparison DataFrame
comparison_df = add_to_comparison(comparison_df, 'LR', train_mse.mean(), train_mae.mean(), train_r2.mean(),
                                 test_mse, test_mae, test_r2)

# Display the updated comparison DataFrame
display(comparison_df)

Unnamed: 0,Model,train_MSE,train_MAE,train_R2,test_MSE,test_MAE,test_R2
0,LR,0.748349,0.668722,0.250295,0.775179,0.669313,0.222811


## 3.3 Decision Tree Regressor <a class="anchor"  id="subchapter3.3"></a>

Decision tree regression is a machine learning technique that uses a tree-like structure to model the relationship between features and a target variable. It breaks down the data into smaller subsets and makes predictions based on the average value of the target variable within each subset, creating a tree of decisions for making predictions.

In [25]:
# Create a Decision Tree Regressor model
DT = DecisionTreeRegressor(random_state=42)

# Perform 5-fold cross-validation and get MSE, MAE, and R2 scores
train_mse = -cross_val_score(DT, X_train, y_train, cv=CROSS_VALIDATION, scoring='neg_mean_squared_error')
train_mae = -cross_val_score(DT, X_train, y_train, cv=CROSS_VALIDATION, scoring='neg_mean_absolute_error')
train_r2 = cross_val_score(DT, X_train, y_train, cv=CROSS_VALIDATION, scoring='r2')

# Fit the model to the training data
DT.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = DT.predict(X_test)

# Calculate evaluation metrics
test_mse = mean_squared_error(y_test, y_pred)
test_mae = mean_absolute_error(y_test, y_pred)
test_r2 = r2_score(y_test, y_pred)

# Add model metrics to the comparison DataFrame
comparison_df = add_to_comparison(comparison_df, 'DT', train_mse.mean(), train_mae.mean(), train_r2.mean(), 
                                  test_mse, test_mae, test_r2)

# Display the updated comparison DataFrame
display(comparison_df)

Unnamed: 0,Model,train_MSE,train_MAE,train_R2,test_MSE,test_MAE,test_R2
0,LR,0.748349,0.668722,0.250295,0.775179,0.669313,0.222811
0,DT,0.76675,0.6688,0.230441,0.783235,0.662914,0.214735


## 3.4 Lasso Regression <a class="anchor"  id="subchapter3.4"></a>

Lasso regression is a regression technique that adds a penalty to the absolute values of the coefficients of features, encouraging the model to select only important features and reduce overfitting. It helps in feature selection and regularization, making the model more interpretable and improving its generalization to new data.

In [26]:
# Create a Lasso Regression model
alpha = 0.01  # Regularization parameter
lasso_model = Lasso(alpha=alpha)

# Perform 5-fold cross-validation and get MSE, MAE, and R2 scores
train_mse = -cross_val_score(lasso_model, X_train, y_train, cv=CROSS_VALIDATION, scoring='neg_mean_squared_error')
train_mae = -cross_val_score(lasso_model, X_train, y_train, cv=CROSS_VALIDATION, scoring='neg_mean_absolute_error')
train_r2 = cross_val_score(lasso_model, X_train, y_train, cv=CROSS_VALIDATION, scoring='r2')

# Fit the model to the training data
lasso_model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = lasso_model.predict(X_test)

# Calculate evaluation metrics
test_mse = mean_squared_error(y_test, y_pred)
test_mae = mean_absolute_error(y_test, y_pred)
test_r2 = r2_score(y_test, y_pred)

# Add model metrics to the comparison DataFrame
comparison_df = add_to_comparison(comparison_df, 'Lasso', train_mse.mean(), train_mae.mean(), train_r2.mean(), 
                                 test_mse, test_mae, test_r2)

# Display the updated comparison DataFrame
display(comparison_df)

Unnamed: 0,Model,train_MSE,train_MAE,train_R2,test_MSE,test_MAE,test_R2
0,LR,0.748349,0.668722,0.250295,0.775179,0.669313,0.222811
0,DT,0.76675,0.6688,0.230441,0.783235,0.662914,0.214735
0,Lasso,0.755617,0.674044,0.24305,0.77904,0.675434,0.218941


The choice of `alpha` controls the strength of regularization. Higher values of `alpha` increase the amount of regularization applied.

## 3.5 Ridge Regression <a class="anchor"  id="subchapter3.5"></a>

Ridge regression is a regression method that adds a penalty to the squared values of the coefficients of features, encouraging the model to shrink them towards zero. This helps to reduce multicollinearity and overfitting, leading to a more stable and generalizable model.

In [27]:
# Create a Ridge Regression model
alpha = 1.0  # Regularization parameter
ridge_model = Ridge(alpha=alpha)

# Perform 5-fold cross-validation and get MSE, MAE, and R2 scores
train_mse = -cross_val_score(ridge_model, X_train, y_train, cv=CROSS_VALIDATION, scoring='neg_mean_squared_error')
train_mae = -cross_val_score(ridge_model, X_train, y_train, cv=CROSS_VALIDATION, scoring='neg_mean_absolute_error')
train_r2 = cross_val_score(ridge_model, X_train, y_train, cv=CROSS_VALIDATION, scoring='r2')

# Fit the model to the training data
ridge_model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = ridge_model.predict(X_test)

# Calculate evaluation metrics
test_mse = mean_squared_error(y_test, y_pred)
test_mae = mean_absolute_error(y_test, y_pred)
test_r2 = r2_score(y_test, y_pred)

# Add model metrics to the comparison DataFrame
comparison_df = add_to_comparison(comparison_df, 'Ridge', train_mse.mean(), train_mae.mean(), train_r2.mean(), 
                                 test_mse, test_mae, test_r2)

# Display the updated comparison DataFrame
display(comparison_df)

Unnamed: 0,Model,train_MSE,train_MAE,train_R2,test_MSE,test_MAE,test_R2
0,LR,0.748349,0.668722,0.250295,0.775179,0.669313,0.222811
0,DT,0.76675,0.6688,0.230441,0.783235,0.662914,0.214735
0,Lasso,0.755617,0.674044,0.24305,0.77904,0.675434,0.218941
0,Ridge,0.745775,0.667508,0.252963,0.773944,0.669588,0.22405


The choice of `alpha` controls the strength of regularization. Higher values of `alpha` increase the amount of regularization applied.

## 3.6 Elastic Net Regression <a class="anchor"  id="subchapter3.6"></a>

Elastic Net regression is a combination of Lasso and Ridge regressions, adding both L1 (absolute value) and L2 (squared value) penalties to the coefficients of features. It strikes a balance between feature selection and regularization, addressing the limitations of individual methods and providing a more flexible approach to regression modeling.

In [28]:
# Create an Elastic Net Regression model
alpha = 0.01  # Regularization parameter
l1_ratio = 0.5  # Mixing parameter between L1 and L2 regularization
elastic_net_model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio)

# Perform 5-fold cross-validation and get MSE, MAE, and R2 scores
train_mse = -cross_val_score(elastic_net_model, X_train, y_train, cv=CROSS_VALIDATION, scoring='neg_mean_squared_error')
train_mae = -cross_val_score(elastic_net_model, X_train, y_train, cv=CROSS_VALIDATION, scoring='neg_mean_absolute_error')
train_r2 = cross_val_score(elastic_net_model, X_train, y_train, cv=CROSS_VALIDATION, scoring='r2')

# Fit the model to the training data
elastic_net_model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = elastic_net_model.predict(X_test)

# Calculate evaluation metrics
test_mse = mean_squared_error(y_test, y_pred)
test_mae = mean_absolute_error(y_test, y_pred)
test_r2 = r2_score(y_test, y_pred)

# Add model metrics to the comparison DataFrame
comparison_df = add_to_comparison(comparison_df, 'Elastic Net', train_mse.mean(), train_mae.mean(), train_r2.mean(), 
                                 test_mse, test_mae, test_r2)

# Display the updated comparison DataFrame
display(comparison_df)

Unnamed: 0,Model,train_MSE,train_MAE,train_R2,test_MSE,test_MAE,test_R2
0,LR,0.748349,0.668722,0.250295,0.775179,0.669313,0.222811
0,DT,0.76675,0.6688,0.230441,0.783235,0.662914,0.214735
0,Lasso,0.755617,0.674044,0.24305,0.77904,0.675434,0.218941
0,Ridge,0.745775,0.667508,0.252963,0.773944,0.669588,0.22405
0,Elastic Net,0.750112,0.671209,0.248616,0.774702,0.673105,0.22329


# 4. Conclusion  <a class="anchor"  id="chapter4"></a>

A model with low Mean Squared Error (MSE), low Mean Absolute Error (MAE), and high R-squared (R2) is considered to be a good-performing model for regression analysis. Here's an explanation of each of these evaluation metrics and how they contribute to the assessment of the model's performance:

1. **Mean Squared Error (MSE):**
   MSE is a measure of the average squared difference between the predicted values and the actual target values. Lower MSE values indicate that the model's predictions are closer to the actual values. In other words, a low MSE means that the model's errors are relatively small and consistent. It is calculated as the average of the squared differences between predicted and actual values.

2. **Mean Absolute Error (MAE):**
   MAE is a measure of the average absolute difference between the predicted values and the actual target values. Similar to MSE, lower MAE values indicate that the model's predictions are closer to the actual values. It measures the average magnitude of errors without considering their direction.

3. **R-squared (R2) or Coefficient of Determination:**
   R-squared measures the proportion of the variance in the dependent variable (target) that is explained by the independent variables (features) in the model. It ranges from 0 to 1, with a higher value indicating a better fit of the model to the data. A high R2 indicates that a large portion of the variability in the target variable is explained by the model's predictions.

When you have a model with low MSE and MAE, it means that the model's predictions are accurate and close to the actual values. A high R2 indicates that the model is capturing a significant amount of the variability in the target variable, which suggests that the model's predictions align well with the underlying patterns in the data.

In summary, a model with low MSE, low MAE, and high R2 is a model that makes accurate predictions, closely follows the actual values, and effectively explains the variability in the target variable. Such a model is generally considered to be a good choice for regression tasks and is likely to provide reliable predictions on new, unseen data.