# Preprocessing and Training Data
## 1. Introduction
### Data Storytelling

The Data Science field has expanded significantly in recent years, leading to changes in wages and working conditions. It is important to understand how data science salaries correlate with various socio-economic indicators internationally. This research will help identify the relationship between professionals' income levels, countries' economic conditions and the quality of life of the population and help professionals and organizations make informed career and salary decisions.

### Dataset Description:

The Dataset provides valuable insights into the compensation trends and variations in the field of data science from 2020 to 2024, and a wealth of information about all countries worldwide, covering a wide range of indicators and attributes. It encompasses demographic statistics, economic indicators, environmental factors, healthcare metrics, education statistics, and much more.



## 2.Import Libraries 

In [74]:
# Import relevant libraries and packages.
import numpy as np 
import pandas as pd 
from scipy.stats import norm

In [75]:
from sklearn.model_selection import train_test_split

In [76]:
from sklearn.preprocessing import StandardScaler

## 3. Data Collection

Import the data into the working environment

In [77]:
# Save new Dataset 
salaries = pd.read_csv('/Users/juliabolgova/Documents/CapstoneProject/data/interim/salariesEDA.csv')
salaries.set_index('Unnamed: 0', inplace=True)
salaries.reset_index(drop=True, inplace=True)

In [78]:
salaries.head(50)

Unnamed: 0,job_title,employment_type,experience_level,expertise_level,salary_currency,company_location,salary_in_usd,employee_residence,company_size,year,Country,density_p_km2,cpi,cpi_change_pct,fertility_rate,gdp,gross_primary_education_enrollment_pct,gross_tertiary_education_enrollment_pct,life_expectancy,minimum_wage,official_language,population,population_labor_force_participation_pct,tax_revenue_pct,total_tax_rate,unemployment_rate,urban_population
0,Data Engineer,Full-Time,Senior,Expert,United States Dollar,United States,210000,United States,Medium,2023,United States,36.0,117.24,7.5,1.73,21427700000000.0,101.8,88.2,78.5,7.25,Spanish,328239500.0,62.0,9.6,36.6,14.7,270663028.0
1,Data Engineer,Full-Time,Senior,Expert,United States Dollar,United States,165000,United States,Medium,2023,United States,36.0,117.24,7.5,1.73,21427700000000.0,101.8,88.2,78.5,7.25,Spanish,328239500.0,62.0,9.6,36.6,14.7,270663028.0
2,Data Engineer,Full-Time,Senior,Expert,United States Dollar,United States,185900,United States,Medium,2023,United States,36.0,117.24,7.5,1.73,21427700000000.0,101.8,88.2,78.5,7.25,Spanish,328239500.0,62.0,9.6,36.6,14.7,270663028.0
3,Data Engineer,Full-Time,Senior,Expert,United States Dollar,United States,129300,United States,Medium,2023,United States,36.0,117.24,7.5,1.73,21427700000000.0,101.8,88.2,78.5,7.25,Spanish,328239500.0,62.0,9.6,36.6,14.7,270663028.0
4,Data Scientist,Full-Time,Senior,Expert,United States Dollar,United States,140000,United States,Medium,2023,United States,36.0,117.24,7.5,1.73,21427700000000.0,101.8,88.2,78.5,7.25,Spanish,328239500.0,62.0,9.6,36.6,14.7,270663028.0
5,Data Scientist,Full-Time,Senior,Expert,United States Dollar,United States,126000,United States,Medium,2023,United States,36.0,117.24,7.5,1.73,21427700000000.0,101.8,88.2,78.5,7.25,Spanish,328239500.0,62.0,9.6,36.6,14.7,270663028.0
6,Data Scientist,Full-Time,Senior,Expert,United States Dollar,United States,170000,United States,Medium,2023,United States,36.0,117.24,7.5,1.73,21427700000000.0,101.8,88.2,78.5,7.25,Spanish,328239500.0,62.0,9.6,36.6,14.7,270663028.0
7,Data Scientist,Full-Time,Senior,Expert,United States Dollar,United States,130000,United States,Medium,2023,United States,36.0,117.24,7.5,1.73,21427700000000.0,101.8,88.2,78.5,7.25,Spanish,328239500.0,62.0,9.6,36.6,14.7,270663028.0
8,Data Engineer,Full-Time,Mid,Intermediate,British Pound Sterling,United Kingdom,104584,United Kingdom,Medium,2023,United Kingdom,281.0,119.62,1.7,1.68,2827113000000.0,101.2,60.0,81.3,10.13,English,66834400.0,62.8,25.5,30.6,3.85,55908316.0
9,Data Engineer,Full-Time,Mid,Intermediate,British Pound Sterling,United Kingdom,92280,United Kingdom,Medium,2023,United Kingdom,281.0,119.62,1.7,1.68,2827113000000.0,101.2,60.0,81.3,10.13,English,66834400.0,62.8,25.5,30.6,3.85,55908316.0


In [87]:
salaries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5613 entries, 0 to 5612
Data columns (total 27 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   job_title                                 5613 non-null   object 
 1   employment_type                           5613 non-null   object 
 2   experience_level                          5613 non-null   object 
 3   expertise_level                           5613 non-null   object 
 4   salary_currency                           5613 non-null   object 
 5   company_location                          5613 non-null   object 
 6   salary_in_usd                             5613 non-null   int64  
 7   employee_residence                        5613 non-null   object 
 8   company_size                              5613 non-null   object 
 9   year                                      5613 non-null   int64  
 10  Country                             

In [88]:
salaries.describe()

Unnamed: 0,salary_in_usd,year,density_p_km2,cpi,cpi_change_pct,fertility_rate,gdp,gross_primary_education_enrollment_pct,gross_tertiary_education_enrollment_pct,life_expectancy,minimum_wage,population,population_labor_force_participation_pct,tax_revenue_pct,total_tax_rate,unemployment_rate,urban_population
count,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0
mean,140912.158917,2022.744878,71.994655,118.812496,6.486211,1.722685,17539810000000.0,101.940264,83.462908,78.816105,7.362546,286457600.0,61.869214,11.489346,36.804026,12.90023,229586500.0
std,63158.347118,0.641945,262.150194,12.775854,2.769618,0.208599,7775901000000.0,1.942687,12.529344,2.080722,1.669579,152378300.0,2.19772,4.759618,5.860639,3.896863,92924550.0
min,15000.0,2020.0,3.0,99.55,-1.9,1.14,2220307000.0,84.7,3.0,52.8,0.25,502653.0,41.2,0.1,11.3,0.09,475902.0
25%,93000.0,2023.0,36.0,117.24,7.5,1.73,21427700000000.0,101.8,88.2,78.5,7.25,328239500.0,62.0,9.6,36.6,14.7,270663000.0
50%,135842.0,2023.0,36.0,117.24,7.5,1.73,21427700000000.0,101.8,88.2,78.5,7.25,328239500.0,62.0,9.6,36.6,14.7,270663000.0
75%,183414.0,2023.0,36.0,117.24,7.5,1.73,21427700000000.0,101.8,88.2,78.5,7.25,328239500.0,62.0,9.6,36.6,14.7,270663000.0
max,319000.0,2024.0,8358.0,288.57,53.5,5.39,21427700000000.0,126.6,136.6,84.2,13.59,1397715000.0,86.8,37.2,106.3,28.18,842934000.0


In [89]:
print(salaries.isna().sum())

job_title                                   0
employment_type                             0
experience_level                            0
expertise_level                             0
salary_currency                             0
company_location                            0
salary_in_usd                               0
employee_residence                          0
company_size                                0
year                                        0
Country                                     0
density_p_km2                               0
cpi                                         0
cpi_change_pct                              0
fertility_rate                              0
gdp                                         0
gross_primary_education_enrollment_pct      0
gross_tertiary_education_enrollment_pct     0
life_expectancy                             0
minimum_wage                                0
official_language                           0
population                        

In [90]:
print(salaries.isnull().sum())

job_title                                   0
employment_type                             0
experience_level                            0
expertise_level                             0
salary_currency                             0
company_location                            0
salary_in_usd                               0
employee_residence                          0
company_size                                0
year                                        0
Country                                     0
density_p_km2                               0
cpi                                         0
cpi_change_pct                              0
fertility_rate                              0
gdp                                         0
gross_primary_education_enrollment_pct      0
gross_tertiary_education_enrollment_pct     0
life_expectancy                             0
minimum_wage                                0
official_language                           0
population                        

In [91]:
# Вывод списка столбцов датафрейма
print("Список столбцов датафрейма:")
print(salaries.columns)

Список столбцов датафрейма:
Index(['job_title', 'employment_type', 'experience_level', 'expertise_level',
       'salary_currency', 'company_location', 'salary_in_usd',
       'employee_residence', 'company_size', 'year', 'Country',
       'density_p_km2', 'cpi', 'cpi_change_pct', 'fertility_rate', 'gdp',
       'gross_primary_education_enrollment_pct',
       'gross_tertiary_education_enrollment_pct', 'life_expectancy',
       'minimum_wage', 'official_language', 'population',
       'population_labor_force_participation_pct', 'tax_revenue_pct',
       'total_tax_rate', 'unemployment_rate', 'urban_population'],
      dtype='object')


## 3. Create dummy or indicator features for categorical variables

In [93]:
# Определение категориальных признаков
categorical_features = ['job_title', 'employment_type', 'experience_level', 'expertise_level',
       'salary_currency', 'company_location', 'company_size',
       'employee_residence', 'year', 'Country',
       'official_language']

# Преобразование категориальных признаков с помощью One-Hot Encoding
salaries_encoded = pd.get_dummies(salaries, columns=categorical_features)

# Вывод первых строк преобразованного датафрейма
salaries_encoded.head()

# Сохранение преобразованных данных в новый CSV файл (по желанию)
salaries_encoded.to_csv('/Users/juliabolgova/Documents/CapstoneProject/data/interim/salaries_encoded.csv', index=False)

In [94]:
# Настройка вывода всех столбцов
pd.set_option('display.max_columns', None)

In [98]:
salaries_encoded = salaries_encoded.apply(lambda col: col.map({False: 0, True: 1}) if col.dtype == bool else col)

In [99]:
salaries_encoded

Unnamed: 0,salary_in_usd,density_p_km2,cpi,cpi_change_pct,fertility_rate,gdp,gross_primary_education_enrollment_pct,gross_tertiary_education_enrollment_pct,life_expectancy,minimum_wage,population,population_labor_force_participation_pct,tax_revenue_pct,total_tax_rate,unemployment_rate,urban_population,job_title_AI Architect,job_title_AI Developer,job_title_AI Engineer,job_title_AI Product Manager,job_title_AI Programmer,job_title_AI Research Engineer,job_title_AI Scientist,job_title_AWS Data Architect,job_title_Analytics Engineer,job_title_Applied Data Scientist,job_title_Applied Machine Learning Engineer,job_title_Applied Machine Learning Scientist,job_title_Applied Scientist,job_title_Autonomous Vehicle Technician,job_title_Azure Data Engineer,job_title_BI Analyst,job_title_BI Data Analyst,job_title_BI Data Engineer,job_title_BI Developer,job_title_Bear Robotics,job_title_Big Data Architect,job_title_Big Data Engineer,job_title_Business Data Analyst,job_title_Business Intelligence Analyst,job_title_Business Intelligence Data Analyst,job_title_Business Intelligence Developer,job_title_Business Intelligence Engineer,job_title_Business Intelligence Manager,job_title_Business Intelligence Specialist,job_title_Cloud Data Architect,job_title_Cloud Data Engineer,job_title_Cloud Database Engineer,job_title_Compliance Data Analyst,job_title_Computer Vision Engineer,job_title_Computer Vision Software Engineer,job_title_Consultant Data Engineer,job_title_Data Analyst,job_title_Data Analyst Lead,job_title_Data Analytics Consultant,job_title_Data Analytics Engineer,job_title_Data Analytics Lead,job_title_Data Analytics Manager,job_title_Data Analytics Specialist,job_title_Data Architect,job_title_Data DevOps Engineer,job_title_Data Developer,job_title_Data Engineer,job_title_Data Engineer 2,job_title_Data Infrastructure Engineer,job_title_Data Integration Engineer,job_title_Data Integration Specialist,job_title_Data Lead,job_title_Data Management Analyst,job_title_Data Management Specialist,job_title_Data Manager,job_title_Data Modeler,job_title_Data Modeller,job_title_Data Operations Analyst,job_title_Data Operations Engineer,job_title_Data Operations Manager,job_title_Data Operations Specialist,job_title_Data Product Manager,job_title_Data Product Owner,job_title_Data Quality Analyst,job_title_Data Quality Engineer,job_title_Data Quality Manager,job_title_Data Science,job_title_Data Science Consultant,job_title_Data Science Director,job_title_Data Science Engineer,job_title_Data Science Lead,job_title_Data Science Manager,job_title_Data Science Practitioner,job_title_Data Scientist,job_title_Data Scientist Lead,job_title_Data Specialist,job_title_Data Strategist,job_title_Data Strategy Manager,job_title_Data Visualization Analyst,job_title_Data Visualization Engineer,job_title_Data Visualization Specialist,job_title_Decision Scientist,job_title_Deep Learning Engineer,job_title_Deep Learning Researcher,job_title_Director of Data Science,job_title_ETL Developer,job_title_ETL Engineer,job_title_Finance Data Analyst,job_title_Financial Data Analyst,job_title_Head of Data,job_title_Head of Data Science,job_title_Head of Machine Learning,job_title_Insight Analyst,job_title_Lead Data Analyst,job_title_Lead Data Engineer,job_title_Lead Data Scientist,job_title_Lead Machine Learning Engineer,job_title_ML Engineer,job_title_MLOps Engineer,job_title_Machine Learning Developer,job_title_Machine Learning Engineer,job_title_Machine Learning Infrastructure Engineer,job_title_Machine Learning Manager,job_title_Machine Learning Modeler,job_title_Machine Learning Operations Engineer,job_title_Machine Learning Research Engineer,job_title_Machine Learning Researcher,job_title_Machine Learning Scientist,job_title_Machine Learning Software Engineer,job_title_Machine Learning Specialist,job_title_Manager Data Management,job_title_Managing Director Data Science,job_title_Marketing Data Analyst,job_title_Marketing Data Engineer,job_title_Marketing Data Scientist,job_title_NLP Engineer,job_title_Power BI Developer,job_title_Principal Data Analyst,job_title_Principal Data Architect,job_title_Principal Data Engineer,job_title_Principal Data Scientist,job_title_Principal Machine Learning Engineer,job_title_Product Data Analyst,job_title_Prompt Engineer,job_title_Research Analyst,job_title_Research Engineer,job_title_Research Scientist,job_title_Sales Data Analyst,job_title_Software Data Engineer,job_title_Staff Data Analyst,job_title_Staff Data Scientist,job_title_Staff Machine Learning Engineer,employment_type_Contract,employment_type_Freelance,employment_type_Full-Time,employment_type_Part-Time,experience_level_Entry,experience_level_Executive,experience_level_Mid,experience_level_Senior,expertise_level_Director,expertise_level_Expert,expertise_level_Intermediate,expertise_level_Junior,salary_currency_Australian Dollar,salary_currency_Brazilian Real,salary_currency_British Pound Sterling,salary_currency_Canadian Dollar,salary_currency_Chilean Peso,salary_currency_Danish Krone,salary_currency_Euro,salary_currency_Hungarian Forint,salary_currency_Indian Rupee,salary_currency_Japanese Yen,salary_currency_Mexican Peso,salary_currency_Norwegian Krone,salary_currency_Philippine Peso,salary_currency_Polish Zloty,salary_currency_Singapore Dollar,salary_currency_South African Rand,salary_currency_Swiss Franc,salary_currency_Thai Baht,salary_currency_Turkish Lira,salary_currency_United States Dollar,company_location_Algeria,company_location_Argentina,company_location_Armenia,company_location_Australia,company_location_Austria,company_location_Belgium,company_location_Brazil,company_location_Canada,company_location_Central African Republic,company_location_Chile,company_location_China,company_location_Colombia,company_location_Croatia,company_location_Denmark,company_location_Egypt,company_location_Estonia,company_location_Finland,company_location_France,company_location_Germany,company_location_Ghana,company_location_Greece,company_location_Honduras,company_location_Hungary,company_location_India,company_location_Indonesia,company_location_Iraq,company_location_Israel,company_location_Italy,company_location_Japan,company_location_Kenya,company_location_Latvia,company_location_Lithuania,company_location_Luxembourg,company_location_Malaysia,company_location_Malta,company_location_Mauritius,company_location_Mexico,company_location_Netherlands,company_location_New Zealand,company_location_Nigeria,company_location_Norway,company_location_Pakistan,company_location_Philippines,company_location_Poland,company_location_Portugal,company_location_Qatar,company_location_Romania,company_location_Saudi Arabia,company_location_Singapore,company_location_Slovenia,company_location_South Africa,company_location_Spain,company_location_Sweden,company_location_Switzerland,company_location_Thailand,company_location_Turkey,company_location_Ukraine,company_location_United Arab Emirates,company_location_United Kingdom,company_location_United States,company_size_Large,company_size_Medium,company_size_Small,employee_residence_Algeria,employee_residence_Argentina,employee_residence_Armenia,employee_residence_Australia,employee_residence_Austria,employee_residence_Belgium,"employee_residence_Bolivia, Plurinational State of",employee_residence_Brazil,employee_residence_Bulgaria,employee_residence_Canada,employee_residence_Central African Republic,employee_residence_Chile,employee_residence_China,employee_residence_Colombia,employee_residence_Costa Rica,employee_residence_Croatia,employee_residence_Cyprus,employee_residence_Czechia,employee_residence_Denmark,employee_residence_Dominican Republic,employee_residence_Egypt,employee_residence_Estonia,employee_residence_Finland,employee_residence_France,employee_residence_Germany,employee_residence_Ghana,employee_residence_Greece,employee_residence_Honduras,employee_residence_Hong Kong,employee_residence_Hungary,employee_residence_India,employee_residence_Indonesia,employee_residence_Iraq,employee_residence_Italy,employee_residence_Japan,employee_residence_Jersey,employee_residence_Kenya,employee_residence_Kuwait,employee_residence_Latvia,employee_residence_Lithuania,employee_residence_Luxembourg,employee_residence_Malaysia,employee_residence_Malta,employee_residence_Mauritius,employee_residence_Mexico,"employee_residence_Moldova, Republic of",employee_residence_Netherlands,employee_residence_New Zealand,employee_residence_Nigeria,employee_residence_Norway,employee_residence_Pakistan,employee_residence_Peru,employee_residence_Philippines,employee_residence_Poland,employee_residence_Portugal,employee_residence_Puerto Rico,employee_residence_Qatar,employee_residence_Romania,employee_residence_Russian Federation,employee_residence_Saudi Arabia,employee_residence_Serbia,employee_residence_Singapore,employee_residence_Slovenia,employee_residence_South Africa,employee_residence_Spain,employee_residence_Sweden,employee_residence_Switzerland,employee_residence_Thailand,employee_residence_Tunisia,employee_residence_Turkey,employee_residence_Uganda,employee_residence_Ukraine,employee_residence_United Arab Emirates,employee_residence_United Kingdom,employee_residence_United States,employee_residence_Uzbekistan,employee_residence_Viet Nam,year_2020,year_2021,year_2022,year_2023,year_2024,Country_Algeria,Country_Argentina,Country_Armenia,Country_Australia,Country_Austria,Country_Belgium,Country_Brazil,Country_Canada,Country_Central African Republic,Country_Chile,Country_China,Country_Colombia,Country_Croatia,Country_Denmark,Country_Egypt,Country_Estonia,Country_Finland,Country_France,Country_Germany,Country_Ghana,Country_Greece,Country_Honduras,Country_Hungary,Country_India,Country_Indonesia,Country_Iraq,Country_Israel,Country_Italy,Country_Japan,Country_Kenya,Country_Latvia,Country_Lithuania,Country_Luxembourg,Country_Malaysia,Country_Malta,Country_Mauritius,Country_Mexico,Country_Netherlands,Country_New Zealand,Country_Nigeria,Country_Norway,Country_Pakistan,Country_Philippines,Country_Poland,Country_Portugal,Country_Qatar,Country_Romania,Country_Saudi Arabia,Country_Singapore,Country_Slovenia,Country_South Africa,Country_Spain,Country_Sweden,Country_Switzerland,Country_Thailand,Country_Turkey,Country_Ukraine,Country_United Arab Emirates,Country_United Kingdom,Country_United States,official_language_Afrikaans,official_language_Arabic,official_language_Armenian,official_language_Croatian,official_language_Danish,official_language_Dutch,official_language_Eglish,official_language_English,official_language_Estonian,official_language_French,official_language_German,official_language_Greek,official_language_Hebrew,official_language_Hindi,official_language_Hungarian,official_language_Indonesian,official_language_Italian,official_language_Japanese,official_language_Latvian,official_language_Lithuanian,official_language_Luxembourgish,official_language_Malay,official_language_Malaysian language,official_language_Maltese,official_language_Modern Standard Arabic,official_language_Norwegian,official_language_Polish,official_language_Portuguese,official_language_Romanian,official_language_Slovene language,official_language_Spanish,official_language_Standard Chinese,official_language_Swahili,official_language_Swedish,official_language_Thai,official_language_Turkish,official_language_Ukrainian,official_language_Urdu
0,210000,36.0,117.24,7.5,1.73,2.142770e+13,101.8,88.2,78.5,7.25,328239523.0,62.0,9.6,36.6,14.70,270663028.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,165000,36.0,117.24,7.5,1.73,2.142770e+13,101.8,88.2,78.5,7.25,328239523.0,62.0,9.6,36.6,14.70,270663028.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
2,185900,36.0,117.24,7.5,1.73,2.142770e+13,101.8,88.2,78.5,7.25,328239523.0,62.0,9.6,36.6,14.70,270663028.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
3,129300,36.0,117.24,7.5,1.73,2.142770e+13,101.8,88.2,78.5,7.25,328239523.0,62.0,9.6,36.6,14.70,270663028.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
4,140000,36.0,117.24,7.5,1.73,2.142770e+13,101.8,88.2,78.5,7.25,328239523.0,62.0,9.6,36.6,14.70,270663028.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5608,73824,281.0,119.62,1.7,1.68,2.827113e+12,101.2,60.0,81.3,10.13,66834405.0,62.8,25.5,30.6,3.85,55908316.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5609,49216,281.0,119.62,1.7,1.68,2.827113e+12,101.2,60.0,81.3,10.13,66834405.0,62.8,25.5,30.6,3.85,55908316.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5610,190000,4.0,116.76,1.9,1.50,1.736426e+12,100.9,68.9,81.9,9.51,36991981.0,65.1,12.8,24.5,5.56,30628482.0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5611,150000,4.0,116.76,1.9,1.50,1.736426e+12,100.9,68.9,81.9,9.51,36991981.0,65.1,12.8,24.5,5.56,30628482.0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## 4. Standardize the magnitude of numeric features using a scaler

In [102]:
from sklearn.preprocessing import StandardScaler

# Определение признаков для масштабирования
numerical_features = ['salary_in_usd', 'density_p_km2', 'cpi', 'cpi_change_pct', 'fertility_rate', 'gdp',
                      'gross_primary_education_enrollment_pct', 'gross_tertiary_education_enrollment_pct',
                      'life_expectancy', 'minimum_wage', 'population', 'population_labor_force_participation_pct',
                      'tax_revenue_pct', 'total_tax_rate', 'unemployment_rate', 'urban_population']

# Инициализация StandardScaler
scaler = StandardScaler()

# Применение масштабирования к числовым признакам
salaries_encoded[numerical_features] = scaler.fit_transform(salaries_encoded[numerical_features])

# Сохранение масштабированных данных в новый CSV файл (по желанию)
salaries_encoded.to_csv('/Users/juliabolgova/Documents/CapstoneProject/data/interim/salaries_scaled.csv', index=False)

In [103]:
# Вывод первых строк масштабированного датафрейма
salaries_encoded.head()

Unnamed: 0,salary_in_usd,density_p_km2,cpi,cpi_change_pct,fertility_rate,gdp,gross_primary_education_enrollment_pct,gross_tertiary_education_enrollment_pct,life_expectancy,minimum_wage,population,population_labor_force_participation_pct,tax_revenue_pct,total_tax_rate,unemployment_rate,urban_population,job_title_AI Architect,job_title_AI Developer,job_title_AI Engineer,job_title_AI Product Manager,job_title_AI Programmer,job_title_AI Research Engineer,job_title_AI Scientist,job_title_AWS Data Architect,job_title_Analytics Engineer,job_title_Applied Data Scientist,job_title_Applied Machine Learning Engineer,job_title_Applied Machine Learning Scientist,job_title_Applied Scientist,job_title_Autonomous Vehicle Technician,job_title_Azure Data Engineer,job_title_BI Analyst,job_title_BI Data Analyst,job_title_BI Data Engineer,job_title_BI Developer,job_title_Bear Robotics,job_title_Big Data Architect,job_title_Big Data Engineer,job_title_Business Data Analyst,job_title_Business Intelligence Analyst,job_title_Business Intelligence Data Analyst,job_title_Business Intelligence Developer,job_title_Business Intelligence Engineer,job_title_Business Intelligence Manager,job_title_Business Intelligence Specialist,job_title_Cloud Data Architect,job_title_Cloud Data Engineer,job_title_Cloud Database Engineer,job_title_Compliance Data Analyst,job_title_Computer Vision Engineer,job_title_Computer Vision Software Engineer,job_title_Consultant Data Engineer,job_title_Data Analyst,job_title_Data Analyst Lead,job_title_Data Analytics Consultant,job_title_Data Analytics Engineer,job_title_Data Analytics Lead,job_title_Data Analytics Manager,job_title_Data Analytics Specialist,job_title_Data Architect,job_title_Data DevOps Engineer,job_title_Data Developer,job_title_Data Engineer,job_title_Data Engineer 2,job_title_Data Infrastructure Engineer,job_title_Data Integration Engineer,job_title_Data Integration Specialist,job_title_Data Lead,job_title_Data Management Analyst,job_title_Data Management Specialist,job_title_Data Manager,job_title_Data Modeler,job_title_Data Modeller,job_title_Data Operations Analyst,job_title_Data Operations Engineer,job_title_Data Operations Manager,job_title_Data Operations Specialist,job_title_Data Product Manager,job_title_Data Product Owner,job_title_Data Quality Analyst,job_title_Data Quality Engineer,job_title_Data Quality Manager,job_title_Data Science,job_title_Data Science Consultant,job_title_Data Science Director,job_title_Data Science Engineer,job_title_Data Science Lead,job_title_Data Science Manager,job_title_Data Science Practitioner,job_title_Data Scientist,job_title_Data Scientist Lead,job_title_Data Specialist,job_title_Data Strategist,job_title_Data Strategy Manager,job_title_Data Visualization Analyst,job_title_Data Visualization Engineer,job_title_Data Visualization Specialist,job_title_Decision Scientist,job_title_Deep Learning Engineer,job_title_Deep Learning Researcher,job_title_Director of Data Science,job_title_ETL Developer,job_title_ETL Engineer,job_title_Finance Data Analyst,job_title_Financial Data Analyst,job_title_Head of Data,job_title_Head of Data Science,job_title_Head of Machine Learning,job_title_Insight Analyst,job_title_Lead Data Analyst,job_title_Lead Data Engineer,job_title_Lead Data Scientist,job_title_Lead Machine Learning Engineer,job_title_ML Engineer,job_title_MLOps Engineer,job_title_Machine Learning Developer,job_title_Machine Learning Engineer,job_title_Machine Learning Infrastructure Engineer,job_title_Machine Learning Manager,job_title_Machine Learning Modeler,job_title_Machine Learning Operations Engineer,job_title_Machine Learning Research Engineer,job_title_Machine Learning Researcher,job_title_Machine Learning Scientist,job_title_Machine Learning Software Engineer,job_title_Machine Learning Specialist,job_title_Manager Data Management,job_title_Managing Director Data Science,job_title_Marketing Data Analyst,job_title_Marketing Data Engineer,job_title_Marketing Data Scientist,job_title_NLP Engineer,job_title_Power BI Developer,job_title_Principal Data Analyst,job_title_Principal Data Architect,job_title_Principal Data Engineer,job_title_Principal Data Scientist,job_title_Principal Machine Learning Engineer,job_title_Product Data Analyst,job_title_Prompt Engineer,job_title_Research Analyst,job_title_Research Engineer,job_title_Research Scientist,job_title_Sales Data Analyst,job_title_Software Data Engineer,job_title_Staff Data Analyst,job_title_Staff Data Scientist,job_title_Staff Machine Learning Engineer,employment_type_Contract,employment_type_Freelance,employment_type_Full-Time,employment_type_Part-Time,experience_level_Entry,experience_level_Executive,experience_level_Mid,experience_level_Senior,expertise_level_Director,expertise_level_Expert,expertise_level_Intermediate,expertise_level_Junior,salary_currency_Australian Dollar,salary_currency_Brazilian Real,salary_currency_British Pound Sterling,salary_currency_Canadian Dollar,salary_currency_Chilean Peso,salary_currency_Danish Krone,salary_currency_Euro,salary_currency_Hungarian Forint,salary_currency_Indian Rupee,salary_currency_Japanese Yen,salary_currency_Mexican Peso,salary_currency_Norwegian Krone,salary_currency_Philippine Peso,salary_currency_Polish Zloty,salary_currency_Singapore Dollar,salary_currency_South African Rand,salary_currency_Swiss Franc,salary_currency_Thai Baht,salary_currency_Turkish Lira,salary_currency_United States Dollar,company_location_Algeria,company_location_Argentina,company_location_Armenia,company_location_Australia,company_location_Austria,company_location_Belgium,company_location_Brazil,company_location_Canada,company_location_Central African Republic,company_location_Chile,company_location_China,company_location_Colombia,company_location_Croatia,company_location_Denmark,company_location_Egypt,company_location_Estonia,company_location_Finland,company_location_France,company_location_Germany,company_location_Ghana,company_location_Greece,company_location_Honduras,company_location_Hungary,company_location_India,company_location_Indonesia,company_location_Iraq,company_location_Israel,company_location_Italy,company_location_Japan,company_location_Kenya,company_location_Latvia,company_location_Lithuania,company_location_Luxembourg,company_location_Malaysia,company_location_Malta,company_location_Mauritius,company_location_Mexico,company_location_Netherlands,company_location_New Zealand,company_location_Nigeria,company_location_Norway,company_location_Pakistan,company_location_Philippines,company_location_Poland,company_location_Portugal,company_location_Qatar,company_location_Romania,company_location_Saudi Arabia,company_location_Singapore,company_location_Slovenia,company_location_South Africa,company_location_Spain,company_location_Sweden,company_location_Switzerland,company_location_Thailand,company_location_Turkey,company_location_Ukraine,company_location_United Arab Emirates,company_location_United Kingdom,company_location_United States,company_size_Large,company_size_Medium,company_size_Small,employee_residence_Algeria,employee_residence_Argentina,employee_residence_Armenia,employee_residence_Australia,employee_residence_Austria,employee_residence_Belgium,"employee_residence_Bolivia, Plurinational State of",employee_residence_Brazil,employee_residence_Bulgaria,employee_residence_Canada,employee_residence_Central African Republic,employee_residence_Chile,employee_residence_China,employee_residence_Colombia,employee_residence_Costa Rica,employee_residence_Croatia,employee_residence_Cyprus,employee_residence_Czechia,employee_residence_Denmark,employee_residence_Dominican Republic,employee_residence_Egypt,employee_residence_Estonia,employee_residence_Finland,employee_residence_France,employee_residence_Germany,employee_residence_Ghana,employee_residence_Greece,employee_residence_Honduras,employee_residence_Hong Kong,employee_residence_Hungary,employee_residence_India,employee_residence_Indonesia,employee_residence_Iraq,employee_residence_Italy,employee_residence_Japan,employee_residence_Jersey,employee_residence_Kenya,employee_residence_Kuwait,employee_residence_Latvia,employee_residence_Lithuania,employee_residence_Luxembourg,employee_residence_Malaysia,employee_residence_Malta,employee_residence_Mauritius,employee_residence_Mexico,"employee_residence_Moldova, Republic of",employee_residence_Netherlands,employee_residence_New Zealand,employee_residence_Nigeria,employee_residence_Norway,employee_residence_Pakistan,employee_residence_Peru,employee_residence_Philippines,employee_residence_Poland,employee_residence_Portugal,employee_residence_Puerto Rico,employee_residence_Qatar,employee_residence_Romania,employee_residence_Russian Federation,employee_residence_Saudi Arabia,employee_residence_Serbia,employee_residence_Singapore,employee_residence_Slovenia,employee_residence_South Africa,employee_residence_Spain,employee_residence_Sweden,employee_residence_Switzerland,employee_residence_Thailand,employee_residence_Tunisia,employee_residence_Turkey,employee_residence_Uganda,employee_residence_Ukraine,employee_residence_United Arab Emirates,employee_residence_United Kingdom,employee_residence_United States,employee_residence_Uzbekistan,employee_residence_Viet Nam,year_2020,year_2021,year_2022,year_2023,year_2024,Country_Algeria,Country_Argentina,Country_Armenia,Country_Australia,Country_Austria,Country_Belgium,Country_Brazil,Country_Canada,Country_Central African Republic,Country_Chile,Country_China,Country_Colombia,Country_Croatia,Country_Denmark,Country_Egypt,Country_Estonia,Country_Finland,Country_France,Country_Germany,Country_Ghana,Country_Greece,Country_Honduras,Country_Hungary,Country_India,Country_Indonesia,Country_Iraq,Country_Israel,Country_Italy,Country_Japan,Country_Kenya,Country_Latvia,Country_Lithuania,Country_Luxembourg,Country_Malaysia,Country_Malta,Country_Mauritius,Country_Mexico,Country_Netherlands,Country_New Zealand,Country_Nigeria,Country_Norway,Country_Pakistan,Country_Philippines,Country_Poland,Country_Portugal,Country_Qatar,Country_Romania,Country_Saudi Arabia,Country_Singapore,Country_Slovenia,Country_South Africa,Country_Spain,Country_Sweden,Country_Switzerland,Country_Thailand,Country_Turkey,Country_Ukraine,Country_United Arab Emirates,Country_United Kingdom,Country_United States,official_language_Afrikaans,official_language_Arabic,official_language_Armenian,official_language_Croatian,official_language_Danish,official_language_Dutch,official_language_Eglish,official_language_English,official_language_Estonian,official_language_French,official_language_German,official_language_Greek,official_language_Hebrew,official_language_Hindi,official_language_Hungarian,official_language_Indonesian,official_language_Italian,official_language_Japanese,official_language_Latvian,official_language_Lithuanian,official_language_Luxembourgish,official_language_Malay,official_language_Malaysian language,official_language_Maltese,official_language_Modern Standard Arabic,official_language_Norwegian,official_language_Polish,official_language_Portuguese,official_language_Romanian,official_language_Slovene language,official_language_Spanish,official_language_Standard Chinese,official_language_Swahili,official_language_Swedish,official_language_Thai,official_language_Turkish,official_language_Ukrainian,official_language_Urdu
0,1.09398,-0.137318,-0.123094,0.366072,0.035071,0.500037,-0.072207,0.378114,-0.151935,-0.067416,0.274223,0.059515,-0.396989,-0.034816,0.461892,0.442081,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,0.381422,-0.137318,-0.123094,0.366072,0.035071,0.500037,-0.072207,0.378114,-0.151935,-0.067416,0.274223,0.059515,-0.396989,-0.034816,0.461892,0.442081,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
2,0.712366,-0.137318,-0.123094,0.366072,0.035071,0.500037,-0.072207,0.378114,-0.151935,-0.067416,0.274223,0.059515,-0.396989,-0.034816,0.461892,0.442081,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
3,-0.183874,-0.137318,-0.123094,0.366072,0.035071,0.500037,-0.072207,0.378114,-0.151935,-0.067416,0.274223,0.059515,-0.396989,-0.034816,0.461892,0.442081,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
4,-0.014444,-0.137318,-0.123094,0.366072,0.035071,0.500037,-0.072207,0.378114,-0.151935,-0.067416,0.274223,0.059515,-0.396989,-0.034816,0.461892,0.442081,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0


### Checking for categorical data

In [104]:
# Checking for columns of type object
object_columns = salaries_encoded.select_dtypes(include=['object']).columns
if len(object_columns) > 0:
    print("Columns with object data type:")
    print(object_columns)
else:
    print("No columns with object data type found.")

No columns with object data type found.


### Checking for NaN values

In [105]:
missing_values = salaries_encoded.isnull().sum()
print("Missing values in each column:")
print(missing_values[missing_values > 0])


Missing values in each column:
Series([], dtype: int64)


## 5. Split into testing and training datasets

In [107]:
# Separation of data into features (X) and target variable (y)
# 'salary_in_usd' is a target variable
X = salaries_encoded.drop('salary_in_usd', axis=1)
y = salaries_encoded['salary_in_usd']

In [108]:
# Separation into training and test samples
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [109]:
# Checking the size of the received samples
print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)

Training set size: (4490, 422)
Test set size: (1123, 422)


In [110]:
salaries_encoded.to_csv('/Users/juliabolgova/Documents/CapstoneProject/data/interim/salaries_preprocess.csv', index=False)