# Preprocessing and Training Data
## 1. Introduction
### Data Storytelling

The Data Science field has expanded significantly in recent years, leading to changes in wages and working conditions. It is important to understand how data science salaries correlate with various socio-economic indicators internationally. This research will help identify the relationship between professionals' income levels, countries' economic conditions and the quality of life of the population and help professionals and organizations make informed career and salary decisions.

### Dataset Description:

The Dataset provides valuable insights into the compensation trends and variations in the field of data science from 2020 to 2024, and a wealth of information about all countries worldwide, covering a wide range of indicators and attributes. It encompasses demographic statistics, economic indicators, environmental factors, healthcare metrics, education statistics, and much more.



## 2.Import Libraries 

In [26]:
# Import relevant libraries and packages.
import numpy as np 
import pandas as pd 
from scipy.stats import norm

In [27]:
from sklearn.model_selection import train_test_split

In [28]:
from sklearn.preprocessing import StandardScaler

## 3. Data Collection

Import the data into the working environment

In [29]:
# Save new Dataset 
salaries = pd.read_csv('/Users/juliabolgova/Documents/CapstoneProject/data/interim/salariesEDA.csv')
salaries.set_index('Unnamed: 0', inplace=True)
salaries.reset_index(drop=True, inplace=True)

In [30]:
salaries.head()

Unnamed: 0,job_title,employment_type,experience_level,expertise_level,salary_currency,company_location,salary_in_usd,employee_residence,company_size,year,...,gross_tertiary_education_enrollment_pct,life_expectancy,minimum_wage,official_language,population,population_labor_force_participation_pct,tax_revenue_pct,total_tax_rate,unemployment_rate,urban_population
0,Data Engineer,Full-Time,Senior,Expert,United States Dollar,United States,210000,United States,Medium,2023,...,88.2,78.5,7.25,Spanish,328239523.0,62.0,9.6,36.6,14.7,270663028.0
1,Data Engineer,Full-Time,Senior,Expert,United States Dollar,United States,165000,United States,Medium,2023,...,88.2,78.5,7.25,Spanish,328239523.0,62.0,9.6,36.6,14.7,270663028.0
2,Data Engineer,Full-Time,Senior,Expert,United States Dollar,United States,185900,United States,Medium,2023,...,88.2,78.5,7.25,Spanish,328239523.0,62.0,9.6,36.6,14.7,270663028.0
3,Data Engineer,Full-Time,Senior,Expert,United States Dollar,United States,129300,United States,Medium,2023,...,88.2,78.5,7.25,Spanish,328239523.0,62.0,9.6,36.6,14.7,270663028.0
4,Data Scientist,Full-Time,Senior,Expert,United States Dollar,United States,140000,United States,Medium,2023,...,88.2,78.5,7.25,Spanish,328239523.0,62.0,9.6,36.6,14.7,270663028.0


In [31]:
salaries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5613 entries, 0 to 5612
Data columns (total 27 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   job_title                                 5613 non-null   object 
 1   employment_type                           5613 non-null   object 
 2   experience_level                          5613 non-null   object 
 3   expertise_level                           5613 non-null   object 
 4   salary_currency                           5613 non-null   object 
 5   company_location                          5613 non-null   object 
 6   salary_in_usd                             5613 non-null   int64  
 7   employee_residence                        5613 non-null   object 
 8   company_size                              5613 non-null   object 
 9   year                                      5613 non-null   int64  
 10  Country                             

In [32]:
salaries.describe()

Unnamed: 0,salary_in_usd,year,density_p_km2,cpi,cpi_change_pct,fertility_rate,gdp,gross_primary_education_enrollment_pct,gross_tertiary_education_enrollment_pct,life_expectancy,minimum_wage,population,population_labor_force_participation_pct,tax_revenue_pct,total_tax_rate,unemployment_rate,urban_population
count,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0,5613.0
mean,140912.158917,2022.744878,71.994655,118.812496,6.486211,1.722685,17539810000000.0,101.940264,83.462908,78.816105,7.362546,286457600.0,61.869214,11.489346,36.804026,12.90023,229586500.0
std,63158.347118,0.641945,262.150194,12.775854,2.769618,0.208599,7775901000000.0,1.942687,12.529344,2.080722,1.669579,152378300.0,2.19772,4.759618,5.860639,3.896863,92924550.0
min,15000.0,2020.0,3.0,99.55,-1.9,1.14,2220307000.0,84.7,3.0,52.8,0.25,502653.0,41.2,0.1,11.3,0.09,475902.0
25%,93000.0,2023.0,36.0,117.24,7.5,1.73,21427700000000.0,101.8,88.2,78.5,7.25,328239500.0,62.0,9.6,36.6,14.7,270663000.0
50%,135842.0,2023.0,36.0,117.24,7.5,1.73,21427700000000.0,101.8,88.2,78.5,7.25,328239500.0,62.0,9.6,36.6,14.7,270663000.0
75%,183414.0,2023.0,36.0,117.24,7.5,1.73,21427700000000.0,101.8,88.2,78.5,7.25,328239500.0,62.0,9.6,36.6,14.7,270663000.0
max,319000.0,2024.0,8358.0,288.57,53.5,5.39,21427700000000.0,126.6,136.6,84.2,13.59,1397715000.0,86.8,37.2,106.3,28.18,842934000.0


In [33]:
print(salaries.isnull().sum())

job_title                                   0
employment_type                             0
experience_level                            0
expertise_level                             0
salary_currency                             0
company_location                            0
salary_in_usd                               0
employee_residence                          0
company_size                                0
year                                        0
Country                                     0
density_p_km2                               0
cpi                                         0
cpi_change_pct                              0
fertility_rate                              0
gdp                                         0
gross_primary_education_enrollment_pct      0
gross_tertiary_education_enrollment_pct     0
life_expectancy                             0
minimum_wage                                0
official_language                           0
population                        

## 3. Create dummy or indicator features for categorical variables

In [34]:
# Creating indicator features for all categorical variables
categorical_columns = ['job_title', 'employment_type', 'experience_level', 'expertise_level', 'salary_currency', 'company_location', 'employee_residence', 'company_size', 'Country', 'official_language']
salaries_dummies = pd.get_dummies(salaries, columns=categorical_columns, drop_first=True)

salaries_dummies = salaries_dummies.astype(int)


In [35]:
salaries_dummies

Unnamed: 0,salary_in_usd,year,density_p_km2,cpi,cpi_change_pct,fertility_rate,gdp,gross_primary_education_enrollment_pct,gross_tertiary_education_enrollment_pct,life_expectancy,...,official_language_Romanian,official_language_Slovene language,official_language_Spanish,official_language_Standard Chinese,official_language_Swahili,official_language_Swedish,official_language_Thai,official_language_Turkish,official_language_Ukrainian,official_language_Urdu
0,210000,2023,36,117,7,1,21427700000000,101,88,78,...,0,0,1,0,0,0,0,0,0,0
1,165000,2023,36,117,7,1,21427700000000,101,88,78,...,0,0,1,0,0,0,0,0,0,0
2,185900,2023,36,117,7,1,21427700000000,101,88,78,...,0,0,1,0,0,0,0,0,0,0
3,129300,2023,36,117,7,1,21427700000000,101,88,78,...,0,0,1,0,0,0,0,0,0,0
4,140000,2023,36,117,7,1,21427700000000,101,88,78,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5608,73824,2023,281,119,1,1,2827113184696,101,60,81,...,0,0,0,0,0,0,0,0,0,0
5609,49216,2023,281,119,1,1,2827113184696,101,60,81,...,0,0,0,0,0,0,0,0,0,0
5610,190000,2023,4,116,1,1,1736425629520,100,68,81,...,0,0,0,0,0,0,0,0,0,0
5611,150000,2023,4,116,1,1,1736425629520,100,68,81,...,0,0,0,0,0,0,0,0,0,0


In [36]:
# Definition of numerical features
numeric_features = salaries_dummies.select_dtypes(include=['float64', 'int64']).columns

# Creating a StandardScaler object
scaler = StandardScaler()

# Applying StandardScaler to numeric features
salaries_dummies[numeric_features] = scaler.fit_transform(salaries_dummies[numeric_features])

# Checking the result
print(salaries_dummies.head())

   salary_in_usd      year  density_p_km2       cpi  cpi_change_pct  \
0       1.093980  0.397456      -0.137318 -0.117142        0.369824   
1       0.381422  0.397456      -0.137318 -0.117142        0.369824   
2       0.712366  0.397456      -0.137318 -0.117142        0.369824   
3      -0.183874  0.397456      -0.137318 -0.117142        0.369824   
4      -0.014444  0.397456      -0.137318 -0.117142        0.369824   

   fertility_rate       gdp  gross_primary_education_enrollment_pct  \
0       -0.121095  0.500037                               -0.108808   
1       -0.121095  0.500037                               -0.108808   
2       -0.121095  0.500037                               -0.108808   
3       -0.121095  0.500037                               -0.108808   
4       -0.121095  0.500037                               -0.108808   

   gross_tertiary_education_enrollment_pct  life_expectancy  ...  \
0                                 0.380161        -0.151156  ...   
1         

## 4. Standardize the magnitude of numeric features using a scaler

In [37]:
# Definition of numerical features
numeric_features = ['salary_in_usd', 'year', 'density_p_km2', 'cpi', 'cpi_change_pct', 'fertility_rate', 'gdp', 'gross_primary_education_enrollment_pct', 'gross_tertiary_education_enrollment_pct', 'life_expectancy', 'minimum_wage', 'population', 'population_labor_force_participation_pct', 'tax_revenue_pct', 'total_tax_rate', 'unemployment_rate', 'urban_population']

In [38]:
print(numeric_features)

['salary_in_usd', 'year', 'density_p_km2', 'cpi', 'cpi_change_pct', 'fertility_rate', 'gdp', 'gross_primary_education_enrollment_pct', 'gross_tertiary_education_enrollment_pct', 'life_expectancy', 'minimum_wage', 'population', 'population_labor_force_participation_pct', 'tax_revenue_pct', 'total_tax_rate', 'unemployment_rate', 'urban_population']


In [39]:
# Creating a StandardScaler object
scaler = StandardScaler()

# Applying StandardScaler to numeric features
salaries_dummies[numeric_features] = scaler.fit_transform(salaries_dummies[numeric_features])

In [40]:
# Checking the result
print(salaries_dummies.head())

   salary_in_usd      year  density_p_km2       cpi  cpi_change_pct  \
0       1.093980  0.397456      -0.137318 -0.117142        0.369824   
1       0.381422  0.397456      -0.137318 -0.117142        0.369824   
2       0.712366  0.397456      -0.137318 -0.117142        0.369824   
3      -0.183874  0.397456      -0.137318 -0.117142        0.369824   
4      -0.014444  0.397456      -0.137318 -0.117142        0.369824   

   fertility_rate       gdp  gross_primary_education_enrollment_pct  \
0       -0.121095  0.500037                               -0.108808   
1       -0.121095  0.500037                               -0.108808   
2       -0.121095  0.500037                               -0.108808   
3       -0.121095  0.500037                               -0.108808   
4       -0.121095  0.500037                               -0.108808   

   gross_tertiary_education_enrollment_pct  life_expectancy  ...  \
0                                 0.380161        -0.151156  ...   
1         

### Checking for categorical data

In [42]:
# Checking for columns of type object
object_columns = salaries_dummies.select_dtypes(include=['object']).columns
if len(object_columns) > 0:
    print("Columns with object data type:")
    print(object_columns)
else:
    print("No columns with object data type found.")

No columns with object data type found.


### Checking for NaN values

In [41]:
missing_values = salaries_dummies.isnull().sum()
print("Missing values in each column:")
print(missing_values[missing_values > 0])


Missing values in each column:
Series([], dtype: int64)


## 5. Split into testing and training datasets

In [43]:
# Separation of data into features (X) and target variable (y)
# 'salary_in_usd' is a target variable
X = salaries_dummies.drop('salary_in_usd', axis=1)
y = salaries_dummies['salary_in_usd']

In [44]:
# Separation into training and test samples
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [45]:
# Checking the size of the received samples
print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)

Training set size: (4490, 408)
Test set size: (1123, 408)


In [48]:
salaries = salaries_dummies

In [50]:
salaries.to_csv('/Users/juliabolgova/Documents/CapstoneProject/data/interim/salaries_prepared.csv', index=False)