In [None]:
#import libaries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# --- Load data ---
df_salaries = pd.read_csv('data_salaries.csv')

# For numerical features
numerical_features = df_salaries.select_dtypes(include=[np.number])
numerical_summary = numerical_features.describe().transpose()

# For categorical features
categorical_features = df_salaries.select_dtypes(include=['object'])
categorical_summary = categorical_features.describe(include=['object']).transpose()

# Display summary
print("Summary of Numerical Features:\n", numerical_summary)
print("\n")
print("Summary of Categorical Features:\n", categorical_summary)

Here I am displaying the features in the numerical form I changed the categorical type to numercial so the ai model can use the data.

In [None]:
# --- Cleaning ---
#  Removoing the outliers
def remove_outliers(df):
    numeric_cols = df.select_dtypes(include=[np.number])
    mean = numeric_cols.mean()
    std = numeric_cols.std()
    is_outlier = (np.abs(numeric_cols - mean) > 3 * std).any(axis=1)
    return df[~is_outlier]

df_salaries = remove_outliers(df_salaries)

Here I ran through the cols and then I removed any outliers that was not in the range of standard deviation

In [None]:
# --- Statistics ---
# Calculate mean of 'salary_in_usd'
mean_salary = df_salaries['salary_in_usd'].mean()
print("Mean Salary: ", mean_salary)

# Calculate standard deviation of 'salary_in_usd'
std_salary = df_salaries['salary_in_usd'].std()
print("Standard Deviation of Salary: ", std_salary)

# --- Visualization ---
# Show first 10 instances
print("df_salaries first ten instances:\n", df_salaries.head(10))

Here I decided that I should find the mean of salary in USD as it is what I would be trying to get my model to predict as it is the variable that everyone has that I believe is the best to test for. I also decided to show the first ten instances in df_salaries.

In [None]:
# Box plot Salaries vs Company Size
plt.figure(figsize=(10,7))
sns.boxplot(x="company_size", y="salary_in_usd", data=df_salaries, order=['S', 'M', 'L'])
plt.title('Salaries vs Company Size')
plt.show()

Graph 1: The graphs shows a strong correlation between company size and employee wages. Smaller companies pay less, likely due to limited funds and being a start up. Medium-sized companies pay competitively, as they represent the majority. Large companies pay the second most with a wide wage distribution, possibly due to diverse roles and work experiences among employees for example employees could be with the company when it was a small company and overtime got pay increases as the company grew.


In [None]:
# Grouped bar plot of Salary Comparison by experience level and company size
plt.figure(figsize=(10, 7))
sns.barplot(data=df_salaries, x='experience_level', y='salary_in_usd', hue='company_size')
plt.xlabel('Experience Level')
plt.ylabel('Salary (USD)')
plt.title('Salary Comparison by Experience Level and Company Size')
plt.legend(title='Company Size')
plt.show()

Graph 2: The graph reveals a link between experience level and pay in data science. The 'EX' level earns the most, mirroring Graph 1's distribution. 'EN' level, though paid the least, earns slightly more in smaller companies. The pay trends across experience levels and company sizes show smaller companies paying least, followed by large, and then medium companies.





In [None]:
# Box plot Salaries vs Employment Type and Company Size
plt.figure(figsize=(10, 7))
sns.boxplot(x="company_size", y="salary_in_usd", hue="employment_type", data=df_salaries, order=['S', 'M', 'L'])
plt.title('Salaries vs Employment Type and Company Size')
plt.legend(title='Employment Type')
plt.show()

# Box plot Salary usd vs employment type
plt.figure(figsize=(10,7))
sns.boxplot(x='employment_type', y='salary_in_usd', data=df_salaries)
plt.xlabel('Employment Type')
plt.ylabel('Salary (USD)')
plt.title('Salary Comparison by Employment Type')
plt.show()


Graph 3: indicates that full-time (FT) is the most common and highest-paying employment type. Part-time (PT) jobs have the widest pay spread, especially in medium companies. Freelance (FL) workers in medium companies earn more than PT workers, possibly due to longer work hours. Contract (CT) workers in medium companies have a symmetrical pay spread. Larger companies employ only FT and PT workers and appear to pay less, but considering the additional roles in medium companies, it suggests they pay more in data science. Smaller companies, with FT, CT, and PT jobs, follow the trend of FT paying more. 


Graph 4 confirms FL as the second-highest earners with the least pay spread, and PT earning the least.

# 5 Summary:

From examining the data, I found that these variables had the most negligible correlation with salary in USD 'salary_currency' and 'company_location'. Whereas company_size, experience_level and employment_type contributed the most when predicting the salary of people in the data science field.

My exploration of the data and preprocessing began with loading the salary dataset into the pandas data frame. This dataset provides a comprehensive insight into the salaries of data science professionals and how it varies based on different factors.

In order to ensure the quality of the data, the cleaning process involved removing outliers from the dataset. I implemented a function that removes any data point more than three standard deviations from the mean in the numeric columns. This step was vital because outliers can significantly skew results and impact the predictive performance of machine learning models.

Following the data cleaning, I performed statistical analysis on the 'salary_in_usd' field. By calculating the mean and standard deviation, I was able to determine the dispersion of the salaries, which provided an understanding of the overall salary distribution.

The next step involved diving deeper into the data through visualisations. These included box plots and bar plots, which I used to compare salaries against variables such as company size, experience level, and employment type. This phase was crucial in visualising trends and patterns in the data that would not be evident from just looking at raw data, meaning the best variables were easier to identify.

Some of the interesting findings from the EDA included:

A correlation between company size and wages became obvious when graphing the features, with smaller companies generally paying less than larger ones while medium ones pay the most.
When examining the data, there was a clear impact of experience level on salaries, with more experienced professionals ('EX') earning more.
There was also a variation in salaries based on employment type, with full-time roles usually offering higher salaries than part-time ones.
Through preprocessing the data, I developed a thorough understanding of the datasets, revealing essential insights that will guide the subsequent steps in model training and evaluation. These insights include identifying key features and their potential influence on predicting salary outcomes.
