In [None]:
#import libaries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# --- Cleaning ---
# Load data
df_salaries = pd.read_csv('data_salaries.csv')

# Function to remove outliers
def remove_outliers(df):
    numeric_cols = df.select_dtypes(include=[np.number])
    mean = numeric_cols.mean()
    std = numeric_cols.std()
    is_outlier = (np.abs(numeric_cols - mean) > 3 * std).any(axis=1)
    return df[~is_outlier]

df_salaries = remove_outliers(df_salaries)

In [None]:
# --- Statistics ---
# Calculate mean of 'salary_in_usd'
mean_salary = df_salaries['salary_in_usd'].mean()
print("Mean Salary: ", mean_salary)

# Calculate standard deviation of 'salary_in_usd'
std_salary = df_salaries['salary_in_usd'].std()
print("Standard Deviation of Salary: ", std_salary)
# Show first 10 instances
print("df_salaries first 10 instances:\n", df_salaries.head(10))

![Screenshot 2023-08-05 at 1.52.36 PM.png](attachment:72681e44-8328-46e6-9ef7-e04b792f4e23.png)Here is where I show the mean of the salary in usd as that is the variable I am most intrested in. I also wanted to show the deviation and the first 10 instances of the data frame we are given.

In [None]:
# --- Visualization ---


# Box plot Salaries vs Company Size
plt.figure(figsize=(10,6))
sns.boxplot(x="company_size", y="salary_in_usd", data=df_salaries, order=['S', 'M', 'L'])
plt.title('Salaries vs Company Size')
plt.show()

# Grouped bar plot of Salary Comparison by experience level and company size
plt.figure(figsize=(12, 8))
sns.barplot(data=df_salaries, x='experience_level', y='salary_in_usd', hue='company_size')
plt.xlabel('Experience Level')
plt.ylabel('Salary (USD)')
plt.title('Salary Comparison by Experience Level and Company Size')
plt.legend(title='Company Size')
plt.show()

# Box plot Salaries vs Employment Type and Company Size
plt.figure(figsize=(12, 6))
sns.boxplot(x="company_size", y="salary_in_usd", hue="employment_type", data=df_salaries, order=['S', 'M', 'L'])
plt.title('Salaries vs Employment Type and Company Size')
plt.legend(title='Employment Type')
plt.show()

# Box plot Salary Comparison by employment type
plt.figure(figsize=(10,6))
sns.boxplot(x='employment_type', y='salary_in_usd', data=df_salaries)
plt.xlabel('Employment Type')
plt.ylabel('Salary (USD)')
plt.title('Salary Comparison by Employment Type')
plt.show()

Here I am removing all the numerical outliers in my code

Graph 1: The graphs indicate a correlation between company size and employee wages. Smaller companies pay less, likely due to limited funds. Medium-sized companies pay competitively, as they represent the majority. Large companies pay the second most with a wide wage distribution, possibly due to diverse roles and work experiences among employees.


![image.png](attachment:41257b68-bb1f-4544-bb61-c3173ec8e501.png)


Graph 2: The graph reveals a link between experience level and pay in data science. The 'EX' level earns the most, mirroring Graph 1's distribution. 'EN' level, though paid the least, earns slightly more in smaller companies. The pay trends across experience levels and company sizes show smaller companies paying least, followed by large, and then medium companies.
![image.png](attachment:05eb59c4-bb4f-4822-a961-72358c11a50d.png)


Graph 3: indicates that full-time (FT) is the most common and highest-paying employment type. Part-time (PT) jobs have the widest pay spread, especially in medium companies. Freelance (FL) workers in medium companies earn more than PT workers, possibly due to longer work hours. Contract (CT) workers in medium companies have a symmetrical pay spread. Larger companies employ only FT and PT workers and appear to pay less, but considering the additional roles in medium companies, it suggests they pay more in data science. Smaller companies, with FT, CT, and PT jobs, follow the trend of FT paying more. 

![image.png](attachment:78b980e2-62e8-47aa-91bf-3112def3b585.png)
Graph 4 confirms FL as second-highest earners with the least pay spread, and PT earning the least.

![image.png](attachment:ac8abe35-ad37-4f77-bf5e-5564676f96e7.png)



From examining the data I found that these variables had the least correlation with salary in USD 'salary_in_usd','salary_currency','company_location'. where as company_size, experience_level and employment_type contributed the most, when it came to predicting the salary of people working in the data science field.


#5 Summary:

My exploratory data analysis (EDA) and preprocessing began with loading the salary dataset into pandas dataframe. This dataset provides a comprehensive insight into the salaries of data science professionals and how it varies based on different factors.

In order to ensure the integrity and quality of the data, the cleaning process involved removing outliers from the dataset. I implemented a function that removes any data point more than three standard deviations from the mean in the numeric columns. This step was vital because outliers can significantly skew results and impact the predictive performance of machine learning models.

Following the data cleaning, I performed statistical analysis on the 'salary_in_usd' field. By calculating the mean and standard deviation, I was able to determine the central tendency and dispersion of the salaries, which provided an understanding of the overall salary distribution.

The next step involved diving deeper into the data through visualisations. These included box plots and bar plots, which I used to compare salaries against variables such as company size, experience level, and employment type. This phase of EDA was crucial in visualising trends and patterns in the data that would not be evident from just looking at raw data.

Some of the interesting findings from the EDA included:

A correlation between company size and wages, with smaller companies generally paying less than larger ones, while medium ones pay the most.
A clear impact of experience level on salaries, with more experienced professionals ('EX') earning more.
A variation in salaries based on employment type, with full-time roles usually offering higher salaries than part-time ones.
Through EDA and preprocessing, I developed a thorough understanding of the datasets, revealing essential insights that will guide the subsequent steps in model training and evaluation. These insights include identifying key features and their potential influence on predicting salary outcomes.
