# ANÁLISE EXPLORATÓRIA 

Este Projeto tem como objetivo realizar uma Análise Exploratória dos dados dos salarios da area de ciência de dados entre os anos de 2020 e 2024.



# Dicionario dos Dados


## Colunas e seus significados :
<br>

**job_title: The job title or role associated with the reported salary.**

**experience_level: The level of experience of the individual.**

**employment_type: Indicates whether the employment is full-time, part-time, etc.**

**work_models: Describes different working models (remote, on-site, hybrid).**

**work_year: The specific year in which the salary information was recorded.**

**employee_residence: The residence location of the employee.**

**salary: The reported salary in the original currency.**

**salary_currency: The currency in which the salary is denominated.**

**salary_in_usd: The converted salary in US dollars.**

**company_location: The geographic location of the employing organization.**

**company_size: The size of the company, categorized by the number of employees.**

<br>


Fonte dos Dados: https://www.kaggle.com/datasets/sazidthe1/data-science-salaries/data

In [47]:
import pandas as pd
import numpy as np

import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
import matplotlib.pyplot as plt

# Primeira Olhada nos Dados e seu Shape

In [48]:
salario=pd.read_csv('data_science_salaries.csv')

print(salario.shape,
      f'É um conjunto de dados de {salario.shape[0]} Linhas e {salario.shape[1]} Colunas')

salario.head()


(6599, 11) É um conjunto de dados de 6599 Linhas e 11 Colunas


Unnamed: 0,job_title,experience_level,employment_type,work_models,work_year,employee_residence,salary,salary_currency,salary_in_usd,company_location,company_size
0,Data Engineer,Mid-level,Full-time,Remote,2024,United States,148100,USD,148100,United States,Medium
1,Data Engineer,Mid-level,Full-time,Remote,2024,United States,98700,USD,98700,United States,Medium
2,Data Scientist,Senior-level,Full-time,Remote,2024,United States,140032,USD,140032,United States,Medium
3,Data Scientist,Senior-level,Full-time,Remote,2024,United States,100022,USD,100022,United States,Medium
4,BI Developer,Mid-level,Full-time,On-site,2024,United States,120000,USD,120000,United States,Medium


In [49]:
#Colunas disponiveis no dataset
salario.columns

Index(['job_title', 'experience_level', 'employment_type', 'work_models',
       'work_year', 'employee_residence', 'salary', 'salary_currency',
       'salary_in_usd', 'company_location', 'company_size'],
      dtype='object')

## Existem valores ausentes ? os dados estão nos formatos corretos ?

In [50]:
salario.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6599 entries, 0 to 6598
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   job_title           6599 non-null   object
 1   experience_level    6599 non-null   object
 2   employment_type     6599 non-null   object
 3   work_models         6599 non-null   object
 4   work_year           6599 non-null   int64 
 5   employee_residence  6599 non-null   object
 6   salary              6599 non-null   int64 
 7   salary_currency     6599 non-null   object
 8   salary_in_usd       6599 non-null   int64 
 9   company_location    6599 non-null   object
 10  company_size        6599 non-null   object
dtypes: int64(3), object(8)
memory usage: 567.2+ KB


Não existem valores nulos e os tipos dos dados estão corretos

## Existem Duplicadas ?

In [51]:
salario.duplicated().sum()

0

## Como é o comportamento das variaveis numericas ?

### Por conta dos valores altos de "salary" é necessario fazer uma alteração no formato de apresentação dos valores numericos, se não ficariam no formato de Notação Científica.

In [52]:
#Para mostrar todos os números como floats sem notação científica
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [53]:
salario.describe(percentiles=[0.25,0.5,0.75,0.90,0.99])

Unnamed: 0,work_year,salary,salary_in_usd
count,6599.0,6599.0,6599.0
mean,2022.818,179283.255,145560.559
std,0.675,526372.242,70946.838
min,2020.0,14000.0,15000.0
25%,2023.0,96000.0,95000.0
50%,2023.0,140000.0,138666.0
75%,2023.0,187500.0,185000.0
90%,2023.0,240070.0,235000.0
99%,2024.0,750000.0,345012.0
max,2024.0,30400000.0,750000.0


### Um desvião padrão tão alto na Coluna "Salary" é explicado pela variação do "valor" das moedas de cada local onde reside o empregado.

### Como exemplo a pessoa que tem o maior valor de "salary" apesar de ser "30400000" CLP é um valor de "40038" USD , um quantile abaixo de 25%!

In [54]:
salario.query('salary==30400000')

Unnamed: 0,job_title,experience_level,employment_type,work_models,work_year,employee_residence,salary,salary_currency,salary_in_usd,company_location,company_size
6504,Data Scientist,Mid-level,Full-time,Remote,2021,Chile,30400000,CLP,40038,Chile,Large


### Agora vamos dar uma olhada nos valores não numericos

In [55]:
salario.describe(exclude=np.number)

Unnamed: 0,job_title,experience_level,employment_type,work_models,employee_residence,salary_currency,company_location,company_size
count,6599,6599,6599,6599,6599,6599,6599,6599
unique,132,4,4,3,87,22,75,3
top,Data Engineer,Senior-level,Full-time,On-site,United States,USD,United States,Medium
freq,1307,4105,6552,3813,5305,5827,5354,5860


### Neste dataset como podemos ver que 6552 de 6599 são empregados Full-Time, então dos 4 tipos de modalidade de trabalho como pode ser visto pela linha "unique" que representa a quantidade de categorias, a grande maioria é um empregado "Full-Time".


### Algumas outras observações que temos é que neste dataset a maioria das pessoas são dos estados unidos, recebem em dolar, e a maioria das empresas são de porte médio.

In [56]:
salario.job_title.unique()

array(['Data Engineer', 'Data Scientist', 'BI Developer',
       'Research Analyst', 'Business Intelligence Developer',
       'Data Analyst', 'Director of Data Science', 'MLOps Engineer',
       'Machine Learning Scientist', 'Machine Learning Engineer',
       'Data Science Manager', 'Applied Scientist',
       'Business Intelligence Analyst', 'Analytics Engineer',
       'Business Intelligence Engineer', 'Data Science',
       'Research Scientist', 'Research Engineer',
       'Managing Director Data Science', 'AI Engineer', 'Data Specialist',
       'Data Architect', 'Data Visualization Specialist', 'ETL Developer',
       'Data Science Practitioner', 'Computer Vision Engineer',
       'Data Lead', 'ML Engineer', 'Data Developer', 'Data Modeler',
       'Data Science Consultant', 'AI Architect',
       'Data Analytics Manager', 'Data Science Engineer',
       'Data Product Manager', 'Data Quality Analyst', 'Data Strategist',
       'Prompt Engineer', 'Data Science Lead',
       'Busi

--------------------------------------------------------------------------------------------------------------------------------

# Agora iremos realizar uma analise univariada das colunas, para vermos melhor como elas se comportam

# Analise Univariada

In [57]:
fig=px.box(salario,x='salary_in_usd')
fig.show()

In [58]:
fig = px.histogram(salario, x="salary_in_usd")
fig.show()

In [59]:
fig = px.histogram(salario, x="experience_level",color="experience_level")

fig.show()

In [60]:
fig = px.histogram(salario, x="work_models",color="work_models")

fig.show()

In [61]:
fig = px.histogram(salario, x="company_size",color="company_size")

fig.show()

In [62]:
fig = px.histogram(salario, x="company_location",color="company_location")

fig.show()

In [63]:
fig = px.histogram(salario, x="salary_currency",color="salary_currency")

fig.show()

In [64]:
fig = px.histogram(salario, x="work_year",color="work_year")

fig.show()

In [65]:
fig = px.histogram(salario, x="job_title",color="job_title")

fig.show()

In [66]:
fig = px.histogram(salario, x="employment_type",color="employment_type")

fig.show()
