<a href="https://colab.research.google.com/github/LaisST/FIAP_202501_HandsOn_data_analytics/blob/main/Fase_2_Aula03_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Análise Exploratória de Dados EDA

## INTRODUÇÃO

### Qual o problema a ser resolvido?


Digamos que você foi contratado pela Spotify como um ciêntista de dados, e uma grande dor atual que eles possuem é de conseguir entender as características e comportamentos que as músicas dos artistas/bandas Top Artists causam em seus usuários.

Todo grande projeto de ciência de dados, nasce a partir de uma dor/problema a se resolver! Então após fazer o entendimento inicial sobre o projeto a ser desenvolvido e o ganho para a empresa, você começou a investigar em sua base de dados. E o primeiro caso que você decidiu a analisar é a famosa banda de rock Rolling Stones.
ANÁLISE EXPLORATÓRIA

### O que é análise explorátória?


A análise exploratória é uma etapa essencial no dia a dia de um ciêntista de dados, pois é um processo de descoberta de insights e entendimento dos seus dados. Com uma boa análise exploratória conseguimos descobrir têndencias, padrões e possíveis relações entre as variáveis.

### Porque preciso entender meus dados?


Quando entendemos como os nossos dados se comportam, conseguimos identificar as nossas limitações atuais de negócio, possíveis erros ou inconsistências no cadastro dos dados e diversos outros insights, que caso não sejam tratados poderam trazer resultados inconclusíveis ou até mesmo má tomadas de decisões para sua empresa.

Em resumo, entender seus dados é fundamental para qualquer trabalho com dados, pois isso ajuda garantir que você consiga decidir quais variáveis para seu projeto são importantes e confiáveis, e que os resultados sejam comunicados de forma clara e compreensível.

## Preparação dos dados

In [1]:
# Importar bibliotecas
import pandas as pd

In [3]:
df_banda = pd.read_excel('/content/dataset_rolling_stones.xlsx')
df_banda.head()

Unnamed: 0,name,album,release_date,track_number,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,duration_ms
0,Concert Intro Music - Live,Licked Live In NYC,2022-06-10,1,0.0824,0.463,0.993,0.996,0.932,-12913.0,0.11,118001.0,0.0302,33,48640
1,Street Fighting Man - Live,Licked Live In NYC,2022-06-10,2,0.437,0.326,0.965,0.233,0.961,-4803.0,0.0759,131455.0,0.318,34,253173
2,Start Me Up - Live,Licked Live In NYC,2022-06-10,3,0.416,0.386,0.969,0.4,0.956,-4936.0,0.115,130066.0,0.313,34,263160
3,If You Can't Rock Me - Live,Licked Live In NYC,2022-06-10,4,0.567,0.369,0.985,0.000107,0.895,-5535.0,0.193,132994.0,0.147,32,305880
4,Don’t Stop - Live,Licked Live In NYC,2022-06-10,5,0.4,0.303,0.969,0.0559,0.966,-5098.0,0.093,130533.0,0.206,32,305106


In [4]:
# Uma das primeiras etapas é saber os tipos de dados que temos na base
df_banda.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1610 entries, 0 to 1609
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   name              1610 non-null   object        
 1   album             1610 non-null   object        
 2   release_date      1610 non-null   datetime64[ns]
 3   track_number      1610 non-null   int64         
 4   acousticness      1610 non-null   float64       
 5   danceability      1610 non-null   float64       
 6   energy            1610 non-null   float64       
 7   instrumentalness  1610 non-null   float64       
 8   liveness          1610 non-null   float64       
 9   loudness          1610 non-null   float64       
 10  speechiness       1610 non-null   float64       
 11  tempo             1610 non-null   float64       
 12  valence           1610 non-null   float64       
 13  popularity        1610 non-null   int64         
 14  duration_ms       1610 n

In [7]:
# Saber o tamanho da base
df_banda.shape
#Forma de printar mais legível
print(f'A base contém {df_banda.shape[0]} linhas e {df_banda.shape[1]} colunas.')

A base contém 1610 linhas e 15 colunas.


In [8]:
# Saber o inicio da banda
print(f"Data inicial: {df_banda['release_date'].min()}")

Data inicial: 1964-04-16 00:00:00


In [9]:
# Saber até quando eles lançaram músicas novas
print(f"Data Final: {df_banda['release_date'].max()}")

Data Final: 2022-06-10 00:00:00


In [10]:
# Conferir se tem dados nulos
df_banda.isnull().sum()

Unnamed: 0,0
name,0
album,0
release_date,0
track_number,0
acousticness,0
danceability,0
energy,0
instrumentalness,0
liveness,0
loudness,0


In [11]:
# Consultar dados duplicados
df_banda.duplicated().sum()

np.int64(6)

In [12]:
# Consultar dados duplicados
df_banda[df_banda.duplicated()]

Unnamed: 0,name,album,release_date,track_number,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,duration_ms
928,Some Girls - Remastered,Some Girls (Deluxe Version),1978-06-09,4,0.527,0.474,0.938,0.52,0.299,-2643.0,0.0898,71995.0,0.505,21,276933
929,Lies - Remastered,Some Girls (Deluxe Version),1978-06-09,5,0.437,0.382,0.997,0.95,0.617,-1568.0,0.188,162428.0,0.563,16,191266
935,Claudine,Some Girls (Deluxe Version),1978-06-09,1,0.0144,0.439,0.977,0.0221,0.383,-4386.0,0.128,105124.0,0.364,17,222253
939,No Spare Parts,Some Girls (Deluxe Version),1978-06-09,5,0.24,0.594,0.762,1.5e-05,0.712,-5145.0,0.0292,72648.0,0.54,19,270466
940,Don't Be A Stranger,Some Girls (Deluxe Version),1978-06-09,6,0.061,0.72,0.867,0.0297,0.385,-5871.0,0.039,127329.0,0.847,15,246266
946,Petrol Blues,Some Girls (Deluxe Version),1978-06-09,12,0.769,0.835,0.621,0.114,0.116,-8007.0,0.0406,115.87,0.336,13,95626


Nesse caso não precisa excluir os dados duplicados, pois se trata de músicas diferentes

### Análise Exploratória

In [14]:
# Consultar estatísticas descritivas
df_banda.describe().T

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
release_date,1610.0,1992-04-23 12:57:14.534161536,1964-04-16 00:00:00,1970-09-04 00:00:00,1986-03-24 00:00:00,2017-12-01 00:00:00,2022-06-10 00:00:00,
track_number,1610.0,8.613665,1.0,4.0,7.0,11.0,47.0,6.56022
acousticness,1610.0,0.250475,0.000009,0.05835,0.183,0.40375,0.994,0.227397
danceability,1610.0,0.46886,0.104,0.36225,0.458,0.578,0.887,0.141775
energy,1610.0,0.792352,0.141,0.674,0.8485,0.945,0.999,0.179886
instrumentalness,1610.0,0.16417,0.0,0.000219,0.01375,0.179,0.996,0.276249
liveness,1610.0,0.49173,0.0219,0.153,0.3795,0.89375,0.998,0.3491
loudness,1610.0,-6406.640075,-24408.0,-8829.5,-6179.0,-4254.75,-2.31,3474.285941
speechiness,1610.0,0.069512,0.0232,0.0365,0.0512,0.0866,0.624,0.051631
tempo,1610.0,114078.725261,65.99,98996.5,120319.0,140853.75,216304.0,46196.602233


In [19]:
# Converter a coluna duration_ms em minutos para facilitar a leitura
df_banda['Duracao_minutos'] = df_banda['duration_ms'] / 60000
df_banda.head()

Unnamed: 0,name,album,release_date,track_number,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,duration_ms,Duracao_minutos
0,Concert Intro Music - Live,Licked Live In NYC,2022-06-10,1,0.0824,0.463,0.993,0.996,0.932,-12913.0,0.11,118001.0,0.0302,33,48640,0.810667
1,Street Fighting Man - Live,Licked Live In NYC,2022-06-10,2,0.437,0.326,0.965,0.233,0.961,-4803.0,0.0759,131455.0,0.318,34,253173,4.21955
2,Start Me Up - Live,Licked Live In NYC,2022-06-10,3,0.416,0.386,0.969,0.4,0.956,-4936.0,0.115,130066.0,0.313,34,263160,4.386
3,If You Can't Rock Me - Live,Licked Live In NYC,2022-06-10,4,0.567,0.369,0.985,0.000107,0.895,-5535.0,0.193,132994.0,0.147,32,305880,5.098
4,Don’t Stop - Live,Licked Live In NYC,2022-06-10,5,0.4,0.303,0.969,0.0559,0.966,-5098.0,0.093,130533.0,0.206,32,305106,5.0851


In [20]:
# Consultar estatísticas descritivas novamente
df_banda.describe().T

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
release_date,1610.0,1992-04-23 12:57:14.534161536,1964-04-16 00:00:00,1970-09-04 00:00:00,1986-03-24 00:00:00,2017-12-01 00:00:00,2022-06-10 00:00:00,
track_number,1610.0,8.613665,1.0,4.0,7.0,11.0,47.0,6.56022
acousticness,1610.0,0.250475,0.000009,0.05835,0.183,0.40375,0.994,0.227397
danceability,1610.0,0.46886,0.104,0.36225,0.458,0.578,0.887,0.141775
energy,1610.0,0.792352,0.141,0.674,0.8485,0.945,0.999,0.179886
instrumentalness,1610.0,0.16417,0.0,0.000219,0.01375,0.179,0.996,0.276249
liveness,1610.0,0.49173,0.0219,0.153,0.3795,0.89375,0.998,0.3491
loudness,1610.0,-6406.640075,-24408.0,-8829.5,-6179.0,-4254.75,-2.31,3474.285941
speechiness,1610.0,0.069512,0.0232,0.0365,0.0512,0.0866,0.624,0.051631
tempo,1610.0,114078.725261,65.99,98996.5,120319.0,140853.75,216304.0,46196.602233


In [26]:
# Análisar a média de duração em minutos dos albuns
df_banda.groupby('album')['Duracao_minutos'].mean()

Unnamed: 0_level_0,Duracao_minutos
album,Unnamed: 1_level_1
12 X 5,2.682068
12 x 5,2.682772
A Bigger Bang (2009 Re-Mastered),4.016356
A Bigger Bang (Live),5.176630
Aftermath,3.899185
...,...
Undercover,4.492750
Undercover (2009 Re-Mastered),4.497595
Voodoo Lounge (Remastered 2009),4.118779
Voodoo Lounge Uncut (Live),5.248507


In [42]:
# Quais músicas tem a menor e a maior duração em minutos?
menor_duracao = df_banda['Duracao_minutos'].min()
maior_duracao = df_banda['Duracao_minutos'].max()

Musica_menor_duracao = df_banda[df_banda['Duracao_minutos'] == menor_duracao]['name'].iloc[0]
Musica_maior_duracao = df_banda[df_banda['Duracao_minutos'] == maior_duracao]['name'].iloc[0]

print(f'A Música do Rolling Stones com a menor duração é a {Musica_menor_duracao} do album {df_banda[df_banda['Duracao_minutos'] == menor_duracao]['album'].iloc[0]} com {menor_duracao} minutos.')
print(f'A Música do Rolling Stones com a maior duração é a {Musica_maior_duracao} do album {df_banda[df_banda['Duracao_minutos'] == maior_duracao]['album'].iloc[0]} com {maior_duracao} minutos.')

#É possível criar as variaveis para localizar o dado ou aplicar diretamente na função print.
#Uma boa prática é criar a váriavel


A Música do Rolling Stones com a menor duração é a Show Intro - Live do album Live 1965: Music From Charlie Is My Darling (Live From England/1965) com 0.35 minutos.
A Música do Rolling Stones com a maior duração é a Miss You - Live do album Bridges To Bremen (Live) com 16.364433333333334 minutos.
