# Setup 💻 (do not forget to run these cells before starting 👇)

## 1. Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## 2. Loading Spotify dataset

In [None]:
df = pd.read_csv('spotify-dataset.csv')
df

----------

# Your Assignement starts here 🔥

Note, that **there are no solutions** to the exercises below. However, these questions are very similar to the exercise you did in the last class - **Exploring the AirBnB dataset**.

So make sure to keep that one open in another tab, so you can easily look back at your previous answers 🙌 And of course the internet, [Pandas guide](https://pandas.pydata.org/docs/user_guide/index.html) and the teachers are here to help!

## Can you display the first 10 row of the dataset? 

In [None]:
# Your code goes here 💪

## How many songs are there in the dataset?

In [None]:
# Your code goes here 💪

## How many artists?

In [7]:
# Your code goes here 💪

## What's the average duration of a song, in minutes?

**Note** that the columns is called `duration_ms`, hence the duration is in milliseconds.

In [None]:
# Your code goes here 💪

## Can you plot the distribution of the durations?

In [None]:
# Your code goes here 💪

## [Follow-up] Can you make this plot more readable by removing outliers (extreme values)?

[Boolean filtering](https://cmdlinetips.com/2018/02/how-to-subset-pandas-dataframe-based-on-values-of-a-column/) will help here! Make sure to check back with the AirBnB challenge on how we dealt with outliers there.

**Note** that we only remove  around 470 songs (out of almost 170k) and get a way more insightful plot!

In [73]:
# Your code goes here 💪

**Tip:** you can add the `bins` parameter to your plot function, to increase the number of columns in your graph. For example:

`sns.distplot(df['duration_ms'], bins=30)`

Try to change the number of `bins` and see how that affects the readability of your graph.

## What are the top 10 longest songs (duration)?

In [70]:
# Your code goes here 💪

## What are the top 10 most popular songs? We would like a DataFrame with only the columns `artists`, `name` and `popularity` as output!

In [None]:
# Your code goes here 💪

# You get a call from Daniel Ek, Spotify CEO ☎️

<div style="display: flex; margin-top: 20px;">
    <p style="font-style: italic; padding: 30px;">Hi there! Excited to have you working on analysing our data and looking forward to your findings! 👍 <br> I was hoping you can help me answer some of these questions below. It could help guide the business on our artist selection strategy.</p>
    <img src="./photo.jpeg" width=300/>
</div>


## First of all, let's create a new DataFrame with only the post-2000 songs

In [74]:
# Your code here

<details><summary>Solution</summary>

```python
post2000 = df[df['year'] > 1999]
```

</details>

## What are the top 10 artists in terms of number of songs in this period?

**Tip**: this can be calculated with `.value_counts()` or with the `.groupby()` and then the [.count()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.count.html) function - another cousin of the `.mean()` and `.sum()` methods.

As a reminder, here's the typical `.groupby()` syntax:

```
df.groupby('COLUMN_NAME').some_function()[['COLUMN_TO_MEASURE']]
```

For example:

```
df.groupby('popularity').count()[['artists']]
df.groupby('year').sum()[['duration_ms']]
```

Or if you want to see **multiple columns**:

```
df.groupby('popularity').count()[['artists', 'key']]
df.groupby('year').sum()[['duration_ms', 'loudness']]
```

Pandas documentation has a [thorough guide on using .groupby()](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html). It might be tricky to read, but you can find lots of insight here!

In [132]:
# Your code goes here 💪

<details><summary>Solution</summary>

```python
post2000['artists'].value_counts().head(10)
```
or

```python
artist_songs = post2000.groupby('artists').count()[['name']]
artist_songs.sort_values('name').tail(10)
```
</details>


### Make sure to make your findings visual 📊 You're not going to send Daniel your Python code, are you? 👀

In [106]:
# Your code here



<details><summary>Solution</summary>

```python
top10artists = post2000['artists'].value_counts().head(10)
top10artists.plot(kind='bar')
```
    
</details>


## [Follow-up] Do the same artists often get highest popularity? Let's check the 10 most popular artists and see if the artists above often appear in the list

**Tip:** since `popularity` is a _numerical_ column, we can calculate the `.mean()` of the popularity, grouped by `artists`.

In [87]:
# Your code goes here 💪

Explain your findings to Spotify CEO here 💪

<details><summary>Solution</summary>


```python
post2000.groupby('artists').mean()[['popularity']].sort_values('popularity').tail(10)
```

This would be a one-liner solution, but remember that you can also break your code up to multiple lines to make it more readable:

```python
artist_popularity = post2000.groupby('artists').mean()[['popularity']]
artist_popularity.sort_values('popularity').tail(10)
```

We take the `tail()` because by default the `sort_values()` function sorts from lowest to highest.
    
</details>


## Can you visually check how is `popularity` changing with the `year`?

**Tip:** a **scatterplot** would be great to visualize a relationship between two numeric columns.

In [109]:
# Your code goes here 

Explain your findings to Spotify CEO here 💪



<details><summary>Solution</summary>

```python
sns.scatterplot(data=post2000, x='year', y='popularity')
```
    
</details>


**Tip**: even though you can probably already see a trend, you can use the Seaborn `regplot` to draw a trend line. Give it a try 👇

In [117]:
sns.regplot(data=post2000, x='year', y='popularity')

## Most importantly - what makes awesome artists? 🤔

You have many interesting features in the data - different aspects of the song, like it's `speechiness`, `danceability`, `energy` and many more.

I want to know what differentiates top artists from the least popular ones. What kind of artists should we look for? 👩‍🎤

In [None]:
# Tip: do this step by step 💪
# 1. Create a DataFrame from the Top 10 most popular artists 
# (similar to what you did above, but note we want a DataFrame with all the data, not just one column this time)



<details><summary>Solution</summary>

```python
top10 = post2000.groupby('artists').mean().sort_values('popularity').tail(10)
```
    
</details>


In [122]:
# 2. Create a DataFrame from the 10 least popular artists 
# (you should be able to reuse the code above, and just take 10 artists from the opposite end ;))

<details><summary>Solution</summary>

```python
worst10 = post2000.groupby('artists').mean().sort_values('popularity').head(10)
```
    
</details>

In [None]:
# 3. Calculate the mean() of the different features for each DataFrame
# (you can use the .mean() function on the whole DataFrame in one go!)

In [None]:
# Do it in separate cells, so you see both results side by side

<details><summary>Solution</summary>

```python
top10.mean()
worst10.mean()
```
    
</details>

In [124]:
# 4. Analyze! Compare the two results and tell Daniel what differences you notice :)

Your findings and recommendations to Spotify CEO here 💪