## Exercise: Pandas & NumPy with Spotify Dataset
Use pandas and NumPy to analyze, filter, manipulate, and visualize data from the Spotify 2023 dataset.

### Task 1: Data exploration and cleaning
1. Load the dataset in pandas.
2. Check for missing values and handle them:
    - Replace missing values in the "key" column with "Unknown".
    - Fill missing values in "in_shazam_charts" with 0.
3. Filter the dataset:
    - Extract all tracks from 2023 that have been in Spotify Charts at least 50 times.
    - Save this subset as "popular_tracks_2023.csv".

In [21]:
import pandas as pd

spotify_df = pd.read_csv("spotify-2023.csv", encoding_errors="ignore")
spotify_df.head()

Unnamed: 0,track_name,artist(s)_name,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,streams,in_apple_playlists,...,bpm,key,mode,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%
0,Seven (feat. Latto) (Explicit Ver.),"Latto, Jung Kook",2,2023,7,14,553,147,141381703,43,...,125,B,Major,80,89,83,31,0,8,4
1,LALA,Myke Towers,1,2023,3,23,1474,48,133716286,48,...,92,C#,Major,71,61,74,7,0,10,4
2,vampire,Olivia Rodrigo,1,2023,6,30,1397,113,140003974,94,...,138,F,Major,51,32,53,17,0,31,6
3,Cruel Summer,Taylor Swift,1,2019,8,23,7858,100,800840817,116,...,170,A,Major,55,58,72,11,0,11,15
4,WHERE SHE GOES,Bad Bunny,1,2023,5,18,3133,50,303236322,84,...,144,A,Minor,65,23,80,14,63,11,6


In [24]:
spotify_df.isnull().sum()
spotify_df.fillna({"key" : "Unkown", "in_shazam_charts" : 0}, inplace = True)
spotify_df.isnull().sum()

track_name              2
artist(s)_name          0
artist_count            0
released_year           0
released_month          0
released_day            0
in_spotify_playlists    0
in_spotify_charts       0
streams                 0
in_apple_playlists      0
in_apple_charts         0
in_deezer_playlists     0
in_deezer_charts        0
in_shazam_charts        0
bpm                     0
key                     0
mode                    0
danceability_%          0
valence_%               0
energy_%                0
acousticness_%          0
instrumentalness_%      0
liveness_%              0
speechiness_%           0
dtype: int64

In [31]:
spotify_charts = spotify_df[(spotify_df["released_year"] == 2023) & (spotify_df["in_spotify_charts"] >= 50)]
print(spotify_charts)

spotify_charts.to_csv("popular_tracks_2023.csv", index=False)
print(spotify_charts.head())


                                            track_name  \
0                  Seven (feat. Latto) (Explicit Ver.)   
2                                              vampire   
4                                       WHERE SHE GOES   
5                                             Sprinter   
6                                      Ella Baila Sola   
8                                             fukumean   
11                                           Super Shy   
12                                             Flowers   
13                                            Daylight   
16                                   Cupid - Twin Ver.   
17   What Was I Made For? [From The Motion Picture ...   
19                                          Like Crazy   
24   Popular (with Playboi Carti & Madonna) - The I...   
29             Dance The Night (From Barbie The Album)   
30                                                Rush   
44    Barbie World (with Aqua) [From Barbie The Album]   
51            

### Task 2: Statistical analysis and aggregation
1. Calculate basic statistics:
    - Find the average BPM (tempo) by key.
    - Find the average energy level for songs with more than 100 million streams.
2. Sort the dataset:
    - Find the top 10 most streamed songs.
    - Find the 5 least danceable songs.
3. Group the dataset:
    - Count how many tracks belong to each mode (Major/Minor).

In [48]:
print(spotify_df.columns)

Bpm_mean = spotify_df.groupby(["bpm", "key"])["bpm"].mean()
print(Bpm_mean)

spotify_df["streams"] = pd.to_numeric(spotify_df["streams"], errors="coerce")

high_stream_songs = spotify_df[spotify_df["streams"] > 100_000_000]
average_songs = high_stream_songs["energy_%"].mean()
print(average_songs)

# why not showing correct output?? 

Index(['track_name', 'artist(s)_name', 'artist_count', 'released_year',
       'released_month', 'released_day', 'in_spotify_playlists',
       'in_spotify_charts', 'streams', 'in_apple_playlists', 'in_apple_charts',
       'in_deezer_playlists', 'in_deezer_charts', 'in_shazam_charts', 'bpm',
       'key', 'mode', 'danceability_%', 'valence_%', 'energy_%',
       'acousticness_%', 'instrumentalness_%', 'liveness_%', 'speechiness_%'],
      dtype='object')
bpm  key   
65   A          65.0
     Unkown     65.0
67   C#         67.0
71   C#         71.0
     Unkown     71.0
               ...  
202  A#        202.0
     G         202.0
204  F#        204.0
206  G         206.0
     Unkown    206.0
Name: bpm, Length: 599, dtype: float64
64.1125


In [50]:
print(spotify_df.sort_values(by="streams", ascending=False).head(10))
print(spotify_df.sort_values(by="streams", ascending=True).head(10))

                                        track_name  \
55                                 Blinding Lights   
179                                   Shape of You   
86                               Someone You Loved   
620                                   Dance Monkey   
41   Sunflower - Spider-Man: Into the Spider-Verse   
162                                      One Dance   
84                       STAY (with Justin Bieber)   
140                                       Believer   
725                                         Closer   
48                                         Starboy   

                   artist(s)_name  artist_count  released_year  \
55                     The Weeknd             1           2019   
179                    Ed Sheeran             1           2017   
86                  Lewis Capaldi             1           2018   
620                   Tones and I             1           2019   
41          Post Malone, Swae Lee             2           2018   
162      

3. Group the dataset:
    - Count how many tracks belong to each mode (Major/Minor).

In [54]:
mode_counts = spotify_df.groupby("mode")["track_name"].count()
print(mode_counts)

mode
Major    549
Minor    402
Name: track_name, dtype: int64


### Task 3: Feature engineering
1. Create a new variable "track_popularity" using the following logic:
    - "Super Hit" if streams > 500M.
    - "Hit" if streams between 100M and 500M.
    - "Moderate" if streams between 50M and 100M.
    - "Less Popular" otherwise.
2. Save the modified dataset with the new column as "track_popularity_data.csv".

In [57]:
def classify_track_popularity(streams):
    if streams >= 500_000_000:
        return "Super Hit"
    elif streams >= 100_000_000 and streams < 500_000_000:
        return "Hit"
    elif streams >= 50_000_000 and streams < 100_000_000:
        return "Moderate"
    else:
        return "Less Popular"

spotify_df.to_csv("track_popularity_data", index=False)

### Task 4: Unique Task
Each student must create their own unique variable in the dataset. Choose one approach:
1. Assign a playlist ranking (playlist_rank):
    - Generate a random rank between 1 and 100 for each track using np.random.randint().
2. Create an emotional category (mood_category):
    - Categorize songs based on valence (happiness) and energy.
      - "Energetic & Happy" if valence > 60 and energy > 70.
      - "Calm & Happy" if valence > 60 and energy <= 70.
      - "Sad" if valence < 40.
      - "Neutral" otherwise.
3. Define a personal popularity score (custom_popularity_score):
    - Use a custom formula (e.g., (streams / bpm) * danceability_%).

In [None]:
import numpy as np
spotify_df["playlist_rank"] = np.random.randint(1, 101, size=len(spotify_df))

def categorize_mood(valence, energy):
    if valence > 60 and energy > 70:
        return "Energetic & Happy"
    elif valence > 60 and energy <= 70:
        return "Calm & Happy"
    elif valence < 40:
        return "Sad"
    else:
        return "Neutral"

spotify_df["mood_category"] = spotify_df.apply(lambda x: categorize_mood(x["valence"], x["energy"]), axis=1)
spotify_df["custom_popularity_score"] = (spotify_df["streams"] / spotify_df["bpm"]) * spotify_df["danceability_%"]
print(spotify_df[["track_name", "playlist_rank", "mood_category", "custom_popularity_score"]].head())
