# Moosic EDA :: Iteration v1



## Importing required libraries

* numpy
* pandas
* ??
* scikit learn

In [None]:
# IMPORT LIBRARIES


try:

    import numpy as np
    import pandas as pd

    # visualisation
    import seaborn as sns
    import matplotlib.pyplot as plt
    from sklearn.model_selection import train_test_split


except ImportError as error:
    print(f"Installation of the required dependencies necessary! {error}")

    %pip install numpy
    %pip install pandas
    %pip install seaborn
    %pip install matplotlib

    print(f"Successful installation of the required dependencies necessary")


    import warnings
    warnings.filterwarnings('ignore')




Import the Datasets

In [None]:
df_artists = pd.read_csv('../.data/NB_03_artists.csv', low_memory=False)
df_tracks = pd.read_csv('../.data/NB_03_tracks.csv', low_memory=False)

## Data Overview Artists

| column | additional information |
|--------|------------------------|
| id | id of artist |
| followers | number of followers | 
| genres | genres associated with artist |
| name | name of artist |
| popularity | popularity of artist in range 0 to 100 |

## Data Overview Tracks

| column | additional information |
|--------|------------------------|
| id | id of track |
| name | name of track | 
| popularity | popularity of track in range 0 to 100 |
| duration_ms | duration of songs in ms |
| explicit | whether it contains explicit content or not |
| artists | artists who created the track | 
| id_artists | id of artists who created the track |
| release_date | date of release |
| danceability | how danceable a song is in range 0 to 1 |
| energy | how energized a song is in range 0 to 1 |
| key | The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1 |
| loudness | The overall loudness of a track in decibels (dB) |
| mode |  Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0 |
| speechiness | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks |
| acousticness | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic |
| instrumentalness | Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content |
| liveness | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live |
| valence | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry) |
| tempo | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration | 
| time_signature | An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of 3/4, to 7/4. | 

In [None]:
df_artists.columns

In [None]:
df_tracks.columns

Get general information from df_artists:

In [None]:
df_artists.info()

In [None]:
df_artists.head()

In [None]:
df_artists.describe()

Get general information from df_tracks:

In [None]:
df_tracks.info()

In [None]:
df_tracks.head()

In [None]:
df_tracks.describe()

As we want all necessary information in a single dataset, we need to combine the genre and followers columns from the artists df with the tracks df.

We can do this by using the artist's id's from both dataframes. But first we need to make sure all entries are in the same format. In the df_artists 'genre' and in df_tracks 'artists' and 'id_artists' entries seem to be in this format ['...']. 

In [None]:
# Remove square brackets and quotes from the entire df_tracks
df_tracks = df_tracks.applymap(lambda x: str(x).replace('[', '').replace(']', '').replace("'", ""))

In [None]:
df_tracks.head().T

In [None]:
# Remove square brackets and quotes from the entire df_tracks
df_artists = df_artists.applymap(lambda x: str(x).replace('[', '').replace(']', '').replace("'", ""))

In [None]:
df_artists.head().T

Now that all the entries should be in the same 'clean' format, we can merge the 2 sets by the artists id:

In [None]:
# Merge df_artists and df_tracks using 'id' from df_artists and 'id_artists' from df_tracks
combined_df = df_tracks.merge(df_artists, left_on='id_artists', right_on='id', how='left')

In [None]:
combined_df.info()

In [None]:
combined_df.head().T

Now let's check for null values and duplicates in the new combined_df

In [None]:
# Check for null values in the dataframe
null_counts = combined_df.isnull().sum()

# Check for duplicate rows in the dataframe
duplicate_counts = combined_df.duplicated().sum()

print("Null value counts:")
print(null_counts)

print("\nNumber of duplicate rows:", duplicate_counts)

There are many null values after our join, lets go more into detail with this:

In [None]:
# Print rows with null values in the columns from df_artists
null_rows = combined_df[combined_df['id_y'].isnull()]
print("Rows with null values in df_artists columns:")
null_rows.head().T

We can throw them out, because there are no matching id's ...

In [None]:
combined_df_cleaned = combined_df.dropna()

In [None]:
combined_df_cleaned.info()

In [None]:
combined_df_cleaned.head().T

We need to rename some collumns:

In [None]:
# Rename the columns
combined_df_cleaned.rename(columns={
    'id_x': 'track_id',
    'id_artists': 'artists_id',
    'name_x': 'track_name',
    'artists': 'artist_name',
    'popularity_x': 'artist_popularity',
    'popularity_y': 'track_popularity'
}, inplace=True)

In [None]:
combined_df_cleaned.head().T

drop columns we don't need anymore:

In [None]:
df_cleaned_1 = combined_df_cleaned.drop(['name_y', 'id_y'], axis=1)

In [None]:
df_cleaned_1.info()

In [None]:
df_cleaned_1.columns

Rearrange the order of columns

In [None]:
# Define the desired column order
desired_column_order = ['artists_id', 'track_id', 'artist_name', 'track_name', 'genres', 'release_date', 'explicit', 'duration_ms',
                        'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness',
                        'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature', 'followers',
                        'artist_popularity', 'track_popularity']

# Rearrange the columns
df_reordered = df_cleaned_1[desired_column_order]

In [None]:
df_reordered.head().T

Now we want to make sure, the most popular artists are still presented in our dataset after the cleaning process:
Therefor we can compare our Dataset with the artist from the "Most Streamed Artist Table" 

In [None]:
# Read the HTML table with the Spotify most streamed artists of all time
url = 'https://kworb.net/spotify/artists.html'
df_ms_artists = pd.read_html(url)[0]

In [None]:
df_ms_artists.head().T

We have to compare: How many Artist are in this Table

In [None]:
# Get the unique artists from df_ms_artists
ms_unique_artists = df_ms_artists['Artist'].unique()

# Count how many of these unique artists are in combined_df_final
matching_artist_count = df_reordered['artist_name'].isin(ms_unique_artists).sum()

print("Number of artists from df_ms_artists in combined_df_final:", matching_artist_count)

Seems that there are enough similarities to go on with our dataset.

In [None]:
# df_reordered
# Find the most presented artists
most_presented_artists = df_reordered['artist_name'].value_counts()

# Find the most presented genres
most_presented_genres = df_reordered['genres'].value_counts()

print("Most presented artists:")
print(most_presented_artists)

print("\nMost presented genres:")
print(most_presented_genres)

There are a lot of 'Hörspiele' in our Dataset, lets try to remove them and see how many data is lost due to this.

In [None]:
df_reordered.info()

In [None]:
# Create a boolean mask for rows with 'hoerspiel' in the 'genres' column
mask = df_reordered['genres'].str.contains('hoerspiel', case=False)

# Filter the dataframe to exclude rows with 'hoerspiel' in the 'genres' column
df_filtered = df_reordered[~mask]

In [None]:
df_filtered.info()

Are there any other genres like podcast which we don't need in our Dataset? 

In [None]:
column_values = df_filtered['genres']

unique_values = column_values.unique()

for value in unique_values:
    print(value)


In [None]:
unique_genre_count = df_filtered['genres'].nunique()

print("Number of unique genres:", unique_genre_count)

Search for Podcasts

In [None]:
filtered_data = df_filtered[df_filtered['genres'].str.contains(r'podcast|podcasts', case=False)]

print(filtered_data)

## Main Music Genres

* Jazz
* Country
* Pop
* Reggae
* Electronic
* Indie Rock
* Gospel
* House
* Hip Hop
* Classical Music
* R&B
* Punk Rock
* Folk Music
* Techno
* Disco
* EDM
* Rock
* Blues
* Metal
* Soul 
* Funk
* Alternative
* Dubstep
* World Music

Try:

To categorize our sub genres we want to look at the list of genres from artists

search by keyword and append to genre lists
order by appearing &rarr; special genres don't appear as often as other more known genre. 

In [None]:
import re

sub_genres = {}

main_genres = ['Jazz', 'Country', 'Pop', 'Reggae', 'Electronic', 'Indie Rock', 'Gospel', 
                'House', 'Hip Hop', 'Classical', 'R&B', 'Punk Rock', 'Folk', 'Techno', 'Disco', 'EDM', 'Rock',
                'Blues', 'Metal', 'Soul', 'Funk', 'Alternative', 'Dubstep', 'World Music', 'Rockabilly', 'Other']

other_genres = set()
main_genres_col = []

# dict with main genres and sets for sub genres
for genre in main_genres:
    sub_genres[genre.lower()] = set()

for genre_text in df_filtered['genres'].values.tolist():
    main_genres_cell = []
    genres = genre_text.strip().split(',')
    for genre in genres:
        genre = genre.strip()
        if not genre:
            continue
        actual_genres = []
        for main_genre in sub_genres:
            if main_genre == 'other':
                continue
            try:
                index = genre.index(main_genre) # index and length to decide on genre: indie rock should go into indie rock and not rock
            except ValueError:
                continue
            end = index + len(main_genre)
            length = len(main_genre)
            m = main_genre
            if main_genre in ('pop', 'rock') and  genre.endswith(main_genre):
                m = main_genre
            elif main_genre == 'metal' and 'metalcore' in genre:
                m = 'metal'
            elif (index and 'a'<= genre[index - 1] <= 'z') or (end < len(genre) and 'a' <= genre[end] <= 'z'):
                other_genres.add(genre)
                m = 'other'
                length = 0
            else:
                m = main_genre

            actual_genres.append([end, length, m])

        if actual_genres:
            actual_genre = sorted(actual_genres)[-1][2]
            sub_genres[actual_genre].add(genre)
            main_genres_cell.append(actual_genre)
    main_genres_col.append(list(set(main_genres_cell)))  # make main_genre unique i.e.: ['jazz', 'jazz', 'jazz'] shouldn't make it into zhe cell 


df_with_main_genres = df_filtered.copy(deep=True)
# make new column from list
df_with_main_genres['main_genres'] = main_genres_col


#for m, s in sub_genres.items():
#    print(m, s)

In [None]:

for i, m in enumerate(main_genres_col):
    if m: 
        print(i, m)


In [None]:
df_with_main_genres.loc[169]