# 4. Data Exploration

## 4.1.
**How many observations does your dataset have?**


999,056 unique songs
- There are 1,000,000 tracks along with various metadata. One such column is song_id. There are tracks with the same song_id due to slight audio differences, but since we do not care about that for now we drop these duplicates. After dropping there are 999,056 unique songs.

280,831 songs have genre labels.

48,373,586 mappings of users to songs with number of playthroughs.

In [1]:
import pandas as pd
import sqlite3
triplets_path = "train_triplets.txt"
track_metadata_path = "track_metadata.db"
genre_labels_path = "msd_tagtraum_cd2.cls"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
track_metadata_conn = sqlite3.connect(track_metadata_path)
track_metadata_sql_query = "SELECT * FROM songs"
track_metadata_df = pd.read_sql_query(
    track_metadata_sql_query,
    track_metadata_conn
    ).drop_duplicates(subset=['song_id'])
track_metadata_conn.close()

In [3]:
# 48,373,586 rows. Can select a smaller amount for testing by uncommenting code
user_song_play_df = pd.read_csv(
    triplets_path,
    sep='\t',
    names=['user_id', 'song_id', 'plays'],
    # nrows=500
    )

In [4]:
genre_labels_df = pd.read_csv(genre_labels_path,sep='\t', names = ['track_id', 'majority_genre', 'minority_genre'], skiprows=7)

In [5]:
track_metadata_with_genre_df = pd.merge(
    track_metadata_df,
    genre_labels_df,
    on='track_id',
    how='left'   # so we can keep all rows
)

In [6]:
len(genre_labels_df)

280831

In [7]:
len(track_metadata_df)

999056

In [8]:
len(user_song_play_df)

48373586

In [9]:
# Merge with the user information
songs_df = pd.merge(user_song_play_df, track_metadata_with_genre_df, on='song_id')

In [10]:
len(songs_df)

48373586

## 4.2

**Describe all columns in your dataset their scales and data distributions. Describe the categorical and continuous variables in your dataset. Describe your target column and if you are using images plot some example classes of the images.**


- **user_id**: Unique identifier for the user (from triplets table).  
- **song_id**: Echo Nest song ID. Multiple tracks can share the same song_id if they are different versions of the same song. However, we removed these duplicates from the track_metadata table.  
- **plays**: Number of times the user played the song (from triplets table).  
- **track_id**: The Echo Nest ID of this particular track on which the analysis was done. Unique identifier for each track within the Million Song Dataset.  
- **title**: Title of the song.  
- **release**: Release (album name) from which the track was taken.  
- **artist_id**: Unique identifier for the artist. Matches the artist_id field in the track_metadata table and artist_term table.  
- **artist_mbid**: MusicBrainz ID (MBID) for the artist.  
- **artist_name**: Name of the artist.  
- **duration**: Duration of the track in seconds.  
- **artist_familiarity**: EchoNest numerical estimation of how familiar a song’s artist currently is to the world.  
- **artist_hotttnesss**: EchoNest numerical estimation of how hottt a song’s artist currently is.  
- **year**: The year the track was released according to MusicBrainz.  
- **track_7digitalid**: Identifier linking the song to the 7digital online music catalog.  
- **shs_perf**: Performance number on the SecondHandSongs website. Default value is -1.  
- **shs_work**: Clique numbers from the SecondHandSongs train and test files. If the work is known by MillionSongs, then the number is positive.  
- **majority_genre**: The most common or dominant genre associated with the artist or track (derived field).  
- **minority_genre**: A secondary or less frequent genre associated with the artist or track (derived field).  


Target column:
*   Our goal is to cluster users together, which is unsupervised.
*   So our entity of interest is "user_id"



In [13]:
songs_df.describe(include='all')

Unnamed: 0,user_id,song_id,plays,track_id,title,release,artist_id,artist_mbid,artist_name,duration,artist_familiarity,artist_hotttnesss,year,track_7digitalid,shs_perf,shs_work,majority_genre,minority_genre
count,48373586,48373586,48373590.0,48373586,48373586,48373586,48373586,48373586.0,48373586,48373590.0,48373590.0,48373590.0,48373590.0,48373590.0,48373590.0,48373590.0,29864500,8265936
unique,1019318,384546,,384546,306841,81835,30543,28464.0,42055,,,,,,,,15,15
top,ec6dfcf19485cb011e0b22637075037aae34cf26,SOFRQTD12A81C233C0,,TRDMBIJ128F4290431,Sehr kosmisch,Greatest Hits,ARJ7KF01187B98D717,,Coldplay,,,,,,,,Rock,Rock
freq,4400,110479,,110479,110479,294574,412421,618894.0,412421,,,,,,,,14743685,1955566
mean,,,2.866859,,,,,,,248.518,0.7283596,0.5435861,1605.213,3445440.0,1899.954,1415.716,,
std,,,6.437725,,,,,,,97.15339,0.1316299,0.1438316,796.5815,2686464.0,12063.7,9614.157,,
min,,,1.0,,,,,,,0.522,-1.0,0.0,0.0,-1.0,-1.0,-935.0,,
25%,,,1.0,,,,,,,198.4257,0.6456383,0.4534857,1981.0,821454.0,-1.0,0.0,,
50%,,,1.0,,,,,,,233.8999,0.7508383,0.530432,2001.0,3159011.0,-1.0,0.0,,
75%,,,3.0,,,,,,,278.2298,0.8308846,0.6001319,2007.0,5611643.0,-1.0,0.0,,


In [14]:
songs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48373586 entries, 0 to 48373585
Data columns (total 18 columns):
 #   Column              Dtype  
---  ------              -----  
 0   user_id             object 
 1   song_id             object 
 2   plays               int64  
 3   track_id            object 
 4   title               object 
 5   release             object 
 6   artist_id           object 
 7   artist_mbid         object 
 8   artist_name         object 
 9   duration            float64
 10  artist_familiarity  float64
 11  artist_hotttnesss   float64
 12  year                int64  
 13  track_7digitalid    int64  
 14  shs_perf            int64  
 15  shs_work            int64  
 16  majority_genre      object 
 17  minority_genre      object 
dtypes: float64(3), int64(5), object(10)
memory usage: 6.5+ GB


## 4.3
**Do you have missing and duplicate values in your dataset?**

Yes. As noted earlier we have missing values for genre, and we have duplicate songs. We have not decided what to do for the genre yet, but decided to drop the duplicate songs because we are not interested in the slight audio differences. Also dropping these duplicate songs makes it easier to map users with songs for counting purposes.

