# Reading, analyzing, and storing fetched Genius API data

This notebook demonstrates how we can load the chunks of lyric data we fetched from Genius, merge them into a single DataFrame, analyze the data and finally write it back to a single JSON file.

In [1]:
import numpy as np
import pandas as pd
import os

## Loading chunks and combining them into a single DataFrame

In [2]:
data_dir = "data"
lyric_chunk_folder_path = os.path.join(data_dir, "lyric_chunks")
chunk_paths = [os.path.join(lyric_chunk_folder_path, name) for name in os.listdir(lyric_chunk_folder_path)]
chunk_paths.sort()
chunk_paths

['data/lyric_chunks/0-99.json',
 'data/lyric_chunks/100-199.json',
 'data/lyric_chunks/1000-1099.json',
 'data/lyric_chunks/10000-10099.json',
 'data/lyric_chunks/10100-10199.json',
 'data/lyric_chunks/10200-10299.json',
 'data/lyric_chunks/10300-10399.json',
 'data/lyric_chunks/10400-10499.json',
 'data/lyric_chunks/10500-10599.json',
 'data/lyric_chunks/10600-10699.json',
 'data/lyric_chunks/10700-10799.json',
 'data/lyric_chunks/10800-10899.json',
 'data/lyric_chunks/10900-10999.json',
 'data/lyric_chunks/1100-1199.json',
 'data/lyric_chunks/11000-11099.json',
 'data/lyric_chunks/11100-11199.json',
 'data/lyric_chunks/11200-11299.json',
 'data/lyric_chunks/11300-11399.json',
 'data/lyric_chunks/11400-11499.json',
 'data/lyric_chunks/11500-11599.json',
 'data/lyric_chunks/11600-11699.json',
 'data/lyric_chunks/11700-11799.json',
 'data/lyric_chunks/11800-11899.json',
 'data/lyric_chunks/11900-11999.json',
 'data/lyric_chunks/1200-1299.json',
 'data/lyric_chunks/12000-12099.json',
 'd

In [3]:
chunk_paths

['data/lyric_chunks/0-99.json',
 'data/lyric_chunks/100-199.json',
 'data/lyric_chunks/1000-1099.json',
 'data/lyric_chunks/10000-10099.json',
 'data/lyric_chunks/10100-10199.json',
 'data/lyric_chunks/10200-10299.json',
 'data/lyric_chunks/10300-10399.json',
 'data/lyric_chunks/10400-10499.json',
 'data/lyric_chunks/10500-10599.json',
 'data/lyric_chunks/10600-10699.json',
 'data/lyric_chunks/10700-10799.json',
 'data/lyric_chunks/10800-10899.json',
 'data/lyric_chunks/10900-10999.json',
 'data/lyric_chunks/1100-1199.json',
 'data/lyric_chunks/11000-11099.json',
 'data/lyric_chunks/11100-11199.json',
 'data/lyric_chunks/11200-11299.json',
 'data/lyric_chunks/11300-11399.json',
 'data/lyric_chunks/11400-11499.json',
 'data/lyric_chunks/11500-11599.json',
 'data/lyric_chunks/11600-11699.json',
 'data/lyric_chunks/11700-11799.json',
 'data/lyric_chunks/11800-11899.json',
 'data/lyric_chunks/11900-11999.json',
 'data/lyric_chunks/1200-1299.json',
 'data/lyric_chunks/12000-12099.json',
 'd

In [4]:
chunks = [pd.read_json(path, orient="index") for path in chunk_paths]
chunks

[   _type  annotation_count        api_path               artist  \
 0   song               1.0  /songs/1174438         Jamie Cullum   
 1   song               0.0   /songs/752887             Adam Ant   
 2   song               1.0   /songs/315464    Halvdan Sivertsen   
 3   song               4.0    /songs/43024           Snoop Dogg   
 4   song               3.0  /songs/1196276     Planet P Project   
 ..   ...               ...             ...                  ...   
 95  song               1.0   /songs/713994   Earl Thomas Conley   
 96  song               7.0   /songs/116136             Megadeth   
 97  song               1.0  /songs/1394969  Black Label Society   
 98  song               1.0  /songs/1561639           David Kitt   
 99  song               1.0  /songs/1758229       Rory Gallagher   
 
            artist_names featured_artists  \
 0          Jamie Cullum               []   
 1              Adam Ant               []   
 2     Halvdan Sivertsen               []   
 3

## Quick look into the data

In [5]:
data = pd.concat(chunks).reset_index()
data

Unnamed: 0,index,_type,annotation_count,api_path,artist,artist_names,featured_artists,full_title,header_image_thumbnail_url,header_image_url,...,release_date_for_display,release_date_with_abbreviated_month_for_display,song_art_image_thumbnail_url,song_art_image_url,stats,title,title_with_featured,updated_by_human_at,url,0
0,0,song,1.0,/songs/1174438,Jamie Cullum,Jamie Cullum,[],It's About Time by Jamie Cullum,https://images.genius.com/d2e786fa65b27312eff1...,https://images.genius.com/d2e786fa65b27312eff1...,...,"October 20, 2003","Oct. 20, 2003",https://images.genius.com/d2e786fa65b27312eff1...,https://images.genius.com/d2e786fa65b27312eff1...,"{'unreviewed_annotations': 1, 'hot': False}",It’s About Time,It's About Time,2017-07-04 01:04:10,https://genius.com/Jamie-cullum-its-about-time...,
1,1,song,0.0,/songs/752887,Adam Ant,Adam Ant,[],Something Girls by Adam Ant,https://images.genius.com/619296e88bc3842348a8...,https://images.genius.com/619296e88bc3842348a8...,...,"July 26, 1982","Jul. 26, 1982",https://images.genius.com/619296e88bc3842348a8...,https://images.genius.com/619296e88bc3842348a8...,"{'unreviewed_annotations': 0, 'hot': False}",Something Girls,Something Girls,2021-06-24 14:55:00,https://genius.com/Adam-ant-something-girls-ly...,
2,2,song,1.0,/songs/315464,Halvdan Sivertsen,Halvdan Sivertsen,[],Små ord by Halvdan Sivertsen,https://images.genius.com/6bc596531a7d2ce10137...,https://images.genius.com/6bc596531a7d2ce10137...,...,,,https://images.genius.com/6bc596531a7d2ce10137...,https://images.genius.com/6bc596531a7d2ce10137...,"{'unreviewed_annotations': 1, 'hot': False}",Små ord,Små ord,2013-12-23 21:25:48,https://genius.com/Halvdan-sivertsen-sma-ord-l...,
3,3,song,4.0,/songs/43024,Snoop Dogg,Snoop Dogg,[],The One and Only by Snoop Dogg,https://images.genius.com/020258c5b50e382d2763...,https://images.genius.com/020258c5b50e382d2763...,...,"November 26, 2002","Nov. 26, 2002",https://images.genius.com/020258c5b50e382d2763...,https://images.genius.com/020258c5b50e382d2763...,"{'unreviewed_annotations': 1, 'hot': False, 'p...",The One and Only,The One and Only,2022-05-02 01:56:48,https://genius.com/Snoop-dogg-the-one-and-only...,
4,4,song,3.0,/songs/1196276,Planet P Project,Planet P Project,[],Pink World by Planet P Project,https://images.genius.com/566104d7c7e5e0ecbdbb...,https://images.genius.com/566104d7c7e5e0ecbdbb...,...,1984,1984,https://images.genius.com/566104d7c7e5e0ecbdbb...,https://images.genius.com/566104d7c7e5e0ecbdbb...,"{'unreviewed_annotations': 3, 'hot': False}",Pink World,Pink World,2021-03-13 06:37:13,https://genius.com/Planet-p-project-pink-world...,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
83186,9995,song,13.0,/songs/30041,Mack 10,Mack 10 (Ft. Mary Jane Girls),"[{'_type': 'artist', 'api_path': '/artists/166...",On Them Thangs by Mack 10 (Ft. Mary Jane Girls),https://images.genius.com/baa48957d0452cea23d4...,https://images.genius.com/baa48957d0452cea23d4...,...,"June 20, 1995","Jun. 20, 1995",https://images.genius.com/baa48957d0452cea23d4...,https://images.genius.com/baa48957d0452cea23d4...,"{'unreviewed_annotations': 12, 'hot': False}",On Them Thangs,On Them Thangs (Ft. Mary Jane Girls),2019-04-17 08:34:34,https://genius.com/Mack-10-on-them-thangs-lyrics,
83187,9996,song,1.0,/songs/926485,Moonspell,Moonspell,[],Night Eternal by Moonspell,https://images.genius.com/f04af66582850931a374...,https://images.genius.com/f04af66582850931a374...,...,,,https://images.genius.com/f04af66582850931a374...,https://images.genius.com/f04af66582850931a374...,"{'unreviewed_annotations': 0, 'hot': False}",Night Eternal,Night Eternal,2021-11-22 20:00:02,https://genius.com/Moonspell-night-eternal-lyrics,
83188,9997,song,1.0,/songs/1160047,Thee More Shallows,Thee More Shallows,[],Cloisterphobia by Thee More Shallows,https://images.genius.com/b1702cadce49692dda49...,https://images.genius.com/b1702cadce49692dda49...,...,,,https://images.genius.com/b1702cadce49692dda49...,https://images.genius.com/b1702cadce49692dda49...,"{'unreviewed_annotations': 0, 'hot': False}",Cloisterphobia,Cloisterphobia,2016-12-14 21:20:29,https://genius.com/Thee-more-shallows-cloister...,
83189,9998,song,2.0,/songs/169684,Shakira,Shakira,[],Men in This Town by Shakira,https://images.genius.com/ae22030b63ff20633418...,https://images.genius.com/ae22030b63ff20633418...,...,"October 9, 2009","Oct. 9, 2009",https://images.genius.com/ae22030b63ff20633418...,https://images.genius.com/ae22030b63ff20633418...,"{'unreviewed_annotations': 1, 'hot': False}",Men in This Town,Men in This Town,2022-09-25 21:51:36,https://genius.com/Shakira-men-in-this-town-ly...,


In [6]:
data.lyrics.isna().sum()

1971

In [7]:
data.lyrics.isna().sum() / len(data)

0.023692466733180272

We see that for only around 2.3% of lyrics in our dataset without lyrics no lyrics were found on Genius.

In [8]:
data.language.value_counts()

en              73780
es               2395
fr               1141
de               1133
pt                571
it                553
fi                413
sv                244
no                132
nl                 84
pl                 64
da                 36
tr                 35
hr                 31
la                 25
cs                 24
ga                 21
ar                 17
hu                 15
ja                 14
ro                 14
ca                 13
eu                 11
gl                 10
sw                  9
bs                  9
gd                  9
is                  9
tl                  7
ru                  7
ln                  7
sco                 5
sr                  4
wo                  4
co                  4
sk                  3
ko                  3
lt                  3
id                  2
bg                  2
cy                  2
af                  2
romanization        1
eo                  1
ia                  1
sa        

The vast majority of lyrics is English, which is also very convenient for NLP purposes!

## Writing data to JSON

In [9]:
data.to_json(
        os.path.join(
            data_dir, "genius_api_data.json",
        ),
        orient="index",
    )