# EXISTING: Data Analysis
- Will be making 3 data frames: Male Rapper Lyrics, Female Rapper Lyrics, and All Lyrics


In [1]:
# import libraries
import pandas as pd
import os, json
from glob import glob

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Trying out DataFrames with Small Dataset
- I used the `Lyrics_2Pac.json` to do this inital exploration
- The following code chunks deal with creating this initial dataframe `pac_lyrics_df` with the 2Pac information
- It includes finding out what types of object get formed in the process

In [2]:
# Trying out doing a data frame with one of the male artists: "2Pac"
m_path = './rap_lyrics/male_lyrics/'    # path with the male artists
# Pass the JSON into a dictionary
with open(m_path+'Lyrics_2Pac.json') as json_file:
    data = json.load(json_file)
# confirming that it is a dictionary type object   
type(data)    # It is!
# print out the keys to see where the song lyrics may be held
data.keys()   # It should be in 'songs'

dict

dict_keys(['alternate_names', 'api_path', 'description', 'facebook_name', 'followers_count', 'header_image_url', 'id', 'image_url', 'instagram_name', 'is_meme_verified', 'is_verified', 'name', 'translation_artist', 'twitter_name', 'url', 'current_user_metadata', 'description_annotation', 'user', 'songs'])

In [3]:
# build a data frame from dictionary using pd.DataFrame.from_dict
pac_lyrics_df = pd.DataFrame.from_dict(data['songs'])
# data frame should have the columns 'artist', 'title' (for song title), and 'lyrics'
pac_lyrics_df = pac_lyrics_df[['artist', 'title', 'lyrics']]
# did it populate correctly?
pac_lyrics_df.head() # it did!

Unnamed: 0,artist,title,lyrics
0,2Pac,16 on Death Row,Death Row\nThat's where mothafuckas is endin' ...
1,2Pac,1995 Police Station Testimony,"Woman – Sir, will you raise your right hand, p..."
2,2Pac,1 for April,2 me your name alone is poetry\nI barely know ...
3,2Pac,1st impression,Just when I thought I'd seen it all\nour paths...
4,2Pac,1st Impressions: 4 Irene,Just when I thought I'd seen it all\nour paths...


## Now let's try it with the full male artist directory
- The following code chunk populates the official male lyrics dataframe (`mlyrics_df`) with the data from the 10 males artists:
    - J.Cole, Jay-Z, Kanye West, The Notorious B.I.G., Kendrick Lamar, Lil Wayne, Snoop Dogg, Nas, Drake, 2Pac
    - This code uses the code from the previous chunks

In [15]:
# empty df for male lyrics with the column titles: artist, title, lyrics
mlyrics_df = pd.DataFrame(columns=['artist', 'title', 'lyrics'])

# for loop to populate mlyrics_df
for filename in [file for file in os.listdir(m_path) if file.endswith('.json')]:
    # prints out a list of the filenames in the directory
    print(filename)
    # read in each filename and load it
    with open(m_path + filename) as json_file:
        data = json.load(json_file)
        # populate a temp_df with necessary info
        temp_df = pd.DataFrame.from_dict(data['songs'])
        temp_df = temp_df[['artist', 'title', 'lyrics']]
        # concatenate temp_df to mlyrics_df
        mlyrics_df = pd.concat([mlyrics_df,temp_df])
mlyrics_df
type(data)
type(mlyrics_df)

Lyrics_J.Cole.json
Lyrics_JAYZ.json
Lyrics_KanyeWest.json
Lyrics_TheNotoriousB.I.G..json
Lyrics_KendrickLamar.json
Lyrics_LilWayne.json
Lyrics_SnoopDogg.json
Lyrics_Nas.json
Lyrics_Drake.json
Lyrics_2Pac.json


Unnamed: 0,artist,title,lyrics
0,J. Cole,03' Adolescence,La la la\nLa la la la la\nLa la la\nLa la la l...
1,J. Cole,102.1 Jamz Freestyle,For all y’all boys cheap talking\nKeep walking...
2,J. Cole,102 Jamz Freestyle,For all y’all boys cheap talking\nKeep walking...
3,J. Cole,1985,"1985, I arrived\n33 years, damn, I'm grateful ..."
4,J. Cole,2012,"Yes, straight out the Ville and I'm blessed\nN..."
...,...,...,...
95,2Pac,Flex,"Flex, flex flex\nFlex, flex flex\n\nSlippin' t..."
96,2Pac,Forever And Today,U say that u'll love me forever but what about...
97,2Pac,For Mrs. Hawkins (In Memory of Yusef Hawkins),This poem is addressed 2 Mrs. Hawkins\nwho los...
98,2Pac,Fortune & Fame,"And my niggas say, we want the fame!\nC'mon\n\..."


dict

pandas.core.frame.DataFrame

In [16]:
# 10 artists, 1000 values, 979 unique titles, 993 unique lyrics
    # so there are some duplicates with the titles and the lyrics
mlyrics_df.describe()
# The Notorious B.I.G. is the most common artist
# Anything is the most common song title

Unnamed: 0,artist,title,lyrics
count,1000,1000,1000.0
unique,10,979,993.0
top,The Notorious B.I.G.,Anything,
freq,100,3,4.0


In [18]:
# 1000 for each thing column. Good!
mlyrics_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 99
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   artist  1000 non-null   object
 1   title   1000 non-null   object
 2   lyrics  1000 non-null   object
dtypes: object(3)
memory usage: 31.2+ KB


In [19]:
# Every artist has their 100 lyrics as expected
mlyrics_df.artist.value_counts()

The Notorious B.I.G.    100
Drake                   100
2Pac                    100
J. Cole                 100
Lil Wayne               100
JAY-Z                   100
Snoop Dogg              100
Kanye West              100
Nas                     100
Kendrick Lamar          100
Name: artist, dtype: int64

**Observations**:
- There are some duplicates with the titles (979 vs 1000) / lyrics (993 vs 1000)
- I would start the cleaning at the song title level
- Each artist is 0-99 index
    - Trying to figure out if I should change the indexing

## Now let's try it with the full female directory
- The following code chunk populates the official female lyrics dataframe (`flyrics_df`) with the data from the following 10 female artists:
    - Rico Nasty, Missy Elliott, Lil Kim, Cardi B, Remy Ma, Rapsody, Trina, Nicki Minaj, Queen Latifah
    - *Note.* There are less rows in this data frame than the `mlyrics_df` because not all artists reached the 100 songs specified in the data collection code (e.g. Remy Ma [86], Cardi B [76]) 
        - This was expected because Cardi B is a fairly new artist and Remy Ma was incarcerated for a long time and is just now getting back to music.

In [39]:
f_path = './rap_lyrics/female_lyrics/'

# empty df for male lyrics with the column titles: artist, title, lyrics
flyrics_df = pd.DataFrame(columns=['artist', 'title', 'lyrics'])

# for loop to populate mlyrics_df
for filename in [file for file in os.listdir(f_path) if file.endswith('.json')]:
    # prints out a list of the filenames in the directory
    print(filename)
    # read in each filename and load it
    with open(f_path + filename) as json_file:
        data = json.load(json_file)
        # populate a temp_df with necessary info
        temp_df = pd.DataFrame.from_dict(data['songs'])
        temp_df = temp_df[['artist', 'title', 'lyrics']]
        # concatenate temp_df to mlyrics_df
        flyrics_df = pd.concat([flyrics_df,temp_df])
flyrics_df

Lyrics_RicoNasty.json
Lyrics_MissyElliott.json
Lyrics_MeganTheeStallion.json
Lyrics_LilKim.json
Lyrics_CardiB.json
Lyrics_RemyMa.json
Lyrics_Rapsody.json
Lyrics_Trina.json
Lyrics_NickiMinaj.json
Lyrics_QueenLatifah.json


Unnamed: 0,artist,title,lyrics
0,Rico Nasty,10Fo,"Smoov, what's good, baby? (Woo)\nWake up F1LTH..."
1,Rico Nasty,Animal,"I'm a bear, you a mother fuckin' reindeer\nWhe..."
2,Rico Nasty,Ar-15,Pointing red lasers on you\nDo you need a head...
3,Rico Nasty,Arenas,Yeah-yeah-yeah-yeah\n\nI can't wait till I sel...
4,Rico Nasty,Back & Forth,CashMoneyAP\n\nI said I'm back in this bitch\n...
...,...,...,...
95,Queen Latifah,The World,"The world, oh, oh, oh, the world\nThe world, o..."
96,Queen Latifah,Trav’lin’ Light,I'm trav'lin' light\nBecause my man has gone\n...
97,Queen Latifah,Turn You On,Did I make you hot? Tell me\nI didn't mean to ...
98,Queen Latifah,U.N.I.T.Y.,"Uh, U.N.I.T.Y., U.N.I.T.Y. that's a unity\nU.N..."


In [40]:
# 10 artists, 962 values, 950 unique song titles, 952 unique song lyrics
    # there are dupicates here
# Nicki Minaj is the most common artist
# Crazy is the most common song title...must look into this...
flyrics_df.describe()

Unnamed: 0,artist,title,lyrics
count,962,962,962.0
unique,10,950,952.0
top,Nicki Minaj,Crazy,
freq,100,3,11.0


In [42]:
# 962 values in each column as expected!
    # REMEMBER: Cardi B (76), Remy Ma (86)
flyrics_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 962 entries, 0 to 99
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   artist  962 non-null    object
 1   title   962 non-null    object
 2   lyrics  962 non-null    object
dtypes: object(3)
memory usage: 30.1+ KB


In [43]:
# all values are as expected 
flyrics_df.artist.value_counts()

Nicki Minaj            100
Queen Latifah          100
Rico Nasty             100
Lil’ Kim               100
Rapsody                100
Megan Thee Stallion    100
Trina                  100
Missy Elliott          100
Remy Ma                 86
Cardi B                 76
Name: artist, dtype: int64

## There are some issues...let's clean
- I would like to take out any rows that have titles that have Skit or Interlude in them
    - Will explore the data frames and see what problems are 
    - I'm seeing some duplicates that I will need to delete
    - Also some newline characters (definitely delete) and name headers that show that they are other artists featured in the work but not dictated in the song titles that I may/may not remove
        - My initial reaction is that I may not need to but my conclusions will have to be based upon the overall songs rather than the artists.

In [44]:
mlyrics_df.head()
mlyrics_df.tail()

Unnamed: 0,artist,title,lyrics
0,J. Cole,03' Adolescence,La la la\nLa la la la la\nLa la la\nLa la la l...
1,J. Cole,102.1 Jamz Freestyle,For all y’all boys cheap talking\nKeep walking...
2,J. Cole,102 Jamz Freestyle,For all y’all boys cheap talking\nKeep walking...
3,J. Cole,1985,"1985, I arrived\n33 years, damn, I'm grateful ..."
4,J. Cole,2012,"Yes, straight out the Ville and I'm blessed\nN..."


Unnamed: 0,artist,title,lyrics
95,2Pac,Flex,"Flex, flex flex\nFlex, flex flex\n\nSlippin' t..."
96,2Pac,Forever And Today,U say that u'll love me forever but what about...
97,2Pac,For Mrs. Hawkins (In Memory of Yusef Hawkins),This poem is addressed 2 Mrs. Hawkins\nwho los...
98,2Pac,Fortune & Fame,"And my niggas say, we want the fame!\nC'mon\n\..."
99,2Pac,Friends,"I want to be, yo, let me fuck that nigga down\..."


In [45]:
# okay, [99] has only one character so let's take out anything less than 10?
# There are some quotation marks that aren't needed
# [8] only has a link and nothing else

mlyrics_df.sample(10)

Unnamed: 0,artist,title,lyrics
96,Kendrick Lamar,"​good kid, m.A.A.d city [Booklet]",
9,J. Cole,Album of the Year (Freestyle),Yeah\nMy mind state feel like the crime in the...
69,Snoop Dogg,Cadillacs,"Cadillacs, croker sacks\n501's, policies and g..."
94,Kanye West,DJ Khaled’s Son,I keep my gun everywhere I go like DJ Khaled's...
56,J. Cole,Disgusting,Can't help but think about it all the time\nYo...
4,Kendrick Lamar,A.D.H.D.,"Uh-uh, fuck that (fuck that)\n\nEight doobies ..."
10,Kendrick Lamar,Another Nigga (To Pimp a Butterfly),I remember you was conflicted\nMisusing your i...
94,JAY-Z,Do It Again (Put Ya Hands Up),Roc-A-Fella.. y'all know what this is\nWe givi...
57,2Pac,Check Out Time,"Ay what time is it nigga?\n(""I don't know."")\n..."
83,J. Cole,Get It,"Gotta get my groove back, you know?\nIt's been..."


In [47]:
# Let's check out what's up with Anything since it was the most frequent
mlyrics_df[mlyrics_df.title=='Anything']
# JAY-Z, Kanye and Lil Wayne all have a song called 'Anything'

Unnamed: 0,artist,title,lyrics
32,JAY-Z,Anything,"Uh huh yea, yeah\nDuro!\nYou gotta let it bump..."
26,Kanye West,Anything,I mean wow. You know? Man\n\nLookin' out my lo...
46,Lil Wayne,Anything,"I'd risk everything\nFor one kiss, everything\..."


In [37]:
# So there are some song titles with weird titles I saw through sampling
mlyrics_df[mlyrics_df.title=='.']
mlyrics_df[mlyrics_df.title=='E']

Unnamed: 0,artist,title,lyrics
0,Kanye West,.,.
0,Drake,.,.


Unnamed: 0,artist,title,lyrics
99,JAY-Z,E,E


Unnamed: 0,artist,title,lyrics


In [48]:
flyrics_df.head()
flyrics_df.tail()

Unnamed: 0,artist,title,lyrics
0,Rico Nasty,10Fo,"Smoov, what's good, baby? (Woo)\nWake up F1LTH..."
1,Rico Nasty,Animal,"I'm a bear, you a mother fuckin' reindeer\nWhe..."
2,Rico Nasty,Ar-15,Pointing red lasers on you\nDo you need a head...
3,Rico Nasty,Arenas,Yeah-yeah-yeah-yeah\n\nI can't wait till I sel...
4,Rico Nasty,Back & Forth,CashMoneyAP\n\nI said I'm back in this bitch\n...


Unnamed: 0,artist,title,lyrics
95,Queen Latifah,The World,"The world, oh, oh, oh, the world\nThe world, o..."
96,Queen Latifah,Trav’lin’ Light,I'm trav'lin' light\nBecause my man has gone\n...
97,Queen Latifah,Turn You On,Did I make you hot? Tell me\nI didn't mean to ...
98,Queen Latifah,U.N.I.T.Y.,"Uh, U.N.I.T.Y., U.N.I.T.Y. that's a unity\nU.N..."
99,Queen Latifah,Walk the Dinosaur (From ”Ice Age: Dawn of the ...,Boom boom acka-lacka lacka boom\nBoom boom ack...


In [49]:
# there are some section headers with curly brackets '{}'
# there is also puncuation that I won't need
# some clean versions made it through
flyrics_df.sample(10)

Unnamed: 0,artist,title,lyrics
38,Cardi B,Never Give Up,Cardi\nJosh X\n\nI see the pain in your eyes\n...
94,Megan Thee Stallion,W.A.B,"(Mario)\nWhen I say weak ass, you say bitch (A..."
14,Rapsody,Black Diamonds,Rap Diddy and Raekwon The Chef\nSalute to the ...
95,Missy Elliott,Religious Blessings - Outro,"Been caught up in the fame, a part of the glam..."
94,Rico Nasty,Same Thing,"Rico, rico\nRico, rico\nWoah!\n\nHe told her t..."
56,Lil’ Kim,How Many Licks?,"Hold up, so what you're saying is\n(Niggas got..."
7,Megan Thee Stallion,Big Pimpin,"Big pimpin', we spendin’ cheese\nBig pimpin' o..."
22,Trina,Busted Skit,"Mamma!!\nMan, keep dat kid quiet, damn!\nNeris..."
51,Missy Elliott,Higher Ground,Mmmhmmmm.. mmmmmm..\nOhh yea yea yea\nOhhhh.. ...
50,Missy Elliott,Go to the Floor,Hard-working\nThe star of the show\nMisdemeano...


In [None]:
# will run the following code when both mlyrics_df and flyrics_df have been cleaned
# all_lyrics_df = pd.concat([mlyrics_df, flyrics_df])
# all_lyrics_df