**Requirement for project**

Only necessary to run once.

In [None]:
!pip install lyricsgenius
!pip install nltk

Collecting lyricsgenius
[?25l  Downloading https://files.pythonhosted.org/packages/41/c1/b7d56971a43e430214727daf774623d8edd0c13fe7bac1f484d0934af29b/lyricsgenius-2.0.2-py3-none-any.whl (46kB)
[K     |███████▏                        | 10kB 16.0MB/s eta 0:00:01[K     |██████████████▎                 | 20kB 19.4MB/s eta 0:00:01[K     |█████████████████████▍          | 30kB 18.0MB/s eta 0:00:01[K     |████████████████████████████▌   | 40kB 12.2MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 4.2MB/s 
Installing collected packages: lyricsgenius
Successfully installed lyricsgenius-2.0.2


In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet as wn
from nltk import pos_tag
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk.data
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict
import operator
import lyricsgenius
import string
import requests
import urllib
import re
import numpy as np
import pandas as pd
from google.colab import files
from bs4 import BeautifulSoup
from google.colab import drive

# Mount your Drive to the Colab VM.
drive.mount("/content/drive")
PROFANITY_TEXT_URL = "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"




In [None]:
nltk.download('stopwords')
nltk.download('vader_lexicon')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

**We will now scrape the names of artists in the Billboard hot 100**

Given a year, this function will scrape billboard for the hot 100 songs and save artist names for that year and return a set with these names added. We use beautiful soup to clean our scraped data.


In [None]:
def get_billboard_year_songs(year):
    url = 'https://en.m.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_' + str(year)
    req = requests.get(url)
    bs4 = BeautifulSoup(req.text, 'lxml')
    rows = bs4.find('table').find_all('tr')[1:]
    # rows = rows
    artists = set()
    songLocator = ['td', 'th']
    for row in rows:
        cols = row.find_all(songLocator)
        cols = [t.text.strip().strip('"') for t in cols]
        try:
            artists.add(cols[2])
        except:
            continue
    return (artists)



**We get artist names for our desired range of years**

We iterate over our desired range of years and gather artist names that we have scraped. We then remove duplicates by doing a union operation across all gathered artist name sets.

In [None]:
START_YEAR = 1980
END_YEAR = 2020
artistSetList = []
for year in range(START_YEAR, END_YEAR + 1):
  artistSetList.append(get_billboard_year_songs(year))
artistNames = set().union(*artistSetList)
print("We will be collecting songs for " + str(len(artistNames)) + " artists")

We will be collecting songs for 1983 artists


**The below script retrieves our songs for us**

We use the lyricsgenius package to query the Genius API. We clean the artist string to remove features and strange formatting issues. 

We then query the genius API to retrieve an artist object from which we retrieve 25 songs per artist. Along with this, we also gather additional metadata points and the song lyrics.

Once we complete collecting songs for an artist, we periodically save this gathered data in a CSV file by mounting our drive. We will use this csv file henceforth and there is no need to run this script again.


In [None]:
genius = lyricsgenius.Genius("at79YK0XXNNSWi25Dh4EtJUke1t5s0lQY-5KqwMdWBZNthInRQibKkLrvAhY5x2z")
artists = list(artistNames)
songData = []

for artist in artists:
  artist = artist.split('featuring')
  try:
    artistObj = genius.search_artist(artist[0].strip(), max_songs = 25, sort="popularity")
    for song in artistObj.songs:
      songData.append({"Song" : song.title, "Artist": song.artist, "Year": song.year, "Album": song.album, "Lyric": song.lyrics})
    with open('/gdrive/My Drive/dataset.csv', 'w') as f:
      pd.DataFrame(songData).to_csv(f)
  except:
    continue


dataset = pd.DataFrame(songData)
with open('/gdrive/My Drive/dataset.csv', 'w') as f:
      dataset.to_csv(f)

NameError: ignored

**We will now read the saved CSV file and save it into dataframe**

We use pandas for easier manupilation of data. This format is scalable and if we decide to mine more data in the future on songs and add it to the csv file, it can seamlessly be loaded into a new dataframe.

In [None]:
df = pd.read_csv('/content/drive/MyDrive/dataset.csv')

**We will begin cleaning the retrieved data**

Most artist, album and song names are in the correct format for the purposes of this project. We need to clean the lyrics to prepare them for sentiment analysis. We do this by removing stopwords, fixing formatting issues and getting rid of unnecesary details such as verse labels etc. 

In [None]:
df.drop_duplicates()
df.Lyric = df.Lyric.fillna("")
df.Lyric = df.Lyric.str.lower()
df.Song = df.Song.str.lower()
df.Artist = df.Artist.str.lower()
df.Album = df.Album.str.lower()
df.Year = df.Year.str.slice(0, 4)
df.Lyric = df.Lyric.str.replace(r"\[.*\]", "")
df.Lyric = df.Lyric.str.replace(r"\{.*\}", "")
df.Lyric = df.Lyric.str.replace(r"\(", "")
df.Lyric = df.Lyric.str.replace(r"\)", "")
df.Lyric = df.Lyric.str.replace(r"\n", " ")
df.Lyric = df.Lyric.str.replace(r"\\", "")
df.Lyric = df.Lyric.str.strip()
df.Lyric = df.Lyric.str.replace(r"instrumental|intro|guitar|solo","")
df.Lyric = df.Lyric.str.replace("\n"," ").str.replace(r"[^\w\d'\s]+","").str.replace("efil ym fo flah","")
stop = stopwords.words('english')
df['SentimentLyrics'] = df.Lyric
df.Lyric = df.Lyric.apply(lambda x: [item for item in x.split() if item not in stop])
df.head()

Unnamed: 0.1,Unnamed: 0,Song,Artist,Year,Album,Lyric,SentimentLyrics
0,0,nobody knows,juelz santana,2013,god will’n,"[nobody, knows, go, put, inside, shoes, got, f...",nobody knows what i go through if you can put ...
1,1,soft,juelz santana,2013,god will’n,"[say, im, comin, hard, huh, cheaheh, say, nigg...",they say im comin too hard huh cheaheh i say t...
2,2,everything is good,juelz santana,2013,god will’n,"[wiz, khalifa, aight, hehehehehe, yeah, uhh, f...",wiz khalifa aight hehehehehe yeah uhh feeli...
3,3,time ticking,juelz santana,2016,the get back,"[getting, money, hating, real, nigga, celebrat...",we just getting to the money why they hating t...
4,4,black out,juelz santana,2013,god will’n,"[fly, nigga, i'm, taking, whooo, uh, flyfly, n...",fly nigga i'm taking off whooo uh flyfly nigga...


**We will now conduct sentiment analysis on our lyrics column**

We use VADER (Valence Aware Dictionary and sEntiment Reasoner) of the NLKT Python Library is a lexicon and rule-based sentiment analysis tool. 

We will be using VADER's Compound Metric that calculates the sum of all the lexicon rating, which is normalized between -1(max limit of negativity) and 1(max limit of positivity).

In [None]:
compoundScores = []
sid = SentimentIntensityAnalyzer()
for i in df.index:
    scores = sid.polarity_scores(df.SentimentLyrics.iloc[i])
    compoundScores.append(scores['compound'])
df['Sentiment'] = compoundScores
df.head()

Unnamed: 0.1,Unnamed: 0,Song,Artist,Year,Album,Lyric,SentimentLyrics,Sentiment
0,0,nobody knows,juelz santana,2013,god will’n,"[nobody, knows, go, put, inside, shoes, got, f...",nobody knows what i go through if you can put ...,0.9966
1,1,soft,juelz santana,2013,god will’n,"[say, im, comin, hard, huh, cheaheh, say, nigg...",they say im comin too hard huh cheaheh i say t...,-0.9991
2,2,everything is good,juelz santana,2013,god will’n,"[wiz, khalifa, aight, hehehehehe, yeah, uhh, f...",wiz khalifa aight hehehehehe yeah uhh feeli...,0.9781
3,3,time ticking,juelz santana,2016,the get back,"[getting, money, hating, real, nigga, celebrat...",we just getting to the money why they hating t...,-0.9887
4,4,black out,juelz santana,2013,god will’n,"[fly, nigga, i'm, taking, whooo, uh, flyfly, n...",fly nigga i'm taking off whooo uh flyfly nigga...,-0.9987


**We will now work to assign a profanity score to each song**

We do this by using a compiled list of profane words and counting occurances of these words within our lyrics. We will then normalize this count and add it as a column to our dataframe.

In [None]:
data = urllib.request.urlopen(PROFANITY_TEXT_URL)
bytetext = data.read()
profanityString = str(bytetext, 'utf-8')
profanitySet = set(profanityString.split())

In [None]:
def getProfanityCount(lyricWords):
  profanityCount = 0
  for word in lyricWords:
    if word in profanitySet:
      profanityCount += 1
  return profanityCount

In [None]:
counter = 0
profanityCount = []
for i in df.index:
  profanityCount.append(getProfanityCount(df.Lyric.iloc[i]))
df['Profanity'] = profanityCount
df.head()

Unnamed: 0.1,Unnamed: 0,Song,Artist,Year,Album,Lyric,SentimentLyrics,Sentiment,Profanity
0,0,nobody knows,juelz santana,2013,god will’n,"[nobody, knows, go, put, inside, shoes, got, f...",nobody knows what i go through if you can put ...,0.9966,9
1,1,soft,juelz santana,2013,god will’n,"[say, im, comin, hard, huh, cheaheh, say, nigg...",they say im comin too hard huh cheaheh i say t...,-0.9991,54
2,2,everything is good,juelz santana,2013,god will’n,"[wiz, khalifa, aight, hehehehehe, yeah, uhh, f...",wiz khalifa aight hehehehehe yeah uhh feeli...,0.9781,21
3,3,time ticking,juelz santana,2016,the get back,"[getting, money, hating, real, nigga, celebrat...",we just getting to the money why they hating t...,-0.9887,39
4,4,black out,juelz santana,2013,god will’n,"[fly, nigga, i'm, taking, whooo, uh, flyfly, n...",fly nigga i'm taking off whooo uh flyfly nigga...,-0.9987,40


In [None]:
def filterbyartist(df, artistname):
  return df.loc[df['Artist'] == artistname.lower()]

In [None]:
def filterbyalbum(df, albumname):
  return df.loc[df['Album'] == albumname.lower()]

In [73]:
def filterbyyear(df, startyear, endyear):
  #format of year - string as YYYY-MM-DD
  return df.loc[df['Year'] >= float(startyear) and df['Year'] <= float(endyear)]

**We will be using the following metric to evaluate sentiment** 

positive sentiment : (compound score >= 0.05)

neutral sentiment : (compound score > -0.05) and (compound score < 0.05)

negative sentiment : (compound score <= -0.05)

In [76]:
def filterbysentiment(df, word):
  happyset = {"joyful", "cheerful", "delightful", "pleasing", "jolly", "merry", "lighthearted", "ecstatic", "gleeful", "happy"}
  sadset = {"unhappy", "sorrowful", "downhearted", "miserable", "gloomy", "woeful", "melancholy", "despressing", "mournful", "distressing", "sad"}
  
  if word in happyset:
    return df.loc[df['Sentiment'] >= 0]

  elif word in sadset:
    return df.loc[df['Sentiment'] < 0]

  #SNF -  Sentiment not found
  else:
    return df

In [None]:
df = pd.read_csv('/content/drive/MyDrive/unnormalizedDataset.csv')

**The following cell will pre-process TF-IDF for similarity calculations**

TF-IDF = Term Frequency (TF) * Inverse Document Frequency (IDF)

We will be using this formula and applying the functions to the cleaned lyrics.

In [None]:
df = df.drop_duplicates(subset='Song', keep="first")
import ast 


# Create Vocabulary
vocabulary = set()
for doc in df.Lyric:
    res = ast.literal_eval(doc)
    vocabulary.update(res)

vocabulary = list(vocabulary)

tfidf = TfidfVectorizer(sublinear_tf = True, stop_words='english', vocabulary = vocabulary)
tfidf_tran = tfidf.fit_transform(df.Lyric)


In [None]:
def gen_vector_T(tokens):
    Q = np.zeros((len(vocabulary)))
    x = tfidf.transform(tokens)
    for token in tokens[0].split(','):
        try:
            ind = vocabulary.index(token)
            Q[ind] = x[0, tfidf.vocabulary_[token]]
        except:
            pass
    return Q

In [None]:
def cosine_sim(a, b):
    cos_sim = np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))
    return cos_sim

In [None]:
def cosine_similarity_T(k, query):
    preprocessed_query = preprocessed_query = re.sub("\W+", " ", query).strip()
    tokens = word_tokenize(str(preprocessed_query))
    q_df = pd.DataFrame(columns=['q_clean'])
    q_df.loc[0,'q_clean'] = tokens
    q_df['q_clean'] = ','.join(q_df.q_clean[0])
    d_cosines = []
    query_vector = gen_vector_T(q_df['q_clean'])
    for d in tfidf_tran.A:
        d_cosines.append(cosine_sim(query_vector, d))
    out = np.array(d_cosines).argsort()[-k:][::-1]
    a = pd.DataFrame()
    for i,index in enumerate(out):
        a.loc[i,'Song'] = df['Song'][index]
    for j, simScore in enumerate(d_cosines[-k:][::-1]):
        a.loc[j,'Score'] = simScore
    return a

In [None]:
cosine_similarity_T(10, 'guns fuck kill')
# df
# print(tfidf_tran)

  


Unnamed: 0,Song,Score
0,t. mata,0.0
1,gammer gerten’s needle,0.0
2,i robot,0.06368
3,sirius,0.013781
4,jazzy,0.032928
5,choir practice,0.025346
6,axel f,0.0
7,memories,0.0
8,,0.0
9,the one after,0.0


In [None]:
cols_to_norm = ['Profanity']
df[cols_to_norm] = df[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))

In [67]:

df['Profanity'].median()

0.009900990099009901

In [69]:
def filterbyprofanity(df, prof):
  if prof == '0':
    return df.loc[df['Profanity'] < 0.009900990099009901]
  
  elif prof == '1':
    return df.loc[df['Profanity'] >= 0.009900990099009901]

  else:
    return df

In [None]:
def lyricSearch(df, lyrics):
  search = [lyrics]
  return df.loc[df['Lyric'].str.contains(lyrics.lower()) ]


In [None]:
def

In [62]:
lyricSearch(df,'dick')

Unnamed: 0.1,Unnamed: 0,Song,Artist,Year,Album,Lyric,SentimentLyrics,Sentiment,Profanity
1,1,soft,juelz santana,2013.0,god will’n,"['say', 'im', 'comin', 'hard', 'huh', 'cheaheh...",they say im comin too hard huh cheaheh i say t...,-0.9991,0.267327
4,4,black out,juelz santana,2013.0,god will’n,"['fly', 'nigga', ""i'm"", 'taking', 'whooo', 'uh...",fly nigga i'm taking off whooo uh flyfly nigga...,-0.9987,0.198020
5,5,my will,juelz santana,2013.0,god will’n,"['hope', 'son', 'learn', 'like', 'die', 'tonig...",i hope my son learn to be not like me if i die...,-0.9958,0.153465
25,25,damn!,youngbloodz (hip hop),2003.0,drankin’ patnaz,"['calling', 'come', 'back', 'streets', 'sean',...",they calling me to come back to the streets se...,0.9992,0.420792
27,27,whatchu lookin’ at,youngbloodz (hip hop),,drankin’ patnaz,"['yeah', 'yeah', 'yeah', 'whatchu', 'lookin', ...",yeah yeah yeah whatchu lookin at nigga whatchu...,-0.9978,0.242574
...,...,...,...,...,...,...,...,...,...
9154,11024,fuck what happens tonight,french montana,2013.0,excuse my french,"['fuck', 'ho', 'shit', 'fuck', 'fuck', 'boys',...",fuck all that ho shit fuck all you fuck boys b...,-0.9995,0.366337
9257,11167,one minute man,missy elliott,2001.0,miss e ...so addictive,"['ooh', 'want', 'need', ""can't"", 'stand', 'min...",ooh i don't want i don't need i can't stand no...,0.4645,0.064356
9260,11170,sock it 2 me,missy elliott,1997.0,supa dupa fly,"['hehe', 'nigga', ""i'm"", 'nasty', 'looking', '...",hehe nigga i'm nasty do it do it do it do it d...,-0.9879,0.143564
9261,11171,busa rhyme,missy elliott,1999.0,da real world,"['uh', 'slim', 'shady', 'uh', 'slim', 'shady',...",uh slim shady uh slim shady uh slim shady uh y...,-0.9981,0.193069


In [77]:
## input cell
print("-------SmartLyrics Search Engine---------")
newdf = df
artist = input('enter artist: ')
album = input('enter album: ')
startyear = (input('enter start year: '))
endyear = (input('enter end year: '))
lyrics = input('enter lyrics: ')
sentiment = input('enter sentiment: ')
profanity = input('enter profanity(explicit(1) not explicit(0): ')
lyricSearch(filterbyartist(df,artist), lyrics)

if artist == '':
  pass
else:
  newdf = filterbyartist(newdf, artist)

if album == '':
  pass
else:
  newdf = filterbyalbum(newdf, album)

if lyrics == '':
  pass
else:
  newdf = lyricSearch(newdf, lyrics)

if startyear == '' and endyear == '':
  pass
else:
  newdf = filterbyyear(newdf, startyear, endyear)

if sentiment == '':
  pass
else:
  newdf = filterbysentiment(newdf, sentiment)

if profanity == '':
  pass
else:
  newdf = filterbysentiment(newdf, profanity)


newdf








-------SmartLyrics Search Engine---------
enter artist: drake
enter album: scorpion
enter start year: 
enter end year: 
enter lyrics: 
enter sentiment: happy
enter profanity(explicit(1) not explicit(0): 1


Unnamed: 0.1,Unnamed: 0,Song,Artist,Year,Album,Lyric,SentimentLyrics,Sentiment,Profanity
1450,1499,god’s plan,drake,2018.0,scorpion,"[""wishin'"", ""wishin'"", ""wishin'"", ""wishin'"", ""...",and they wishin' and wishin' and wishin' and w...,0.932,0.024752
