# **Scraping From azlyics.com using BeautifulSoup**

## Song Title Scraping

To begin, we import all the necessary libraries we need. 

In [1]:
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
from time import sleep
import csv
import requests
import urllib

We use a base url to scrape all songs titles from azlyrics. Note that 0 and 1 are place holders for an artists name and then the page number/extension, respectively. 

In [2]:
base_url = 'https://www.azlyrics.com/{0}/{1}.html' 

Creating a list of the artists names that we want to scrape and then making an empty dictionary (songs_dict{}) to put our scraped song data into.

In [3]:
artists_first_names = ["florence", "hayley", "hozier"]
artists_last_names = ["florencethemachine", "heynderickx", "hozier"]
artists_full_names = ["florencethemachine", "hayleyheynderickx", "hozier"]
artists = ["florencethemachine", "haleyheynderickx", "hozier"]
songs_dict = {}

The first character of the artists last names is what allows for the base url to function correctly in azlyics. But do to the artist, Florence And The Machine, I had to make the values of artsists_full_last_names and artists_full_names the same; which allows for the correct letters to be used in the base url during scraping. 

In [4]:
first_char = [w[0] for w in artists]
first_char 

['f', 'h', 'h']

I explained my process by using comments within the code. 

In [5]:
#Looping through each artist in artists [list] to create a specific URL for each individual artist using the base_url 
#base_url will use the first character of artists_last_name and the full value of artists_last_name 
for artist in artists:
    artist_url = base_url.format(artist[0], artist)
    print("Going to url:", artist_url) #Check 1

    #Making a HTTP request and parsing the HTML
    headers = {'User-Agent': 'Mozilla/5.0'}  
    #Setting a user-agent header to avoid potential blocking by mimicking a web browser
    response = requests.get(artist_url, headers=headers)
    #Using get to send a HTTP GET request to the artist_url 
    content = response.content 
    soup = BeautifulSoup(content, 'html.parser') #Parsing using beautiful soup
    

    #Finding and retrieving the song list in the HTML sturcture of page 
    songs_list = soup.find_all('div', attrs={'class': 'listalbum-item'})

    #Making an empty list for songs_dict 
    songs_dict[artist] = []
    
    #Then taking the text of X song 
    #Using strip() to remove any leading and trailing whitespaces in text
    #Adding cleaned text to artist's list of songs in 'songs_dict'
    for song in songs_list:
        song_name = song.text.strip()
        songs_dict[artist].append(song_name)
    print("Artist:", artist)
    print(songs_dict[artist][:5])
    #Making a 10 second delay to not overwhelm the server with requests 
    sleep(10)

#Looping through 'songs_dict' to print each artist's name and the total number of songs found for them
#Then printing out the entire dictionary with all of scraped song titles
for key, val in songs_dict.items():
    print(key, len(val))
print(songs_dict) #Check 2

#Run Time: ~32 seconds

Going to url: https://www.azlyrics.com/f/florencethemachine.html
Artist: florencethemachine
['Dog Days Are Over', 'Rabbit Heart (Raise It Up)', "I'm Not Calling You A Liar", 'Howl', 'Kiss With A Fist']
Going to url: https://www.azlyrics.com/h/haleyheynderickx.html
Artist: haleyheynderickx
['Drinking Song', "First I'm Sorry", 'Fish Eyes', 'Sane', 'No Face']
Going to url: https://www.azlyrics.com/h/hozier.html
Artist: hozier
['Take Me To Church', 'Angel Of Small Death & The Codeine Scene', 'Jackie And Wilson', 'Someone New', 'To Be Alone']
florencethemachine 113
haleyheynderickx 21
hozier 67
{'florencethemachine': ['Dog Days Are Over', 'Rabbit Heart (Raise It Up)', "I'm Not Calling You A Liar", 'Howl', 'Kiss With A Fist', 'Girl With One Eye', 'Drumming Song', 'Between Two Lungs', 'Cosmic Love', 'My Boy Builds Coffins', 'Hurricane Drunk', 'Blinding', "You've Got The Love", 'Swimming', 'Heavy In Your Arms(from "The Twilight Saga: Eclipse" soundtrack)', 'Ghosts(Demo)', "You've Got The Dirte

In the next portion, I created a json file to hold all of the data (song titles and song lyrics). More indepth explanations can be found through the comments included in the code.

In [6]:
import json

#Naming a new JSON file
json_file = "Artists-Songs Mapping.json"

#Saving dictionary 'songs_dict' to 'Artists-Songs Mapping'
with open(json_file, "w") as file:
    json.dump(songs_dict, file)

#Loading 'Artists-Songs Mapping' file and printing contents
with open(json_file, "r") as file:
    data = json.load(file)
    print(data)

{'florencethemachine': ['Dog Days Are Over', 'Rabbit Heart (Raise It Up)', "I'm Not Calling You A Liar", 'Howl', 'Kiss With A Fist', 'Girl With One Eye', 'Drumming Song', 'Between Two Lungs', 'Cosmic Love', 'My Boy Builds Coffins', 'Hurricane Drunk', 'Blinding', "You've Got The Love", 'Swimming', 'Heavy In Your Arms(from "The Twilight Saga: Eclipse" soundtrack)', 'Ghosts(Demo)', "You've Got The Dirtee Love(Live)", 'Dog Days Are Over (Yeasayer Remix)', 'Falling', 'Are You Hurting The One You Love?', 'Addicted To Love', 'Bird Song', 'Hospital Beds', 'Hardest Of Hearts', 'Only If For A Night', 'Shake It Out', 'What The Water Gave Me', 'Never Let Me Go', 'Breaking Down', 'Lover To Lover', 'No Light, No Light', 'Seven Devils', 'Heartlines', 'Spectrum', 'All This And Heaven Too', 'Leave My Body', 'Remain Nameless(Deluxe Edition Bonus Track)', 'Strangeness & Charm(Deluxe Edition Bonus Track)', 'Bedroom Hymns(Deluxe Edition Bonus Track)', 'What The Water Gave Me (Demo)(Deluxe Edition Bonus Tra

Song Title Scraping Complete!

## Song Lyric Scraping

Similar to the scraping done for song titles we start with importing all the necessary libraries we need. 

In [7]:
import threading
import queue
import requests 
from urllib.request import urlopen
from bs4 import BeautifulSoup
from time import sleep
import csv
import json
import re
import time
import random
import pandas as pd

I then created a proxy text file for each indivdual artist so there are multiple session objects for the requests. 

In [8]:
#Listing filenames to make
filenames = ["proxy_list_florencethemachine.txt", "proxy_list_haleyheynderickx.txt", "proxy_list_hozier.txt"]

#Looping through filenames and creating new text file for each one
for filename in filenames:
    with open(filename, "w") as f:
        f.write("")


In [9]:
#florencethemachine
with open("proxy_list_florencethemachine.txt", "w") as txt:
    txt.write("103.176.108.105:1402, 72.170.220.17:8080, 220.226.202.146:80, 80.240.130.161:8888, 47.56.110.204:8989, 50.223.129.106:80, 50.222.245.44:80, 41.207.187.178:80, 50.207.199.85:80, 178.33.3.163:8080, 50.219.106.81:80, 50.221.227.130:80, 50.223.129.104:80, 50.221.203.193:80, 117.251.103.186:8080,50.227.121.34:80,50.217.153.73:80,68.185.57.66:80,50.217.226.42:80,50.207.199.87:80,50.237.89.165:80,213.171.44.134:3128,50.227.121.37:80,50.174.7.157:80,77.48.244.78:80,47.74.226.8:5001,184.10.84.74:80,50.168.49.111:80,50.228.141.96:80,50.217.153.72:80,213.143.113.82:80,50.217.29.198:80,50.222.245.50:80,50.168.49.108:80,50.169.62.110:80,107.1.93.216:80,50.175.31.244:80,96.113.159.162:80,50.218.57.67:80,50.221.203.205:80,50.223.129.110:80,50.173.157.72:80,50.219.106.82:80,50.171.32.227:80,50.223.38.98:80,50.173.140.146:80,50.170.90.26:80,66.191.31.158:80,107.1.93.217:80,50.169.62.109:80,50.228.141.98:80,50.204.219.225:80,50.223.129.111:80,50.218.57.70:80,85.26.146.169:80,50.168.72.114:80,50.228.141.100:80,50.206.111.90:80,50.217.226.44:80,50.221.203.195:80,80.228.235.6:80,50.218.57.64:80,32.223.6.94:80,50.222.245.43:80,50.169.62.104:80,50.173.140.149:80,50.220.168.134:80,50.207.199.80:80,50.168.163.181:80,50.169.62.107:80,47.177.148.110:80,50.174.7.158:80,50.173.140.144:80,50.218.57.68:80,50.168.10.174:80,50.219.106.85:80,50.207.199.82:80,50.171.32.226:80,50.221.203.222:80,50.168.163.183:80,50.231.110.26:80,50.173.140.147:80,50.223.129.107:80,50.168.72.113:80,50.230.222.202:80,50.171.2.11:80,50.217.226.43:80,50.222.245.40:80,50.219.106.83:80,50.217.226.47:80,50.206.25.107:80,107.1.93.223:80,77.73.241.154:80,50.221.203.210:80,50.223.129.109:80,50.239.72.18:80,50.237.89.164:80,50.206.25.108:80,201.148.32.162:80,50.222.245.46:80,191.37.208.201:8080,190.103.87.3:8080,186.3.38.200:999,190.110.35.16:999,45.236.171.76:999,103.78.36.163:46977,50.235.240.86:80,103.167.134.31:80,50.122.86.118:80,50.168.72.116:80,107.1.93.215:80,116.203.28.43:80,50.168.163.179:80,50.237.89.162:80,50.219.106.84:80,50.174.7.162:80,81.25.227.216:3128,50.174.145.12:80,75.114.77.35:8080,50.221.203.208:80,20.206.106.192:80,47.88.3.19:8080,186.121.235.222:8080,50.204.219.226:80,75.114.77.37:8080,50.173.157.70:80,212.107.31.118:80,50.204.233.30:80,50.168.10.168:80,103.36.25.253:80,50.168.163.177:80,50.174.145.15:80,50.171.32.230:80,108.48.164.252:80,50.204.190.234:80,107.1.93.218:80,50.231.104.58:80,75.114.77.38:8080,96.113.158.126:80,50.171.2.9:80,134.209.29.120:8080,103.149.130.38:80,50.168.72.118:80,50.237.89.163:80,50.206.111.91:80,50.221.203.192:80,50.239.72.19:80,50.221.203.212:80,103.168.53.53:41367,50.228.141.103:80,50.221.203.216:80,91.107.140.81:80,103.49.202.252:80,202.61.192.193:80,68.188.59.198:80,213.33.126.130:80,50.207.199.86:80,207.2.120.19:80,103.83.232.122:80,47.88.62.42:80,213.33.2.28:80,185.140.53.137:80,62.99.138.162:80,178.234.31.40:3128,50.171.32.228:80,221.151.181.101:8000,158.69.66.131:9300,158.160.56.149:8080,176.95.54.202:83,154.202.110.223:3128,38.15.154.68:3128,64.137.90.165:5785,154.202.115.11:3128,154.202.124.95:3128,154.202.96.123:3128,23.230.167.29:3128,154.202.118.203:3128,154.202.120.213:3128,154.202.120.95:3128,154.202.109.134:3128,154.202.120.85:3128,154.202.98.155:3128,154.202.109.232:3128,154.202.121.252:3128,154.202.96.103:3128,154.202.120.113:3128,154.84.143.25:3128,13.127.176.125:80,167.172.182.154:80,154.201.62.63:3128,103.76.117.197:6462,161.123.208.228:6472,64.137.75.78:5998,139.59.36.242:80,138.197.178.229:80,64.137.126.25:6633,159.112.235.143:80,154.202.115.85:3128,206.189.56.220:80,154.202.108.13:3128,154.202.109.154:3128,64.137.75.31:5951,154.202.96.215:3128,141.193.213.142:80,154.84.143.83:3128,142.93.132.62:80,3.6.249.143:80,45.114.12.170:5238,64.137.126.242:6850,161.123.209.37:6537,139.59.157.13:80,3.16.193.154:80,157.245.137.189:80,154.201.62.229:3128,68.183.90.211:80,3.20.3.157:80,154.202.120.21:3128,154.202.117.89:3128,13.233.196.149:80,154.202.98.145:3128,154.202.116.102:3128,45.114.15.148:6129,194.50.243.156:3128,154.201.62.167:3128,154.202.115.191:3128,154.202.98.93:3128,154.202.108.101:3128,154.202.120.109:3128,154.83.9.50:3128,154.202.98.47:3128,154.202.120.115:3128,154.202.121.250:3128,154.202.121.20:3128,154.202.96.45:3128,154.202.108.221:3128,52.78.94.158:80,154.202.108.187:3128,66.235.200.157:80,154.202.121.188:3128,68.183.229.114:80,154.201.62.65:3128,154.83.9.74:3128,154.202.126.87:3128,45.114.12.135:5203,154.201.61.131:3128,203.13.32.55:80,154.202.120.233:3128,203.34.28.215:80,51.83.96.229:80,142.93.207.16:80,138.197.132.234:80,154.202.119.46:3128,3.14.150.7:80,154.201.63.221:3128,3.21.218.72:80,154.202.109.50:3128,154.202.112.120:3128,154.201.61.213:3128,154.202.115.7:3128,154.202.122.167:3128,154.202.108.177:3128,154.201.63.83:3128,194.50.243.240:3128,194.50.243.118:3128,154.202.113.13:3128,154.202.123.178:3128,154.202.96.203:3128,154.202.115.131:3128,154.202.98.103:3128,13.124.46.251:80,154.202.121.126:3128,64.137.126.212:6820,154.201.61.69:3128,154.202.96.149:3128,154.84.143.91:3128,167.71.252.61:80,154.202.125.228:3128,154.202.119.14:3128,154.201.63.39:3128,154.83.10.33:3128,154.202.122.157:3128,167.71.56.252:80,167.71.158.219:80,154.202.119.10:3128,154.202.122.85:3128,154.202.114.222:3128,103.76.117.189:6454,154.202.126.71:3128,154.201.62.21:3128,154.202.121.226:3128,154.202.126.9:3128,154.201.63.63:3128,154.202.114.202:3128,194.50.243.28:3128,154.84.142.43:3128,154.84.142.159:3128,154.202.111.177:3128,154.202.122.233:3128,194.50.243.108:3128,154.202.109.52:3128")

In [10]:
#haleyheynderickx
with open("proxy_list_haleyheynderickx.txt", "w") as txt:
    txt.write("115.144.99.220:11116,81.10.92.35:8080,190.102.229.74:999,187.63.157.60:999,222.127.75.23:8085,13.37.228.120:80,117.102.75.234:8080,154.236.168.179:1981,174.108.200.2:8080,147.182.204.92:3129,41.57.15.46:6060,41.186.44.106:3128,27.54.71.234:8080,177.105.232.114:8080,189.251.19.215:999,164.52.12.230:3128,176.213.141.107:8080,167.250.29.235:3128,51.222.131.111:8050,192.111.150.17:8080,94.45.208.164:8080,103.165.126.66:8080,104.248.83.79:80,163.53.186.148:80,103.154.86.46:8080,103.36.35.135:8080,36.37.146.119:32650,197.210.186.226:8080,103.6.177.174:8002,139.193.102.34:8080,154.236.168.141:1976,45.81.145.128:8080,50.84.48.130:8080,86.105.188.14:8080,82.222.11.215:8080,103.95.40.218:8080,43.251.168.254:8080,217.77.218.193:3128,196.204.24.251:8080,203.202.255.67:8080,186.3.155.25:8080,123.25.15.209:9812,114.4.226.247:8080,103.126.87.177:8080,114.129.18.94:8080,200.111.186.215:999,116.58.239.34:80,200.160.105.197:3128,129.154.225.163:8100,103.164.58.88:8080,111.95.41.154:8080,47.242.3.214:8081,201.77.110.1:999,45.174.87.18:999,103.144.38.65:8080,121.126.200.123:11361,124.40.246.210:8080,200.106.167.130:999,176.98.234.124:8080,103.171.31.127:8080,129.151.191.20:80,103.187.167.182:8080,142.93.223.219:8080,93.170.90.223:3128,103.175.242.32:8080,88.255.185.247:8080,191.102.68.109:999,125.167.12.181:8080,185.74.6.247:80,186.148.182.154:999,36.89.158.91:4480,161.97.97.155:3128,124.198.90.115:12652,24.152.40.49:8080,209.97.171.82:8080,66.188.181.143:8080,179.61.229.86:999,66.85.128.252:8080,103.247.22.125:3127,201.91.82.155:3128,115.144.1.222:12089,152.67.46.249:3128,45.190.52.24:8080,168.195.211.189:8080,103.169.254.164:8061,165.16.28.96:8080,103.124.138.67:3126,23.132.48.1:999,45.167.253.225:999,183.89.208.108:8080,91.250.83.200:3128,191.7.216.26:8080,47.90.126.138:9090,88.255.217.17:8080,148.251.110.152:3128,88.99.21.162:3128,121.58.210.211:8080,154.236.191.51:1981,103.167.71.39:8080,217.197.237.74:8080,203.192.217.11:8080,115.144.99.223:11119,200.123.29.45:3128,41.33.254.186:1981,104.128.102.195:8080,41.65.55.10:1976,102.38.6.225:8080,14.194.101.219:3128,103.155.196.114:8080,45.230.169.253:999,177.69.180.171:8080,201.168.136.169:999,43.132.175.181:81,212.175.118.169:8080,190.15.221.21:8080,85.172.15.98:80,212.174.17.15:8085,103.171.164.98:8080,177.93.51.168:999,200.116.198.222:9812,81.12.36.51:3128,72.169.67.101:87,36.92.93.61:8080,51.79.248.87:3128,177.234.211.63:999,43.240.101.89:8080,102.129.157.231:8080,45.173.12.142:1994,41.65.103.15:1976,182.253.45.223:32650,103.112.253.89:32650,43.242.239.36:3128,103.118.44.244:8080,102.129.157.171:8080,103.118.44.24:8080,179.49.119.214:8080,18.184.26.64:80,45.174.249.45:999,45.133.168.50:8080,37.61.78.196:8888,94.20.183.172:80,88.119.22.135:8080,103.11.107.50:3125,93.157.196.58:8080,102.129.157.59:8080,103.118.44.56:8080,102.129.157.98:8080,177.125.89.101:8080,179.49.118.13:8080,102.129.157.5:8080,102.129.157.254:8080,103.167.172.104:41890,200.82.188.101:999,102.38.27.12:8080,102.129.157.163:8080,41.210.138.242:8080,181.78.64.83:999,108.170.12.11:80,102.129.157.121:8080,103.118.44.224:8080,103.154.230.106:5678,103.152.232.74:8080,124.158.175.26:8080,200.55.250.20:6969,201.71.2.143:999,110.78.114.161:8080,102.38.22.32:8080,102.39.193.213:8080,180.211.158.122:58375,206.189.199.91:80,167.99.124.118:80,102.134.98.222:8081,178.33.3.163:8080,196.20.125.149:8083,89.117.32.209:80,103.155.217.105:41403,117.54.114.32:80,143.198.228.250:80,3.226.168.144:80,190.82.105.123:43949,201.238.248.139:9229,35.209.198.222:80,5.135.136.60:9090,103.130.90.203:80,188.166.56.246:80,206.189.146.13:8080,103.174.102.127:80,54.219.125.50:8080,137.74.65.101:80,196.20.125.157:8083,162.241.207.217:80,110.49.34.126:32650,14.139.242.7:80,67.205.179.93:31028,142.93.61.46:80,68.183.143.134:80,195.201.99.153:80,37.27.6.46:80,113.53.231.133:3129,190.2.137.225:3128,184.60.66.122:80,190.202.3.22:32650,212.112.113.178:3128,35.222.50.197:80,176.9.238.155:16379,198.199.86.11:3128,103.203.136.253:80,181.170.189.125:8080,8.209.114.72:3129,146.196.54.68:443,86.100.71.126:8080,117.54.114.102:80,103.175.99.167:80,153.122.86.46:80,124.198.11.101:12425,122.175.58.131:80,143.47.185.211:80,64.176.5.119:80,103.216.103.163:80,51.222.152.223:80,219.78.195.115:80,186.121.235.222:8080,47.252.1.180:1234,159.203.3.234:80,190.5.77.211:80,51.178.47.12:80,103.37.88.10:80,190.103.177.131:80,34.122.187.196:80,200.69.210.59:80,94.232.11.178:46449,117.54.114.103:80,162.19.50.37:80,172.105.128.71:56444,51.15.242.202:8888,207.2.120.57:80,202.169.229.139:53281,165.232.169.44:8080,187.44.167.78:60786,134.209.29.120:3128,129.151.141.65:80,91.229.114.137:80,41.77.188.131:80,20.219.235.172:3129,182.72.203.246:80,78.28.152.111:80,23.238.33.186:80,163.172.85.30:80,20.219.178.121:3129,207.2.120.16:80,50.224.251.204:80,103.197.251.202:80,103.83.232.122:80,212.107.31.118:80,20.219.176.57:3129,47.74.152.29:8888,144.217.233.75:80,212.182.90.118:80,203.109.19.137:12241,110.164.177.114:80,157.245.65.206:80,209.145.60.213:80,91.221.67.197:8082,45.85.45.30:80,162.223.94.164:80,217.76.50.200:8000,158.160.56.149:8080,20.219.180.149:3129,207.2.120.15:80,34.211.142.157:3128,62.201.218.82:8080,190.90.8.74:8080,185.128.240.66:8080,38.44.237.62:999,167.250.51.71:999,103.173.230.50:1080,50.199.32.226:8080,202.93.245.46:8080,45.174.79.101:999,157.119.211.133:8080,124.158.167.242:8080,192.111.150.11:8080,43.153.117.113:8800,84.74.141.235:80,36.91.45.11:51672,103.78.97.38:8080,45.5.92.94:8137,115.124.68.226:8080,104.211.29.96:80,167.172.62.114:80,186.121.235.66:8080,202.0.107.133:80,154.79.254.236:32650,128.199.202.122:3128,147.50.205.2:8080,35.199.88.137:80,51.75.122.80:80,139.177.185.242:80,38.242.244.29:80,142.11.232.45:80")

In [11]:
#hozier
with open("proxy_list_hozier.txt", "w") as txt:
    txt.write("107.1.93.214:80,50.221.203.195:80,50.168.49.105:80,50.168.49.111:80,50.221.203.193:80,50.204.219.225:80,107.1.93.221:80,50.168.10.171:80,50.221.203.222:80,20.206.106.192:80,206.189.30.235:80,91.249.134.148:80,50.227.121.34:80,50.227.121.33:80,41.230.216.70:80,50.171.32.226:80,50.222.245.41:80,50.206.25.106:80,50.169.62.107:80,50.221.227.130:80,50.206.25.109:80,107.1.93.219:80,50.207.253.118:80,50.173.140.149:80,50.174.145.8:80,50.169.62.110:80,50.170.90.34:80,50.219.106.85:80,212.145.210.146:80,50.228.141.96:80,50.171.32.227:80,50.168.49.108:80,50.219.106.81:80,50.207.199.83:80,50.170.90.25:80,172.104.97.150:32539,47.74.152.29:8888,107.1.93.211:80,50.168.49.106:80,50.219.106.80:80,50.175.31.247:80,50.175.31.244:80,50.227.121.36:80,212.107.31.118:80,50.204.219.231:80,50.169.62.109:80,190.61.88.147:8080,207.2.120.19:80,50.202.75.26:80,50.171.32.229:80,50.217.29.198:80,213.157.6.50:80,82.119.96.254:80,50.168.72.122:80,50.169.91.138:80,24.205.201.186:80,50.173.157.74:80,50.221.74.130:80,50.218.57.66:80,50.172.71.203:80,127.0.0.7:80,50.169.37.50:80,50.237.89.170:80,50.217.226.41:80,50.217.153.75:80,50.206.111.90:80,107.1.93.208:80,50.168.72.113:80,107.1.93.223:80,47.177.148.110:80,50.173.140.146:80,50.217.226.44:80,50.168.72.114:80,68.185.57.66:80,50.173.140.147:80,50.200.12.86:80,50.170.90.31:80,194.158.203.14:80,50.207.199.82:80,50.168.49.109:80,50.221.203.219:80,50.217.153.79:80,50.207.199.81:80,50.223.38.6:80,50.206.25.107:80,80.228.235.6:80,50.173.157.72:80,50.169.62.106:80,213.143.113.82:80,77.73.241.154:80,50.170.90.26:80,50.173.140.144:80,50.171.2.13:80,50.172.23.10:80,50.217.226.42:80,50.218.57.64:80,20.210.26.214:3333,50.168.163.181:80,50.207.199.80:80,50.168.72.115:80,188.64.132.59:3127,165.154.236.174:80,34.88.86.0:8888,50.217.153.77:80,50.171.32.228:80,50.219.106.74:80,134.195.101.34:8080,103.149.130.38:80,77.48.244.78:80,107.1.93.222:80,50.204.219.229:80,50.168.10.173:80,50.218.57.71:80,50.204.219.224:80,80.240.130.161:8888,50.173.157.78:80,50.237.89.164:80,50.235.240.86:80,178.21.163.24:80,50.168.72.112:80,50.175.31.250:80,107.1.93.217:80,50.231.110.26:80,50.218.57.68:80,50.237.89.162:80,103.83.232.122:80,50.174.145.14:80,50.175.31.243:80,50.223.129.104:80,50.222.245.50:80,50.221.203.208:80,50.217.226.46:80,50.170.90.24:80,43.154.150.99:8080,50.204.190.234:80,50.218.57.70:80,50.221.203.192:80,103.49.202.252:80,50.200.12.85:80,50.217.153.74:80,117.251.103.186:8080,47.56.110.204:8989,50.169.62.114:80,159.203.61.169:3128,129.153.157.63:3128,50.168.34.138:80,47.88.62.42:80,50.171.2.11:80,50.174.7.153:80,50.218.57.74:80,50.200.12.84:80,50.206.111.88:80,139.59.1.14:8080,50.168.163.183:80,135.181.53.229:80,0.0.0.0:80,85.26.146.169:80,50.219.106.82:80,50.219.106.83:80,50.238.154.98:80,201.148.32.162:80,50.217.153.73:80,50.231.104.58:80,50.218.57.67:80,75.114.77.38:8080,50.171.2.9:80,50.174.7.154:80,50.169.62.104:80,50.204.233.30:80,75.114.77.35:8080,50.206.111.89:80,50.174.145.15:80,137.74.65.101:80,202.5.16.44:80,50.222.245.45:80,50.200.12.81:80,181.78.64.83:999,195.222.165.209:3128,45.136.50.113:3128,45.81.146.7:8080,190.121.195.78:999,201.184.53.180:999,167.250.50.11:999,177.234.206.180:999,103.94.125.110:8080,72.170.220.17:8080,103.168.53.105:41407,186.121.235.66:8080,64.225.4.85:9989,183.181.8.173:11070,103.184.50.27:8080,129.159.112.251:3128,27.254.217.116:8081,50.204.219.227:80,50.221.203.205:80,178.33.3.163:8080,50.223.129.108:80,50.230.222.202:80,50.223.129.111:80,107.1.93.216:80,50.217.226.47:80,50.217.226.43:80,50.221.203.210:80,50.239.72.18:80,62.99.138.162:80,197.255.125.12:80,50.222.245.40:80,50.223.129.109:80,50.168.10.174:80,50.222.245.46:80,50.174.7.158:80,50.227.121.37:80,50.207.199.87:80,184.10.84.74:80,50.206.25.108:80,50.237.89.165:80,50.222.245.43:80,50.223.129.107:80,50.174.7.157:80,50.222.245.44:80,85.8.68.2:80,96.113.159.162:80,50.221.230.186:80,173.249.23.14:80,50.223.129.110:80,41.207.187.178:80,50.223.129.106:80,103.148.76.153:80,47.88.3.19:8080,186.121.235.222:8080,91.187.113.50:8080,50.175.31.240:80,50.239.72.19:80,50.217.153.76:80,50.200.12.87:80,50.237.89.163:80,50.220.21.202:80,50.174.7.152:80,50.217.153.78:80,50.221.203.209:80,50.239.72.16:80,50.206.25.111:80,50.171.1.222:80,50.171.32.222:80,50.228.141.99:80,62.141.11.68:80,50.206.25.104:80,50.221.203.217:80,50.171.32.224:80,41.77.188.131:80,50.173.157.73:80,50.204.219.230:80,50.239.72.17:80,50.227.121.39:80,50.219.106.86:80,50.168.210.226:80,50.171.2.12:80,50.168.163.166:80,50.173.140.145:80,50.171.32.225:80,8.219.97.248:80,50.227.121.38:80,50.173.157.79:80,107.1.93.215:80,50.173.140.148:80,107.1.93.213:80,50.206.111.91:80,50.218.57.69:80,50.204.219.228:80,50.237.207.186:80,50.169.62.111:80,96.113.158.126:80,50.174.145.9:80,50.228.141.102:80,50.174.7.159:80,50.237.89.166:80,50.223.129.105:80,50.231.167.218:80,50.219.106.84:80,107.1.93.212:80,50.174.7.156:80,50.171.32.231:80,50.221.203.212:80,107.1.93.209:80,50.207.199.84:80,118.69.111.51:8080,80.120.130.231:80,50.221.203.216:80,202.61.192.193:80,50.221.203.218:80,181.214.29.14:999,47.254.244.202:8080,171.5.15.92:8081,216.169.73.65:34679,45.5.68.2:999,220.226.202.146:80,103.176.108.105:1402,47.74.226.8:5001,50.217.153.72:80,50.220.168.134:80")

Note: The file 'Proxy_list.txt' contains a list of free proxies addresses from [title] (free-proxy-list.net). 

This code worked before, but after re-running it seemed to have issues scraping the text from the song titles. So I tried...

In [12]:
import requests
from bs4 import BeautifulSoup
import json
import re
import random
from time import sleep

#Loading Artists-Songs Mapping.json file
with open("Artists-Songs Mapping.json") as file:
    songs_dict = json.load(file)

#Listing all artist (b/c all need lyrics to be found/written)
artists = ["florencethemachine", "haleyheynderickx", "hozier"]

#Dictionary mapping each artist to a path to their file containing their proxy list
proxy_files = {
    "florencethemachine": "proxy_list_florencethemachine.txt",
    "haleyheynderickx": "proxy_list_haleyheynderickx.txt",
    "hozier": "proxy_list_hozier.txt"
}

#base_url to scrape the lyrics from
base_url = "https://www.azlyrics.com/lyrics/{}/{}.html"

#Making a file in which the lyrics will be saved
lyrics_file = "lyrics_scraped_USE2.txt"

#Making variables/lists
song_lyrics = []
titles = []
lyrics_not_found_for = []

#Setting limits for a randomized delay between requests
min_delay = 7
max_delay = 23

#Making a file
with open(lyrics_file, "w") as file:
    for artist in artists: #Looping through artists
        artist_last_name = artist.replace(" ", "").lower() #No spaces + all lowercase of artists names
        proxy_file = proxy_files.get(artist_last_name) #Connecting to individual proxy list file
        proxy_list = []

        if proxy_file is not None:
            #Loading the proxy list for the current artist from the specified file
            with open(proxy_file, "r") as f:
                proxy_list = f.read().split("\n")

        songs = songs_dict[artist]
        processed_songs = []

        #Preprocessing the songs name for scraping
        for song in songs:
            numbers_in_brackets_removed = re.sub(r'\(.*\)',"",song) #Removing text within parthentheses
            processed_song = re.sub(r'\W+', '', numbers_in_brackets_removed).lower() #lowercasing all
            processed_songs.append(processed_song)

        #Removing duplicate songs
        processed_songs = list(set(processed_songs))

        #Starting a RequestsToolProxy object with the proxy list for the current artist
        requests_tool_proxy = requests.Session()
        requests_tool_proxy.proxies = {'http': random.choice(proxy_list), 'https': random.choice(proxy_list)}

        for song in processed_songs:
            final_url = base_url.format(artist, song)

            try:
                #Making a request to the URL using RequestsToolProxy (object)
                response = requests_tool_proxy.get(final_url)

                #Seeing if the request was blocked by AZLyrics
                if "request for access" in response.text.lower():
                    print("Request for access detected. Waiting...")
                    #If blocked, do a randomized delay and try again
                    delay = random.randint(min_delay, max_delay)
                    sleep(delay)
                    response = requests_tool_proxy.get(final_url)

                #Parsing the HTML response using BeautifulSoup
                soup = BeautifulSoup(response.text, 'html.parser')
                
                #Finding song name
                html_pointer = soup.find('div', attrs={'class':'ringtone'})
                song_name = html_pointer.find_next('b').contents[0].strip()
                
                #Getting lyrics
                lyrics = html_pointer.find_next('div').text.strip()
                
                #Appending lyrics and song name to their respective lists
                song_lyrics.append(lyrics)
                titles.append(song_name)

                #Writing the lyrics to the lyrics file
                file.write("###"+song_name+"###")
                file.write("\n\n")
                file.write(lyrics)
                file.write("\n\n")
                
                print("Lyrics successfully written to file for : " + song_name)
                
            except:
                print("Lyrics not found for : " + song)
                lyrics_not_found_for.append(song)
                
            finally:
                #Randomized delay between requests
                delay = random.randint(min_delay, max_delay)
                sleep(delay)

#Printing the list of songs for which lyrics were not found
print("Lyrics not found for the following songs:")
print(lyrics_not_found_for)

Lyrics not found for : conductor


KeyboardInterrupt: 

This is the code above modified, specifically it doesn't have a proxy list as it seemed that the website was possibly checking to see if code (such as the one above) was using different proxies to scrape and blocking those. 
I explained my process by using comments within the code. 

In [13]:
#KEEP...WORKING W/O PROXIES PROBABLY B/C IT BLOCKED 

import requests
from bs4 import BeautifulSoup
import json
import re
import random
from time import sleep

#List of artists for which the lyrics need to be written
artists = ["florencethemachine", "haleyheynderickx", "hozier"]

#base_url to scrape the lyrics from
base_url = "https://www.azlyrics.com/lyrics/{}/{}.html"

#File in which the lyrics will be saved
lyrics_file = "lyrics_scraped.txt"

#Making variables/lists
song_lyrics = []
titles = []
lyrics_not_found_for = []

#Randomized delay between requests
min_delay = 7
max_delay = 23

with open(lyrics_file, "w") as file:
    for artist in artists:
        artist_last_name = artist.replace(" ", "").lower()

        songs = songs_dict[artist_last_name]
        processed_songs = []

        #Preprocessing the songs name for scraping
        for song in songs:
            numbers_in_brackets_removed = re.sub(r'\(.*\)', "", song)
            processed_song = re.sub(r'\W+', '', numbers_in_brackets_removed).lower()
            processed_songs.append(processed_song)

        #Removing duplicate songs
        processed_songs = list(set(processed_songs))

        for song in processed_songs:
            final_url = base_url.format(artist_last_name, song)

            try:
                #Making a request to the URL
                response = requests.get(final_url)

                #Seeing if request was blocked by AZLyrics
                if "request for access" in response.text.lower():
                    print("Request for access detected. Waiting...")
                    # If the request was blocked, wait for a randomized delay and try again
                    delay = random.randint(min_delay, max_delay)
                    sleep(delay)
                    response = requests.get(final_url)

                #Parsing the HTML response using BeautifulSoup
                soup = BeautifulSoup(response.text, 'html.parser')

                #Finding song name
                html_pointer = soup.find('div', attrs={'class': 'ringtone'})
                song_name = html_pointer.find_next('b').contents[0].strip()

                #Getting lyrics
                lyrics = html_pointer.find_next('div').text.strip()

                #Appending the lyrics and song name to the respective lists
                song_lyrics.append(lyrics)
                titles.append(song_name)

                # Write the lyrics to the lyrics file
                file.write("###" + song_name + "###")
                file.write("\n\n")
                file.write(lyrics)
                file.write("\n\n")

                print("Lyrics successfully written to file for : " + song_name)

            except:
                print("Lyrics not found for : " + song)
                lyrics_not_found_for.append(song)

            finally:
                #Randomized delay between requests
                delay = random.randint(min_delay, max_delay)
                sleep(delay)

#Printing the list of songs for which lyrics were not found
print("Lyrics not found for the following songs:")
print(lyrics_not_found_for)

#Run Time:~55 min

Lyrics successfully written to file for : "Conductor"
Lyrics successfully written to file for : "Postcards From Italy"
Lyrics successfully written to file for : "Long & Lost"
Lyrics successfully written to file for : "Morning Elvis"
Lyrics successfully written to file for : "Which Witch"
Lyrics successfully written to file for : "The Bomb"
Lyrics not found for : jennyofoldstones
Lyrics successfully written to file for : "Falling"
Lyrics successfully written to file for : "Haunted House"
Lyrics successfully written to file for : "Prayer Factory"
Lyrics successfully written to file for : "As Far As I Could Get"
Lyrics successfully written to file for : "Halo"
Lyrics successfully written to file for : "How Big, How Blue, How Beautiful"
Lyrics successfully written to file for : "Various Storms & Saints"
Lyrics successfully written to file for : "Never Let Me Go"
Lyrics successfully written to file for : "Throwing Bricks"
Lyrics successfully written to file for : "Make Up Your Mind"
Lyrics 

By having the code print out a line with the song titles that didn't scrape correctly, it allowed me to go through and individually check each one (url) to see why and fix if possible.

Lyrics not foudn for the following songs: 
['jennyofoldstones', 'rabbitheart', 'whereareünow', 'landscape', 'constructionat8am', 'bigolmiyazakitears', 'abstract', 'momentssilence', 'jackbootjump', 'tonoisemaking', 'tosomeonefromawarmclimate', 'almost', 'throughme', 'icarrion', 'deselby']

After going through individually these should be the correct titles and # in json: 

FOR florencethemachine:
jennyofoldstonesgameofthrones | 99
rabbitheartraiseitup | 1
wherearenow | 110
landscapedemo | 40

FOR haleyheynderickx: 
constructionat8amlive | 19
bigolmiyazakitearslive | 18

FOR hozier: 
momentssilencecommontongue | 19 
jackbootjumplive | 56
tonoisemakingsing | 26
tosomeonefromawarmclimateuiscefhuaraithe | 45
almostsweetmusic | 22
throughmetheflood | 65
icarrionicarian | 39
deselbypart2 | 36


Got rid of songs by Hozier: 
    -abstract | 48
    -deselbypart1 | 35
Why? 
    -There were no lyrics/text to scrape

Editing song titles for florencethemachine

In [14]:
#EDIT FOR florencethemachine

import json

#Loading in 'Artists-Songs Mapping'
with open("Artists-Songs Mapping.json", "r") as f:
    data = json.load(f)

#Modifying song titles
artist_last_name = "florencethemachine"
songs_to_modify = [
    {"title": "jennyofoldstonesgameofthrones", "index": 99},
    {"title": "rabbitheartraiseitup", "index": 1},
    {"title": "wherearenow", "index": 110},
    {"title": "landscapedemo", "index": 40}
]

#Using a for loop to go through and change titles by corresponding index number
for song in songs_to_modify:
    data[artist_last_name][song["index"]] = song["title"]

#Writing the updated dictionary to the 'Artists-Songs Mapping'
with open("Artists-Songs Mapping.json", "w") as f:
    json.dump(data, f)

Editing song titles for haleyheynderickx

In [21]:
#EDIT FOR haleyheynderickx

import json

#Loading in 'Artists-Songs Mapping'
with open("Artists-Songs Mapping.json", "r") as f:
    data = json.load(f)

#Modifying song titles
artist_last_name = "haleyheynderickx"
songs_to_modify = [
    {"title": "constructionat8amlive", "index": 19},
    {"title": "bigolmiyazakitearslive", "index": 18}
]

#Using a for loop to go through and change titles by corresponding index number
for song in songs_to_modify:
    data[artist_last_name][song["index"]] = song["title"]

#Writing the updated dictionary to the 'Artists-Songs Mapping'
with open("Artists-Songs Mapping.json", "w") as f:
    json.dump(data, f)


Editing song titles for hozier

In [23]:
#EDIT FOR hozier 

import json

#Loading in 'Artists-Songs Mapping'
with open("Artists-Songs Mapping.json", "r") as f:
    data = json.load(f)

#Modifying song titles
artist_last_name = "hozier"
songs_to_modify = [
    {"title": "momentssilencecommontongue", "index": 19},
    {"title": "jackbootjumplive", "index": 54},
    {"title": "tonoisemakingsing", "index": 26},
    {"title": "tosomeonefromawarmclimateuiscefhuaraithe", "index": 44},
    {"title": "almostsweetmusic", "index": 22},
    {"title": "throughmetheflood", "index": 63},
    {"title": "icarrionicarian", "index": 38},
    {"title": "deselbypart2", "index": 35}
]

#Using a for loop to go through and change titles by corresponding index number
for song in songs_to_modify:
    data[artist_last_name][song["index"]] = song["title"]

#Writing the updated dictionary to the 'Artists-Songs Mapping'
with open("Artists-Songs Mapping.json", "w") as f:
    json.dump(data, f)


Now I will delete the two songs from the artist hozier. 

In [32]:
import json

#Loading in 'Artists-Songs Mapping'
with open("Artists-Songs Mapping.json", "r") as f:
    data = json.load(f)

#Removing songs
artist_last_name = "hozier"
songs_to_remove = ["-Abstract (Psychopomp)", "-De Selby (Part 1)"]

#Using song title
for song in songs_to_remove:
    if song in data[artist_last_name]:
        data[artist_last_name].remove(song)

#Writing the updated dictionary to the file
with open("Artists-Songs Mapping.json", "w") as f:
    json.dump(data, f)


That code didn't work for some odd reason...tried by using index numbers and DID work. 

In [17]:
import json

#Loading in 'Artists-Songs Mapping'
with open("Artists-Songs Mapping.json", "r") as f:
    data = json.load(f)

#Removing songs 
artist_last_name = "hozier"
songs_to_remove = [35, 48]

#Using index number
for index in sorted(songs_to_remove, reverse=True):
    del data[artist_last_name][index]

#Writing the updated dictionary to the 'Artists-Songs Mapping'
with open("Artists-Songs Mapping.json", "w") as f:
    json.dump(data, f)


Now that I have re-cleaned the titles for scraping...I will try scraping for all songs again. 

In [28]:
import requests
from bs4 import BeautifulSoup
import json
import re
import random
from time import sleep

#List of artists for which the lyrics need to be written
artists = ["florencethemachine", "haleyheynderickx", "hozier"]

#base_url to scrape the lyrics from
base_url = "https://www.azlyrics.com/lyrics/{}/{}.html"

#Making a file in which the lyrics will be saved
lyrics_file = "lyrics_scraped.txt"

#Randomized delay between requests
min_delay = 7
max_delay = 23

#Loading the mapping of artists to song titles
with open("Artists-Songs Mapping.json", "r") as f:
    songs_dict = json.load(f)

#Updating and deleting song titles for each artist
#New titles
hozier_songs_to_modify = [
    {"title": "momentssilencecommontongue", "index": 19},
    {"title": "jackbootjumplive", "index": 54},
    {"title": "tonoisemakingsing", "index": 26},
    {"title": "tosomeonefromawarmclimateuiscefhuaraithe", "index": 44},
    {"title": "almostsweetmusic", "index": 22},
    {"title": "throughmetheflood", "index": 63},
    {"title": "icarrionicarian", "index": 38},
    {"title": "deselbypart2", "index": 35}
]
haleyheyndericks_songs_to_modify = [
    {"title": "constructionat8amlive", "index": 19},
    {"title": "bigolmiyazakitearslive", "index": 18}
]
florencethemachine_songs_to_modify = [
    {"title": "jennyofoldstonesgameofthrones", "index": 99},
    {"title": "rabbitheartraiseitup", "index": 1},
    {"title": "wherearenow", "index": 110},
    {"title": "landscapedemo", "index": 40}
]
# Titles to delete
hozier_songs_to_delete = [
    {"title": "-Abstract (Psychopomp)", "index": 35},
    {"title": "-De Selby (Part 1)", "index": 48}
]

#Updating song titles for Hozier
for song in hozier_songs_to_modify:
    index = song["index"]
    if index < len(songs_dict["hozier"]):
        songs_dict["hozier"][index] = song["title"]

#Updating song titles for Haley Heyndericks
for song in haleyheyndericks_songs_to_modify:
    index = song["index"]
    if index < len(songs_dict["haleyheynderickx"]):
        songs_dict["haleyheynderickx"][index] = song["title"]

#Updating song titles for Florence + The Machine
for song in florencethemachine_songs_to_modify:
    index = song["index"]
    if index < len(songs_dict["florencethemachine"]):
        songs_dict["florencethemachine"][index] = song["title"]

#Deleting unwanted song titles for Hozier
hozier_songs_to_delete_indices = [song["index"] for song in hozier_songs_to_delete]
songs_dict["hozier"] = [song for idx, song in enumerate(songs_dict["hozier"]) if idx not in hozier_songs_to_delete_indices]

#Writing the updated dictionary to the 'Artists-Songs Mapping'
with open("Artists-Songs Mapping.json", "w") as f:
    json.dump(songs_dict, f)

with open(lyrics_file, "w") as file:
    for artist in artists:
        artist_last_name = artist.replace(" ", "").lower()

        #Getting the list of song titles for artist
        songs = songs_dict.get(artist_last_name, [])
        processed_songs = []

        #Preprocessing songs name for scraping
        for song in songs:
            numbers_in_brackets_removed = re.sub(r'\(.*\)', "", song)
            processed_song = re.sub(r'\W+', '', numbers_in_brackets_removed).lower()
            processed_songs.append(processed_song)

        #Removing duplicate songs
        processed_songs = list(set(processed_songs))

        for song in processed_songs:
            final_url = base_url.format(artist_last_name, song)

            try:
                #Making a request to the URL
                response = requests.get(final_url)

                #Seeing if request was blocked by AZLyrics
                if "request for access" in response.text.lower():
                    print("Request for access detected. Waiting...")
                    # If the request was blocked, wait for a randomized delay and try again
                    delay = random.randint(min_delay, max_delay)
                    sleep(delay)
                    response = requests.get(final_url)

                #Parsing the HTML response with BeautifulSoup
                soup = BeautifulSoup(response.text, 'html.parser')

                #Finding the song name
                html_pointer = soup.find('div', attrs={'class': 'ringtone'})
                song_name = html_pointer.find_next('b').contents[0].strip()

                #Getting the lyrics
                lyrics = html_pointer.find_next('div').text.strip()

                #Appending the lyrics and song name to the respective lists
                song_lyrics.append(lyrics)
                titles.append(song_name)

                #Writing the lyrics to the lyrics file
                file.write("###" + song_name + "###")
                file.write("\n\n")
                file.write(lyrics)
                file.write("\n\n")

                print("Lyrics successfully written to file for : " + song_name)

            except:
                print("Lyrics not found for : " + song)
                lyrics_not_found_for.append(song)

            finally:
                #Randomized delay between requests
                delay = random.randint(min_delay, max_delay)
                sleep(delay)

#Printing the list of songs for which lyrics were not found
print("Lyrics not found for the following songs:")
print(lyrics_not_found_for)

#Run Time: ~51 min

Lyrics successfully written to file for : "Conductor"
Lyrics successfully written to file for : "Postcards From Italy"
Lyrics successfully written to file for : "Long & Lost"
Lyrics successfully written to file for : "Morning Elvis"
Lyrics successfully written to file for : "Which Witch"
Lyrics successfully written to file for : "The Bomb"
Lyrics successfully written to file for : "Rabbit Heart (Raise It Up)"
Lyrics successfully written to file for : "Falling"
Lyrics successfully written to file for : "Haunted House"
Lyrics successfully written to file for : "Prayer Factory"
Lyrics successfully written to file for : "As Far As I Could Get"
Lyrics successfully written to file for : "Halo"
Lyrics successfully written to file for : "How Big, How Blue, How Beautiful"
Lyrics successfully written to file for : "Various Storms & Saints"
Lyrics successfully written to file for : "Never Let Me Go"
Lyrics successfully written to file for : "Throwing Bricks"
Lyrics successfully written to file fo

Although it says not found for the following songs, I, individually checked each song and within the 'lyrics_scraped.txt' lyrics for all songs where found.

In the next portion, I created a json file to hold all of the data (song titles and song lyrics) titled 'Everything'. More indepth explanations can be found through the comments included in the code.

In [33]:
final_dict = dict(zip(titles, song_lyrics))

In [34]:
import json
json_file = "Everything.json"
with open(json_file, 'w') as file:
    json.dump(final_dict, file)

Now that lyrics have been found for all songs and saved into a file, putting together CSVs for each artist so analysis can continue will be much easier. 

## Cleaning

Importing all needed libraries.

In [36]:
import io
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(context='talk', style='ticks')

Reading in 'Everything.json'

In [37]:
df = pd.read_json('Everything.json', orient = 'records', typ='series')
df

"Conductor"                                           I tried to control my shaking\nWith just one s...
"Postcards From Italy"                                The times we had\nOh when the wind would blow ...
"Long & Lost"                                         Lost in the fog, these hollow hills\nBlood run...
"Morning Elvis"                                       When they dressed me and they put me on a plan...
"Which Witch"                                         And it's my whole heart\nWeighed and measured ...
                                                                            ...                        
"Jackboot Jump (Live)"                                At Standing Rock the Jackboot Jump\nYou'd swea...
"Moment's Silence (Common Tongue)"                    When stunted hand earns place with man by mere...
"Through Me (The Flood)"                              Picture a man\nSeen like a speck out from the ...
"To Someone From A Warm Climate (Uiscefhuaraithe)"              

Deleting anything that is in a []

In [39]:
new_df = pd.DataFrame(df.str.replace("[\[\[].*?[\]\]]", "",regex=True), columns=['Lyrics'])
new_df.head()

Unnamed: 0,Lyrics
"""Conductor""",I tried to control my shaking\nWith just one s...
"""Postcards From Italy""",The times we had\nOh when the wind would blow ...
"""Long & Lost""","Lost in the fog, these hollow hills\nBlood run..."
"""Morning Elvis""",When they dressed me and they put me on a plan...
"""Which Witch""",And it's my whole heart\nWeighed and measured ...


Deleting \n and \r line breaks

In [40]:
df_stripped = new_df['Lyrics'].str.replace('\n',' ').str.replace('\r',' ').str[0:]
df_stripped.head()

"Conductor"               I tried to control my shaking With just one so...
"Postcards From Italy"    The times we had Oh when the wind would blow w...
"Long & Lost"             Lost in the fog, these hollow hills Blood runn...
"Morning Elvis"           When they dressed me and they put me on a plan...
"Which Witch"             And it's my whole heart Weighed and measured i...
Name: Lyrics, dtype: object

Creating a column from index...in the new column stripping quotemarks (\")

In [41]:
df_stripped = pd.DataFrame(df_stripped)
df_stripped['Songs'] = df_stripped.index
df_stripped['Songs'] = df_stripped['Songs'].str.replace('\"','')
df_stripped.set_index('Songs',inplace=True)
df_stripped.head()

Unnamed: 0_level_0,Lyrics
Songs,Unnamed: 1_level_1
Conductor,I tried to control my shaking With just one so...
Postcards From Italy,The times we had Oh when the wind would blow w...
Long & Lost,"Lost in the fog, these hollow hills Blood runn..."
Morning Elvis,When they dressed me and they put me on a plan...
Which Witch,And it's my whole heart Weighed and measured i...


Saving the cleaned data to a new csv titled 'Everything Cleaned'

In [42]:
df_stripped.to_csv('Everything Cleaned.csv')

# **Webscraping Portion Complete**