The dataset available at: http://www.cp.jku.at/datasets/MMTD/ contains over a million tweets from all 7 continents the dataset was split locally to these seven continents and this analysis is presented with the information gathered from the European section

#Installing packages and importing libraries

Neccassary packages are installed 
Beatiful Soup(bs4) is needed for webscraping and parsing

In [0]:
pip install pyspark bs4 numpy tqdm

Python interpreter will be restarted.
Collecting pyspark
  Downloading pyspark-3.3.1.tar.gz (281.4 MB)
Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
Collecting tqdm
  Using cached tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
Collecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
Building wheels for collected packages: pyspark, bs4
  Building wheel for pyspark (setup.py): started
  Building wheel for pyspark (setup.py): finished with status 'done'
  Created wheel for pyspark: filename=pyspark-3.3.1-py2.py3-none-any.whl size=281845512 sha256=2b2148efc4f3c5446ed734c2d9ef27b9bdde38f607500a180d7127a94e8690d3
  Stored in directory: /root/.cache/pip/wheels/51/c8/18/298a4ced8ebb3ab8a7d26a7198c0cc7035abb906bde94a4c4b
  Building wheel for bs4 (setup.py): started
  Building wheel for bs4 (setup.py): finished with status 'done'
  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1272 sha256=7fa73badc9e484761e901d698ba5ad2e621d4af69cbdbe3af3f44ef262fe

In [0]:
import requests
from bs4 import BeautifulSoup, SoupStrainer
import numpy as np
from pyspark.sql.types import StringType
from multiprocessing.pool import ThreadPool
from tqdm import tqdm

# Functions needed

##FetchGenreInfo

fetchGenreInfo takes the Raw artist name and formats the name to remove spaces, ampersands and inverted commas. Urls are also generated to test for a band and singer before the raw artist name as some artists names are nouns or names that would return a infobox but no genre info

In [0]:
def fetchGenreInfo(RawArtist):
    genre = None
    genres =[[],[]]
    WikiURL = "https://en.wikipedia.org/wiki/"
    RawArtist = RawArtist[0]
    artist = RawArtist
    artist = artist.replace("\"","")
    artist = artist.replace(" ","_")
    artist = artist.replace("&","%26")
    
    BaseURL = WikiURL + artist
    URLSinger = BaseURL + "_(singer)"
    URLBand = BaseURL + "_(band)"
    attempts = [URLBand, URLSinger,BaseURL]# Base URL is searched for last to stop false positive infoboxes with no genre 

    for urls in attempts:
        soup = getHTML(urls)
        
        infobox = getInfobox(soup)
        if infobox is not None:
            genre = findGenre(infobox)
            break
            
    genres = [(RawArtist),(genre)]
    pbar.update(1)
    return genres


##getHTML

getHTML requests the HTML from the generated URL and Beautiful Soup Strainer was used to only parse the infobox tables to speed up the process

In [0]:
def getHTML(URL):
    response = requests.get(URL)
    strained = SoupStrainer("table") ## SoupStrainer selects only table to parsed reducing run time 
    soup = BeautifulSoup(response.content, 'html.parser',parse_only=strained)
    if response == "<Response [404]>":
        return 0
    else:
        return soup
    


##getInfobox

In [0]:
def getInfobox(soup):
    infobox = soup.find("table", attrs={"class":"infobox"})
    return infobox

##findGenre

In [0]:
def findGenre(infobox):
    genre = ""
    i=0
    position = 0
    
    for tr in infobox.find_all("tr"): ## Searched through the rows in the infobox
        for th in tr.find_all("th"): ## Searches through all the headers in the infobox
            
            if th.get_text() == "Genres" or th.get_text() == "genres" or th.get_text() == "Genre" or th.get_text() == "genre": ## Checks for varied spellings of Genre as a header
                ## Try except is implemented as sometimes the genre is saved as a hyperlink and other times as text
                try:
                    genrelinks =tr.find_all("a")
                    
                    for a in genrelinks:
                        a = a.get_text()
                        
                        if "[" not in a and a != "Genre": ## First hyperlink can be a number within square brackets or genre 
                            genre = simplifyGenre(a)
                            
                            break
                            
                except: ## where genre is not saved as a link it can be saved as a list of genres so the first item is extracted by searching for a ,
                    genre = genre  + tr.find("td").get_text()
                    index = genre.find(",")
                    genre = genre[:index]
                    genre = simplifyGenre(genre)


    return genre

##simplifyGenre

The simplifyGenre function checks the genre extracted from the Wikipedia page and simplifies the genere as there were multiple sub genres of rock, metal, pop, reggae and hip hop.

In [0]:
def simplifyGenre(genre):
    if "pop" in genre or "Pop" in genre:
        genre = "pop"
    elif "ock" in genre:
        genre = "rock"
    elif "lternative" in genre:
        genre = "alternative"
    elif "etal" in genre:
        genre = "metal"
    elif "hip" in genre or "Hip" in genre or "hop" in genre or "Hop" in genre:
        genre= "hip hop"
    elif "azz" in genre:
        genre = "jazz"
    elif "eggae" in genre:
        genre = "reggae"
    else:
        genre = genre
    return genre

#Loading Dataset

The raw dataset is loaded and the unique values for artists names are extracted to reduce the running time. Searching for duplicate artists increases the run time as the samne HTMl page has to be requested multiple times, especially for popular artists. The subset of raw data relating to Europe was selected for this analysis.

In [0]:
MMTD = (spark.read
  .format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("/FileStore/tables/MMTD/mmtdEurope.csv") ## this is the deciding factor 
)

artists = MMTD.select("artist_name").distinct()


artists = artists.collect()
display(artists)

artist_name
The Black Keys
Cinema Bizarre
Grimes
Jupiter Jones
Generation X
New Kids
Mt Eden
Eurythmics
Kate Nash
Gallon Drunk


#Generating Genre Info

The fetchGenreInfo function is called as a map function and multithreaded to improve running time as the number of unique artists in the full dataset is roughly 24000 artists.

In [0]:

pool = ThreadPool(9)

pbar = tqdm(total=len(artists))

genreResults = pool.map(lambda artist: fetchGenreInfo(artist),artists)

print("genre info gathered")

columns = ["Artist","Genre"]
genresDF = spark.createDataFrame(data = genreResults,schema = columns)

display(genresDF)
genresDF.show()


  0%|          | 0/16433 [00:00<?, ?it/s]  0%|          | 1/16433 [00:00<2:02:23,  2.24it/s]  0%|          | 2/16433 [00:01<2:51:04,  1.60it/s]  0%|          | 3/16433 [00:01<1:57:52,  2.32it/s]  0%|          | 4/16433 [00:01<1:08:19,  4.01it/s]  0%|          | 7/16433 [00:01<39:10,  6.99it/s]    0%|          | 9/16433 [00:02<1:17:52,  3.52it/s]  0%|          | 11/16433 [00:02<1:02:16,  4.40it/s]  0%|          | 12/16433 [00:03<55:41,  4.91it/s]    0%|          | 13/16433 [00:03<1:05:45,  4.16it/s]  0%|          | 14/16433 [00:03<1:01:17,  4.46it/s]  0%|          | 15/16433 [00:03<48:34,  5.63it/s]    0%|          | 16/16433 [00:03<54:35,  5.01it/s]  0%|          | 18/16433 [00:04<50:09,  5.46it/s]  0%|          | 19/16433 [00:04<56:06,  4.88it/s]  0%|          | 21/16433 [00:04<52:01,  5.26it/s]  0%|          | 22/16433 [00:04<51:15,  5.34it/s]  0%|          | 23/16433 [00:05<51:54,  5.27it/s]  0%|          | 24/16433 [00:05<52:15,  5.23it/s]  0%|          | 25/164

_1,_2
The Black Keys,rock
Cinema Bizarre,rock
Grimes,pop
Jupiter Jones,
Generation X,rock
New Kids,Comedy
Mt Eden,
Eurythmics,
Kate Nash,pop
Gallon Drunk,


genre info gathered


# Adding Genre info to dataset and saving

The Genre info dataset is joinerd to the raw dataset and saved to be used in investigation

In [0]:
MMTD = MMTD.join(genresDF,MMTD.artist_name ==  genresDF.Artist,"inner")

savepath = "/FileStore/tables/MMTDGenres"
MMTD.coalesce(1).write.option("header",True).format("csv").mode("overwrite").save(savepath)

print("Write Complete")