# Scrape From Chosic to Classify Genres

It is difficult to plot the trends of popular genres since Spotify contains many granular genres such as "bubblegum pop". I as using the source `https://www.chosic.com/list-of-music-genres/` which aleady contains pre-existing genre data scraped and classified by their algorithm to try and refine the data.

In [30]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import glob

In [11]:
url = "https://www.chosic.com/list-of-music-genres/"

# Website automatically blocks scraping requests.....
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

response = requests.get(url, headers=headers)
response.raise_for_status()

In [16]:
soup = BeautifulSoup(response.text, "html.parser")

genre_elements = soup.find_all("li", class_="capital-letter genre-term")
genres = [genre.text.strip() for genre in genre_elements]

genre_df = pd.DataFrame(genres, columns=["Genre"])

genre_df

Unnamed: 0,Genre
0,pop
1,dance pop
2,pov: indie
3,k-pop
4,indonesian pop
...,...
5771,musique peule
5772,zim urban groove
5773,filmi
5774,manila sound


Tried to scrape the entire website by `Category:Subcategories` but ran into issues with the HTML. I decided to manually pull each first genre in the categories and divide my categories by hand. Luckily, the first genre in each category was just itself (E.g. Pop:Pop, Rock:Rock, etc):

In [None]:
main_genres = [
    "Pop", "Electronic", "Hip Hop", "R&B", "Latin", "Rock",
    "Metal", "Country", "Folk/Acoustic", "Classical", "Jazz",
    "Blues", "Easy listening", "New age", "Traditional music"
]

genre_df["Main Genre"] = None

current_main_genre = None

for index, row in genre_df.iterrows():
    if row["Genre"] in main_genres:
        current_main_genre = row["Genre"]
    genre_df.at[index, "Main Genre"] = current_main_genre


In [28]:
genre_df

Unnamed: 0,Genre,Main Genre
0,Pop,Pop
1,pop,Pop
2,dance pop,Pop
3,pov: indie,Pop
4,k-pop,Pop
...,...,...
5786,musique peule,Traditional music
5787,zim urban groove,Traditional music
5788,filmi,Traditional music
5789,manila sound,Traditional music


In [29]:
genre_df.to_csv("genre_categorized.csv")

## Attach Genres to Music Data

In [31]:
genre_mapping = pd.read_csv("genre_categorized.csv")

genre_dict = dict(zip(genre_mapping["Genre"], genre_mapping["Main Genre"]))

In [39]:
for year in range(1960, 2025):
    file_path = f"./by_year_data/{year}_music_data.csv"
    
    print(f"Add genre to {year}")

    df = pd.read_csv(file_path)

    df["First Genre"] = df["Genres"].apply(lambda x: str(x).split(",")[0] if pd.notna(x) else "Unknown")

    df["Main Genre"] = df["First Genre"].map(genre_dict).fillna("Unknown")

    df.drop(columns=["First Genre"], inplace=True)

    df.to_csv(file_path, index=False)


Add genre to 1960
Add genre to 1961
Add genre to 1962
Add genre to 1963
Add genre to 1964
Add genre to 1965
Add genre to 1966
Add genre to 1967
Add genre to 1968
Add genre to 1969
Add genre to 1970
Add genre to 1971
Add genre to 1972
Add genre to 1973
Add genre to 1974
Add genre to 1975
Add genre to 1976
Add genre to 1977
Add genre to 1978
Add genre to 1979
Add genre to 1980
Add genre to 1981
Add genre to 1982
Add genre to 1983
Add genre to 1984
Add genre to 1985
Add genre to 1986
Add genre to 1987
Add genre to 1988
Add genre to 1989
Add genre to 1990
Add genre to 1991
Add genre to 1992
Add genre to 1993
Add genre to 1994
Add genre to 1995
Add genre to 1996
Add genre to 1997
Add genre to 1998
Add genre to 1999
Add genre to 2000
Add genre to 2001
Add genre to 2002
Add genre to 2003
Add genre to 2004
Add genre to 2005
Add genre to 2006
Add genre to 2007
Add genre to 2008
Add genre to 2009
Add genre to 2010
Add genre to 2011
Add genre to 2012
Add genre to 2013
Add genre to 2014
Add genre 