### Web Scraping: Get the top20 songs list of Coldplay on Spotify from a website called KWORB
Kworb is a music data analytics website that aggregates and showcases Spotify streaming charts and artist rankings. The platform offers insights into song popularity and trends across various regions globally.

In [115]:
import pandas as pd
import requests

# URL of kworb
url = "https://kworb.net/spotify/artist/4gzpq5DPGxSnKTe4SA8HAU_songs.html"
# request the content
r = requests.get(url)
# make sure the request is success
if r.status_code == 200:
    # analyze the tables on the website
    df_list = pd.read_html(r.text)
    #get the second table on the website
    df = df_list[1]
    #remove the repeated row...
    df = df.drop(13, axis=0)
    # save the table as CSV format
    df.to_csv("spotify_songs.csv", index=False)
    print(df.head(20))

                       Song Title     Streams      Daily
0      * Something Just Like This  2360429097  1024226.0
1                          Yellow  1994170156  1619221.0
2                    Viva La Vida  1817301965  1397710.0
3                   The Scientist  1730247039   809670.0
4                         Fix You  1331066416   726609.0
5             A Sky Full of Stars  1306202500   927520.0
6            Hymn for the Weekend  1269746647   634085.0
7                     My Universe  1165484041   599792.0
8                        Paradise  1162178312   653060.0
9         Adventure of a Lifetime   923446921   532144.0
10                         Clocks   793528827   576293.0
11                         Sparks   771267093  1062281.0
12                          Magic   669899679   157968.0
14                        Trouble   329969893   160220.0
15                   Higher Power   320956467   153495.0
16                    In My Place   300647715   193981.0
17               Christmas Ligh

###  Cleaning & Annotating 

#### 1. Install and initiate everything we are going to use

In [3]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install spacy

Note: you may need to restart the kernel to use updated packages.


In [5]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [12]:
pip install wordninja

Collecting wordninja
  Downloading wordninja-2.0.0.tar.gz (541 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m541.6/541.6 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: wordninja
  Building wheel for wordninja (setup.py) ... [?25ldone
[?25h  Created wheel for wordninja: filename=wordninja-2.0.0-py3-none-any.whl size=541530 sha256=b47d83f828c8744d7b7b5ca274dc7e9b97542c19fbaebca5b1e0e01e3593911f
  Stored in directory: /Users/irontree/Library/Caches/pip/wheels/7c/e6/e6/e95742bec8d8c3d40687c0c50b8537bb71347ce84a2b322234
Successfully built wordninja
Installing collected packages: wordninja
Successfully installed wordninja-2.0.0
Note: you may need to restart the kernel to use updated packages.


In [71]:
# initiate everything
import os
import re
import string
import pandas as pd
import nltk
import spacy
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/irontree/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

#### 2. Define the function for cleaning the texts

In [73]:
def clean_and_tokenize_lyrics(text):
    # Lowercase the text
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenization
    tokens = word_tokenize(text)
    return tokens

#### 3. Define the function for getting POS and Lemmas

In [74]:
def get_pos_and_lemma(tokens):
    spacy_doc = spacy_nlp(' '.join(tokens))
    # get pos
    pos = [token.pos_ for token in spacy_doc]
    # get lemmas 
    lemmas = [token.lemma_ for token in spacy_doc]
    return pos, lemmas

#### 4. Read the text files and make them into structured dataset

In [75]:
file_data = []
# read the text
for filename in os.listdir('Lyrics Dataset of top20 Songs'):
    if filename.endswith('.txt'):
        with open(f'Lyrics Dataset of top20 Songs/{filename}', 'r', encoding='utf-8') as file:
            text = file.read()
            #clean the text
            cleaned_tokens = clean_and_tokenize_lyrics(text)
            #get pos and lemmas of the text
            pos_tags, lemmas = get_pos_and_lemma(cleaned_tokens)
            #remove the extension name of file 
            file_data.append({
                'filename': filename.replace('.txt', ''),
                'original_text': text,
                'tokens': cleaned_tokens,
                'POS': pos_tags,
                'lemmas': lemmas
            })

#### 5. Make the file_data into datafram and export CSV

In [76]:
df = pd.DataFrame(file_data)

df.to_csv('Corpus.csv', index=False)