# Spotify in Egypt
> data analysis for music in Egypt

- toc: true
- badges: true
- comments: true
- categories: [web scraping, API, data analysis]

This is an attempt to study and analyze the music industry in Egypt for last 70 years. 
First I setup Spotify API to help me search and lookup the data for each song and for each artist. But, there's no telling (using the API) what are the most listened to artists in Egypt (whether they're Egyptian, Arabic, or not). So i looked up for a list includes such data and i found a [website](https://www.last.fm/tag/egyptian/artists?page=1) that have similar stats to what i want(not perfect though), and this was the best i could get browsing the web.

Questions to be answered:
1. what are the most listened to genres?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import requests
import base64
import datetime
from urllib.parse import urlencode

from bs4 import BeautifulSoup
from time import sleep

%matplotlib inline

## Retrieving Data from Spotify API

client_id and client_secret are retrieved from the project created on [spotify developers website's dashboard](https://developer.spotify.com/dashboard/applications)

In [15]:
endpoint = "https://api.spotify.com/v1" 

client_id = "CLIENT_ID"
client_secret = "CLIENT_SECRET"

In [3]:
token_url = "https://accounts.spotify.com/api/token"
client_creds = f"{client_id}:{client_secret}"
client_creds_b64 = base64.b64encode(client_creds.encode())

token_data = {
    "grant_type": "client_credentials"
}

token_headers = {
    "Authorization": f"Basic {client_creds_b64.decode()}", # <base64 encoded client_id:client_secret>
}

extracting information from the response from the response

In [4]:
r = requests.post(token_url, data=token_data, headers=token_headers)
valid_request = r.status_code in range(200, 300)

if valid_request:
    token_response_data = r.json()
    access_token = token_response_data['access_token']

    now = datetime.datetime.now()
    expire_in = token_response_data['expires_in'] # seconds
    expires = now + datetime.timedelta(seconds=expire_in)
    did_expire = expires < now

In [8]:
# it expires in an hour from now
print(expires)

2022-06-18 19:54:47.124216


## Search for an item

In [12]:
headers = {
    "Authorization": f"Bearer {access_token}"
}

In [20]:
def search_spotify(access_token, query, search_type):
    search_endpoint = f"{endpoint}/search"
    data = urlencode({
        "q": query,
        "type": search_type,
    })
    
    lookup_url = f"{search_endpoint}?{data}"
    
    r = requests.get(lookup_url, headers=headers)
    if r.status_code not in range(200, 300):
        return {}
    return r.json()

## Scraping Data out of Last.fm

In [115]:
artists_names_array = []

for page_number in range(1, 5):
    # adding a small sleep before each request to prevent the ConnectionResetError104
    sleep(1)
    html = requests.get(f"https://www.last.fm/tag/egyptian/artists?page={page_number}")
    soup = BeautifulSoup(html.content)

    main_div = soup.find_all('div', class_="col-main")[0]
    artists = main_div.find_all('li', attrs={"itemtype":"http://schema.org/MusicGroup"})

    for artist in artists:
        artists_names_array.append(artist.h3.text)

In [116]:
artists_names = pd.Series(data=artists_names_array, index=np.arange(0, len(artists_names_array)))

In [117]:
artists_names

0                 عمر دياب
1             Karl Sanders
2             Oum Kalthoum
3                  Sherine
4                    Hakim
              ...         
79                   شيرين
80                Shahinaz
81              Hathorious
82         Hesham El Araby
83    Mohamed Ali Ensemble
Length: 84, dtype: object

### Translate Arabic names into English

Using RapidAPI

In [125]:
def translate(query, tl="en", sl="ar"):
    url = "https://google-translate1.p.rapidapi.com/language/translate/v2"

    payload = urlencode({
        "q": query,
        "target": tl,
        "source": sl
    })
    
    headers = {
        "content-type": "application/x-www-form-urlencoded",
        "Accept-Encoding": "application/gzip",
        "X-RapidAPI-Key": "API-KEY",
        "X-RapidAPI-Host": "google-translate1.p.rapidapi.com"
    }

    response = requests.post(url, data=payload, headers=headers)
    if response.status_code not in range(200, 300):
        print(response.status_code)
        return {}
    
    return response.text

In [130]:
artists_names

0                 عمر دياب
1             Karl Sanders
2             Oum Kalthoum
3                  Sherine
4                    Hakim
              ...         
79                   شيرين
80                Shahinaz
81              Hathorious
82         Hesham El Araby
83    Mohamed Ali Ensemble
Length: 84, dtype: object

In [131]:
translate(artists_names[79])

'{"data":{"translations":[{"translatedText":"Shereen"}]}}'

### Drop Duplicates

### Fix the inconsistency in the data
e.g. Umm Kulthum, Oum Kalthoum
[Kaggle's lessson on data inconsistency](https://www.kaggle.com/code/alexisbcook/inconsistent-data-entry)