# Scraping Top Charts on ‘JioSaavn’  using Python libraries requests, BeautifulSoup and pandas

![JioSaavn-homepage](https://i.imgur.com/JntSV0z.png)

[JioSaavn](https://www.jiosaavn.com) is an Indian online music streaming service and a digital distributor of Hindi, English, Malayalam, Bengali, Kannada, Tamil, Telugu, Bhojpuri and other regional Indian music around the world. Here songs are categorized in many charts based on popularity, artists, release era and many others.

In this project, we will retrieve information from Top charts pages/urls using [Web Scrapping](www.geeksforgeeks.org/what-is-web-scraping-and-how-to-use-it): The process of extracting and parsing data from websites in an automated fashion using a computer program / code.
![Web Scraping](https://i.imgur.com/g7xoZZ8.png)
We will use the python libraries-
[Requests](https://realpython.com/python-requests), [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc) and [Pandas](https://pandas.pydata.org) to scrape data from this page.

Here's an outline of the steps we'll follow: 
1. Install important libraries that will be helpful for the project i.e. requests, BeautifulSoup4, pandas.
2. Download the web page using the requests library.
3. Parsing the HTML source code using beautiful soup library.
4. Inspecting HTML source code of the web pages.
5. Extracts song title, artists name(s), song duration from the web pages.
6. Compile extracted information into python lists and dictionaries.
7. Save the extracted information to a csv file.

By the end of project, we'll create a csv file which have the top 25 songs of all 7 top charts in the following format-

Title,Artist_name, Time_duration

Tujhe Dekha To,	Lata Mangeshkar Kumar Sanu,	5:03

Mera Dil Bhi Kitna Pagal Hai,	Kumar Sanu S. P. Balasubrahmanyam Alka Yagnik,	5:24

## 1. Install important libraries that will be helpful for the project i.e. requests, BeautifulSoup4, pandas.
The requests library allows us to send HTTP requests using Python. The HTTP request returns a Response Object with all the response data (content, encoding, status, etc). So, we have to install requests to download the webpage.
Beautiful Soup is a Python library for pulling data out of HTML and XML files. So, we have to install Beautiful Soup to parse through HTML tags.
Pandas is a python library and is used to analyse data.
After installing, we have to import library for use in our program.

In [1]:
!pip install requests --upgrade --quiet
!pip install BeautifulSoup4 --upgrade --quiet
!pip install pandas --upgrade --quiet

In [2]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

## 2. Download the web page using the requests library.
1. We'll use the requests library to download the web page using 'requests.get' function from requests which return a response object.
2. with response.text, we will get the text file for the url.
3.. we can run response.status_code property to check if webpage successfully parsed or not. If [HTTP status code]('https://www.geeksforgeeks.org/http-status-codes-successful-responses') value between 200-299 then it is successful.
4. We will save it to a file and view the page locally using "File>Open" within Jupyter. In same manner, we can read saved file. saved file preview looks similar to the original page but none of the links works.
Here, I have link 2 images to show similarity between both.
We  have to do this process many times in our project. So, here in section 3.1, we have defined a function for it.
![JioSaavn_original_image](https://i.imgur.com/i0unC4S.png)
![JioSaavan_request_get_page](https://i.imgur.com/Z6zwcal.png)

## 3. Parsing the HTML source code using beautiful soup library. 
### To extract information from the HTML source code using programming, we will use the Beautiful Soup library. Beautiful Soup will return an object containing several properties and methods to extract the information from HTML documents.

As we need to do step 2 & 3 many times for home url (https://www.jiosaavn.com) and then for Top Charts which have 11 pages. So, we will define below function here.

3.1 Source_code to get the source code (HTML file)of url  and to make a beautiful soup objects.

3.2 get_name to get the title of songs from a chart.

3.3 get_artist_name to get the artist name(s) for a song.

3.4 get_duration to get the duration for a song.

3.5 get_songs_from_url_list to get the required song's information from the url list.

### 3.1 Defining a function 'Source_code' to get the source code (HTML file)of url and to make a beautiful soup objects.

In [4]:
def Source_code(url):
    response = requests.get(url)
    #To ensure that the response is successful
    if response.status_code != 200:
        raise Exception('Unable tp download the requested web page')
    url_contents = response.text
    with open('jiosavan.html','w',encoding='utf-8') as file:
        file.write(url_contents)
    with open('jiosavan.html','r',encoding = 'utf-8') as fle:
        jiosavan_source = fle.read()
    doc = BeautifulSoup(jiosavan_source,'html.parser')
    return doc

### 3.2 get_name to get the title of songs from a chart.
to do this, at first we have to inspect the HTML code for chart. From code, we can see that every song is listed in div'' tag with class 'c-drag' and title in 'a' tag with class 'u-color-js-gray'. you can check below.
![song_source_code](https://i.imgur.com/HqgHejy.png) So, first we will find all div tags and then use for loop to scrap first 25 songs' titles.

In [5]:
def get_name(doc):
    List_tag = doc.find_all('div', {'class':'c-drag'}) #to find out all song's main tags.
    Name = [] #we have created an empty list to store the titles of songs.
    for tag in List_tag:
        S_name = tag.find('a', {'class':'u-color-js-gray'}).text.strip() #to scrap the titles of song.
        Name.append(S_name)
        if len(Name) ==25 : #to limit it to 25
            break
    return(Name)

### 3.3 get_artist_name to get the artist name(s) for a song.
to do this, at first we have to inspect the HTML code for chart. From code, we can see that every song is listed in div'' tag with class 'c-drag' and title in 'p' tag with class 'u-centi u-ellipsis u-color-js-gray u-margin-right@sm u-margin-right-none@lg'. you can check below.
![song_artist_source_code](https://i.imgur.com/G9MwwyS.png). For a song, there can be more artists for a song. So, we will use loop inside loop. So, first we will find all div tags and then use for loop inside loop to scrap first 25 songs' artists.

In [6]:
#defining functions get_artist_names to scrap artist name.
def get_artist_name(doc):
    List_tag = doc.find_all('div', {'class':'c-drag'}) #to scrap all the song's tag.
    Artist_name = [] #we have created an empty list to store the artists of songs.
    for tag in List_tag:
        main_tag_artist = tag.find_all('p', {'class':'u-centi u-ellipsis u-color-js-gray u-margin-right@sm u-margin-right-none@lg'}) #to scrap artist names main tag for a song.
        for tags in main_tag_artist: 
            artist_tag = tags.find_all('a') #to scrap all artist names for a song.
            Artist = ""
            for atag in artist_tag: #to add all artists name to list item.
                artist1 = atag.text
                Artist = Artist + artist1 +','
            Artist_name.append(Artist) #to add artists related to a song in list.
            if len(Artist_name) == 25 : #to limit it to 25
                break
    return(Artist_name)

### 3.4 get_duration to get the duration for a song.
to do this, at first we have to inspect the HTML code for chart. From code, we can see that every song is listed in div'' tag with class 'c-drag' and title in 'span' tag with class 'o-snippet__action-init u-centi'. you can check below.
![song_duration_source_code](https://i.imgur.com/dETRHrS.png). So, first we will find all div tags and then use for loop to scrap first 25 songs' durations.

In [7]:
#defining a function get_duration for duration of song
def get_duration(doc):
    import time
    List_tag = doc.find_all('div', {'class':'c-drag'}) #to scrap all song's main tag
    duration = [] #we have created an enpty list to store the duration of songs.
    for tag in List_tag:
        S_duration = tag.find_all('span', {'class':'o-snippet__action-init u-centi'}) #to scrap duration for a song.
        for atag in S_duration:
            S_duration = atag.text #here we have fetched time as string.
        duration.append(S_duration) #to add duration of a song to a list.
        if len(duration) ==25 : #to limit it to 25
            break
    return(duration)

### 3.5 get_songs_from_url_list to get the required song's information from the url list.
We will pass treat this as main function and pass list of urls to  scrap the required informations from all of the urls in a combined dictionary.

In [8]:
def get_songs_from_url_list(lst):
    name = [] #we will create an empty list to store the related scrapped data
    Artist =[]
    duration = []
    for urls in url:
        doc = Source_code(urls)
        #now, we will pass the source code parse from Source_code to get_name to fetch the titles of songs.
        Titles = get_name(doc)
        name = name + Titles
        #now, we will pass the source code parse from Source_code to get_artist_name to fetch the Artists of songs.
        Artist_Title = get_artist_name(doc)
        Artist = Artist + Artist_Title
        #now, we will pass the source code parse from Source_code to get_duration to fetch the duration of songs.
        Time_duration = get_duration(doc)
        duration = duration + Time_duration
    #now, we will convert above 3 lists in single dictionary
    Add_dict = {}
    for i in range(len(name)):
         Add_dict[i] = {'Title' : name[i],
                        'Artist_name' : Artist[i],
                        'Time_duration' : duration[i]
             }  
    return(Add_dict)

### Now, we have to scrap urls for top charts from home page url and for this, we have to get source code of this page by Source_code function.

In [10]:
url_main = 'https://www.jiosaavn.com' #homepage url

In [11]:
doc_home = Source_code(url_main)

In [12]:
doc_home


<!DOCTYPE html>

<!--[if IEMobile 7 ]> <html dir="ltr" lang="en-US"class="no-js iem7"> <![endif]-->
<!--[if IE 9 ]>    <html dir="ltr" lang="en-US" class="no-js ie9"> <![endif]-->
<!--[if (gte IE 9)|(gt IEMobile 7)|!(IEMobile)|!(IE)]><!-->
<html class="no-js" dir="ltr" lang="en-US">
<!--<![endif]-->
<head>
<base href="/"/>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=5, user-scalable=0" name="viewport"/>
<link href="https://staticfe.saavn.com/web6/jioindw/dist/1660222378/_i/favicon.ico" rel="icon"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible">
<html lang="en">
<head>
<title>Online Songs on JioSaavn: Download &amp; Play Latest Music for Free</title>
<meta content="Listen to New Hindi Songs Online Only on JioSaavn." name="subtitle">
<meta content="JioSaavn" name="author"/>
<meta content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=5, user-scalable=0" name="viewport"/>
<meta content="IE=edge" h

## 4. Inspecting HTML source code of the web pages.
Now we have to scrap the url for top charts from the HTML code. For this, we have to inspect the HTML code. From source code, we have inspect that 'Weekly top songs'is inside div tag with class 'o-layout__item u-1/2@sm u-1/3@md u-1/4@lg u-1/5@xxl'. you can check in 

![Weekly_top_song_source_code](https://i.imgur.com/erIMGgR.png) 

and other charts are inside div tag with class 'o-layout__item u-48@md u-1/3@lg u-1/4@xxl'. you can check in 

![Trending today source code](https://i.imgur.com/iclxiYa.png).

So, at first, we have to find out 'href' for url of 'weekly top songs' and then we will use for loop for other charts.

In [13]:
url = [] #empty list to store the required url/href
List_tag = doc_home.find('div',{'class':'o-layout__item u-1/2@sm u-1/3@md u-1/4@lg u-1/5@xxl'}) #to scrap 'weekly top songs' main tag
url_name = List_tag.find('a', {'class':'o-block__link'}, href = True).get('href') #to scrap 'href' from 'weekly top songs'
url.append('https://www.jiosaavn.com'+url_name) #now,we have concatenate href with base url to make complete url and store it in url list.
    
List_tag2 = doc_home.find_all('div', {'class':'o-layout__item u-48@md u-1/3@lg u-1/4@xxl'}) #to scrap other charts main tag.
for tag in List_tag2:
    url_name2 = tag.find('a', {'class':'o-block__link'}, href = True).get('href') #to scrap other charts href.
    url.append('https://www.jiosaavn.com'+url_name2) #to add other charts url to list.

In [15]:
url #url lists we are going to scrap

['https://www.jiosaavn.com/featured/weekly-top-songs/8MT-LQlP35c_',
 'https://www.jiosaavn.com/featured/trending_today/I3kvhipIy73uCJW60TJk1Q__',
 'https://www.jiosaavn.com/featured/romantic_top_40/m9Qkal5S733ufxkxMEIbIw__',
 'https://www.jiosaavn.com/featured/hindi_90s/T64MUCqdndw_',
 'https://www.jiosaavn.com/featured/hindi_70s/VSMrnr-njCk_',
 'https://www.jiosaavn.com/featured/hindi_retro/dYn-,-QcKzA_',
 'https://www.jiosaavn.com/featured/hindi_chartbusters/1HiqW,xnqZTuCJW60TJk1Q__',
 'https://www.jiosaavn.com/featured/hindi_00s/tsJahdem34A_',
 'https://www.jiosaavn.com/featured/hindi_80s/fE9YxTvTDjU_',
 'https://www.jiosaavn.com/featured/hindi_60s/TOL5Rewc8Mk_',
 'https://www.jiosaavn.com/featured/delhi_hot_50/GTNWyqVzfESO0eMLZZxqsA__']

So, we have scrapped 11 top charts url here to scrap songs from. Now we will scrap first 25 songs (title, artist name(s), duration) from every top chart. As we have to do for 11 scrapped url so, now, we will define balance required functions first.

In [16]:
#Now, we will pass ht url_list 'url' to main function get_songs_from_url_list to get the required dictionary with required scrapped data.
song_dict = get_songs_from_url_list(url)
#Convert parsing parts into csv file
#Let’s first convert dictionary into Pandas DataFrame. A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns. Then, using to_csv, will save the DataFrame into CSV file.
data_frame = pd.DataFrame(song_dict)
data_frame = data_frame.transpose()
data_frame.to_csv('Songs.csv', index=None)
#Have a look on csv file using pandas library
#read_csv helps to read a comma-separated values (csv) file into DataFrame.
pd.read_csv('Songs.csv')

Unnamed: 0,Title,Artist_name,Time_duration
0,"Kesariya (From ""Brahmastra"")","Pritam, Arijit Singh, Amitabh Bhattacharya,",4:28
1,Baarish Aayi Hai,"Javed-Mohsin, Stebin Ben, Shreya Ghoshal,",3:32
2,Iss Baarish Mein,"Yasser Desai, Neeti Mohan,",4:00
3,Ijazzat Hai,"Raj Barman,",3:15
4,Dhoke Pyaar Ke,"Rochak Kohli, B Praak,",4:19
...,...,...,...
270,Kana Yaari,"Payal Dev, Jubin Nautiyal,",3:46
271,Halki Si Barsaat,"Yasser Desai, Neeti Mohan,",3:32
272,"Nikamma (From ""Nikamma"")","Payal Dev, Stebin Ben,",2:48
273,Jaa Rahe Ho,"Sachet Tandon,",4:07


## Summary  

1. Downloaded the webpage using requests library 
2. Scrapped top 25 top song's titles, artist names, time duration for 11 top charts listed on home page at jioSaavn by parsing the HTML source code of the web page using the Beautiful Soup library. 
3. Combined the lists of all the required informations in a dictionary. 
4. parsed the scrapped dictiobary into CSV file of 275 rows and 3 columns from 11 different pages. 

## Future works ideas
1.	Applying sort functions to make a playlist of self choice.
2.	We can do analysis for finding the short duration songs.
3. Improving the documentation part.

## References
1. [JioSaavn](https://www.jiosaavn.com)
2. [Jovian web scrapping with Python - project ideas](https://jovian.ai/learn/zero-to-data-analyst-bootcamp/assignment/project-web-scraping-with-python)
3. [Jovian web scrapping with python](https://jovian.ai/learn/zero-to-data-analyst-bootcamp/assignment/project-web-scraping-with-python)
4. [jovian documentation and story telling](https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/documentation-and-storytelling)

In [None]:
jovian.commit(files = ['Songs.csv','JioSavan.ipynb'])

<IPython.core.display.Javascript object>