Scraping apple charts using requests & Beautifulsoup

Purpose
In this notebook apple charts dataset will be scraped directly from itunes' website. Only title, description, rating, genre and the title of chart will be kept. We use Beautifulsoup instead of xpath to avoid any trouble brought by different positions of the same data in different apps' website. 

In [1]:
import json
import os.path
import time
import random

import requests
from requests.compat import urljoin
from bs4 import BeautifulSoup

In [2]:
apple_save_path = '../../datasets/1100_apple_chart_dataset.json'

each chart's url and title

In [3]:
apple_start_url = ['https://www.apple.com/itunes/charts/top-grossing-apps/',
           'https://www.apple.com/itunes/charts/free-apps/',
           'https://www.apple.com/itunes/charts/paid-apps/']

table_title = ['top-grossing-apps',
             'free-apps',
             'paid-apps']

To avoid being blocked, we change the 'User-Agent' every time we send a request.

In [4]:
headers = { "Accept":"text/html,application/xhtml+xml,application/xml;",
            "Accept-Encoding":"gzip",
            "Referer":"http://www.example.com/" }

user_agent_list = [
 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
  'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3',
 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
 'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
 'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
 'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
]

def get_soup(search_url):
    user_agent = random.choice(user_agent_list)
    headers['User-Agent'] = user_agent
    
    wait_sec = random.random()*2
    time.sleep(wait_sec)
    searchHtml = requests.get(search_url, headers = headers)
    soup = BeautifulSoup(searchHtml.text, features='html5lib')
    
    return soup

In [5]:
apple_dataset = []

Create a Spider class. Url and title of a chart should be the input in initialization. The output of "work" should be a list containing items with title, description, rating, genre and the title of chart.

In [6]:
class Spider(object):
    def __init__(self, search_url, chart_title):
        self.soup = get_soup(search_url)
        self.li = self.soup.find('div', {'id': 'main'}).find('section').find('div').find('ul').find_all('li')
        self.link = []
        self.title = []
        self.chart = chart_title
        
        for x in self.li:
            self.link.append(x.find('h3').find('a')['href'])
            self.title.append(x.find('h3').find('a').string.strip())
            
    def work(self):
        num = 0
        result = []
        for url in self.link:
            soup = get_soup(url)
            try:
                rating = soup.find('li', {'class': 'product-header__list__item app-header__list__item--user-rating'}).find('figcaption').string.split(',')[0]
            except:
                rating = None
            div = soup.find('div', {'class': 'animation-wrapper is-visible ember-view'})
            try:
                section = div.find_all('div', {'class': 'ember-view'})[1].find('section', {'class': 'l-content-width section section--hero product-hero ember-view'})
                description = section.find('header').find_all('h2')[0].string
            except:
                description = None
            try:
                section = div.find_all('div', {'class': 'ember-view'})[1].find_all('section', {'class': 'l-content-width section section--bordered'})[3]
                genre = section.find_all('dd')[2].find('a').string
            except:
                genre = None
            
            title = self.title[num]
            num += 1
            
            item = {}
            item['title'] = title
            item['description'] = description
            item['chart'] =self.chart
            item['rating'] = rating
            item['genre'] = genre
             
            result.append(item)
        
        return result


In any time failure of connection can happen. Thus we save the dataset in the format of json when we finish a chart, so that we can continue to work on the next chart after a failure of connection happens.

In [7]:
num = 0
for url in apple_start_url:
    spider = Spider(url, table_title[num])
    num += 1
    
    apple_dataset.extend(spider.work())
    with open(apple_save_path, 'w') as file:
        file.write(json.dumps(apple_dataset))