Scraping developers' data for common apps

Purpose
This notebook is to obtain developer dataset in which developers release the same app on both platforms.

In [1]:
from os.path import exists, isfile
import random
import time

import pandas as pd
import numpy as np

import requests
from bs4 import BeautifulSoup

In [2]:
save_path = '../../datasets/2400_combine_developer_datasets.csv'

In [3]:
if not exists("../../datasets/2300_combine_kaggle_datasets.csv"):
    print ("Missing dataset file")
    
df=pd.read_csv("../../datasets/2300_combine_kaggle_datasets.csv")
df.head()

Unnamed: 0,apple_id,trim_title,apple_title,genre,apple_rating,apple_reviews,apple_size,pegi,normed_apple_rating,google_title,...,log_google_reviews,log_apple_reviews,z_score_google_rating,z_score_apple_rating,z_score_google_sub_apple,norm_google_sub_apple,google_revenue,apple_revenue,log_google_revenue,log_apple_revenue
0,898968647,Call of Duty®,Call of Duty®: Heroes,Games,4.5,179416,201.075195,Teen,0.9,Call of Duty®: Heroes,...,6.205244,5.253861,0.449144,0.638043,-0.188899,-0.02,0.0,0.0,,
1,1147297267,Dont Starve,Don't Starve: Shipwrecked,Games,3.5,495,604.341797,Everyone 10+,0.7,Don't Starve: Shipwrecked,...,3.166726,2.694605,-0.143135,-0.806018,0.662884,0.12,7325.32,2470.05,3.864827,3.392706
2,352670055,F,F-Sim Space Shuttle,Games,4.5,6403,72.855469,Everyone,0.9,F-Sim Space Shuttle,...,3.73456,3.806384,0.449144,0.638043,-0.188899,-0.02,27080.73,31950.97,4.43266,4.504484
3,763692274,Grand Theft Auto,Grand Theft Auto: San Andreas,Games,4.0,32533,1964.96582,Mature 17+,0.8,Grand Theft Auto: San Andreas,...,5.542778,4.512324,0.449144,-0.083987,0.533131,0.08,2439244.38,227405.67,6.387255,5.356801
4,771989093,LEGO® Friends,LEGO® Friends,Games,4.0,400,730.941406,Everyone,0.8,LEGO® Friends,...,2.931458,2.60206,0.449144,-0.083987,0.533131,0.08,4261.46,1996.0,3.629558,3.300161


To avoid being blocked, we change the 'User-Agent' every time we send a request.

In [4]:
headers = { "Accept":"text/html,application/xhtml+xml,application/xml;",
            "Accept-Encoding":"gzip",
            "Referer":"http://www.example.com/" }

user_agent_list = [
 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
  'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3',
 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
 'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
 'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
 'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
]

def get_soup(search_url):
    user_agent = random.choice(user_agent_list)
    headers['User-Agent'] = user_agent
    
    wait_sec = random.random()*2
    time.sleep(wait_sec)
    searchHtml = requests.get(search_url, headers = headers)
    soup = BeautifulSoup(searchHtml.text, features='html5lib')
    
    return soup

We use app ID to search each app precisely on apple store. We can also use an app's google title to search it on google play. Obviously we can search those common apps' developers so any developer we get release their apps on two platforms. 
Combining the google developers' and apple developers' datasets is not a good choice because most developers have different names when they release apps on different platforms. 

In [5]:
def app2developer(id, title):   
    if id != None:
        search_url = 'https://itunes.apple.com/ie/app/id{0}?mt=8'.format(str(id))
        soup = get_soup(search_url)
        
        try:
            div = soup.find('div', {'class': 'animation-wrapper is-visible ember-view'})
            section = div.find_all('div', {'class': 'ember-view'})[1].find('section', {'class': 'l-content-width section section--hero product-hero ember-view'})  
            developer = section.find('header').find_all('h2')[1].find('a').string.strip()
            return developer
        except:
            print ('Cannnot find this app on apple store')
    
    if title != None:
        title = str(title)
        search_url = 'https://play.google.com/store/search?q={0}&c=apps'.format(title)
        soup = get_soup(search_url)
        
        try:
            developer = soup.find('a',{'class': 'subtitle'}).string.strip()
            return developer
        except:
            print ('Cannnot find this app on google play')

    return None

In [6]:
df['developer'] = df.apply(lambda x: str(app2developer(x.apple_id, x.google_title)).strip(), axis = 1)

Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
C

Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
Cannnot find this app on apple store
C

In [7]:
df = df.dropna(subset=['developer'])
#df.to_csv('../../scraper.csv', index=False)

Create developers' dataset.

In [9]:
df_grouped = df.groupby('developer')
df_mean = df_grouped[['apple_rating', 'google_rating', 'normed_apple_rating','normed_google_rating', 'apple_revenue', 'google_revenue','z_score_apple_rating','z_score_google_rating']].mean()
df_count = df_grouped['apple_rating'].count()
df_median = df_grouped[['log_google_reviews', 'log_apple_reviews', 'google_reviews', 'apple_reviews']].median()

developer=pd.concat([df_mean, df_count, df_median],axis=1)
developer.columns = [['apple_rating', 'google_rating', 'normed_apple_rating','normed_google_rating', 'apple_revenue', 'google_revenue','z_score_apple_rating','z_score_google_rating', 'app_num', 'log_google_reviews', 'log_apple_reviews', 'google_reviews', 'apple_reviews']]
developer.head()

Unnamed: 0_level_0,apple_rating,google_rating,normed_apple_rating,normed_google_rating,apple_revenue,google_revenue,z_score_apple_rating,z_score_google_rating,app_num,log_google_reviews,log_apple_reviews,google_reviews,apple_reviews
developer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
123RF Limited,4.0,4.4,0.8,0.88,0.0,0.0,-0.083987,0.449144,1,6.065666,3.322012,1163232.0,2099.0
1Der Entertainment,4.0,4.5,0.8,0.9,0.0,0.0,-0.083987,0.64657,1,4.906755,2.754348,80678.0,568.0
"2K, Inc.",4.0,4.2,0.8,0.84,28521.45,137382.48,-0.083987,0.054292,1,4.138366,3.455606,13752.0,2855.0
2ThumbsApp,3.5,4.2,0.7,0.84,0.0,0.0,-0.806018,0.054292,1,4.878206,4.222274,75545.0,16683.0
365Scores,4.5,4.6,0.9,0.92,0.0,0.0,0.638043,0.843996,1,5.823814,3.994713,666521.0,9879.0


In [10]:
developer.to_csv(save_path, index=False)
developer.shape

(444, 13)