Combining the google and apple datasets

Purpose
This notebook is going to bring together the google and apple datasets. 

In [1]:
from os.path import exists, isfile
import random
import time

import re

import pandas as pd
import numpy as np
import math

In [2]:
save_path = '../../datasets/2300_combine_kaggle_datasets.csv'

In [3]:
if not exists("../../datasets/2200_clean_apple.csv"):
    print ("Missing dataset file")
    
df_apple=pd.read_csv("../../datasets/2200_clean_apple.csv")
df_apple.head()

Unnamed: 0,apple_id,apple_title,apple_size,apple_price,apple_reviews,apple_rating,apple_pegi,apple_genre,normed_apple_rating,z_score_apple,log_apple_reviews
0,281656475,PAC-MAN Premium,96.119141,3.99,21292,4.0,Everyone,Games,0.8,-0.083987,4.328216
1,281796108,Evernote - stay organized,151.232422,0.0,161065,4.0,Everyone,Business,0.8,-0.083987,5.207001
2,281940292,"WeatherBug - Local Weather, Radar, Maps, Alerts",95.867188,0.0,188583,3.5,Everyone,Others,0.7,-0.806018,5.275503
3,282614216,"eBay: Best App to Buy, Sell, Save! Online Shop...",122.558594,0.0,262241,4.0,Teen,Lifestyle,0.8,-0.083987,5.418701
4,282935706,Bible,88.476562,0.0,985920,4.5,Everyone,Books & Reference,0.9,0.638043,5.993842


In [4]:
if not exists("datasets/2100_clean_google.csv"):
    print ("Missing dataset file")
    
df_google=pd.read_csv("datasets/2100_clean_google.csv")
df_google.head()

Unnamed: 0,google_title,google_genre,google_rating,google_reviews,google_size,google_price,google_pegi,log_google_reviews,normed_google_rating,z_score_google
0,Photo Editor & Candy Camera & Grid & ScrapBook,Utilities,4.1,159,19.0,0.0,Everyone,2.201397,0.82,-0.143135
1,Coloring book moana,Utilities,3.9,967,14.0,0.0,Everyone,2.985426,0.78,-0.537987
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",Utilities,4.7,87510,8.7,0.0,Everyone,4.942058,0.94,1.041422
3,Sketch - Draw & Paint,Utilities,4.5,215644,25.0,0.0,Teen,5.333737,0.9,0.64657
4,Pixel Draw - Number Art Coloring Book,Utilities,4.3,967,2.8,0.0,Everyone,2.985426,0.86,0.251718


Title to title mappings

The same apps can have different titles when they are released in different platforms. This happens when the title of each app is in the format like "A - B" or "A : B". However, some apps have titles like "A - B" or "A : B" but they are different. For example, different games in a series can have titles like "A - B", "A - C", and "A - D" but they cannot be considered as the same app.

A possible solution to this problem is, first take "A" out of "A - B" or "A : B" and save it as the key to combine two datasets. If such a key is unique in the new dataset, we simply consider it as a perfect match. However, when such a key is not unique, we only consider those matches with the same google title and apple title as the correct match.

In [5]:
def clean_title(x):
    x = str(x)
    pos1 = x.find('-')
    pos2 = x.find('–')
    pos3 = x.find(':')
    pos4 = x.find('(')
    if pos1 != -1:
        x = x[:pos1].strip()
    if pos2 != -1:
        x = x[:pos2].strip()
    if pos3 != -1:
        x = x[:pos3].strip()
    if pos4 != -1:
        x = x[:pos4].strip()
    r1 = '[’!"#$%&\'()*+,-./:;<=>?@，。?★、…【】《》？“”‘’！[\\]^_`{|}~]+'
    x = re.sub(r1, '', x)
    x = x.strip()
    return x

In [6]:
df_google['trim_title'] = df_google['google_title'].apply(clean_title)
df_apple['trim_title'] = df_apple['apple_title'].apply(clean_title)

In [7]:
combine_apps = df_apple.set_index('trim_title').join(
    df_google.set_index('trim_title'), how='inner').reset_index()
combine_apps.shape

(781, 22)

In [8]:
same_title = combine_apps[(combine_apps.duplicated('trim_title') == True) & (combine_apps['google_title'] == combine_apps['apple_title'])]
same_trim = combine_apps[combine_apps.duplicated('trim_title') == False]
combine_apps = same_title.append(same_trim)
combine_apps.sample(10)

Unnamed: 0,trim_title,apple_id,apple_title,apple_size,apple_price,apple_reviews,apple_rating,apple_pegi,apple_genre,normed_apple_rating,...,google_title,google_genre,google_rating,google_reviews,google_size,google_price,google_pegi,log_google_reviews,normed_google_rating,z_score_google
74,Bejeweled Classic,479536744,Bejeweled Classic,94.979492,0.0,183259,4.5,Everyone,Games,0.9,...,Bejeweled Classic,Others,4.4,203101,,0.0,Everyone,5.307712,0.88,0.449144
375,Mad Skills Motocross,410229362,Mad Skills Motocross,16.164062,0.99,9341,4.5,Teen,Games,0.9,...,Mad Skills Motocross,Games,4.0,32522,44.0,0.0,Everyone,4.512177,0.8,-0.340561
417,My Talking Pet,662090840,My Talking Pet,48.642578,1.99,9035,4.5,Everyone,Entertainment,0.9,...,My Talking Pet,Entertainment,4.6,6238,,4.99,Everyone,3.795045,0.92,0.843996
173,Dots Co,1100331256,Dots & Co: A New Puzzle Adventure,172.30957,0.0,10693,4.5,Everyone,Games,0.9,...,Dots & Co: A Puzzle Adventure,Others,4.5,81001,85.0,0.0,Everyone,4.90849,0.9,0.64657
583,Sworkit,527219710,Sworkit - Custom Workouts for Exercise & Fitness,136.53418,0.0,16819,5.0,Everyone,Health & Fitness,1.0,...,Sworkit: Workouts & Fitness Plans,Health & Fitness,4.6,109756,78.0,0.0,Everyone,5.040428,0.92,0.843996
44,Angry Birds Rio,420635506,Angry Birds Rio,131.912109,0.0,170843,4.5,Everyone,Games,0.9,...,Angry Birds Rio,Games,4.4,2610526,46.0,0.0,Everyone,6.416728,0.88,0.449144
62,Bad Piggies HD,545229893,Bad Piggies HD,247.994141,0.0,19018,4.5,Everyone,Games,0.9,...,Bad Piggies HD,Others,4.4,764967,69.0,0.0,Everyone,5.883643,0.88,0.449144
63,Badoo,351331194,"Badoo - Meet New People, Chat, Socialize.",150.323242,0.0,34428,4.5,Mature 17+,Social Networking,0.9,...,Badoo - Free Chat & Dating App,Social Networking,4.3,3781770,,0.0,Mature 17+,6.577695,0.86,0.251718
207,Endless Ducker,1117844680,Endless Ducker,152.351562,0.0,639,4.0,Everyone,Games,0.8,...,Endless Ducker,Games,4.5,8193,41.0,0.0,Everyone,3.913443,0.9,0.64657
523,Retro City Rampage DX,1088540036,Retro City Rampage DX,14.888672,4.99,68,4.5,Teen,Games,0.9,...,Retro City Rampage DX,Games,4.7,416,16.0,2.99,Teen,2.619093,0.94,1.041422


In [9]:
combine_apps.shape

(604, 22)

The difference of each app's two ratings is calculated here.

In [10]:
combine_apps['z_score_google_sub_apple'] = combine_apps['z_score_google'] - combine_apps['z_score_apple']
combine_apps.sample(5)

Unnamed: 0,trim_title,apple_id,apple_title,apple_size,apple_price,apple_reviews,apple_rating,apple_pegi,apple_genre,normed_apple_rating,...,google_genre,google_rating,google_reviews,google_size,google_price,google_pegi,log_google_reviews,normed_google_rating,z_score_google,z_score_google_sub_apple
760,YouTube Kids,936971630,YouTube Kids,113.712891,0.0,28560,4.5,Everyone,Entertainment,0.9,...,Entertainment,4.5,470089,,0.0,Everyone,5.67218,0.9,0.64657,0.008527
44,Angry Birds Rio,420635506,Angry Birds Rio,131.912109,0.0,170843,4.5,Everyone,Games,0.9,...,Games,4.4,2610526,46.0,0.0,Everyone,6.416728,0.88,0.449144,-0.188899
405,My Boo,706099830,My Boo - Virtual Pet with Mini Games for Kids,100.272461,0.0,55783,4.5,Everyone,Games,0.9,...,Others,4.4,899748,80.0,0.0,Everyone,5.954121,0.88,0.449144,-0.188899
612,THE KING OF FIGHTERS,507937883,THE KING OF FIGHTERS-i 2012,1140.796875,2.99,228,4.5,Teen,Games,0.9,...,Games,4.4,406511,21.0,0.0,Teen,5.609072,0.88,0.449144,-0.188899
372,MTV,422366403,MTV,108.516602,0.0,5987,2.5,Teen,Entertainment,0.5,...,Entertainment,3.8,35279,15.0,0.0,Teen,4.547516,0.76,-0.735413,1.514667


In [11]:
combine_apps['norm_google_sub_apple'] = combine_apps['normed_google_rating'] - combine_apps['normed_apple_rating']
combine_apps.sample(5)

Unnamed: 0,trim_title,apple_id,apple_title,apple_size,apple_price,apple_reviews,apple_rating,apple_pegi,apple_genre,normed_apple_rating,...,google_rating,google_reviews,google_size,google_price,google_pegi,log_google_reviews,normed_google_rating,z_score_google,z_score_google_sub_apple,norm_google_sub_apple
258,FollowMeter for Instagram,651309421,FollowMeter for Instagram - Followers Tracking,11.271484,0.0,11976,4.5,Everyone,Social Networking,0.9,...,4.4,90082,8.8,0.0,Everyone,4.954638,0.88,0.449144,-0.188899,-0.02
375,Mad Skills Motocross,410229362,Mad Skills Motocross,16.164062,0.99,9341,4.5,Teen,Games,0.9,...,4.0,32522,44.0,0.0,Everyone,4.512177,0.8,-0.340561,-0.978604,-0.1
573,Storm Shield,526831380,Storm Shield,25.273438,2.99,2516,4.5,Teen,Others,0.9,...,3.5,2000,14.0,0.0,Everyone,3.30103,0.7,-1.327691,-1.965734,-0.2
614,Talking Tom Bubble Shooter,1037420924,Talking Tom Bubble Shooter,182.532227,0.0,2659,4.5,Everyone,Games,0.9,...,4.4,687136,54.0,0.0,Everyone,5.837043,0.88,0.449144,-0.188899,-0.02
752,Yahoo Sports,286058814,"Yahoo Sports - Teams, Scores, News & Highlights",124.53418,0.0,137951,4.0,Everyone,Health & Fitness,0.8,...,4.5,32386,19.0,0.0,Everyone,4.510357,0.9,0.64657,0.730557,0.1


We only keep the common features for the new dataset.

In [12]:
use_cols = [
    'apple_id', 'trim_title', 'apple_title', 'apple_genre', 'apple_rating',
       'apple_reviews', 'apple_size', 'apple_pegi',
       'normed_apple_rating', 'google_title', 'google_rating',
       'google_reviews', 'google_size', 'google_price', 
       'normed_google_rating', 'log_google_reviews', 'log_apple_reviews',
    'z_score_google', 'z_score_apple', 'z_score_google_sub_apple', 'norm_google_sub_apple'
]

df = combine_apps[use_cols].copy()
df.head()

Unnamed: 0,apple_id,trim_title,apple_title,apple_genre,apple_rating,apple_reviews,apple_size,apple_pegi,normed_apple_rating,google_title,...,google_reviews,google_size,google_price,normed_google_rating,log_google_reviews,log_apple_reviews,z_score_google,z_score_apple,z_score_google_sub_apple,norm_google_sub_apple
107,898968647,Call of Duty®,Call of Duty®: Heroes,Games,4.5,179416,201.075195,Teen,0.9,Call of Duty®: Heroes,...,1604146,57.0,0.0,0.88,6.205244,5.253861,0.449144,0.638043,-0.188899,-0.02
170,1147297267,Dont Starve,Don't Starve: Shipwrecked,Games,3.5,495,604.341797,Everyone 10+,0.7,Don't Starve: Shipwrecked,...,1468,4.9,4.99,0.82,3.166726,2.694605,-0.143135,-0.806018,0.662884,0.12
223,352670055,F,F-Sim Space Shuttle,Games,4.5,6403,72.855469,Everyone,0.9,F-Sim Space Shuttle,...,5427,,4.99,0.88,3.73456,3.806384,0.449144,0.638043,-0.188899,-0.02
301,763692274,Grand Theft Auto,Grand Theft Auto: San Andreas,Games,4.0,32533,1964.96582,Mature 17+,0.8,Grand Theft Auto: San Andreas,...,348962,26.0,6.99,0.88,5.542778,4.512324,0.449144,-0.083987,0.533131,0.08
355,771989093,LEGO® Friends,LEGO® Friends,Games,4.0,400,730.941406,Everyone,0.8,LEGO® Friends,...,854,6.9,4.99,0.88,2.931458,2.60206,0.449144,-0.083987,0.533131,0.08


In [13]:
df.columns = ['apple_id','trim_title', 'apple_title', 'genre', 'apple_rating',
       'apple_reviews', 'apple_size', 'pegi',
       'normed_apple_rating', 'google_title', 'google_rating',
       'google_reviews', 'google_size', 'price', 
       'normed_google_rating', 'log_google_reviews', 'log_apple_reviews',
    'z_score_google_rating', 'z_score_apple_rating', 'z_score_google_sub_apple','norm_google_sub_apple']
df.shape

(604, 21)

In [14]:
df.to_csv(save_path, index=False)
df.shape

(604, 21)