# Finding business oportunities - Analyzing google play and apple store data.
> My goal for this project is to analyze data from apple store and google play to understand, what type of `free apps` are likely to attract more users. In order to help developers make data-driven decisions with respect to the kind of apps they build.

As of September 2018, there were approximately 2 milion IOS apps available of the App Store, and 2.1 milion Android apps on Google Play.Collecting data for over four million apps requires a significant amount of time and money, so I will try to analyze a sample of data instead. 

- First data set containing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from this [link](https://www.kaggle.com/datasets/lava18/google-play-store-apps?sort=votes).
- Second data set containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from this [link](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps?sort=votes).


I will start by opening the two data sets and the continue explorting the data.

In [1]:
from csv import reader

## Ios data ##
opened_file = open("AppleStore.csv")
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

## Android data ##
opened_file = open("googleplaystore.csv")
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]





In [2]:
# Define esplore_data func. to easily explore data
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
print(ios_header)
print("\n")
print(android_header)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [5]:
explore_data(android, 10472, 10473, rows_and_columns=True)
explore_data(ios, 0, 2, rows_and_columns=True)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Number of rows: 10841
Number of columns: 13
['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


Number of rows: 7197
Number of columns: 17


In [6]:
# Deleting corupt row
del android[10472]

I deleted incorrect row of android dataset, found out about that on [Kaglle discussion](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015?sort=votes). Now I am going to check if there are any duplicates. If so, I will delete once with lower number of reviews as it indicates older data.

In [7]:
# Making list for duplicate and unique apps
duplicate_apps = []
unique_apps = []

# Determining number of duplicate and unique apps for android
for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print(len(duplicate_apps))
print(len(unique_apps))


1181
9659


In [8]:
# Making list for duplicate and unique apps
duplicate_apps = []
unique_apps = []

# Determining number of duplicate and unique apps for ios
for app in ios:
    name = app[1]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print(len(duplicate_apps))
print(len(unique_apps))


0
7197


Android dataset has `1181 duplicate` apps and ios dataset has `none duplicate` apps.

In [9]:
# Create empty dict
reviews_max = {}


for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews


print(len(reviews_max))

9659


In the cell above I made `reviews_max dict.` in order to determine data with the most reviews per app. Then I mane list of lists `android_clean` where I added every app with most reviews. I also made `android_added` list to keep track of apps I already added.

In [10]:
# Create 2 lists
android_clean = []
already_added = []

# Adding uniqie apps to android_clean list
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
print(android_clean[:1])
print(len(android_clean))
    
    

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']]
9659


Now I going to check if there are any `Non english apps`. For that I made `char_check` function.

In [11]:
# Define func. to check for english characters
def char_check(string):
    non_ascii = 0
    for x in string:
        if ord(x) > 127:
            non_ascii += 1
            if non_ascii > 3:
                return False
    return True
char_check('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [12]:
# Creating android and ios lists for only english apps
android_char_check = []
ios_char_check = []

# Creating ios and android lists with only english apps
for app in android_clean:
    true_false = char_check(app[0])
    if true_false == True:
        android_char_check.append(app)
for app in ios:
    true_false = char_check(app[2])
    if true_false == True:
        ios_char_check.append(app)
        
        
print(android_char_check[0], "\n")
print(ios_char_check[0], "\n")
print(len(android_char_check))
print(len(ios_char_check))

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1'] 

9614
6183


Lastly I am going to isolate all free apps.

In [13]:
# Create empty android and ios free apps lists
android_free = []
ios_free = []

# Create android and ios free apps lists
for app in android_char_check:
    if app[6] == "Free":
        android_free.append(app)
for app in ios_char_check:
    if app[5] == "0":
        ios_free.append(app)

print(len(android_free))
print(len(ios_free))



8863
3222


Because my end goal is to add the app on both Google Play and the App Store, I need to find app profiles that are successful in both markets. For instance, a profile that works well for both markets might be a productivity app, a game or social networking app.

Let's begin the analysis by determining the most common genres for each market. For this, I will need to build frequency tables for a few columns in my datasets.

In [14]:
# Create empty dict.
freq_t = {}

# Define frequency table
def freq_table(dataset, index):
    for row in dataset:
        data = row[index]
        if data in freq_t:
            freq_t[data] += 1
        else:
            freq_t[data] = 1
    return freq_t

# Def funct. to display frequency table
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

# Print a few frequency tables
display_table(android_free, 9)
print("\n\n")
display_table(android_free, 1)
print("\n\n")
display_table(ios_free, 12)


Tools : 749
Entertainment : 538
Education : 474
Business : 407
Productivity : 345
Lifestyle : 345
Finance : 328
Medical : 313
Sports : 307
Personalization : 294
Communication : 287
Action : 275
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 190
Simulation : 181
Dating : 165
Arcade : 164
Video Players & Editors : 157
Casual : 156
Maps & Navigation : 124
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 80
House & Home : 73
Weather : 71
Events : 63
Adventure : 60
Comics : 54
Beauty : 53
Art & Design : 53
Parenting : 44
Card : 40
Casino : 38
Trivia : 37
Educational;Education : 35
Board : 34
Educational : 33
Education;Education : 30
Word : 23
Casual;Pretend Play : 21
Music : 18
Racing;Action & Adventure : 15
Puzzle;Brain Games : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casual;Action & Adventure : 12
Arcade;Action & Advent

In [20]:
freq_t = {}
def freq_table(dataset, index):
    for row in dataset:
        data = row[index]
        if data in freq_t:
            freq_t[data] += 1
        else:
            freq_t[data] = 1
    return freq_t


ios_prime_genre_freq = freq_table(ios_free, 12)

# Print number of reviews / number of apps in each genre
best_genre = 0
for genre in ios_prime_genre_freq:
    total = 0
    len_genre = 0
    for app in ios_free:
        genre_app = app[12]
        if genre_app == genre:
            total += round(float(app[6]), ndigits=0)
            len_genre += 1
    avrg = round(total/len_genre, ndigits=0)
    block = 50
    genre_len = len(genre)
    avrg_len = len(str(avrg))
    block_i = block - (genre_len + avrg_len)
    print(genre, ":", " "*block_i ,avrg)
    if best_genre < avrg:
        best_genre = avrg
print("Beset genre is: " + str(best_genre))
    

Productivity :                                 21028.0
Weather :                                      52280.0
Shopping :                                     26920.0
Reference :                                    74942.0
Finance :                                      31468.0
Music :                                        57327.0
Utilities :                                    18684.0
Travel :                                       28244.0
Social Networking :                            71548.0
Sports :                                       23009.0
Health & Fitness :                             23298.0
Games :                                        22789.0
Food & Drink :                                 33334.0
News :                                         21248.0
Book :                                         39758.0
Photo & Video :                                28442.0
Entertainment :                                14030.0
Business :                                      7491.0
Lifestyle 

In [19]:
freq_t = {}
def freq_table(dataset, index):
    for row in dataset:
        data = row[index]
        if data in freq_t:
            freq_t[data] += 1
        else:
            freq_t[data] = 1
    return freq_t
android_category_freq = freq_table(android_free, 1)

# Print number of reviews / number of apps in each genre
best_category = 0
for category in android_category_freq:
    total = 0
    len_category = 0
    for app in android_free:
        category_app = app[1]
        if category_app == category:
            installs = app[5]
            installs = installs.replace("+", "")
            installs = installs.replace(",", "")
            installs = round(float(installs), ndigits=0)
            total += installs
            len_category += 1
    avrg = round(total/len_category, ndigits=0)
    block = 50
    category_len = len(category)
    avrg_len = len(str(avrg))
    block_i = block - (category_len + avrg_len)
    print(category, ":", " "*block_i, avrg)
    if best_category < avrg:
        best_category = avrg
print(f"Best category is: {best_category}")
    

ART_AND_DESIGN :                             1986335.0
AUTO_AND_VEHICLES :                           647318.0
BEAUTY :                                      513152.0
BOOKS_AND_REFERENCE :                        8767812.0
BUSINESS :                                   1712290.0
COMICS :                                      817657.0
COMMUNICATION :                             38456119.0
DATING :                                      854029.0
EDUCATION :                                  1833495.0
ENTERTAINMENT :                             11640706.0
EVENTS :                                      253542.0
FINANCE :                                    1387692.0
FOOD_AND_DRINK :                             1924898.0
HEALTH_AND_FITNESS :                         4188822.0
HOUSE_AND_HOME :                             1331541.0
LIBRARIES_AND_DEMO :                          638504.0
LIFESTYLE :                                  1437816.0
GAME :                                      15588016.0
FAMILY :  

In [18]:
freq_t = {}
def freq_table(dataset, index):
    for row in dataset:
        data = row[index]
        if data in freq_t:
            freq_t[data] += 1
        else:
            freq_t[data] = 1
    return freq_t
android_category_freq = freq_table(android_free, 9)

# Print number of reviews / number of apps in each genre
best_category = 0
for category in android_category_freq:
    total = 0
    len_category = 0
    for app in android_free:
        category_app = app[9]
        if category_app == category:
            installs = app[5]
            installs = installs.replace("+", "")
            installs = installs.replace(",", "")
            installs = round(float(installs), ndigits=0)
            total += installs
            len_category += 1
    avrg = round(total/len_category, ndigits=0)
    block = 50
    category_len = len(category)
    avrg_len = len(str(avrg))
    block_i = block - (category_len + avrg_len)
    print(category, ":", " "*block_i, avrg)
    if best_category < avrg:
        best_category = avrg
print(f"Best category is: {best_category}")

Art & Design :                               2122851.0
Art & Design;Creativity :                     285000.0
Auto & Vehicles :                             647318.0
Beauty :                                      513152.0
Books & Reference :                          8767812.0
Business :                                   1712290.0
Comics :                                      831873.0
Comics;Creativity :                            50000.0
Communication :                             38456119.0
Dating :                                      854029.0
Education :                                   550185.0
Education;Creativity :                       2875000.0
Education;Education :                        4759517.0
Education;Pretend Play :                     1800000.0
Education;Brain Games :                      5333333.0
Entertainment :                              5602793.0
Entertainment;Brain Games :                  3314286.0
Entertainment;Creativity :                   4000000.0
Entertainm

## Concslusion
Best genre on apple store is `Navigation` and best category on google play is `communication`. Second best on apple store is `social networking` and the same category on google play labeled `social` is on the thrid place. Therefore it seems that building a social networking platform would be the best bet.