# Exploring mobile apps in the App Store and Google Play Market
This project is about researching for the most popular type of mobile apps among users. We'll take a look at two largest "shopping centers of apps": Play Market for Android and App Store for Apple devices.

The aim of my first project is
* to get practice of some kind of real data scientist work
* to explore the massive of data we can have 
* to find any patterns, which could turn to be useful and profitable for business
* to represent my basic speciality knowledges 
* to create a nice looking article for greater clarity of data scientists' occupation


In [2]:
def open_csv(string):
    opened_file = open(string)
    from csv import reader
    read_file = reader(opened_file)
    return list(read_file)

app_store = open_csv('AppleStore.csv')

play_market = open_csv('googleplaystore.csv')

def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row, '\n')
        
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))
        

print('Some apps from the App Store')
explore_data(app_store[1:], 0, 5, True)
print('\n\nSome apps from the Play Market')
explore_data(play_market[1:], 0, 5, True)

print('\n\nColumn names of the App Store dataset')
print(app_store[0])
print('\nColumn names of the Play Market dataset')
print(play_market[0])
            
        

Some apps from the App Store
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] 

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] 

['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] 

['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'] 

['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1'] 

Number of rows:  7197
Number of columns:  16


Some apps from the Play Market
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and 

Describes of each columns are in the primary sources - [App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) and [Play Market](https://www.kaggle.com/lava18/google-play-store-apps)

# Deleting wrong data

In [3]:
#Let's check if there is some mistakes of column size in dataset
def check_mistakes(dataset):
    string = 0
    for row in dataset:    
        if len(row) != len(dataset[0]):
            print("Mistake at ", string, " string")
            print(row)
            print("Wrong lenght: ", len(row))
        string += 1
        

check_mistakes(play_market)

Mistake at  10473  string
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Wrong lenght:  12


We can see that one row misses a "Category" column. This time we'll remove it to leave clear data

In [4]:
del play_market[10473]

In [5]:
#Let's check if there is some mistakes of column size in dataset
check_mistakes(app_store)

Everything is clear with App Store dataset

Nextly, check for any duplicate row in our data, and if find then will delete leaving the newest row (according to Number of reviews column)

# Removing duplicate entries

In [6]:
def duplicate_check(dataset): 
    repeating_apps = []
    unique_apps = []

    for row in dataset[1:]:
        name = row[0]
        if name in unique_apps:
            repeating_apps.append(name)
        else:
            unique_apps.append(name)

    print('Number of repeating apps: ', len(repeating_apps))
    print('\nExamples: ', repeating_apps[:15], '...')
    print('\nLength of set should be: ', len(dataset[1:]) - len(repeating_apps))
    
    
duplicate_check(play_market)

Number of repeating apps:  1181

Examples:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software'] ...

Length of set should be:  9659


In [7]:
reviews_max = {}
for row in play_market[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(len(reviews_max), 'rows')

9659 rows


In [8]:
android_clean = []
already_added = []

for row in play_market[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)
        
print(len(android_clean), 'rows (apps) total left')

9659 rows (apps) total left


By now we've got a new list of lists `android_clean` which contains unique apps without duplicates and mistakes

In [9]:
duplicate_check(app_store)

Number of repeating apps:  0

Examples:  [] ...

Length of set should be:  7197


There is no duplicates in the App Store dataset

Nextly, we'll exclude non-English named apps, because we aim on English-speaker auditory

# Removing non-English apps

In [10]:
def english_check(word):
    bad_letters = []
    for letter in word:
        if ord(letter) > 127:
            bad_letters.append(letter)
            
    if len(bad_letters) >= 3:
        return False
    elif len(bad_letters) < 3:
            return True

In [11]:
print(english_check('Instagram'))
print(english_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_check('™ Free Office Suite'))
print(english_check('😜😜😜😜'))

True
False
True
False


In [12]:
eng_clean_android = []
for row in android_clean[1:]:
    if english_check(row[0]):
        eng_clean_android.append(row)
        
print(len(eng_clean_android), 'english apps for Android')

9596 english apps for Android


In [13]:
eng_clean_ios = []
for row in app_store[1:]:
    if english_check(row[1]):
        eng_clean_ios.append(row)
        
print(len(eng_clean_ios), 'english apps for iOS')
        
    

6155 english apps for iOS


The last step to clean and prepare data is isolate only free apps for our aims of analysis

# Isolating the Free apps

In [14]:
final_free_android = []
test_1 = []

for row in eng_clean_android[1:]:
    if row[7] == '0':
        final_free_android.append(row)
    if row[6] == 'Free':
        test_1.append(row)
        
    if row[7] == '0' and row[6] != 'Free':
        print(row)
        print('One string has NaN instead of Free, but it is the only mistake in this, so we confirm data')
        
print(len(final_free_android), 'number of free android apps')
print(len(test_1), 'testing number, should be equal')

['Command & Conquer: Rivals', 'FAMILY', 'NaN', '0', 'Varies with device', '0', 'NaN', '0', 'Everyone 10+', 'Strategy', 'June 28, 2018', 'Varies with device', 'Varies with device']
One string has NaN instead of Free, but it is the only mistake in this, so we confirm data
8846 number of free android apps
8845 testing number, should be equal


In [15]:
final_free_ios = []

for row in eng_clean_ios[1:]:
    if row[4] == '0.0':
        final_free_ios.append(row)
        
print(len(final_free_ios), 'number of free ios apps')


3202 number of free ios apps


--------
A business plan of a project is as follow:
* we plan to attract users with a free app
* we set an in-app adds to get a revenue
* the more the users, the more the revenue and profit
* to provide more users, we explore for the most popular kind of apps

A steps after:
1. Build a minimal Android verson and add to Google Play
2. If response is good, then develop further
3. If profitable for 6 months, build an iOS version

So, we have to find a category being worth to both markets

Now, let's create a category frequency tables and display in descending order to analyse

# Most common apps by genre

In [16]:
def freq_table(dataset, index):
    freq_dictionary = {}
    for row in dataset[1:]:
        category = row[index]
        if category in freq_dictionary:
            freq_dictionary[category] += 1
        elif category not in freq_dictionary:
            freq_dictionary[category] = 1
            
    return freq_dictionary

In [27]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [18]:
print('Prime genres of App Store:')
display_table(final_free_ios, 11)

print('\nGenres of Play Market:')
display_table(final_free_android, 9)
print('\nCategory of Play Market:')
display_table(final_free_android, 1)

Prime genres of App Store:
Games : 1866
Entertainment : 251
Photo & Video : 159
Education : 118
Social Networking : 105
Shopping : 83
Utilities : 79
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 56
Lifestyle : 50
News : 43
Travel : 40
Finance : 35
Weather : 28
Food & Drink : 26
Reference : 17
Business : 17
Book : 12
Navigation : 6
Medical : 6
Catalogs : 4

Genres of Play Market:
Tools : 747
Entertainment : 538
Education : 474
Business : 407
Productivity : 345
Lifestyle : 343
Finance : 328
Medical : 313
Sports : 306
Personalization : 294
Communication : 286
Action : 274
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 189
Simulation : 181
Dating : 165
Arcade : 163
Video Players & Editors : 157
Casual : 156
Maps & Navigation : 123
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 81
House & Home : 71
Weather : 70
Events : 63
A

The frequency tables we get show us that the App Store is dominated by apps designed for fun, while Play Market has balanced fun and practical apps.

We can also find the most popular one by using the average number of installs or rates for each app genre.

# Most Popular Apps by Genre on the App Store

In [26]:
prime_genres_freq = freq_table(final_free_ios, 11)

for genre in prime_genres_freq:
    total = 0
    len_genre = 0
    
    for row in final_free_ios:
        genre_app = row[11]
        if genre_app == genre:
            ratings_num = float(row[5])
            total += ratings_num
            len_genre += 1
            
    average_ratings = total / len_genre
    print(genre, ' : ', average_ratings)
    
print('\nNavigation category has the hottest rates amount! A bit surprising ')
print('A Reference is also good')

Games  :  22886.36709539121
Music  :  57326.530303030304
Social Networking  :  43899.514285714286
Reference  :  79350.4705882353
Health & Fitness  :  23298.015384615384
Weather  :  52279.892857142855
Utilities  :  19156.493670886077
Travel  :  28243.8
Shopping  :  27230.734939759037
News  :  21248.023255813954
Navigation  :  86090.33333333333
Lifestyle  :  16815.48
Photo & Video  :  28441.54375
Entertainment  :  14195.358565737051
Food & Drink  :  33333.92307692308
Sports  :  23008.898550724636
Book  :  46384.916666666664
Finance  :  32367.02857142857
Education  :  7003.983050847458
Productivity  :  21028.410714285714
Business  :  7491.117647058823
Catalogs  :  4004.0
Medical  :  612.0

Navigation category has the hottest rates amount! A bit surprising 
A Reference is also good


# Most Popular Apps by Genre on Google Play

In [20]:
category_freq = freq_table(final_free_android, 1)

for category in category_freq:
    total = 0
    len_category = 0
    
    for row in final_free_android:
        category_app = row[1]
        if category_app == category:
            installs_num = (row[5]).replace(',', '')
            installs_num = float(installs_num.replace('+', ''))
            total += installs_num
            len_category += 1
            
    average_installs = total / len_category
    print(category, ' : ', average_installs)
    
print('\nCommunication category is most frequently installed!')
print('Entertainment, Game, Social, Photography, Travel, Tools, Productivity and Video players are also good')

ART_AND_DESIGN  :  1967474.5454545454
AUTO_AND_VEHICLES  :  647317.8170731707
BEAUTY  :  513151.88679245283
BOOKS_AND_REFERENCE  :  8814199.78835979
BUSINESS  :  1712290.1474201474
COMICS  :  832613.8888888889
COMMUNICATION  :  38590581.08741259
DATING  :  854028.8303030303
EDUCATION  :  1833495.145631068
ENTERTAINMENT  :  11640705.88235294
EVENTS  :  253542.22222222222
FINANCE  :  1387692.475609756
FOOD_AND_DRINK  :  1924897.7363636363
HEALTH_AND_FITNESS  :  4188821.9853479853
HOUSE_AND_HOME  :  1360598.042253521
LIBRARIES_AND_DEMO  :  638503.734939759
LIFESTYLE  :  1446158.2238372094
GAME  :  15544014.51048951
FAMILY  :  3695641.8198090694
MEDICAL  :  120550.61980830671
SOCIAL  :  23253652.127118643
SHOPPING  :  7036877.311557789
PHOTOGRAPHY  :  17840110.40229885
SPORTS  :  3650602.276666667
TRAVEL_AND_LOCAL  :  13984077.710144928
TOOLS  :  10830251.970588235
PERSONALIZATION  :  5201482.6122448975
PRODUCTIVITY  :  16787331.344927534
PARENTING  :  542603.6206896552
WEATHER  :  5145550

In [24]:
genre_freq = freq_table(final_free_android, 9)

for genre in genre_freq:
    total = 0
    len_genre = 0
    
    for row in final_free_android:
        genre_app = row[9]
        if genre_app == genre:
            installs_num = (row[5]).replace(',', '')
            installs_num = float(installs_num.replace('+', ''))
            total += installs_num
            len_genre += 1
            
    average_installs = total / len_genre
    print(genre, ' : ', average_installs)
    
print('\nThe Socials group would be a good choice')

Art & Design;Creativity  :  285000.0
Art & Design  :  2107864.705882353
Auto & Vehicles  :  647317.8170731707
Beauty  :  513151.88679245283
Books & Reference  :  8814199.78835979
Business  :  1712290.1474201474
Comics  :  847380.1886792453
Comics;Creativity  :  50000.0
Communication  :  38590581.08741259
Dating  :  854028.8303030303
Education  :  550185.4430379746
Education;Creativity  :  2875000.0
Education;Education  :  4759517.0
Education;Pretend Play  :  1800000.0
Education;Brain Games  :  5333333.333333333
Entertainment  :  5602792.775092937
Entertainment;Brain Games  :  3314285.714285714
Entertainment;Creativity  :  4000000.0
Entertainment;Music & Video  :  6413333.333333333
Events  :  253542.22222222222
Finance  :  1387692.475609756
Food & Drink  :  1924897.7363636363
Health & Fitness  :  4188821.9853479853
House & Home  :  1360598.042253521
Libraries & Demo  :  638503.734939759
Lifestyle  :  1421219.9096209912
Lifestyle;Pretend Play  :  10000000.0
Card  :  3815462.5
Arcade  :  

 **Summary, I would confidently advise a Social Media sphere** according to balance between popularity, frequency, installations and ratings on both platforms

# Conclusion
In this small project we analyzed data about the App Store and Google Play mobile apps with the goal of practice and recommending an app profile that can be profitable for both markets.

We concluded that creating a new social web or media with some special features like high level of confidentiality and safety can occure a profitable and succesful start up