# Project:  Analyzing Mobile App Data

This project seeks to analyze mobile app data from the Google App Store and App Store. 
As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

Our aim is to help our developers understand what type of apps are likely to attract more users on Google Play and the App Store in order to develop more of those to generate more revenue.

# Opening and Exploring the Data

In [5]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set ###
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

In [6]:
# this function prints the number of rows and columns from the dataset
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [7]:
print(android_header)    #returns the columns of the dataset
print('\n')         #returns a line space between each row
explore_data(android, 0, 3, True)    #calls the function

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


The columns that could help our analysis include; `'App'`, `'Category'`, `'Reviews'`, `'Installs'`, `'Type'`, `'Price'`, and `'Genres'`. For more information about the columns, the documentation can be found [here](https://www.kaggle.com/datasets/lava18/google-play-store-apps).

In [8]:
print(ios_header)       #returns the columns of the dataset
print('\n')             #returns a line space between each row
explore_data(ios, 0, 3, True)       #calls the function  

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


the columns that could help our analysis include; `'track_name'`, `'currency'`, `'price'`, `'rating_count_tot'`, `'rating_count_ver'`, and `'prime_genre'`.  For more information about the columns, the documentation can be found [here](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps).

# Deleting Wrong Data

In [9]:
print(android[10472])  # incorrect row
print('\n')
print(android_header)  # header
print('\n')
print(android[0])      # correct row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


I will have to check the length of `android` before and after deleting the row to confirm if the code actually worked.

**N/B:** do not repeat this procedure to avoid deleting more than one row.

In [10]:
print(len(android))    #returns length of android before deleting row
del android[10472]     #deletes row
print(len(android))    #returns length of android after deleting row

10841
10840


# Removing Duplicate Entries

# Part One

If you explore the Google Play data set long enough or look at the [discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion) section, you'll notice some apps have duplicate entries. For instance, Instagram has four entries:

In [11]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [12]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
    
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


When we work with data, it's important to make sure that we're not counting the same thing multiple times. In this case, we're looking at apps, and we want to make sure that we don't count an app more than once when we're analyzing our data. This is important because if we count an app multiple times, we might get inaccurate results and make incorrect conclusions.

To make sure we're only counting each app once, we need to remove duplicate entries from our dataset. This means that if there are two or more rows in our dataset that refer to the same app, we only keep one of those rows and remove the duplicates.

One way to remove the duplicate rows would be to do it randomly, but that might not be the best approach. Instead, we can look for a better way to identify which row to keep and which ones to remove.

If you examine the snippet above for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times. We can use this to build a criterion for keeping rows. We won't remove rows randomly, but rather we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

# Removing Duplicate Entries

# Part Two

To do this, we will:

1) Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app

2) Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)

In [13]:
reviews_max = {}   #empty dictionary

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In [14]:
print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max)) #length of the dictionary containg unique values

Expected length: 9659
Actual length: 9659


In [15]:
android_clean = [] #empty list which will store the new cleaned dataset
already_added =[]  #empty list which will store the app names

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (n_reviews == reviews_max[name]) and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

print(len(android_clean))

9659


In [16]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


We have 9659 rows as expected.

# Removing Non-English Apps

# Part One

In [17]:
print(ios[813][1])
print(ios[6731][1])

print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜
中国語 AQリスニング
لعبة تقدر تربح DZ


In [18]:
def check_english(a_string):
    
    for character in a_string:
        if ord(character) > 127:
            return False
    
    return True
    

print(check_english('Instagram'))
print(check_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(check_english('Docs To Go™ Free Office Suite'))
print(check_english('Instachat 😜'))

True
False
False
False


Our results are accurate

# Removing Non-English Apps

# Part Two

To minimize the impact of data loss, we'll only remove an app if its name has more than three non-ASCII characters:

In [19]:
# function that has `a_string` as an argument and  filters out non-english apps

def check_english(a_string):
    non_ascii = 0
    
    for character in a_string:
        if ord(character) > 127:
             non_ascii += 1
                
    if non_ascii > 3:
        return False
    else:
        return True          
    
print(check_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(check_english('Docs To Go™ Free Office Suite'))
print(check_english('Instachat 😜'))

False
True
True


In [21]:
android_english = []
ios_english = []

for apps in android_clean:
    name = apps[0]
    if check_english(name):
        android_english.append(apps)

for apps in ios:
    name = apps[1]
    if check_english(name):
        ios_english.append(apps)

# calling the function
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

# Isolating the Free Apps

In [22]:
android_free_apps = []
ios_free_apps = []

for apps in android_english:
    price = apps[7]
    if price == '0':
        android_free_apps.append(apps)
        
for apps in ios_english:
    price = apps[4]
    if price == '0.0':
        ios_free_apps.append(apps)
        
print(len(android_free_apps))
print(len(ios_free_apps))
    

8864
3222


# Most Common Apps by Genre

# Part One

The reason we want to find an app profile that fits both the App Store and Google Play is that it allows us to reach a wider audience, which can potentially increase our revenue. The validation strategy for an app idea involves building a minimal Android version of the app and adding it to Google Play. If the app receives a good response from users, we further develop it. If it is profitable after six months, we build an iOS version of the app and add it to the App Store.

To determine the most common genres in each market, we can use the following columns in our datasets:

For the App Store dataset:

`"prime_genre"`: the genre of the app

For the Google Play dataset:

`"Genres"`: the primary genre of the app

`"Category"`: the category of the app

We can use these columns to generate frequency tables to determine the most common genres in each market.






# Part Two

In [23]:
def freq_table(dataset, index):
    
    freq = {}
    total = 0
   
    # Counting the frequencies
    for data in dataset:
        value = data[index]
        if value in freq:
            freq[value] += 1
        else:
            freq[value] = 1
    total += 1
          
         # Converting to percentages
    freq_percentages = {}
    for key in freq:
        percentage = (freq[key] / total) * 100
        freq_percentages[key] = percentage 
    
    return freq_percentages

    
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

# Part Three

We start by examining the frequency table for the `prime_genre` column of the App Store data set.

In [24]:
display_table(ios_free_apps, -5)

Games : 187400.0
Entertainment : 25400.0
Photo & Video : 16000.0
Education : 11800.0
Social Networking : 10600.0
Shopping : 8400.0
Utilities : 8100.0
Sports : 6900.0
Music : 6600.0
Health & Fitness : 6500.0
Productivity : 5600.0
Lifestyle : 5100.0
News : 4300.0
Travel : 4000.0
Finance : 3600.0
Weather : 2800.0
Food & Drink : 2600.0
Reference : 1800.0
Business : 1700.0
Book : 1400.0
Navigation : 600.0
Medical : 600.0
Catalogs : 400.0


From this frequency table, we can answer the following questions:

* **What is the most common genre? What is the next most common?**

The most common genre is "Games", which appears in 58.16% of the apps in the dataset. The next most common genre is "Entertainment", which appears in 7.88% of the apps.

* **What other patterns do you see?**

We can see that the vast majority of apps in the dataset (over 75%) are designed for either games or entertainment purposes. Additionally, there are a few other genres that are somewhat popular, such as photo and video, education, and social networking.

* **What is the general impression — are most of the apps designed for practical purposes (education, shopping, utilities, productivity, lifestyle) or more for entertainment (games, photo and video, social networking, sports, music)?**

Based on this frequency table, it seems that the majority of apps are designed for entertainment purposes. However, there are still a significant number of apps in the dataset that are designed for practical purposes.

* **Can you recommend an app profile for the App Store market based on this frequency table alone?**

Based on this frequency table alone, it would make sense to focus on developing apps in the "Games" or "Entertainment" categories, as these are by far the most popular genres in the dataset.

* **If there's a large number of apps for a particular genre, does that also imply that apps of that genre generally have a large number of users?**

Not necessarily. While a large number of apps in a particular genre may 
indicate that there is high demand for that type of app, it does not necessarily mean that each individual app in that genre has a large number of users. Additionally, the competition in popular genres like "Games" and "Entertainment" can be very high






Let's continue by examining the `Genres` and `Category` columns of the Google Play data set (two columns which seem to be related).

In [25]:
display_table(android_free_apps, 1) # Category

FAMILY : 167600.0
GAME : 86200.0
TOOLS : 75000.0
BUSINESS : 40700.0
LIFESTYLE : 34600.0
PRODUCTIVITY : 34500.0
FINANCE : 32800.0
MEDICAL : 31300.0
SPORTS : 30100.0
PERSONALIZATION : 29400.0
COMMUNICATION : 28700.0
HEALTH_AND_FITNESS : 27300.0
PHOTOGRAPHY : 26100.0
NEWS_AND_MAGAZINES : 24800.0
SOCIAL : 23600.0
TRAVEL_AND_LOCAL : 20700.0
SHOPPING : 19900.0
BOOKS_AND_REFERENCE : 19000.0
DATING : 16500.0
VIDEO_PLAYERS : 15900.0
MAPS_AND_NAVIGATION : 12400.0
FOOD_AND_DRINK : 11000.0
EDUCATION : 10300.0
ENTERTAINMENT : 8500.0
LIBRARIES_AND_DEMO : 8300.0
AUTO_AND_VEHICLES : 8200.0
HOUSE_AND_HOME : 7300.0
WEATHER : 7100.0
EVENTS : 6300.0
PARENTING : 5800.0
ART_AND_DESIGN : 5700.0
COMICS : 5500.0
BEAUTY : 5300.0


* The most common genres in the Category column of the Google Play dataset are Family and Game.


* Other patterns that can be observed from the frequency table are that the most common app genres fall under the categories of Family, Game, Tools, Business, and Medical. The least common genres are Beauty and Parenting apps.


* When compared to the App Store market, the Google Play market has a higher proportion of apps designed for practical purposes such as Tools, Business, and Medical. However, similar to the App Store, the most common genres are still Games and Family.


* Based on this frequency table alone, it would also be difficult to recommend a specific app profile for the Google Play market. Similar to the App Store, the most frequent app genres do not necessarily imply that those genres have the most users. It is important to consider other factors such as user reviews, ratings, and downloads to determine the popularity of a particular app genre.

# Most Popular Apps by Genre on the App Store

In [26]:
genre_ios = freq_table(ios_free_apps, -5)

for genre in genre_ios:
    total = 0
    len_genre = 0
    
    for apps in ios_free_apps:
        genre_app = apps[-5]
        if genre_app == genre:
            n_ratings = float(apps[5])
            total += n_ratings
            len_genre += 1
            
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


Based on the results, here are some potential app profile recommendations for the App Store:

* Gaming apps have the highest number of user ratings on average, so developing a gaming app may be a profitable option.

* Social networking and music apps also have a high number of user ratings on average, so creating an app in one of these categories could be a good choice.

* Education and book apps have a relatively low number of user ratings on average, but they also have a lower level of competition compared to other genres. Developing an app in this category may be a good option for a niche audience.

# Most Popular Apps by Genre on Google Play

In [27]:
categories_android = freq_table(android_free_apps, 1)

for category in categories_android:
    total = 0
    len_category = 0
    n_installs = 0  # moved outside the for loop and initialized to 0
    for apps in android_free_apps:
        category_app = apps[1]  # corrected variable name to "category_app"
        if category_app == category:
            n_installs_str = apps[5]  # corrected variable name to "n_installs_str"
            n_installs_str = n_installs_str.replace(',', '')
            n_installs_str = n_installs_str.replace('+', '')
            n_installs = int(n_installs_str)  # converted to int
            total += n_installs
            len_category += 1
    avg_n_installs = total / len_category  
    print(category, ':', avg_n_installs)


ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

Based on the results, here are a few potential app profile recommendations for Google Play:

* Communication: This category has the highest average number of installs (38,456,119), which suggests that there is high demand for communication apps. A potential app profile could be a messaging app that includes additional features such as voice and video calls, file sharing, and social media integration.

* Video Players: This category has the second highest average number of installs (24,727,872), indicating that there is a large audience for video player apps. A potential app profile could be a video player that allows users to stream content from various sources, including social media platforms and online streaming services.

* Social: This category has the third highest average number of installs (23,253,652), suggesting that social media apps continue to be popular. A potential app profile could be a social media app that focuses on a specific niche or interest group, such as sports fans, hobbyists, or music lovers.

# Conclusions

In conclusion, our analysis aimed to identify app profiles that have the potential to be profitable on both the App Store and Google Play. We started by cleaning and exploring the data sets, then generated frequency tables to determine the most common genres in each market. We found that the most common genres in the App Store are Games, Entertainment, and Photo & Video, while the most common genres on Google Play are Tools, Entertainment, and Education.

Based on our analysis, we recommend developing a gaming app that has elements of entertainment and social networking. This type of app can potentially attract a wide range of users and keep them engaged for longer periods, which can translate to higher revenue. However, it's important to note that our recommendation is based on the current trends and may change in the future.

Overall, the success of any app idea depends on several factors, such as marketing, user engagement, and competition. Therefore, it's crucial to thoroughly research and validate any app idea before investing time and resources into its development.