Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

In [1]:
from csv import reader

In [2]:
# open , read and save as list for Playstore Apps
Openfile = open('googleplaystore.csv',encoding="utf8")
readfile = reader(Openfile)
android = list(readfile)
android_header =android[0]
android = android[1:]

The explore_data() function:

Takes in four parameters:
1. dataset, which is expected to be a list of lists.
2. start and end, which are both expected to be integers and represent the starting and the ending indices of a slice from the data set.
3. rows_and_columns, which is expected to be a Boolean and has False as a default argument.

Slices the data set using dataset[start:end].

Loops through the slice, and for each iteration, prints a row and adds a new line after that row using print('\n').

The \n in print('\n') is a special character and won't be printed. Instead, the \n character adds a new line, and we use print('\n') to add some blank space between rows.

Prints the number of rows and columns if rows_and_columns is True.

dataset shouldn't have a header row, otherwise the function will print the wrong number of rows (one more row compared to the actual length).

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [4]:
explore_data(android, 0,3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


Data Cleaning : Identify any wrong data or with missing data point

In [5]:
for index, data in enumerate(android):
    if len(data)!=len(android[0]):
        print('Length of dataset row is {} and of data at {} is {}'. format(len(android[0]), index, len(data)))
        print(data)

Length of dataset row is 13 and of data at 10472 is 12
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [6]:
# deleting the wrong datapoint using del function. Caution: Don't run it more than once else will delete the real data. 
del(android[10472])

In [7]:
#Checking if it got deleted or not
print (android [10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


In [8]:
#Data Cleaning: Identify any duplicated data and remove it
unique_data =[] # Create a new list and assign all the app names to it.
duplicate_data =[] # Create anaother list to hold repeated data sets.
for data in android:
    app_name = data[0]
    if app_name in unique_data:
        duplicate_data.append(app_name)
    else:
        unique_data.append(app_name)
print ('Number of unique data: {} and duplicated data: {}'. format(len(unique_data), len(duplicate_data)))


Number of unique data: 9659 and duplicated data: 1181


Evaluating the duplicate data file shows some data have everything similar except number of reviews for a particular app. So we keep the app with the highest number of reviews in a new dictionary. 

In [9]:
#Create a dictionary - reviews- where each key is a unique app name and 
#the corresponding dictionary value is the highest number of reviews of that app.
reviews = {}

for data in android:
    name = data[0]
    n_review = float(data[2])
    if name in reviews and reviews[name] <n_review:
        reviews[name]= n_review
    elif name not in reviews:
        reviews.update({name:n_review})
print (len(reviews))

#Removing duplicated apps and keeping one which has max reviews
android_clean=[] # to store our new cleaned data set
already_added = []# to store app names
for data in android:
    name = data[0]
    n_review = float(data[2])
    
    if (n_review == reviews[name]) and (name not in already_added):
        android_clean.append(data)
        already_added.append(name)
        
#Explore the android_clean data set to ensure everything went as expected.
explore_data(android_clean,0,5,True)


9659
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 8196
Number of columns: 13


In [10]:
# Data Cleaning : Remove non English characters from the dataset. only apps targeted towards english speaking population to be used
# create a function to identify English and non English Apps
def English(char):
    i=0
    for a in char:
        if ord(a)>127:
            i+=1
         
    if i>3: 
        return  False
    else:
        return True

English_dataset=[]   # to store all English Apps

for data in android_clean:  
    name = data[0]
    if English(name):
        English_dataset.append(data)
explore_data(English_dataset,0,5,True)


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 8166
Number of columns: 13


In [11]:
# Data cleaning : Isolate free apps from paid apps
Final_dataset= []

for data in English_dataset:
    if data[6]=="Free":
        Final_dataset.append(data)
len(Final_dataset)
        

7565

Our aim is to determine the kinds of apps that are likely to attract more users because the revenue is highly influenced by the number of people using the apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

Build a minimal Android version of the app, and add it to Google Play.
If the app has a good response from users, we develop it further.
If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.
Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of what are the most common genres in the market. 
We'll build two functions we can use to analyze the frequency tables:

One function to generate frequency tables that show percentages

Another function that we can use to display the percentages in a descending order

In [12]:
def freq_table(dataset,index):
    """ frequency table to calculate frequency of a particular genre 
    takes the dataset to evaluate and the index of the genre or category 
    for example : to calculate frequency of genre
    freq_table(Final_dataset,-4) """
    
    table={}
    i=0
    for data in dataset:
        i+=1
        genre = data[index]
        if genre in table:
            table[genre]+=1
        else:
            table[genre] =1
    genre_percentage={}    
    for item in table:
        percentage = (table[item]/i) * 100
        genre_percentage[item] = percentage
        
    return genre_percentage


   
def display_table(dataset,index):
    """" call the frequency function and store into a dictionary, Convert the dictionary  to tuples, sort it
    and print it  in descending order """
    table = freq_table(dataset, index)
    display_table= []
    
    for key in table:
        Key_tuples = (table[key],key)
        display_table.append(Key_tuples)
    
    sorted_table = sorted(display_table,reverse = True)
    for item in sorted_table:
        print(item[1], ":", item[0])
    
    
    

In [13]:
#Analyse for category
display_table(Final_dataset,1)

FAMILY : 19.07468605419696
GAME : 11.037673496364839
TOOLS : 8.67151354923992
FINANCE : 3.8202247191011236
PRODUCTIVITY : 3.727693324520819
LIFESTYLE : 3.688037012557832
BUSINESS : 3.344348975545274
PHOTOGRAPHY : 3.2782551222736287
SPORTS : 3.146067415730337
COMMUNICATION : 3.0931923331130204
PERSONALIZATION : 3.0799735624586915
HEALTH_AND_FITNESS : 3.0667547918043625
MEDICAL : 3.0138797091870457
SOCIAL : 2.6569729015201586
NEWS_AND_MAGAZINES : 2.6173165895571713
TRAVEL_AND_LOCAL : 2.3661599471249173
SHOPPING : 2.3529411764705883
BOOKS_AND_REFERENCE : 2.1017845340383343
VIDEO_PLAYERS : 1.903502974223397
DATING : 1.7316589557171185
EDUCATION : 1.4937210839391937
MAPS_AND_NAVIGATION : 1.4805023132848645
ENTERTAINMENT : 1.3218770654329148
FOOD_AND_DRINK : 1.2161269001982815
AUTO_AND_VEHICLES : 0.9517514871116985
WEATHER : 0.8592200925313946
LIBRARIES_AND_DEMO : 0.8460013218770654
HOUSE_AND_HOME : 0.8195637805684072
ART_AND_DESIGN : 0.7534699272967614
COMICS : 0.7005948446794449
PARENTING 

To find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column

In [14]:
display_table(Final_dataset,5)

1,000,000+ : 18.413747521480502
100,000+ : 13.284864507600794
10,000,000+ : 12.346331791143424
10,000+ : 11.421017845340383
5,000,000+ : 7.984137475214805
1,000+ : 7.455386649041638
500,000+ : 6.51685393258427
50,000+ : 5.43291473892928
5,000+ : 4.732319894249835
100+ : 3.132848645076008
50,000,000+ : 2.683410442828817
100,000,000+ : 2.4983476536682088
500+ : 2.154659616655651
10+ : 0.6741573033707865
50+ : 0.5551883674818242
500,000,000+ : 0.3172504957038995
1,000,000,000+ : 0.2643754130865829
5+ : 0.11896893588896232
1+ : 0.013218770654329148


We have data about the number of installs for the Google Play market, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+,etc). For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. 
However, we don't need very precise data for our purposes — we only want to find out which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on. To perform computations, however, we'll need to convert each install number from string to float. This means we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error.

In [15]:
#generating a frequency table for the Category column of the Google Play data set to get the unique app genres 
category_table = freq_table(Final_dataset,1)

""""Loop over the unique genres of the Google Play data set. 
For each iteration :
Initiate a variable named total with a value of 0. This variable will store the sum of installs specific to each genre.
Initiate a variable named len_category with a value of 0. This variable will store the number of apps specific to each genre.
Loop over the Google Play data set, and for each iteration:
Save the app genre to a variable named category_app.
If category_app is the same as category (the iteration variable of the main loop), then:
Save the number of installs.
Remove any + or , character, and then convert the string to a float.
Add up the number of installs to the total variable.
Increment the len_category variable by 1.
Compute the average number of installs by dividing total by len_category. This should be done outside the nested loop.
Print the app genre and the average number of installs. This should also be done outside the nested loop."""

category_install_data ={}
for category in category_table:
    total =0
    len_category =0
        
    for item in Final_dataset:
        category_app = item[1]
        if category_app == category: 
            n_installs = item[5]
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            n_installs = float(n_installs)
            total+=n_installs
            len_category +=1
        
    Avg_install = total/len_category
    print(category,':',Avg_install)
    category_install_data.update({category:Avg_install})    
    



ART_AND_DESIGN : 2003791.2280701755
AUTO_AND_VEHICLES : 737219.4444444445
BEAUTY : 640861.9047619047
BOOKS_AND_REFERENCE : 10476157.264150944
BUSINESS : 2753974.1501976284
COMICS : 847567.9245283019
COMMUNICATION : 47166160.384615384
DATING : 1075582.5190839695
EDUCATION : 3108407.079646018
ENTERTAINMENT : 21134600.0
EVENTS : 354431.3333333333
FINANCE : 1574833.2179930797
FOOD_AND_DRINK : 2300192.934782609
HEALTH_AND_FITNESS : 4885919.051724138
HOUSE_AND_HOME : 1565838.7096774194
LIBRARIES_AND_DEMO : 813796.875
LIFESTYLE : 1782802.9032258065
GAME : 16655938.269461079
FAMILY : 3045982.508662509
MEDICAL : 168882.35087719298
SOCIAL : 27302664.05472637
SHOPPING : 7866974.382022472
PHOTOGRAPHY : 18738970.201612905
SPORTS : 4601628.844537815
TRAVEL_AND_LOCAL : 16171381.56424581
TOOLS : 12344508.658536585
PERSONALIZATION : 6562636.9527897
PRODUCTIVITY : 20537621.879432622
PARENTING : 647208.5416666666
WEATHER : 5542846.153846154
VIDEO_PLAYERS : 27268931.944444444
NEWS_AND_MAGAZINES : 11960046

In [16]:
#sort the categories based on average installations
def sort_avg_install(category_install_data):
    """ Gives a dictionary category_install_data  as input or any data base to sort 
    and gives a dictionary with data sorted according to avg installations in descending order
    I/P: sort_avg_install(category_install_data)
    O/P : COMMUNICATION : 47166160.384615384
          SOCIAL : 27302664.05472637
          VIDEO_PLAYERS : 27268931.944444444
          ENTERTAINMENT : 21134600.0
          PRODUCTIVITY : 20537621.879432622 ..... """
    
    
    display_table =[]
    for category in category_install_data:
        Key_tuples = (category_install_data[category],category)
        display_table.append(Key_tuples)
    
    sorted_table = sorted(display_table,reverse=True)
    for item in sorted_table:
        print(item[1], ":", item[0])
        
        
sort_avg_install(category_install_data)
    
   

COMMUNICATION : 47166160.384615384
SOCIAL : 27302664.05472637
VIDEO_PLAYERS : 27268931.944444444
ENTERTAINMENT : 21134600.0
PRODUCTIVITY : 20537621.879432622
PHOTOGRAPHY : 18738970.201612905
GAME : 16655938.269461079
TRAVEL_AND_LOCAL : 16171381.56424581
TOOLS : 12344508.658536585
NEWS_AND_MAGAZINES : 11960046.212121213
BOOKS_AND_REFERENCE : 10476157.264150944
SHOPPING : 7866974.382022472
PERSONALIZATION : 6562636.9527897
WEATHER : 5542846.153846154
HEALTH_AND_FITNESS : 4885919.051724138
SPORTS : 4601628.844537815
MAPS_AND_NAVIGATION : 4491486.25
EDUCATION : 3108407.079646018
FAMILY : 3045982.508662509
BUSINESS : 2753974.1501976284
FOOD_AND_DRINK : 2300192.934782609
ART_AND_DESIGN : 2003791.2280701755
LIFESTYLE : 1782802.9032258065
FINANCE : 1574833.2179930797
HOUSE_AND_HOME : 1565838.7096774194
DATING : 1075582.5190839695
COMICS : 847567.9245283019
LIBRARIES_AND_DEMO : 813796.875
AUTO_AND_VEHICLES : 737219.4444444445
PARENTING : 647208.5416666666
BEAUTY : 640861.9047619047
EVENTS : 354

__Conclusion 1__: The apps in communication category are more popular and whereas apps in medical category are least installed. 



In [17]:
# evaluate apps under medical category and their rating and number of installations
app_n_rating ={}
for item in Final_dataset:
    if item[1]=="MEDICAL":
        app_name = item[0]
        app_rating =item[2]
        app_install=item[5]
        app_n_rating.update({app_name:[app_install,app_rating]})                 
        #print (app_name,':',app_install,app_rating ) 
        
# sorting the medical apps according to installations
category_install_data ={}     
for item in app_n_rating:
    n_installs = app_n_rating[item][0]
    n_installs = n_installs.replace('+','')
    n_installs = n_installs.replace(',','')
    n_installs = float(n_installs)
    category_install_data.update({item:n_installs})  


print("Sorted Table")
sort_avg_install(category_install_data) 
        
    



Sorted Table
My Calendar - Period Tracker : 5000000.0
Blood Pressure : 5000000.0
mySugr: the blood sugar tracker made just for you : 1000000.0
Pregnancy Week By Week : 1000000.0
Pregnancy Calculator and Tracker app : 1000000.0
Period Tracker : 1000000.0
Ovia Pregnancy Tracker & Baby Countdown Calendar : 1000000.0
Ovia Fertility Tracker & Ovulation Calculator : 1000000.0
MyChart : 1000000.0
GoodRx Drug Prices and Coupons : 1000000.0
FollowMyHealth® : 1000000.0
Epocrates Plus : 1000000.0
Drugs.com Medication Guide : 1000000.0
Doctor On Demand : 1000000.0
CareZone : 1000000.0
Blood Pressure(BP) Diary : 1000000.0
Anatomy Learning - 3D Atlas : 1000000.0
Ada - Your Health Guide : 1000000.0
1800 Contacts - Lens Store : 1000000.0
Zocdoc: Find Doctors & Book Appointments : 500000.0
Teladoc Member : 500000.0
Teach Me Anatomy : 500000.0
OneTouch Reveal : 500000.0
OnTrack Diabetes : 500000.0
Migraine Buddy - The Migraine and Headache tracker : 500000.0
Mayo Clinic : 500000.0
CVS Caremark : 500000.

Most commonly installed apps in medical category are mostly Tracking apps. To evaluate if there is a potential of developing more tracking apps, lets evaluate the different variety of apps available, number of installations and rating.

In [18]:
#Check if a particular word for a particular condtion like period, pregnancy , blood etc. appear in the dictionary.
print("\nPeriod Tracking Apps")
for item in app_n_rating:
    if "period" in item.lower() or "ovulation" in item.lower(): 
        print (item,app_n_rating[item][1], app_n_rating[item][0])
        
print("\nPregnancy Tracking Apps")
for item in app_n_rating:
    if "pregnancy" in item.lower() or "baby" in item.lower():
        print (item,app_n_rating[item][1], app_n_rating[item][0])

print("\nBlood Pressure Tracking Apps")
for item in app_n_rating:
    if "blood" in item.lower():
        print (item,app_n_rating[item][1], app_n_rating[item][0])
        
print("\nDiabetes Tracking Apps")
for item in app_n_rating:
    if "diabetes" in item.lower() or "sugar"in item.lower():
        print (item,app_n_rating[item][1], app_n_rating[item][0])
        
print("\nMigrane Tracking Apps")
for item in app_n_rating:
    if "migr" in item.lower():
        print (item,app_n_rating[item][1], app_n_rating[item][0])


Period Tracking Apps
Ovia Fertility Tracker & Ovulation Calculator 4.8 1,000,000+
My Calendar - Period Tracker 4.7 5,000,000+
Period Tracker 4.9 1,000,000+

Pregnancy Tracking Apps
Ovia Pregnancy Tracker & Baby Countdown Calendar 4.8 1,000,000+
Pregnancy Week By Week 4.8 1,000,000+
Ovia Parenting & Baby Development Tracker 4.7 100,000+
Pregnancy Calculator and Tracker app 4.8 1,000,000+

Blood Pressure Tracking Apps
Blood Pressure 4.2 5,000,000+
Blood Donor 4.2 500,000+
Blood Pressure Log - MyDiary 4.7 500,000+
mySugr: the blood sugar tracker made just for you 4.6 1,000,000+
JH Blood Pressure Monitor 3.7 500+
Blood Pressure(BP) Diary 3.7 1,000,000+
Blood Pressure Monitor 4.3 10,000+
BP Journal - Blood Pressure Diary 5.0 1,000+
Blood Pressure - Stay Healthy 4.6 1,000+
Low Blood Pressure Symptoms 4.2 10,000+
Hypertension High blood pressure 3.8 100,000+
Cardio Journal — Blood Pressure Log 4.5 100,000+
MedM Blood Pressure 4.0 10,000+
Diabetes, Blood Pressure, Health Tracker App 4.5 10,00

__Conclusion 2__: Further analysis shows , apps for tracking health condition like period, pregnancy,migrane, diabetes or blood pressure is more popular and relevant. In this there are only 3 period tracker apps, 3 pregnancy tracker apps and only 1 migraine tracking app. So there is a possibility of making new apps to track different health conditions or reated items like vaccine schedule, medical checkup schedule or medication schedule which could be profitable. 