## Profitable App Profiles for the App Store and Google Play Markets
Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.


In [1]:
from csv import reader
# open google play dataset
opened_file = open('C:\\Users\Eslam\Desktop\DQ project\data sets\googleplaystore.csv', encoding = "utf8" )
read_file = reader(opened_file)
data = list(read_file)
android_header = data[0]
android = data[1:]

# open app store
opened_file = open('C:\\Users\Eslam\Desktop\DQ project\data sets\AppleStore.csv' , encoding = "utf8")
read_file = reader(opened_file)
data = list(read_file)
ios_header = data[0]
ios = data[1:]

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Eslam\\Desktop\\DQ project\\data sets\\googleplaystore.csv'


To make it easier to explore the two data sets, we'll first write a function named explore_data() that we can use repeatedly to explore rows in a more readable way. We'll also add an option for our function to show the number of rows and columns for any data set.



In [None]:
def explore_data(dataset , start , end , colm_rows = False):
    data_slice = dataset[start : end]
    for app in data_slice:
        print(app)
        print('\n')
    if colm_rows: #it means alawys if True 
        print("number of  rows = " , len(dataset))
        print("number of column = " , len(dataset[0]))

ex_ios = explore_data(android , 5 , 10 , True )




We see that the Google Play data set has 10841 apps and 13 columns. At a quick glance, the columns that might be useful for the purpose of our analysis are 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.

Now let's take a look at the App Store data set

In [None]:
print(ios_header , '\n')
explore_data(ios , 0 , 3 , True)

check the google play store you find an error in column 10472  the rating is '19' so we should delete this column 

In [None]:
explore_data(android , 10471 , 10473 )
print(len(android))

In [None]:
print(android[10472] , '\n')
#del android[10472] #run it only once 
print(android[10472])


### check duplicates
there are two duplicates data in AppStore data set.
we are going to show some duplicated data entries of 'Instagram'

In [None]:
# checking duplicates
for app in android:
    name = app[0]
    if name == 'Instagram' or name == 'Twitter':
        print(app)
        print('\n')


In [None]:
# for checking the number of duplicates on each dataset 
def check_number_of_duplicates(dataset):
    unique_apps = []
    duplicate_apps = []
    clean_data = []
    for app in dataset:
        name = app[0]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
            clean_data.append(app)
    return len(duplicate_apps) , len(clean_data) , len(unique_apps)

check_number_of_duplicates(android)


### removing duplicates
in removing duplicates it is like isolating the duplicates and put the clean data seperated 
so we define two empty lists 
- one for duplicates 
- another one for clean data 

In [None]:
# first we need to detect duplicate how many rows are duplicted that we should delete
duplicat_apps = []
unique_apps = []
for app in android:
    name = app[0]
    if name in unique_apps:
        duplicat_apps.append(name)
    else:
        unique_apps.append(name)
        
print(len(unique_apps) , '' , len(duplicat_apps)) # (10840 - 1181)




starting removing the duplicates based on our criteria

Now, let's use the reviews_max dictionary to remove the duplicates. For the duplicate cases, we'll only keep the entries with the highest number of reviews. In the code cell below:

- We start by initializing two empty lists, android_clean and already_added.
- We loop through the android data set, and for every iteration:

   - We isolate the name of the app and the number of reviews.
   - We add the current row (app) to the android_clean list, and the app name (name) to the already_cleaned list if:
     - The number of reviews of the current app matches the number of reviews of that app as described in the reviews_max dictionary; 
       and
       
        
       -""The name of the app is not already in the already_added list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for reviews_max[name] == n_reviews, we'll still end up with duplicate entries for some apps"".

In [None]:
reviews_max = {}
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if (name in reviews_max and n_reviews > reviews_max[name]):
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

#the start to isolate the data with the highst values
android_clean = []
already_added = []
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
print(len(android_clean))

let's explore the dataset to be sure that our datat is correct

In [None]:
explore_data(android_clean , 5 , 10 , True)

## removing non english apps
### part one
if we explore the data enough we find some apps contain non english names

In [None]:
print(ios[814][2])
print(ios[6734][2])
print('\n')
print(android_clean[4412][0])
print(android_clean[7940][0])

for removing these apps we can make function to check if the name is in english letters domain or not 
it is simple criteria which we can use

In [None]:
#to understand only
def is_english(string):
        for char in string:
            if ord(char) > 127:
                return False
        return True
is_english('Instagram')

def is_english(string):
    in_ascii = 0
    for char in string:
        if ord(char) > 127:
            in_ascii += 1
            if in_ascii > 3:
                return False
    return True
  
            
print(is_english('eslam hosam للال'))
print(is_english('Docs To Go™ Free Office Suite'))

The function is still not perfect, and very few non-English apps might get past our filter, but this seems good enough at this point in our analysis — we shouldn't spend too much time on optimization at this point.

Below, we use the is_english() function to filter out the non-English apps for both data sets:

In [None]:
android_english = []
ios_english = []
#for android
for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
#for ios
for app in ios:
    name = app[2]
    if is_english(name):
        ios_english.append(app)
        
explore_data(android_english , 2 , 5 , True)
print('\n')
explore_data(ios_english , 2 , 5 , True)

### Isolating the Free Apps¶
As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps, and we'll need to isolate only the free apps for our analysis. Below, we isolate the free apps for both our data sets.

In [None]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[5]
    if price == '0':
        ios_final.append(app)

explore_data(ios_final , 2 , 4 , True)
explore_data(android_final , 2 , 4 , True)


We're left with 8864 Android apps and 3222 iOS apps, which should be enough for our analysis.

## Most Common Apps by Genre
### Part One
As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

Build a minimal Android version of the app, and add it to Google Play.
If the app has a good response from users, we then develop it further.
If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.
Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.

### Part Two
We'll build two functions we can use to analyze the frequency tables:

One function to generate frequency tables that show percentages
Another function that we can use to display the percentages in a descending order

In [None]:
#make frequency table function for any dataset
def freq_tables(dataset , index):
    table = {}
    all_values = 0
    for app in dataset:
        all_values += 1
        column = app[index]
        if column in table:
            table[column] += 1
        else:
            table[column] = 1
    table_percentages = {}
    for key in table:
        perceantage = (table[key] / all_values) * 100
        table_percentages[key] = perceantage
        
    return table_percentages 

# display the values but in decending order so we need to sort and change to list of tubles 
def display_percentages(dataset , index):
    table = freq_tables(dataset , index)
    list_of_tubles = []
    for value in table:
        value_as_tuble = (table[value] , value)
        list_of_tubles.append(value_as_tuble)
        sorted_list = sorted(list_of_tubles , reverse = True)
        
    for v in sorted_list:
        print(v[1] , ":" , v[0])



In [None]:
def freq_table(dataset , index):
    table = {}
    values = 0
    for app in dataset:
        values += 1
        col = app[index]
        if col in table:
            table[col] += 1
        else:
            table[col] = 1
    table_percentages = {}
    for key in table:
        percentages = (table[key] / values) * 100
        table_percentages[key] = percentages
        
    return table_percentages

def display_table(dataset , index):
    table = freq_table(dataset , index)
    list_of_tubles = []
    for key in table:
        value_as_tuble = (table[key] , key)
        list_of_tubles.append(value_as_tuble)
        sorted_list = sorted(list_of_tubles , reverse = True)
    
    for tuble in sorted_list:
        print(tuble[1] , tuble[0])
        
display_table(android_final , -4)

In [None]:
display_percentages(android_final, -4)

In [None]:
display_percentages(android_final , 1)


The difference between the Genres and the Category columns is not crystal clear, but one thing we can notice is that the Genres column is much more granular (it has more categories). We're only looking for the bigger picture at the moment, so we'll only work with the Category column moving forward.

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have most users.

## Most Popular Apps by Genre on the App Store
One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but for the App Store data set this information is missing. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

Below, we calculate the average number of user ratings per app genre on the App Store:

In [None]:

genres_ios = freq_table(ios_final, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)