# Profitable App Profiles for Google Play Store and iOS App Store
In this project, we will be analyzing apps from the Google Play Store and iOS App Store for profitability. We are working as Data Analysts to see which apps perform the best in terms of user count. Our app will be free and make money through in-app advertisements. 

Our goal is to analyze what type of apps have the highest user counts in order to optimize our app for the highest profit margin.

# Opening Our Files
We must open our csv files to access the data. 
1. Import reader from csv by using `from csv import reader`.
2. Use `open('AppleStore.csv')` to open the file and save it to the variable `opened_app_store`.
3. Use `reader(opened_app_store)` to read the file and save it to the variable `read_app_store`.
4. Use `list(read_app_store)` to create a list of the data and save it to the variable `app_store_data`.

Repeat steps 2-4 steps for the `googleplaystore.csv`.

In [1]:
# Open AppStore.csv
from csv import reader
opened_app_store = open('AppleStore.csv')
read_app_store = reader(opened_app_store)
app_store_data = list(read_app_store)

# Open GooglePlayStore.csv
opened_play_store = open('googleplaystore.csv')
read_play_store = reader(opened_play_store)
play_store_data = list(read_play_store)

# Exploring the CSV for Relevant Data
We print the first few rows of the data sets to see what columns we will be using for our analysis. 

We use the header row as a reference to select our columns, as well as the rows with actual data to visualize what the data will look like.

We then print the number of rows and number of columns.

In [2]:
# Print the first few rows
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        rows = ('Number of Rows', len(dataset))
        columns = ('Number of Columns', len(dataset[0]))
        print(rows)
        return columns

print('App Store Data')
print('--------------------------------------------------------------')
print('\n')
print(explore_data(app_store_data, 0, 3, True))
print('\n')
print('\n')
print('Play Store Data')
print('--------------------------------------------------------------')
print('\n')
print(explore_data(play_store_data, 0, 3, True))

App Store Data
--------------------------------------------------------------


['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


('Number of Rows', 7198)
('Number of Columns', 17)




Play Store Data
--------------------------------------------------------------


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+'

# Columns We Will Be Using
* [App Store Documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)
* [Google Play Store Documentation](https://www.kaggle.com/lava18/google-play-store-apps)

From `AppStore.csv`, we will be using:
* track_name - App Name
* currency - Currency Type
* price - Price of App
* rating_count_tot - Total Ratings (all versions)
* rating_count_ver - Total Ratings (current version)
* user_rating - Avg User Rating (all versions)
* prime_genre

From `googleplaystore.csv`, we will be using: 
* App - App Name
* Category - App Category
* Rating - Avg User Rating
* Installs - Number of Installs
* Type - Paid or Free
* Price - Price of App
* Genres - App Genre

# Deleting Incorrect Data
Row `10473` in our Google Play Store data set is missing its `Category`, which is causing a shift in the columns. Since the columns are now shifted, `19` becomes the app's `Rating`, which is not possible. We delete it by executing `del play_store_data[10473]`. 

In [3]:
print(play_store_data[0])
print('\n')
print(play_store_data[10473])
del play_store_data[10473]
print('\n')
print(play_store_data[10473])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


We print the Header Row to compare to Row `10473`. Since the `Category` column is not present, all values to the right are shifted left. After deletion, Row `10473` is printed again to show the new row. As you can see, this new row has a `Category` column, which leaves the row unaffected.

# Removing Duplicate Entries: Part One
There are multiple cases in which duplicate rows appear in the Google Play Store data set. As you can see below, `Instagram` appears 4 separate times.

In [4]:
for row in play_store_data[1:]:
    if row[0] == 'Instagram':
        print(row)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In fact, there are a total of 1,181 cases of duplicate rows in the Google Play Store data set.

In [5]:
def find_duplicates(data_set):
    duplicates = []
    unique = []
    for row in data_set[1:]:
        if row[0] in unique:
            duplicates.append(row[0])
        else:
            unique.append(row[0])
            
    print('Number of Unique Apps: ', len(unique))
    print('Number of Duplicate Apps: ', len(duplicates))

print('Google Play Store Data:')
play_store = find_duplicates(play_store_data)
print('\n')
print('App Store Data:')
app_store = find_duplicates(app_store_data)

Google Play Store Data:
Number of Unique Apps:  9659
Number of Duplicate Apps:  1181


App Store Data:
Number of Unique Apps:  7197
Number of Duplicate Apps:  0


Using `Instagram` as an example, we will be keeping apps with the highest number of reviews, and deleting the duplicates with less reviews. A higher number of reviews means that row is the most up-to-date row that we need.

# Removing Duplicate Entries: Part Two
We want to create a dictionary, where each key is a unique app, and each value is the app's number of reviews.

### Play Store Expected Length

In [6]:
# Create a new dictionary
reviews_max = {}
# Iterate through data set
for row in play_store_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print('Expected Length of Play Store List: ', len(reviews_max))

Expected Length of Play Store List:  9659


We iterate through the data set and check the dictionary to see if it contains the values of the data set. 

If the app is already in the dictionary, it checks the number of reviews. If the next app is a duplicate **and** it has more reviews, then set the number of reviews for that app to the bigger value. 

### Play Store Length Without Duplicates

In [7]:
android_no_duplicates = []
already_added = []

for row in play_store_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_no_duplicates.append(row)
        already_added.append(name)
        
print('Length of Play Store List Without Duplicates: ', len(android_no_duplicates))
print('\n')
explore_data(android_no_duplicates, 0, 3, True)

Length of Play Store List Without Duplicates:  9659


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


('Number of Rows', 9659)


('Number of Columns', 13)

We create two lists: `android_no_duplicates` for our clean data set and `already_added` for apps that already exist in `android_no_duplicates`.

We then iterate through our original data set, `play_store_data`, and grab the `app name` and `number of reviews`.

We check `if` the `n_reviews` is equal to the number of reviews in our dictionary `reviews_max` for that app, **and** the `app name` is not in the `already_added` list. If this is **True**, then we append the `row` to `android_no_duplicates` and the `app name` to the `already_added` list.

Finally, we print the length of the `android_no_duplicates` list, `9,659`, which is the expected length for our data set.

We can now do this process for our `App Store` data set, as shown below.

We repeat the same process of finding our expected length first.

### App Store Expected Length

In [8]:
# Create a new dictionary
reviews_max = {}
# Iterate through data set
for row in app_store_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print('Expected Length of App Store List: ', len(reviews_max))

Expected Length of App Store List:  7197


Then we find the length after removing the duplicates.

### App Store Length Without Duplicates

In [9]:
ios_no_duplicates = []
already_added = []

for row in app_store_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        ios_no_duplicates.append(row)
        already_added.append(name)
        
print('Length of App Store List Without Duplicates: ', len(ios_no_duplicates))
print('\n')
explore_data(ios_no_duplicates, 0, 3, True)

Length of App Store List Without Duplicates:  7197


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


('Number of Rows', 7197)


('Number of Columns', 17)

Now we have our cleaned `App Store` list along with our cleaned `Play Store` list.

# Removing Non-English Apps: Part One
In this situation, we only use English for the apps the company we work for develops. Therefore, we only want to analyze the apps directed towards an English-speaking audience.

In [10]:
print(app_store_data[815][2])
print(app_store_data[820][2])
print('\n')
print(android_no_duplicates[4412][0])
print(android_no_duplicates[7940][0])

搜狐新闻—新闻热点资讯掌上阅读软件
聚力视频-蓝光电视剧电影在线热播


中国語 AQリスニング
لعبة تقدر تربح DZ


Each character has an associated value to it. The value indicates its ASCII value. The English alphabet, numbers, punctuations, etc fall between 0 and 127. Any value greater than 127 is not in the English vocabulary.

In [11]:
print(ord('a'))
print(ord('V'))
print(ord('搜'))
print(ord('7'))
print(ord('='))

97
86
25628
55
61


We can build a function that detects whether a character belongs to the set of common English characters, which will assist us in finding and removing all the non-English apps.

In [12]:
def is_english(string):
    for char in string:
        if ord(char) > 127:
            return False
        
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
False
False


# Removing Non-English Apps: Part Two
We can see that the function works well for `Instagram` and `爱奇艺PPS -《欢乐颂2》电视剧热播`, returning `True` and `False` respectively. However, the function doesn't seem to pick up characters like `™` or emojis.

This is because the value of the characters fall out of the ASCII range of 0-127 as shown below.

In [13]:
print(ord('™'))
print(ord('😜'))

8482
128540


To minimize the data loss that would be caused by this function, we can change our function to return `False` if the `app name` has more than **three** characters that fall outside the ASCII range.

The function is not perfect, but it will reduce the amount of data loss from the previous version of the function.

In [14]:
def is_english(string):
    count = 0
    for char in string:
        if ord(char) > 127:
            count += 1
        if count > 3:
            return False
        
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


We use a `count` to track the number of characters that fall out of our ASCII range. 

If the character's value is greater than 127, then add 1 to `count`.
If `count > 3`, return `False`.

Now that we have improved our function, we can use it on our data sets. 

### Removing Non-English Apps From App Store
We will now use the function defined above on the `App Store` data set. 

We append the English apps to the `ios_english` list.

In [15]:
ios_english = []
for row in ios_no_duplicates:
    if is_english(row[2]):
        ios_english.append(row)

print(explore_data(ios_english, 0, 3, True))

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


('Number of Rows', 6183)
('Number of Columns', 17)


### Removing Non-English Apps From Play Store
We will now use the function defined above on the `Play Store` data set.

We append the English apps to the `android_english` list.

In [16]:
android_english = []
for row in android_no_duplicates:
    if is_english(row[0]):
        android_english.append(row)

print(explore_data(android_english, 0, 3, True))

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


('Number of Rows', 9614)
('Number of Columns', 13)


# Isolating Free Apps
Now, we have two lists of apps with no inaccurate data, no duplicates, and no non-English characters.

However, our lists contain both paid and free apps. We only want to analyze the free apps. Therefore, we must isolate these free apps in our list.

### Isolating Free Apps From App Store

In [17]:
# Price is at index 5
ios_free = []
for row in ios_english:
    if row[5] == '0':
        ios_free.append(row)
        
print('Number of Free Apps in the App Store: ', len(ios_free))

Number of Free Apps in the App Store:  3222


### Isolating Free Apps From Play Store

In [18]:
# Price is at index 7
android_free = []
for row in android_english:
    if row[7] == '0':
        android_free.append(row)
        
print('Number of Free Apps in the Play Store: ', len(android_free))

Number of Free Apps in the Play Store:  8864


As we can see above, the two lists containing all English apps was narrowed down to just `Free` apps. 

In the `App Store List`, we have **3,222** free apps.

In the `Play Store List`, we have **8,864** free apps.

# Most Common Apps by Genre: Part One
The goal of this project is to determine what kind of apps are likely to attract more users since our revenue is influenced by the number of people using our app.

Our main strategy is:

1. Develop a minimal Android app for the `Google Play Store`. 
2. Then, if the app is doing well, we will develop it further. 
3. Finally, if the app is profitable after six months, we will build an iOS app for the `iOS App Store`. 

The way we do this is by first analyzing what app profiles are the most successful on both app stores.

We will be creating a frequency table using the `prime_genre` column for the `App Store Data Set`, and the `Genres` and `Category` columns for the `Google Play Store Data Set`.

# Most Common Apps by Genre: Part Two
We need two functions:

1. One function to generate a frequency table that shows percentages.
2. A second function that displays the percentages in descending order.

In [19]:
# Function to generate a frequency table that shows percentages
def freq_table(dataset, index):
    frequency_table = {}
    for row in dataset:
        value = row[index]
        if value in frequency_table:
            frequency_table[value] += 1
        else:
            frequency_table[value] = 1
    
    frequency_percentages = {}
    for key in frequency_table:
        percentage = (frequency_table[key] / len(frequency_table)) * 100
        frequency_percentages[key] = percentage
    
    return frequency_percentages

# Function that displays the percentages in descending order
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])