# The Profile of a Profitable App
This project analyzes Apple App and Google Play Stores to help developers determine what kind of apps turn out to be breakouts. 

Our developers are planning to build free apps so kmost of our revenues will come from in - app ads and purchases. 

Note that this guided project assumes the analyst is a Python beginner.
An analysis using advanced packages and techniques will be posted soon.

Google Play Store dataset: https://dq-content.s3.amazonaws.com/350/googleplaystore.csv

Apple App Store dataset: https://dq-content.s3.amazonaws.com/350/AppleStore.csv

Solution link: https://github.com/dataquestio/solutions/blob/master/Mission350Solutions.ipynb

In [1]:
applestore = open('AppleStore.csv')
googlestore = open('googleplaystore.csv')

In [2]:
# A function to help explore list of list data sets
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [4]:
explore_data(applestore, 0, 5)

TypeError: '_io.TextIOWrapper' object is not subscriptable

Link to documentation: 

In [5]:
import pandas as pd
import numpy as np

In [6]:
applestore = pd.read_csv('AppleStore.csv')
googlestore = pd.read_csv('googleplaystore.csv')

In [7]:
applestore.head()

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
1,389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
2,529479190,Clash of Clans,116476928,USD,0.0,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1
3,420009108,Temple Run,65921024,USD,0.0,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1
4,284035177,Pandora - Music & Radio,130242560,USD,0.0,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1


We'll need to exclude the first row of the Apple store dataset

In [8]:
googlestore = pd.read_csv('googleplaystore.csv')
googlestore.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


 To help readers gain context into your project, use the first Markdown cell of the notebook to:

1. Add a title.
2. Write a short introduction where you explain (in no more than two paragraphs):
    - What the project is about.
    - What your goal is in this project.
The title and the introduction are tentative at this point, so don't spend too much time here — you can come back later to refine them.

1. Open the two data sets we mentioned above, and save both as lists of lists.

    - The App Store data set is stored in a CSV file named AppleStore.csv, and the Google Play data set is stored in a CSV file named googleplaystore.csv.
    - Both CSV files can be opened directly in the Jupyter Notebook interface you see on the right of the screen.
    - If you run into an error named UnicodeDecodeError, add encoding="utf8" to the open() function (for instance, use open('AppleStore.csv', encoding='utf8')).

2. Explore both data sets using the explore_data() function.

    - Print the first few rows of each data set.
    - Find the number of rows and columns of each data set (recall that the function assumes the argument for the dataset parameter doesn't have a header row).
3. Print the column names and try to identify the columns that could help us with our analysis. Use the documentation of the data sets if you're having trouble understanding what a column describes. Add a link to the documentation for readers if you think the column names are not descriptive enough.



The Google Play data set has a dedicated discussion section, and we can see that one of the discussions describes an error for a certain row.

1. Read the discussion and find out what the index of the row is.

2. Print the row at that index to check whether it's indeed incorrect. Take into account the user reporting the error might or might have not removed the header row, so the index number might vary.

3. If the row has an error, remove the row using the del statement. For instance, to remove the row with the index 149 from a data set data that is stored as a list of list, you can use the code del data[149].
4. Make sure you don't run the del statement more than once, otherwise you'll delete more than one row.

5. Read the discussion section for the App Store data set, and see whether you can find any reports of wrong data.



1. Using a combination of narrative and code, explain the reader that the Google Play data set has duplicate entries. Print a few duplicate rows to confirm.

2. Count the number of duplicates using the technique we learned above.

3. Explain that you won't remove the duplicates randomly. Describe the criterion you're going to use to remove the duplicates.

    - We already suggested a criterion above, but you can come up with another criterion if you want. Make sure you support your criterion with at least one argument.

1. Create a dictionary where each key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

    - Start by creating an empty dictionary named reviews_max.
    - Loop through the Google Play data set (make sure you don't include the header row). For each iteration:
        * Assign the app name to a variable named name.
        * Convert the number of reviews to float. Assign it to a variable named n_reviews.
        * If name already exists as a key in the reviews_max dictionary and reviews_max[name] < n_reviews, update the number of reviews for that entry in the reviews_max dictionary.
        * If name is not in the reviews_max dictionary as a key, create a new entry in the dictionary where the key is the app name, and the value is the number of reviews. Make sure you don't use an else clause here, otherwise the number of reviews will be incorrectly updated whenever reviews_max[name] < n_reviews evaluates to False.
    - Inspect the dictionary to make sure everything went as expected. Measure the length of the dictionary — remember that the expected length is 9,659 entries.
2. Use the dictionary you created above to remove the duplicate rows:

    - Start by creating two empty lists: android_clean (which will store our new cleaned data set) and already_added (which will just store app names).
    - Loop through the Google Play data set (make sure you don't include the header row), and for each iteration:
        * Assign the app name to a variable named name.
        * Convert the number of reviews to float, and assign it to a variable named n_reviews.
    - If n_reviews is the same as the number of maximum reviews of the app name (the number can be found in the reviews_max dictionary) and name is not already in the list already_added (read the solution notebook to find out why we need this supplementary condition):
        * Append the entire row to the android_clean list (which will eventually be a list of list and store our cleaned data set).
        * Append the name of the app name to the already_added list — this helps us to keep track of apps that we already added.
3. Explore the android_clean data set to ensure everything went as expected. The data set should have 9,659 rows. The two steps above are a bit more involved, so make sure you use Markdown to explain the readers the steps you took.

1. Write a function that takes in a string and returns False if there's any character in the string that doesn't belong to the set of common English characters, otherwise it returns True.

    - Inside the function, iterate over the input string. For each iteration check whether the number associated with the character is greater than 127. When a character is greater than 127, the function should immediately return False — the app name is probably non-English since it contains a character that doesn't belong to the set of common English characters.
    - If the loop finishes running without the return statement being executed, then it means no character had a corresponding number over 127 — the app name is probably English, so the functions should return True.
2. Use your function to check whether these app names are detected as English or non-English:

    - 'Instagram'
    - '爱奇艺PPS -《欢乐颂2》电视剧热播'
    - 'Docs To Go™ Free Office Suite'
    - 'Instachat 😜'

1. Change the function you created in the previous screen. If the input string has more than three characters that fall outside the ASCII range (0 - 127), then the function should return False (identify the string as non-English), otherwise it should return True.

2. Use the new function to check whether these app names are detected as English or non-English:

    - 'Docs To Go™ Free Office Suite'
    - 'Instachat 😜'
    - '爱奇艺PPS -《欢乐颂2》电视剧热播'
3. Use the new function to filter out non-English apps from both data sets. Loop through each data set. If an app name is identified as English, append the whole row to a separate list.

4. Explore the data sets and see how many rows you have remaining for each data set.

1. Loop through each data set to isolate the free apps in separate lists. Make sure you identify the columns describing the app price correctly.

2. After you isolate the free apps, check the length of each data set to see how many apps you have remaining.

1. Give readers more context into why we want to find an app profile that fits both the App Store and Google Play. Explain our validation strategy for an app idea.

2. Inspect both data sets and identify the columns you could use to generate frequency tables to find out what are the most common genres in each market.

1. Create a function named freq_table() that takes in two inputs: dataset (which is expected to be a list of lists) and index (which is expected to be an integer).

    - The function should return the frequency table (as a dictionary) for any column we want. The frequencies should also be expressed as percentages.
    - We already learned how to build frequency tables in the mission on dictionaries.
2. Copy the display_table() function we wrote above. Use it to display the frequency table of the columns prime_genre, Genres, and Category. We'll analyze the resulting tables on the next screen.

1. Analyze the frequency table you generated for the prime_genre column of the App Store data set.

    - What is the most common genre? What is the runner-up?
    - What other patterns do you see?
    - What is the general impression — are most of the apps designed for practical purposes (education, shopping, utilities, productivity, lifestyle) or more for entertainment (games, photo and video, social networking, sports, music)?
    - Can you recommend an app profile for the App Store market based on this frequency table alone? If there's a large number of apps for a particular genre, does that also imply that apps of that genre generally have a large number of users?
2. Analyze the frequency table you generated for the Category and Genres column of the Google Play data set.

    - What are the most common genres?
    - What other patterns do you see?
    - Compare the patterns you see for the Google Play market with those you saw for the App Store market.
    - Can you recommend an app profile based on what you found so far? Do the frequency tables you generated reveal the most frequent app genres or what genres have the most users?

1. Start by generating a frequency table for the prime_genre column to get the unique app genres (below, we'll need to loop over the unique genres). You can use the freq_table() function you wrote in a previous screen.

2. Loop over the unique genres of the App Store data set. For each iteration (below, we'll assume that the iteration variable is named genre):

    - Initiate a variable named total with a value of 0. This variable will store the sum of user ratings (the number of ratings, not the actual ratings) specific to each genre.
    - Initiate a variable named len_genre with a value of 0. This variable will store the number of apps specific to each genre.
    - Loop over the App Store data set, and for each iteration:
        - Save the app genre to a variable named genre_app.
        - If genre_app is the same as genre (the iteration variable of the main loop), then:
            * Save the number of user ratings of the app as a float.
            * Add up the number of user ratings to the total variable.
            * Increment the len_genre variable by 1.
    - Compute the average number of user ratings by dividing total by len_genre. This should be done outside the nested loop.
    - Print the app genre and the average number of user ratings. This should also be done outside the nested loop.
3. Analyze the results and try to come up with at least one app profile recommendation for the App Store. Note that there's no fixed answer here, and it's perfectly fine if the app profile you recommended is different than the one recommended in the solution notebook.

1. Start by generating a frequency table for the Category column of the Google Play data set to get the unique app genres (below, we'll need to loop over the unique genres). You can use the freq_table() function you wrote in a previous screen.

2. Loop over the unique genres of the Google Play data set. For each iteration (below, we'll assume that the iteration variable is named category):

    - Initiate a variable named total with a value of 0. This variable will store the sum of installs specific to each genre.
    - Initiate a variable named len_category with a value of 0. This variable will store the number of apps specific to each genre.
    - Loop over the Google Play data set, and for each iteration:
        * Save the app genre to a variable named category_app.
        * If category_app is the same as category (the iteration variable of the main loop), then:
        * Save the number of installs.
        * Remove any + or , character, and then convert the string to a float.
        * Add up the number of installs to the total variable.
        * Increment the len_category variable by 1.
    - Compute the average number of installs by dividing total by len_category. This should be done outside the nested loop.
    - Print the app genre and the average number of installs. This should also be done outside the nested loop.
3. Analyze the results and try to come up with at least one app profile recommendation for Google Play. Remember, our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play. Note that there's no fixed answer here, and it's perfectly fine if the app profile you recommended is different than the one recommended in the solution notebook.