# Profitable App Profiles for the App Store and Google Play Markets

![Image](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTZMG2LTP5xD1lZ9SxeiWuql2IJdUG9ct4Ukg&usqp=CAU)

Table of Contents
=================

   * [1. Project Overview](#chapter1)
   * [2. Getting Started](#chapter2)
       * [2.1 Importing relevant libraries and checking their versions](#chapter2.1)
       * [2.2 Reading in datasets as lists of lists](#chapter2.2)
       * [2.3 Sneak peek at the datasets](#chapter2.3)
       
       
   * [3. Data Cleaning](#chapter3)
       * [3.1 Reading in a CSV file](##-3.1-Reading-in-a-CSV-file)
       * [3.2 Reading in an Excel file](##-3.2-Reading-in-an-Excel-file)
       * [3.3 Reading in a JSON file](##-3.3-Reading-in-a-JSON-file)
       
       
   * [4. Conclusion](#-4.-Conclusion))

## 1. Project Overview <a class="anchor" id="chapter1"></a>

This is a data cleaning and analysis project. *We did not use Numpy or Pandas libraries here for a purpose.*

For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and in the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that the number of users of our apps determines our revenue for any given app — the more users who see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.



## 2. Getting started <a class="anchor" id="chapter2"></a>

### 2.1 Importing relevant libraries and checking their versions <a class="anchor" id="chapter2.1"></a>

In [1]:
# Importing libraries
import requests
from csv import reader

In [2]:
# Printing verions of Python modules and packages with watermark - the IPython magic extension.
%load_ext watermark

%watermark -v -p requests,csv

Python implementation: CPython
Python version       : 3.7.11
IPython version      : 7.29.0

requests: 2.26.0
csv     : 1.0



### 2.2 Reading in datasets as lists of lists <a class="anchor" id="chapter2.2"></a>

As of 1st quarter of 2021, there were approximately 2 million iOS apps available on the App Store, and 3.5 million Android apps on Google Play[<sup>1</sup>](#fn1).

Collecting data for over 4 million apps requires a significant amount of time and money, so we'll try to analyze a sample of the data instead. To avoid spending resources on collecting new data ourselves, we should first try to see if we can find any relevant existing data at no cost. Luckily, here are two data sets that seem suitable for our goals:

* [A dataset](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
* [A dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

Now, we will define a function `open_data` to import the two datasets we mentioned above and save them as *list of lists*.

In [3]:
# Saving links as variables
url_android = 'https://dq-content.s3.amazonaws.com/350/googleplaystore.csv'
url_ios = 'https://dq-content.s3.amazonaws.com/350/AppleStore.csv'

# Defining the `open_data` function
def open_data(filename):
    """
    This function takes in a link to a csv file as input and 
    returns dataset in the list of lists format as output
    """
    with requests.Session() as s:
        response = s.get(filename)
        decoded_content = response.content.decode('utf-8')
        read_file = reader(decoded_content.splitlines(), delimiter=',')
        return list(read_file)
    

# Loading both datasets
android = open_data(url_android)   
android_header = android[0]
android = android[1:]

ios = open_data(url_ios) 
ios_header = ios[0]
ios = ios[1:]

### 2.3 Sneak peek at the datasets <a class="anchor" id="chapter2.3"></a>

Let's print the column names and take a look at the data dictionaries to try to identify the columns that could help us with our analysis.

In [4]:
# Exploring the columns names for the Google Play dataset
print(android_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


                                           Data Dictionary for the Google Play dataset
---------------------------------------------
| Column Name | Description | 
| ------------ | ------------- | 
| App | Application name | 
| Category | Category the app belongs to | 
| Rating | Overall user rating of the app (as when scraped) | 
| Reviews | Number of user reviews for the app (as when scraped) | 
| Size | Size of the app (as when scraped) |
| Installs | Number of user downloads/installs for the app (as when scraped)  | 
| Type | Paid or Free |
| Price | Price of the app (as when scraped) | 
| Content Rating | Age group the app is targeted at - Children / Mature 21+ / Adult |
| Genres | An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres. |
| Last Updated | Date when the app was last updated on Play Store (as when scraped) |
| Current Ver | Current version of the app available on Play Store (as when scraped) |
| Android Ver | Min required Android version (as when scraped) |  


In [5]:
# Exploring the columns names for the App Store dataset
print(ios_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


                                          Data Dictionary for the App Store dataset
---------------------------------------------
| Column Name | Description | 
| ------------ | ------------- | 
| id | App ID | 
| track_name | App Name | 
| size_bytes | Size (in Bytes) | 
| currency | Currency Type | 
| price | Price amount |
| rating_count_tot | User Rating counts (for all version) | 
| rating_count_ver | User Rating counts (for current version) |
| user_rating | Average User Rating value (for all version) | 
| user_rating_ver | Average User Rating value (for current version) |
| ver | Latest version code |
| cont_rating | Content Rating |
| prime_genre | Primary Genre |
| sup_devices.num | Number of supporting devices |  
| ipadSc_urls.num | Number of screenshots showed for display |  
| lang.num | Number of supported languages |  
| vpp_lic | Vpp Device Based Licensing Enabled |  

At a quick glance, the columns that might be useful for the purpose of our analysis are: `App` and `track_name`,  `Category` and 'prime_genre',  `Installs` and `rating_count_tot`, `Price` and `price`.

To make it easier to explore the two datasets, we'll write a function named `explore_data()` that we can use repeatedly to explore rows in a more readable way. We'll also add an option for our function to show the number of rows and columns for any dataset.

In [6]:
# Defining the `explore_data` function
def explore_data(dataset, start, end, rows_and_columns=False):
    """
    This function loops through the slice of a dataset, 
    and for each iteration, prints a row and adds a new line after that row, 
    and prints the number of rows and columns if `rows_and_columns` is `True`
    """
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [7]:
# Exploring the first three rows of each dataset
print('Google Play Data')
print('\n')
explore_data(android, 0, 3, True)
print('\n')
print('Apple Store Data')
print('\n')
explore_data(ios, 0, 3, True)

Google Play Data


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


Apple Store Data


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805

## 3. Data Cleaning <a class="anchor" id="chapter3"></a>

Before beginning our analysis, we need to make sure the data we analyze is accurate, or the results of our analysis will be wrong. This means that we need to do the following:

* Detect inaccurate data, and correct or remove it.
* Detect duplicate data, and remove the duplicates.

Recall that at our company, we only build apps that are free to download and install, and we design them for an English-speaking audience. This means that we'll need to do the following:

* Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.
* Remove apps that aren't free.

Let's begin by detecting and deleting wrong data.

### 3.1 Removing inaccurate data

Let's check our datasets one by one to see if all of the rows in them have the same lengths. If some of the values are missing in a row, it will cause other values to shift left to the wrong column, which would be a problem. Of course, if the missing value is in the last column, it would be less problematic, but let's see what we have.

We'll write a function named `check_lengths()` that we can repeatedly use to print out the rows whose length is not equal to a dataset's header row and the index for each of these rows.

In [12]:
# Defining the `check_lengths` function
def check_lengths(dataset, dataset_header):
    """
    This function prints out the rows whose length is not equal 
    to a dataset's header row and the index for each of these rows
    """
    is_found = False
    for row in dataset:
        header_length = len(dataset_header)
        row_length = len(row)
        if row_length != header_length:
            print(row)
            print(dataset.index(row))
            is_found = True
    if not is_found:
        print("All Good!")

In [13]:
# Checking the row lenghts in the Google Play dataset
print(android_header)
print('\n')
check_lengths(android, android_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10472


The output above shows that row *10472* is missing the `Category` column; therefore, all other values have been shifted left to the wrong column. 
Now, let's check the second dataset:

In [14]:
# Checking the row lenghts in the Apple Store dataset
print(ios_header)
print('\n')
check_lengths(ios, ios_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


All Good!


The Apple Store dataset does not have rows of a wrong lenghts.

As the row *10472* misses the `Category` value, we'll remove this row. We put `if` clause so we can run this code more than once without deleting other rows. 

In [17]:
if android[10472][0]=='Life Made WI-Fi Touchscreen Photo Frame':
    del android[10472]  
print('New number of rows:',len(android))
print('Number of columns:', len(android[0]))

New number of rows: 10840
Number of columns: 13


## Footnotes

<span id="fn1"> 1. [Number of apps available in leading app stores as of 1st quarter 2021, Statista
](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/)</span>

In [None]:
import pandas as pd
app_data = pd.read_csv('AppleStore.csv')

In [None]:
print(app_data.shape)

In [None]:
# reading in data as a url from NYC Open Data
host = 'https://dq-content.s3.amazonaws.com/350/AppleStore.csv'

# saving data as a pandas dataframe named 'building_footprints_csv'
hate_Crimes = pd.read_csv(host)
print(hate_Crimes.shape)

In [None]:
import csv
import requests

CSV_URL = 'https://dq-content.s3.amazonaws.com/350/googleplaystore.csv'


with requests.Session() as s:
    download = s.get(CSV_URL)

    decoded_content = download.content.decode('utf-8')

    cr = csv.reader(decoded_content.splitlines(), delimiter=',')
    my_list = list(cr)
    
explore_data(my_list, 0, 3, True)