# Profitable App Profiles for Google Play Store and iOS App Store
In this project, we will be analyzing apps from the Google Play Store and iOS App Store for profitability. We are working as Data Analysts to see which apps perform the best in terms of user count. Our app will be free and make money through in-app advertisements. 

Our goal is to analyze what type of apps have the highest user counts in order to optimize our app for the highest profit margin.

# Opening Our Files
We must open our csv files to access the data. 
1. Import reader from csv by using `from csv import reader`.
2. Use `open('AppleStore.csv')` to open the file and save it to the variable `opened_app_store`.
3. Use `reader(opened_app_store)` to read the file and save it to the variable `read_app_store`.
4. Use `list(read_app_store)` to create a list of the data and save it to the variable `app_store_data`.

Repeat steps 2-4 steps for the `googleplaystore.csv`.

In [1]:
# Open AppStore.csv
from csv import reader
opened_app_store = open('AppleStore.csv')
read_app_store = reader(opened_app_store)
app_store_data = list(read_app_store)

# Open GooglePlayStore.csv
opened_play_store = open('googleplaystore.csv')
read_play_store = reader(opened_play_store)
play_store_data = list(read_play_store)

# Exploring the CSV for Relevant Data
We print the first few rows of the data sets to see what columns we will be using for our analysis. 

We use the header row as a reference to select our columns, as well as the rows with actual data to visualize what the data will look like.

We then print the number of rows and number of columns.

In [2]:
# Print the first few rows
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        rows = ('Number of Rows', len(dataset))
        columns = ('Number of Columns', len(dataset[0]))
        print(rows)
        return columns

print('App Store Data')
print('--------------------------------------------------------------')
print('\n')
print(explore_data(app_store_data, 0, 3, True))
print('\n')
print('\n')
print('Play Store Data')
print('--------------------------------------------------------------')
print('\n')
print(explore_data(play_store_data, 0, 3, True))

App Store Data
--------------------------------------------------------------


['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


('Number of Rows', 7198)
('Number of Columns', 17)




Play Store Data
--------------------------------------------------------------


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+'

# Columns We Will Be Using
* [App Store Documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)
* [Google Play Store Documentation](https://www.kaggle.com/lava18/google-play-store-apps)

From `AppStore.csv`, we will be using:
* track_name - App Name
* currency - Currency Type
* price - Price of App
* rating_count_tot - Total Ratings (all versions)
* rating_count_ver - Total Ratings (current version)
* user_rating - Avg User Rating (all versions)
* prime_genre

From `googleplaystore.csv`, we will be using: 
* App - App Name
* Category - App Category
* Rating - Avg User Rating
* Installs - Number of Installs
* Type - Paid or Free
* Price - Price of App
* Genres - App Genre

# Deleting Incorrect Data
Row `10473` in our Google Play Store data set is missing its `Category`, which is causing a shift in the columns. Since the columns are now shifted, `19` becomes the app's `Rating`, which is not possible. We delete it by executing `del play_store_data[10473]`. 

In [3]:
print(play_store_data[0])
print('\n')
print(play_store_data[10473])
del play_store_data[10473]
print('\n')
print(play_store_data[10473])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


We print the Header Row to compare to Row `10473`. Since the `Category` column is not present, all values to the right are shifted left. After deletion, Row `10473` is printed again to show the new row. As you can see, this new row has a `Category` column, which leaves the row unaffected.

# Removing Duplicate Entries: Part One
There are multiple cases in which duplicate rows appear in the Google Play Store data set. As you can see below, `Instagram` appears 4 separate times.

In [4]:
for row in play_store_data[1:]:
    if row[0] == 'Instagram':
        print(row)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In fact, there are a total of 1,181 cases of duplicate rows.

In [5]:
duplicates = []
unique = []
for row in play_store_data[1:]:
    if row[0] in unique:
        duplicates.append(row[0])
    else:
        unique.append(row[0])

print('Number of Unique Apps: ', len(unique))
print('Number of Duplicate Apps: ', len(duplicates))

Number of Unique Apps:  9659
Number of Duplicate Apps:  1181


Using `Instagram` as an example, we will be keeping apps with the highest number of reviews, and deleting the duplicates with less reviews. A higher number of reviews means that row is the most up-to-date row that we need.

In [6]:
# The -1 accounts for the header row
expected_length = len(play_store_data) - len(duplicates) - 1
print('Expected Length: ', expected_length)

Expected Length:  9659


# Removing Duplicate Entries: Part Two
We want to create a dictionary, where each key is a unique app, and each value is the app's number of reviews.

In [7]:
# Create a new dictionary
reviews_max = {}
# Iterate through data set
for row in play_store_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

# Check if dictionary length is the length we want it at, which is 9,659
print('Expected Length: ', len(reviews_max))

Expected Length:  9659


We iterate through the data set and check the dictionary to see if it contains the values of the data set. 

If the app is already in the dictionary, it checks the number of reviews. If the next app is a duplicate **and** it has more reviews, then set the number of reviews for that app to the bigger value. 

In [8]:
android_clean = []
already_added = []

for row in play_store_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(row)
        already_added.append(name)
        
print('Length of Cleaned List: ', len(android_clean))

Length of Cleaned List:  9659


We create two lists: `android_clean` for our clean data set and `already_added` for apps that already exist in `android_clean`.

We then iterate through our original data set, `play_store_data`, and grab the `app name` and `number of reviews`.

We check `if` the `n_reviews` is equal to the number of reviews in our dictionary `reviews_max` for that app, **and** the `app name` is not in the `already_added` list. If this is **True**, then we append the `row` to `android_clean` and the `app name` to the `already_added` list.

Finally, we print the length of the `android_clean` list, `9,659`, which is the expected length for our data set.