# Elements of success in iOS's App Store
As of September 2018, there were approximately 2 million iOS apps available on the App Store. By looking at a sample of those, we'll ask the questions:

> Are there shared elements that contribute to **free** mobile application's *success*?

> Are there shared elements that contribute to **free** mobile application's *failure*?

### OBJECTIVE
By the end of this analysis we expect a graphical representation of the correlation between specific elements and the success/failure of the app. The **elements to be evaluated** are:
- App's name length & key words
- App's genre in comparision to other genres
- App's size in bytes

And the **success/failure** will be defined as a score considering:
- Absolute Rating Count
- Average Rating itself
- These measures considering both overall and grouped in genres

### FLOW STRUCTURE
Throughout this document the colors as well as title numbers (1, 2...) will indicate in which part of the flow we are. By the end you can find the summary of the findings.
1. <font color=red>**Data Cleaning**</font>: Making sure the data is properly displayed before analysis
2. <font color=purple>**Success Score**</font>: Generating the score and ranking the apps accordingly
3. <font color=orange>**Splitting Sample**</font>: Defining the borderline between Success & Failure
4. <font color=blue>**Element's Evaluation**</font>: Correlation analysis between each element and the Success score
5. <font color=green>**Conclusion**</font>: Summary of the findings

### LIBRARIES
- Pandas
- Numpy
- Matplotlib
- Scipy
- Itertools
- Regex

### RESOURCES

In [1]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import itertools
import scipy.stats as sp
%matplotlib inline

In [5]:
apple_store = pd.read_csv("AppleStore.csv")
print("The dataset contains", apple_store.shape[0],
      "rows and", apple_store.shape[1],
      "columns. Here are the first five rows:")
apple_store.head()

The dataset contains 7197 rows and 16 columns. Here are the first five:


Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
1,389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
2,529479190,Clash of Clans,116476928,USD,0.0,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1
3,420009108,Temple Run,65921024,USD,0.0,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1
4,284035177,Pandora - Music & Radio,130242560,USD,0.0,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1


By a first look, we can already define which are the columns we want to keep and which are the ones we'll drop. Here's a table to explain the whole dataset content and what will actually be useful to us.

| Column | Description | Keep | Why? | Rename to |
| :--- | :--- | :--- | :--- | :--- |
| "id" | App ID | **Yes** | It might help us when trying to identify duplicates | "id" |
| "track_name" | App Name | **Yes** | Identifying the mobile apps | "app_name" |
| "size_bytes" | Size (in Bytes) | **Yes** | If the data shows itself reliable, it can be a point of analysis | "size_bytes" |
| "currency" | Currency Type | No | The focus is on free apps, so this information is irrelevant | - |
| "price" | Price amount | **Yes** | To identify the free apps | "price" |
| "ratingcounttot" | User Rating counts (for all version) | **Yes** | Is the only indicator we have to have an idea of number of active users | "rating_count" |
| "ratingcountver" | User Rating counts (for current version) | No | Our analysis is looking at the big picture, so version information is not meaningful | - |
| "user_rating" | Average User Rating value (for all version) | **Yes** | As a quality indicator, it's going to help to measure success | "rating" |
| "userratingver" | Average User Rating value (for current version) | No | Our analysis is looking at the big picture, so version information is not meaningful | - |
| "ver" | Latest version code | No | Our analysis is looking at the big picture, so version information is not meaningful | - |
| "cont_rating" | Content Rating | No | It's not under our analysis objective to understand about content rating | - |
| "prime_genre" | Primary Genre | **Yes** | It will serve to group the apps in different categories, and evaluate each group with it's alikes | "genre" |
| "sup_devices.num" | Number of supporting devices | No | Although it might seem important to our analysis, it's already trivial that an app with a large range of devices has a greater chance to be successuful | - |
| "ipadSc_urls.num" | Number of screenshots showed for display | No | It's a relevant information, yet not the main focus of the analysis | - |
| "lang.num" | Number of supported languages | No | Although it might seem important to our analysis, it's already trivial that an app with a large range of devices has a greater chance to be successuful | - |
| "vpp_lic" | Vpp Device Based Licensing Enabled | No | The licensing is not relevant to our analysis's objective | - |


## 1. <font color=red>Data Cleaning</font>

The cleaning process is going to follow the order:
- Dropping/Reordering columns
- Renaming columns
- Implementing proper data-types to the columns
- Inspecting NaN values and any other wrong inputs
- Dropping paid apps
- Dealing with duplicates

In [6]:
apple_crop = apple_store.copy()[["id", "track_name", "prime_genre", "size_bytes", "price", "rating_count_tot", "user_rating"]]
apple_crop.columns = ["id", "app_name", "genre", "size_bytes", "price", "rating_count", "rating"]
apple_crop.head()

Unnamed: 0,id,app_name,genre,size_bytes,price,rating_count,rating
0,284882215,Facebook,Social Networking,389879808,0.0,2974676,3.5
1,389801252,Instagram,Photo & Video,113954816,0.0,2161558,4.5
2,529479190,Clash of Clans,Games,116476928,0.0,2130805,4.5
3,420009108,Temple Run,Games,65921024,0.0,1724546,4.5
4,284035177,Pandora - Music & Radio,Music,130242560,0.0,1126879,4.0


In [45]:
for c in apple_crop.columns:
    print(c, ':', apple_crop[c].dtype)

id : int64
app_name : object
genre : object
size_bytes : int64
price : float64
rating_count : int64
rating : float64


In [55]:
print("Are there any NaN values looking column by column?", '\n')
for c in apple_crop.columns:
    print(apple_crop[c].isna().value_counts(), '\n')

Are there any NaN values looking column by column? 

False    7197
Name: id, dtype: int64 

False    7197
Name: app_name, dtype: int64 

False    7197
Name: genre, dtype: int64 

False    7197
Name: size_bytes, dtype: int64 

False    7197
Name: price, dtype: int64 

False    7197
Name: rating_count, dtype: int64 

False    7197
Name: rating, dtype: int64 



In [61]:
apple_free = apple_crop.copy()[apple_crop['price'] == 0]
print("The new dataset with only free apps contains",
      apple_free.shape[0], "rows and", apple_free.shape[1],
      "columns. Here are the first five rows:")
apple_free.head()

The new dataset with only free apps contains 4056 rows and 7 columns. Here are the first five rows:


Unnamed: 0,id,app_name,genre,size_bytes,price,rating_count,rating
0,284882215,Facebook,Social Networking,389879808,0.0,2974676,3.5
1,389801252,Instagram,Photo & Video,113954816,0.0,2161558,4.5
2,529479190,Clash of Clans,Games,116476928,0.0,2130805,4.5
3,420009108,Temple Run,Games,65921024,0.0,1724546,4.5
4,284035177,Pandora - Music & Radio,Music,130242560,0.0,1126879,4.0


In [65]:
print("Out of the remaining dataset, how many rows are duplicated taking into consideration the \"app_name\"?")
apple_free.duplicated("app_name", keep=False).value_counts()

Out of the remaining dataset, how many rows are duplicate taking into consideration the "app_name"?


False    4052
True        4
dtype: int64

In [66]:
print("These are the duplicated rows:")
apple_free[apple_free.duplicated("app_name", keep=False)]

These are the duplicated rows:


Unnamed: 0,id,app_name,genre,size_bytes,price,rating_count,rating
2948,1173990889,Mannequin Challenge,Games,109705216,0.0,668,3.0
4442,952877179,VR Roller Coaster,Games,169523200,0.0,107,3.5
4463,1178454060,Mannequin Challenge,Games,59572224,0.0,105,4.0
4831,1089824278,VR Roller Coaster,Games,240964608,0.0,67,3.5


After looking at the duplicated rows, it's noticeable that although the names are exactly the same, the **"size_bytes"**, **"id"**, **"rating_count"**, and **"rating"** actually present differences, taking us to assume these are different apps, just with the coincidence of having the same name - which is also not very unique or individually branded in any way. For this reason, we <font color=red>**will not drop**</font> these rows.

To finalize the cleaning process we will drop a few more columns that are no longer needed for the upcoming analysis. Those being: "id" and "price".

In [67]:
apple_final = apple_free.copy()[["app_name", "genre", "size_bytes", "rating_count", "rating"]]
apple_final.head()

Unnamed: 0,app_name,genre,size_bytes,rating_count,rating
0,Facebook,Social Networking,389879808,2974676,3.5
1,Instagram,Photo & Video,113954816,2161558,4.5
2,Clash of Clans,Games,116476928,2130805,4.5
3,Temple Run,Games,65921024,1724546,4.5
4,Pandora - Music & Radio,Music,130242560,1126879,4.0


## 2. <font color=purple>Success Score</font>
To generate the score and rank the apps accordingly we will follow the steps:
- Defining variables and it's weights
- Defining Success Rate Formula
- Prototyping and reviewing outcome
- Implementing it to the whole dataset

The score is going to take into consideration two criteria:
- The **app overall** numbers with respect to all other apps
- The app numbers with respect to apps **within the same "genre**"

For both perspectives the indicators we will be looking at are "rating_count" and "rating". Both of them provide first an absolute number which indirectly translates the number of users that utilize these apps, and second those users evaluation over the product.

Wouldn't it be better to measure how many users in total, insted of rating_count?
> Yes, definitely. However we don't have access to this data, so our assumption is going to rely on idea that the more ratings counted, means more active users, more installs and ultimately more success.

<div class="alert alert-block alert-info">
<b>App Overall:</b> We want two equivalent coefficients, one for "rating_count" and another for "rating" in which we can apply different weights and sum them to result in a number ranging between 0 and 1.
</div>

<div class="alert alert-block alert-info">
<b>App within Genre:</b> Similarly we want the same coeficients, however comparing only within it's "genre".
</div>

<div class="alert alert-block alert-warning">
<b>Final Success Score:</b> (Overall Coefficient * Weight) + (Genre Coefficient * Weight)
</div>

The nature of the numbers are radically different ("rating_count" can go up to millions and it's an absolute number, while "rating" range from 0 to 5 and it's the result of an average calculation). With this in mind, to generate the coefficients ranging between 0 and 1 that can be equivalent, we will use the ***Min-max feature scaling***:

[Reference](https://en.wikipedia.org/wiki/Normalization_(statistics))

<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/0222c9472478eec2857b8bcbfa4148ece4a11b84" alt="Min-max feature scaling" title="Min-max feature scaling" />

*For future references:*
- **'rc'** refers to **"rating_count"**
- **'r'** refers to **"rating"**

In [68]:
weight_rc = 0.4
weight_r = 0.6
weight_overall = 0.5
weight_genre = 0.5

In [None]:
def minmax_scaling_overall(x, column):
    # Column is either 'r' or 'rc'
    if column == 'rc':
        min_ = apple_final['rating_count'].min()
        max_ = apple_final['rating_count'].max()
        return (x - min_) / (max_ - min_)
    elif column == 'r':
        min_ = apple_final['rating'].min()
        max_ = apple_final['rating'].max()
        return (x - min_) / (max_ - min_)
    else:
        return "Wrong 'column' input"
    
def minmax_scaling_genre(x, column):
    