## 1. Introduction
<p><img src="https://assets.datacamp.com/production/project_1197/img/google_play_store.png" alt="Google Play logo"></p>
<p>Mobile apps are everywhere. They are easy to create and can be very lucrative from the business standpoint. Specifically, Android is expanding as an operating system and has captured more than 74% of the total market<sup><a href="https://www.statista.com/statistics/272698/global-market-share-held-by-mobile-operating-systems-since-2009">[1]</a></sup>. </p>
<p>The Google Play Store apps data has enormous potential to facilitate data-driven decisions and insights for businesses. In this notebook, we will analyze the Android app market by comparing ~10k apps in Google Play across different categories. We will also use the user reviews to draw a qualitative comparision between the apps.</p>
<p>The dataset you will use here was scraped from Google Play Store in September 2018 and was published on <a href="https://www.kaggle.com/lava18/google-play-store-apps">Kaggle</a>. Here are the details: <br>
<br></p>
<div style="background-color: #efebe4; color: #05192d; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
    <div style="font-size:20px"><b>datasets/apps.csv</b></div>
This file contains all the details of the apps on Google Play. There are 9 features that describe a given app.
<ul>
    <li><b>App:</b> Name of the app</li>
    <li><b>Category:</b> Category of the app. Some examples are: ART_AND_DESIGN, FINANCE, COMICS, BEAUTY etc.</li>
    <li><b>Rating:</b> The current average rating (out of 5) of the app on Google Play</li>
    <li><b>Reviews:</b> Number of user reviews given on the app</li>
    <li><b>Size:</b> Size of the app in MB (megabytes)</li>
    <li><b>Installs:</b> Number of times the app was downloaded from Google Play</li>
    <li><b>Type:</b> Whether the app is paid or free</li>
    <li><b>Price:</b> Price of the app in US$</li>
    <li><b>Last Updated:</b> Date on which the app was last updated on Google Play </li>

</ul>
</div>
<div style="background-color: #efebe4; color: #05192d; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
    <div style="font-size:20px"><b>datasets/user_reviews.csv</b></div>
This file contains a random sample of 100 <i>[most helpful first](https://www.androidpolice.com/2019/01/21/google-play-stores-redesigned-ratings-and-reviews-section-lets-you-easily-filter-by-star-rating/)</i> user reviews for each app. The text in each review has been pre-processed and passed through a sentiment analyzer.
<ul>
    <li><b>App:</b> Name of the app on which the user review was provided. Matches the `App` column of the `apps.csv` file</li>
    <li><b>Review:</b> The pre-processed user review text</li>
    <li><b>Sentiment Category:</b> Sentiment category of the user review - Positive, Negative or Neutral</li>
    <li><b>Sentiment Score:</b> Sentiment score of the user review. It lies between [-1,1]. A higher score denotes a more positive sentiment.</li>

</ul>
</div>
<p>From here on, it will be your task to explore and manipulate the data until you are able to answer the three questions described in the instructions panel.<br></p>

In [None]:
# Use this cell to begin your analysis, and add as many as you would like!

In [None]:
import pandas as pd
import numpy as np


In [None]:
apps = pd.read_csv('datasets/apps.csv')
apps.info()
apps.head(20)

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 9659 entries, 0 to 9658

Data columns (total 9 columns):

 #   Column        Non-Null Count  Dtype  

---  ------        --------------  -----  

 0   App           9659 non-null   object 

 1   Category      9659 non-null   object 

 2   Rating        8196 non-null   float64

 3   Reviews       9659 non-null   int64  

 4   Size          8432 non-null   float64

 5   Installs      9659 non-null   object 

 6   Type          9659 non-null   object 

 7   Price         9659 non-null   float64

 8   Last Updated  9659 non-null   object 

dtypes: float64(3), int64(1), object(5)

memory usage: 679.3+ KB


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Last Updated
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19.0,"10,000+",Free,0.0,"January 7, 2018"
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,"500,000+",Free,0.0,"January 15, 2018"
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7,"5,000,000+",Free,0.0,"August 1, 2018"
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25.0,"50,000,000+",Free,0.0,"June 8, 2018"
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8,"100,000+",Free,0.0,"June 20, 2018"
5,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6,"50,000+",Free,0.0,"March 26, 2017"
6,Smoke Effect Photo Maker - Smoke Editor,ART_AND_DESIGN,3.8,178,19.0,"50,000+",Free,0.0,"April 26, 2018"
7,Infinite Painter,ART_AND_DESIGN,4.1,36815,29.0,"1,000,000+",Free,0.0,"June 14, 2018"
8,Garden Coloring Book,ART_AND_DESIGN,4.4,13791,33.0,"1,000,000+",Free,0.0,"September 20, 2017"
9,Kids Paint Free - Drawing Fun,ART_AND_DESIGN,4.7,121,3.1,"10,000+",Free,0.0,"July 3, 2018"


In [None]:
chars_to_remove = [',', '+']
for char in chars_to_remove:
    apps['Installs'] = apps['Installs'].apply(lambda x: x.replace(char, ''))

apps['Installs'] = apps["Installs"].astype(int)
apps.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 9659 entries, 0 to 9658

Data columns (total 9 columns):

 #   Column        Non-Null Count  Dtype  

---  ------        --------------  -----  

 0   App           9659 non-null   object 

 1   Category      9659 non-null   object 

 2   Rating        8196 non-null   float64

 3   Reviews       9659 non-null   int64  

 4   Size          8432 non-null   float64

 5   Installs      9659 non-null   int64  

 6   Type          9659 non-null   object 

 7   Price         9659 non-null   float64

 8   Last Updated  9659 non-null   object 

dtypes: float64(3), int64(2), object(4)

memory usage: 679.3+ KB


In [None]:

#apps_price_avg = apps['Price'].mean()
#apps_rating_avg = apps['Rating'].mean()
app_category_info = apps.groupby('Category').agg({'App':'count', 'Price':'mean', 'Rating': 'mean'})
app_category_info.rename(columns={'Rating':'Average rating', 'Price':'Average price', 'App':'Number of apps'})

Unnamed: 0_level_0,Number of apps,Average price,Average rating
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ART_AND_DESIGN,64,0.093281,4.357377
AUTO_AND_VEHICLES,85,0.158471,4.190411
BEAUTY,53,0.0,4.278571
BOOKS_AND_REFERENCE,222,0.539505,4.34497
BUSINESS,420,0.417357,4.098479
COMICS,56,0.0,4.181481
COMMUNICATION,315,0.263937,4.121484
DATING,171,0.160468,3.970149
EDUCATION,119,0.150924,4.364407
ENTERTAINMENT,102,0.078235,4.135294


In [None]:
finance_apps = apps[apps['Category']=='FINANCE']
finance_apps.info()


<class 'pandas.core.frame.DataFrame'>

Int64Index: 345 entries, 837 to 9636

Data columns (total 9 columns):

 #   Column        Non-Null Count  Dtype  

---  ------        --------------  -----  

 0   App           345 non-null    object 

 1   Category      345 non-null    object 

 2   Rating        302 non-null    float64

 3   Reviews       345 non-null    int64  

 4   Size          299 non-null    float64

 5   Installs      345 non-null    int64  

 6   Type          345 non-null    object 

 7   Price         345 non-null    float64

 8   Last Updated  345 non-null    object 

dtypes: float64(3), int64(2), object(4)

memory usage: 27.0+ KB


In [None]:
user_reviews = pd.read_csv("datasets/user_reviews.csv")

#user_reviews.info()

user_reviews_with_sent_score = user_reviews.groupby('App').agg({'Sentiment Score':'mean'})
user_reviews_with_sent_score
#top_10_user_feedback = user_reviews.groupby('App').agg({'Sentiment Score':'mean'}).sort_values('Sentiment Score', ascending=False)
#top_10_user_feedback

Unnamed: 0_level_0,Sentiment Score
App,Unnamed: 1_level_1
10 Best Foods for You,0.470733
104 找工作 - 找工作 找打工 找兼職 履歷健檢 履歷診療室,0.392405
11st,0.181294
1800 Contacts - Lens Store,0.318145
1LINE – One Line with One Touch,0.196290
...,...
Hotspot Shield Free VPN Proxy & Wi-Fi Security,0.251765
Hotstar,0.038178
Hotwire Hotel & Car Rental App,0.187029
Housing-Real Estate & Property,-0.021427


In [None]:
finance_app_with_reviews = pd.merge(finance_apps, user_reviews_with_sent_score, on='App').sort_values('Sentiment Score', ascending=False)
user_feedback = finance_app_with_reviews[['App', 'Sentiment Score']]
top_10_user_feedback = user_feedback.nlargest(n=10, columns=['Sentiment Score'])
top_10_user_feedback

Unnamed: 0,App,Sentiment Score
4,BBVA Spain,0.515086
36,Associated Credit Union Mobile,0.388093
34,BankMobile Vibe App,0.353455
44,A+ Mobile,0.329592
29,Current debit card and app made for teens,0.327258
8,BZWBK24 mobile,0.326883
12,"Even - organize your money, get paid early",0.283929
7,Credit Karma,0.270052
45,Fortune City - A Finance App,0.266966
19,Branch,0.26423
