## 1. Introduction
<p><img src="https://assets.datacamp.com/production/project_1197/img/google_play_store.png" alt="Google Play logo"></p>
<p>Mobile apps are everywhere. They are easy to create and can be very lucrative from the business standpoint. Specifically, Android is expanding as an operating system and has captured more than 74% of the total market<sup><a href="https://www.statista.com/statistics/272698/global-market-share-held-by-mobile-operating-systems-since-2009">[1]</a></sup>. </p>
<p>The Google Play Store apps data has enormous potential to facilitate data-driven decisions and insights for businesses. In this notebook, we will analyze the Android app market by comparing ~10k apps in Google Play across different categories. We will also use the user reviews to draw a qualitative comparision between the apps.</p>
<p>The dataset you will use here was scraped from Google Play Store in September 2018 and was published on <a href="https://www.kaggle.com/lava18/google-play-store-apps">Kaggle</a>. Here are the details: <br>
<br></p>
<div style="background-color: #efebe4; color: #05192d; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
    <div style="font-size:20px"><b>datasets/apps.csv</b></div>
This file contains all the details of the apps on Google Play. There are 9 features that describe a given app.
<ul>
    <li><b>App:</b> Name of the app</li>
    <li><b>Category:</b> Category of the app. Some examples are: ART_AND_DESIGN, FINANCE, COMICS, BEAUTY etc.</li>
    <li><b>Rating:</b> The current average rating (out of 5) of the app on Google Play</li>
    <li><b>Reviews:</b> Number of user reviews given on the app</li>
    <li><b>Size:</b> Size of the app in MB (megabytes)</li>
    <li><b>Installs:</b> Number of times the app was downloaded from Google Play</li>
    <li><b>Type:</b> Whether the app is paid or free</li>
    <li><b>Price:</b> Price of the app in US$</li>
    <li><b>Last Updated:</b> Date on which the app was last updated on Google Play </li>

</ul>
</div>
<div style="background-color: #efebe4; color: #05192d; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
    <div style="font-size:20px"><b>datasets/user_reviews.csv</b></div>
This file contains a random sample of 100 <i>[most helpful first](https://www.androidpolice.com/2019/01/21/google-play-stores-redesigned-ratings-and-reviews-section-lets-you-easily-filter-by-star-rating/)</i> user reviews for each app. The text in each review has been pre-processed and passed through a sentiment analyzer.
<ul>
    <li><b>App:</b> Name of the app on which the user review was provided. Matches the `App` column of the `apps.csv` file</li>
    <li><b>Review:</b> The pre-processed user review text</li>
    <li><b>Sentiment Category:</b> Sentiment category of the user review - Positive, Negative or Neutral</li>
    <li><b>Sentiment Score:</b> Sentiment score of the user review. It lies between [-1,1]. A higher score denotes a more positive sentiment.</li>

</ul>
</div>
<p>From here on, it will be your task to explore and manipulate the data until you are able to answer the three questions described in the instructions panel.<br></p>

In [2]:
# Read in dataset
import pandas as pd
apps_with_duplicates = pd.read_csv("datasets/apps.csv")

# Drop duplicates from apps_with_duplicates
apps = apps_with_duplicates.drop_duplicates()

# Print the total number of apps
print('Total number of apps in the dataset = ', str(len(apps.index)))

# Have a look at a random sample of 5 rows
print(apps.sample(5))

Total number of apps in the dataset =  9659
                                 App      Category  Rating  Reviews  Size  \
6661                          CQ Key         TOOLS     NaN        0   1.7   
1445                     PUBG MOBILE          GAME     4.4  3715656  36.0   
8775  EU Brazil Green Business Forum  PRODUCTIVITY     NaN        0   8.7   
6952                           My CW         TOOLS     3.0        1  55.0   
643             GMAT Math Flashcards     EDUCATION     4.4     1769   NaN   

         Installs  Type  Price      Last Updated  
6661         100+  Free    0.0    April 26, 2018  
1445  50,000,000+  Free    0.0     July 24, 2018  
8775          10+  Free    0.0    April 18, 2017  
6952         100+  Free    0.0  October 27, 2016  
643      100,000+  Free    0.0     July 11, 2018  


In [3]:
apps.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9659 entries, 0 to 9658
Data columns (total 9 columns):
App             9659 non-null object
Category        9659 non-null object
Rating          8196 non-null float64
Reviews         9659 non-null int64
Size            8432 non-null float64
Installs        9659 non-null object
Type            9659 non-null object
Price           9659 non-null float64
Last Updated    9659 non-null object
dtypes: float64(3), int64(1), object(5)
memory usage: 754.6+ KB


## 2.0 Data Inspection

Data inspection is one of most important steps in data science, let's get some statistics about each variable

In [4]:
apps.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Last Updated
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19.0,"10,000+",Free,0.0,"January 7, 2018"
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,"500,000+",Free,0.0,"January 15, 2018"
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7,"5,000,000+",Free,0.0,"August 1, 2018"
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25.0,"50,000,000+",Free,0.0,"June 8, 2018"
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8,"100,000+",Free,0.0,"June 20, 2018"


In [5]:
apps.isna().sum()

App                0
Category           0
Rating          1463
Reviews            0
Size            1227
Installs           0
Type               0
Price              0
Last Updated       0
dtype: int64

## 2.5 Data cleaning
<p>Data cleaning is one of the most essential subtask any data science project. Although it can be a very tedious process, it's worth should never be undermined.</p>
<p>By looking at a random sample of the dataset rows (from the above task), we observe that some entries in the columns like <code>Installs</code> and <code>Price</code> have a few special characters (<code>+</code> <code>,</code> <code>$</code>) due to the way the numbers have been represented. This prevents the columns from being purely numeric, making it difficult to use them in subsequent future mathematical calculations. Ideally, as their names suggest, we would want these columns to contain only digits from [0-9].</p>
<p>Hence, we now proceed to clean our data. Specifically, the special characters <code>,</code> and <code>+</code> present in <code>Installs</code> column and <code>$</code> present in <code>Price</code> column need to be removed.</p>
<p>It is also always a good practice to print a summary of your dataframe after completing data cleaning. We will use the <code>info()</code> method to acheive this.</p>

In [6]:
# List of characters to remove
chars_to_remove = ["+", ","]
# List of column names to clean
cols_to_clean = ["Installs"]

# Loop for each column in cols_to_clean
for col in cols_to_clean:
    # Loop for each char in chars_to_remove
    for char in chars_to_remove:
        # Replace the character with an empty string
        apps[col] = apps[col].apply(lambda x: x.replace(char, ""))

# Print a summary of the apps dataframe
print(apps.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9659 entries, 0 to 9658
Data columns (total 9 columns):
App             9659 non-null object
Category        9659 non-null object
Rating          8196 non-null float64
Reviews         9659 non-null int64
Size            8432 non-null float64
Installs        9659 non-null object
Type            9659 non-null object
Price           9659 non-null float64
Last Updated    9659 non-null object
dtypes: float64(3), int64(1), object(5)
memory usage: 754.6+ KB
None


## 3. Correcting data types
<p>From the previous task we noticed that <code>Installs</code> and <code>Price</code> were categorized as <code>object</code> data type (and not <code>int</code> or <code>float</code>) as we would like. This is because these two columns originally had mixed input types: digits and special characters. To know more about Pandas data types, read <a href="https://datacarpentry.org/python-ecology-lesson/04-data-types-and-format/">this</a>.</p>
<p>The four features that we will be working with most frequently henceforth are <code>Installs</code>, <code>Size</code>, <code>Rating</code> and <code>Price</code>. While <code>Size</code> and <code>Rating</code> are both <code>float</code> (i.e. purely numerical data types), we still need to work on <code>Installs</code> and <code>Price</code> to make them numeric.</p>

In [13]:
import numpy as np

# Convert Installs to float data type
apps["Installs"] = apps["Installs"].astype(int)

# Checking dtypes of the apps dataframe
print(apps.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9659 entries, 0 to 9658
Data columns (total 9 columns):
App             9659 non-null object
Category        9659 non-null object
Rating          8196 non-null float64
Reviews         9659 non-null int64
Size            8432 non-null float64
Installs        9659 non-null int64
Type            9659 non-null object
Price           9659 non-null float64
Last Updated    9659 non-null object
dtypes: float64(3), int64(2), object(4)
memory usage: 754.6+ KB
None


## 4. Find statistics about price, rating and quantity by category


In [8]:
app_category_info = pd.DataFrame(apps.groupby(["Category"]).count()["App"])

app_category_info = app_category_info.rename({"App": "Number of apps"}, axis='columns')

app_category_info["Average price"] = apps.groupby(["Category"]).mean()["Price"]

app_category_info["Average rating"] = apps.groupby(["Category"]).mean()["Rating"]

app_category_info

Unnamed: 0_level_0,Number of apps,Average price,Average rating
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ART_AND_DESIGN,64,0.093281,4.357377
AUTO_AND_VEHICLES,85,0.158471,4.190411
BEAUTY,53,0.0,4.278571
BOOKS_AND_REFERENCE,222,0.539505,4.34497
BUSINESS,420,0.417357,4.098479
COMICS,56,0.0,4.181481
COMMUNICATION,315,0.263937,4.121484
DATING,171,0.160468,3.970149
EDUCATION,119,0.150924,4.364407
ENTERTAINMENT,102,0.078235,4.135294


## 5. Top 10 free FINANCE apps having the highest average sentiment score

In [9]:
free_finance_apps = apps[(apps["Category"] == "FINANCE") & (apps["Type"] == "Free")]
free_finance_apps.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Last Updated
837,K PLUS,FINANCE,4.4,124424,,10000000,Free,0.0,"June 26, 2018"
838,ING Banking,FINANCE,4.4,39041,,1000000,Free,0.0,"August 3, 2018"
839,Citibanamex Movil,FINANCE,3.6,52306,42.0,5000000,Free,0.0,"July 27, 2018"
840,The postal bank,FINANCE,3.7,36718,,5000000,Free,0.0,"July 16, 2018"
841,KTB Netbank,FINANCE,3.8,42644,19.0,5000000,Free,0.0,"June 28, 2018"


In [10]:
# Load user_reviews.csv
reviews_df = pd.read_csv("datasets/user_reviews.csv")

# Join the two dataframes
merged_df = free_finance_apps.merge(reviews_df, left_on='App', right_on='App')

# Drop NA values from Sentiment and Review columns
merged_df = merged_df.dropna(subset = ['Sentiment Score'])

In [11]:
merged_df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Last Updated,Review,Sentiment Category,Sentiment Score
0,Citibanamex Movil,FINANCE,3.6,52306,42.0,5000000,Free,0.0,"July 27, 2018","Forget paying app, designed make fail payments...",Negative,-0.5
1,Citibanamex Movil,FINANCE,3.6,52306,42.0,5000000,Free,0.0,"July 27, 2018","It's working expected, talking best bank Mexic...",Positive,0.4
2,Citibanamex Movil,FINANCE,3.6,52306,42.0,5000000,Free,0.0,"July 27, 2018",It has many problems with Android 8.1. You can...,Positive,0.25
3,Citibanamex Movil,FINANCE,3.6,52306,42.0,5000000,Free,0.0,"July 27, 2018","I changed my phone to a Xiaomi Redmi Note 5, t...",Positive,0.175
4,Citibanamex Movil,FINANCE,3.6,52306,42.0,5000000,Free,0.0,"July 27, 2018",In her eagerness to make her look pretty with ...,Negative,-0.158333


In [12]:
top_10_user_feedback = merged_df.sort_values(by="Sentiment Score", ascending=False).head(10)
top_10_user_feedback

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Last Updated,Review,Sentiment Category,Sentiment Score
2063,A+ Mobile,FINANCE,3.9,730,6.3,10000,Free,0.0,"June 26, 2018",AWESOME!!! Thank You.,Positive,1.0
1344,E*TRADE Mobile,FINANCE,3.9,10658,58.0,1000000,Free,0.0,"August 2, 2018",This incredible !,Positive,1.0
1416,Current debit card and app made for teens,FINANCE,4.3,685,21.0,50000,Free,0.0,"August 3, 2018",It's best way get kid understand money.,Positive,1.0
195,BBVA Spain,FINANCE,4.2,36746,,5000000,Free,0.0,"July 24, 2018",Version 2018 works perfect. Zero problems.,Positive,1.0
1269,CNBC: Breaking Business News & Live Market Data,FINANCE,4.2,24647,,1000000,Free,0.0,"July 13, 2018",Best finance news source,Positive,1.0
2070,A+ Mobile,FINANCE,3.9,730,6.3,10000,Free,0.0,"June 26, 2018",APFCU greatest !!!,Positive,1.0
1029,Branch,FINANCE,4.6,69973,3.8,1000000,Free,0.0,"July 23, 2018","Branch best apps,they gave 1 week repay loan w...",Positive,1.0
1560,BBVA Compass Banking,FINANCE,4.3,5905,50.0,500000,Free,0.0,"July 27, 2018",Best banking ever!,Positive,1.0
243,Banorte Movil,FINANCE,4.1,111632,65.0,1000000,Free,0.0,"June 9, 2018",To review the charges generated with the credi...,Positive,1.0
2071,A+ Mobile,FINANCE,3.9,730,6.3,10000,Free,0.0,"June 26, 2018",LOVE IT!!!!,Positive,1.0
