## 1. Introduction
<p><img src="https://assets.datacamp.com/production/project_1197/img/google_play_store.png" alt="Google Play logo"></p>
<p>Mobile apps are everywhere. They are easy to create and can be very lucrative from the business standpoint. Specifically, Android is expanding as an operating system and has captured more than 74% of the total market<sup><a href="https://www.statista.com/statistics/272698/global-market-share-held-by-mobile-operating-systems-since-2009">[1]</a></sup>. </p>
<p>The Google Play Store apps data has enormous potential to facilitate data-driven decisions and insights for businesses. In this notebook, we will analyze the Android app market by comparing ~10k apps in Google Play across different categories. We will also use the user reviews to draw a qualitative comparision between the apps.</p>
<p>The dataset you will use here was scraped from Google Play Store in September 2018 and was published on <a href="https://www.kaggle.com/lava18/google-play-store-apps">Kaggle</a>. Here are the details: <br>
<br></p>
<div style="background-color: #efebe4; color: #05192d; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
    <div style="font-size:20px"><b>datasets/apps.csv</b></div>
This file contains all the details of the apps on Google Play. There are 9 features that describe a given app.
<ul>
    <li><b>App:</b> Name of the app</li>
    <li><b>Category:</b> Category of the app. Some examples are: ART_AND_DESIGN, FINANCE, COMICS, BEAUTY etc.</li>
    <li><b>Rating:</b> The current average rating (out of 5) of the app on Google Play</li>
    <li><b>Reviews:</b> Number of user reviews given on the app</li>
    <li><b>Size:</b> Size of the app in MB (megabytes)</li>
    <li><b>Installs:</b> Number of times the app was downloaded from Google Play</li>
    <li><b>Type:</b> Whether the app is paid or free</li>
    <li><b>Price:</b> Price of the app in US$</li>
    <li><b>Last Updated:</b> Date on which the app was last updated on Google Play </li>

</ul>
</div>
<div style="background-color: #efebe4; color: #05192d; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
    <div style="font-size:20px"><b>datasets/user_reviews.csv</b></div>
This file contains a random sample of 100 <i>[most helpful first](https://www.androidpolice.com/2019/01/21/google-play-stores-redesigned-ratings-and-reviews-section-lets-you-easily-filter-by-star-rating/)</i> user reviews for each app. The text in each review has been pre-processed and passed through a sentiment analyzer.
<ul>
    <li><b>App:</b> Name of the app on which the user review was provided. Matches the `App` column of the `apps.csv` file</li>
    <li><b>Review:</b> The pre-processed user review text</li>
    <li><b>Sentiment Category:</b> Sentiment category of the user review - Positive, Negative or Neutral</li>
    <li><b>Sentiment Score:</b> Sentiment score of the user review. It lies between [-1,1]. A higher score denotes a more positive sentiment.</li>

</ul>
</div>
<p>From here on, it will be your task to explore and manipulate the data until you are able to answer the three questions described in the instructions panel.<br></p>

## Solution from Datacamp (my solution is below the cell below)

In [111]:
# Importing the pandas module
import pandas as pd
apps = pd.read_csv("datasets/apps.csv")

chars_to_remove = ['+', ',']

for char in chars_to_remove:
    apps['Installs'] = apps['Installs'].apply(lambda x: x.replace(char, ''))
    
apps['Installs'] = apps['Installs'].astype(int)

app_category_info = apps.groupby('Category').agg({'App': 'count', 'Price': 'mean', 'Rating': 'mean'})

# Rename the columns for easier understanding
app_category_info = app_category_info.rename(columns={"App": "Number of apps", "Price": "Average price", "Rating": "Average rating"})


# QUESTION 3
# Read datasets/user_reviews.csv
reviews = pd.read_csv('datasets/user_reviews.csv')

# Select finance apps
finance_apps = apps[apps['Category'] == 'FINANCE']
# Select free finance apps
free_finance_apps = finance_apps[finance_apps['Type'] == 'Free']
# We can also combine the two conditions in a single line of code using the & operator
# free_finance_apps = apps[(apps['Category'] == 'FINANCE') & (apps['Type'] == 'Free')]

# Join the dataframes
merged_df = pd.merge(finance_apps, reviews, on = "App", how = "inner")
# The default value of "how" argument is "inner", so we can skip specifying it. 
# But it is a good practice to specify the type of your join for better code readability.

# Find the average sentiment score for each app
app_sentiment_score = merged_df.groupby('App').agg({'Sentiment Score' :'mean'})

# Sort the average sentiment score from highest to lowest (ie - in decreasing order)
user_feedback = app_sentiment_score.sort_values(by = 'Sentiment Score', ascending = False)

# select first 10
top_10_user_feedback = user_feedback[:10]
top_10_user_feedback

Unnamed: 0_level_0,Sentiment Score
App,Unnamed: 1_level_1
BBVA Spain,0.515086
Associated Credit Union Mobile,0.388093
BankMobile Vibe App,0.353455
A+ Mobile,0.329592
Current debit card and app made for teens,0.327258
BZWBK24 mobile,0.326883
"Even - organize your money, get paid early",0.283929
Credit Karma,0.270052
Fortune City - A Finance App,0.266966
Branch,0.26423


### Problem 1
- Read the apps.csv file and clean the Installscolumn to convert it into integer data type. 
- Save your answer as a DataFrame apps. 
- Going forward, you will do all your analysis on the apps DataFrame.

In [112]:
import pandas as pd
import numpy as np

apps = pd.read_csv("datasets/apps.csv")
apps.head(2) #vizualize the head of the Installs column

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Last Updated
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19.0,"10,000+",Free,0.0,"January 7, 2018"
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,"500,000+",Free,0.0,"January 15, 2018"


In [113]:
# The column we will be editing is Installs column. We need to strip it from , and +

apps['Installs'] = apps['Installs'].str.replace(',','')
apps['Installs'] = apps['Installs'].str.extract('(\d+)').astype(int)
apps.head(2)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Last Updated
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19.0,10000,Free,0.0,"January 7, 2018"
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,500000,Free,0.0,"January 15, 2018"


### Problem 2
- Find the number of apps in each category, the average price, and the average rating. 
- Save your answer as a DataFrame app_category_info. 
- Your should rename the four columns as: Category, Number of apps, Average price, Average rating.

In [114]:
# Just trying out different grouping methods

# app_category_info1 = apps.groupby('Category').agg({'App':'count'})
# app_category_info1 = apps.groupby('Category')[['App', 'Price', 'Rating']].agg(['count','mean'])
# app_category_info1.head()

In [115]:
app_category_info = (apps
                    .groupby('Category')
                    .agg({
                        'App': 'count',
                        'Price': 'mean',
                        'Rating': 'mean'})
                    .rename({
                        "App": "Number of apps",
                        "Price": "Average price",
                        "Rating": "Average rating"}, axis=1)
                    .reset_index()
                    )

In [116]:
app_category_info.head(2)

Unnamed: 0,Category,Number of apps,Average price,Average rating
0,ART_AND_DESIGN,64,0.093281,4.357377
1,AUTO_AND_VEHICLES,85,0.158471,4.190411


### Problem 3

- Find the top 10 free FINANCE apps having the highest average sentiment score. 
- Save your answer as a DataFrame top_10_user_feedback. 
- Your answer should have exactly 10 rows and two columns named: App and Sentiment Score, 
  where the average Sentiment Score is sorted from highest to lowest.

In [117]:
# Filtering free FINANCE apps

free_finance_apps = apps[apps['Type']=='Free']
free_finance_apps = apps[apps['Category']=='FINANCE'].reset_index(drop=True)
free_finance_apps.head(2)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Last Updated
0,K PLUS,FINANCE,4.4,124424,,10000000,Free,0.0,"June 26, 2018"
1,ING Banking,FINANCE,4.4,39041,,1000000,Free,0.0,"August 3, 2018"


In [118]:
# Importing user_reviews and merging

user_reviews = pd.read_csv("datasets/user_reviews.csv")
merged_df = free_finance_apps.merge(user_reviews, on = 'App')
merged_df.head(2) 

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Last Updated,Review,Sentiment Category,Sentiment Score
0,Citibanamex Movil,FINANCE,3.6,52306,42.0,5000000,Free,0.0,"July 27, 2018","Forget paying app, designed make fail payments...",Negative,-0.5
1,Citibanamex Movil,FINANCE,3.6,52306,42.0,5000000,Free,0.0,"July 27, 2018","It's working expected, talking best bank Mexic...",Positive,0.4


In [119]:
top_10_user_feedback = (merged_df
                        .groupby('App')
                        .agg({'Sentiment Score': 'mean'})
                        .sort_values('Sentiment Score', ascending=False)
                        .reset_index(drop=False)
                        .head(10)
                        )

In [121]:
top_10_user_feedback

Unnamed: 0,App,Sentiment Score
0,BBVA Spain,0.515086
1,Associated Credit Union Mobile,0.388093
2,BankMobile Vibe App,0.353455
3,A+ Mobile,0.329592
4,Current debit card and app made for teens,0.327258
5,BZWBK24 mobile,0.326883
6,"Even - organize your money, get paid early",0.283929
7,Credit Karma,0.270052
8,Fortune City - A Finance App,0.266966
9,Branch,0.26423
