# The Android App Market on Google Play
## Introduction
<p>The dataset you will use here was scraped from Google Play Store in September 2018 and was published on <a href="https://www.kaggle.com/lava18/google-play-store-apps">Kaggle</a>. Here are the details:</p>
<br>
<br>
<div style="font-size:20px"><b>datasets/apps.csv</b></div>
<p>This file contains all the details of the apps on Google Play. There are 9 features that describe a given app.</p>
<ul>
    <li><b>App:</b> Name of the app</li>
    <li><b>Category:</b> Category of the app. Some examples are: ART_AND_DESIGN, FINANCE, COMICS, BEAUTY etc.</li>
    <li><b>Rating:</b> The current average rating (out of 5) of the app on Google Play</li>
    <li><b>Reviews:</b> Number of user reviews given on the app</li>
    <li><b>Size:</b> Size of the app in MB (megabytes)</li>
    <li><b>Installs:</b> Number of times the app was downloaded from Google Play</li>
    <li><b>Type:</b> Whether the app is paid or free</li>
    <li><b>Price:</b> Price of the app in US </li>
    <li><b>Last Updated:</b> Date on which the app was last updated on Google Play </li>
</ul>
<div style="font-size:20px"><b>datasets/user_reviews.csv</b></div>
<p>This file contains a random sample of 100 user reviews for each app. The text in each review has been pre-processed and passed through a sentiment analyzer.</p>
<ul>
    <li><b>App:</b> Name of the app on which the user review was provided. Matches the `App` column of the `apps.csv` file</li>
    <li><b>Review:</b> The pre-processed user review text</li>
    <li><b>Sentiment Category:</b> Sentiment category of the user review - Positive, Negative or Neutral</li>
    <li><b>Sentiment Score:</b> Sentiment score of the user review. It lies between [-1,1]. A higher score denotes a more positive sentiment.</li>
</ul>
<p>From here on, it will be your task to explore and manipulate the data until you are able to answer the three questions described in the instructions panel.</p>

<p>1. Read the <b>apps.csv</b> file and clean the <b>Installs</b> column to convert it into integer data type. Save your answer as a DataFrame <b>apps</b>. Going forward, you will do all your analysis on the apps DataFrame.</p>

In [1]:
# Import modules
import pandas as pd

In [2]:
# Import dataset
apps = pd.read_csv('./datasets/apps.csv')
print(apps.head())

                                                 App        Category  Rating  \
0     Photo Editor & Candy Camera & Grid & ScrapBook  ART_AND_DESIGN     4.1   
1                                Coloring book moana  ART_AND_DESIGN     3.9   
2  U Launcher Lite â€“ FREE Live Cool Themes, Hide ...  ART_AND_DESIGN     4.7   
3                              Sketch - Draw & Paint  ART_AND_DESIGN     4.5   
4              Pixel Draw - Number Art Coloring Book  ART_AND_DESIGN     4.3   

   Reviews  Size     Installs  Type  Price      Last Updated  
0      159  19.0      10,000+  Free    0.0   January 7, 2018  
1      967  14.0     500,000+  Free    0.0  January 15, 2018  
2    87510   8.7   5,000,000+  Free    0.0    August 1, 2018  
3   215644  25.0  50,000,000+  Free    0.0      June 8, 2018  
4      967   2.8     100,000+  Free    0.0     June 20, 2018  


In [3]:
# Change the data type of the Installs columns to integer
transformed_values = map(lambda y: int(y.rstrip('+').replace(',', '')), apps['Installs'])
apps['Installs'] = pd.Series(transformed_values)
print(apps.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9659 entries, 0 to 9658
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   App           9659 non-null   object 
 1   Category      9659 non-null   object 
 2   Rating        8196 non-null   float64
 3   Reviews       9659 non-null   int64  
 4   Size          8432 non-null   float64
 5   Installs      9659 non-null   int64  
 6   Type          9659 non-null   object 
 7   Price         9659 non-null   float64
 8   Last Updated  9659 non-null   object 
dtypes: float64(3), int64(2), object(4)
memory usage: 679.3+ KB
None


In [4]:
# Another possible solution using for loop:

# int_list = []
# for value in apps['Installs']:
#     value_cleaned = value.rstrip('+').replace(',', '')
#     value_int = int(value_cleaned)
#     int_list.append(value_int)

# apps['Installs']=pd.Series(int_list)  

# print(apps.info())

<p>2. Find <b>the number of apps</b> in each category, <b>the average price</b>, and <b>the average rating</b>. Save your answer as a DataFrame <b>app_category_info</b>. Your should rename the four columns as: Category, Number of apps, Average price, Average rating.</p>

In [5]:
app_category_info = apps.groupby('Category').agg({'App': 'count', 'Price': 'mean', 'Rating': 'mean'})

app_category_info.rename(columns = {'App': 'Number of apps', 'Price':'Average price', 'Rating':'Average rating'},
                         inplace = True)

<p>3. Find <b>the top 10 free FINANCE apps</b> having <b>the highest average sentiment score</b>. Save your answer as a DataFrame <b>top_10_user_feedback</b>. Your answer should have exactly 10 rows and two columns named: App and Sentiment Score, where the average Sentiment Score is sorted from highest to lowest.</p>

In [6]:
# Import dataset
user_reviews = pd.read_csv('./datasets/user_reviews.csv')

In [7]:
# Merge the two DataFrames
merged = apps.merge(user_reviews, on = 'App')
merged.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Last Updated,Review,Sentiment Category,Sentiment Score
0,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,500000,Free,0.0,"January 15, 2018",A kid's excessive ads. The types ads allowed a...,Negative,-0.25
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,500000,Free,0.0,"January 15, 2018",It bad >:(,Negative,-0.725
2,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,500000,Free,0.0,"January 15, 2018",like,Neutral,0.0
3,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,500000,Free,0.0,"January 15, 2018",,,
4,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,500000,Free,0.0,"January 15, 2018",I love colors inspyering,Positive,0.5


In [8]:
# Select only the rows where the category is finance and the apps are free
free_finance_apps = merged[(merged['Category'] == "FINANCE") 
                           & (merged['Type'] == "Free")][['App', 'Sentiment Score']]

# Calculate the mean sentiment score, sort the results and limit them to the 10 highest
mean_sentiment_score = free_finance_apps.groupby('App')['Sentiment Score'].mean()
top_10_user_feedback = mean_sentiment_score.sort_values(ascending = False).iloc[:10]
top_10_user_feedback = top_10_user_feedback.reset_index().set_index('App')
top_10_user_feedback

Unnamed: 0_level_0,Sentiment Score
App,Unnamed: 1_level_1
BBVA Spain,0.515086
Associated Credit Union Mobile,0.388093
BankMobile Vibe App,0.353455
A+ Mobile,0.329592
Current debit card and app made for teens,0.327258
BZWBK24 mobile,0.326883
"Even - organize your money, get paid early",0.283929
Credit Karma,0.270052
Fortune City - A Finance App,0.266966
Branch,0.26423
