## 1. Introduction
<p><img src="https://assets.datacamp.com/production/project_1197/img/google_play_store.png" alt="Google Play logo"></p>
<p>Mobile apps are everywhere. They are easy to create and can be very lucrative from the business standpoint. Specifically, Android is expanding as an operating system and has captured more than 74% of the total market<sup><a href="https://www.statista.com/statistics/272698/global-market-share-held-by-mobile-operating-systems-since-2009">[1]</a></sup>. </p>
<p>The Google Play Store apps data has enormous potential to facilitate data-driven decisions and insights for businesses. In this notebook, we will analyze the Android app market by comparing ~10k apps in Google Play across different categories. We will also use the user reviews to draw a qualitative comparision between the apps.</p>
<p>The dataset you will use here was scraped from Google Play Store in September 2018 and was published on <a href="https://www.kaggle.com/lava18/google-play-store-apps">Kaggle</a>. Here are the details: <br>
<br></p>
<div style="background-color: #efebe4; color: #05192d; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
    <div style="font-size:20px"><b>datasets/apps.csv</b></div>
This file contains all the details of the apps on Google Play. There are 9 features that describe a given app.
<ul>
    <li><b>App:</b> Name of the app</li>
    <li><b>Category:</b> Category of the app. Some examples are: ART_AND_DESIGN, FINANCE, COMICS, BEAUTY etc.</li>
    <li><b>Rating:</b> The current average rating (out of 5) of the app on Google Play</li>
    <li><b>Reviews:</b> Number of user reviews given on the app</li>
    <li><b>Size:</b> Size of the app in MB (megabytes)</li>
    <li><b>Installs:</b> Number of times the app was downloaded from Google Play</li>
    <li><b>Type:</b> Whether the app is paid or free</li>
    <li><b>Price:</b> Price of the app in US$</li>
    <li><b>Last Updated:</b> Date on which the app was last updated on Google Play </li>

</ul>
</div>
<div style="background-color: #efebe4; color: #05192d; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
    <div style="font-size:20px"><b>datasets/user_reviews.csv</b></div>
This file contains a random sample of 100 <i>[most helpful first](https://www.androidpolice.com/2019/01/21/google-play-stores-redesigned-ratings-and-reviews-section-lets-you-easily-filter-by-star-rating/)</i> user reviews for each app. The text in each review has been pre-processed and passed through a sentiment analyzer.
<ul>
    <li><b>App:</b> Name of the app on which the user review was provided. Matches the `App` column of the `apps.csv` file</li>
    <li><b>Review:</b> The pre-processed user review text</li>
    <li><b>Sentiment Category:</b> Sentiment category of the user review - Positive, Negative or Neutral</li>
    <li><b>Sentiment Score:</b> Sentiment score of the user review. It lies between [-1,1]. A higher score denotes a more positive sentiment.</li>

</ul>
</div>
<p>From here on, it will be your task to explore and manipulate the data until you are able to answer the three questions described in the instructions panel.<br></p>

In [37]:
# Use this cell to begin your analysis, and add as many as you would like!
# importing libraries
import pandas as pd

# Reading and Exploring Data 

In [38]:
# reading data file and exploring data
apps = pd.read_csv('datasets/apps.csv')
print('Shape of data is:', apps.shape)
apps.sample(5)

Shape of data is: (9659, 9)


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Last Updated
4995,Outdoor Movies BC,EVENTS,2.9,7,,500+,Free,0.0,"June 22, 2018"
6427,CK Life,HEALTH_AND_FITNESS,,3,60.0,100+,Free,0.0,"August 2, 2018"
1153,Monitor Your Weight,HEALTH_AND_FITNESS,4.5,126017,4.3,"5,000,000+",Free,0.0,"July 5, 2018"
8174,EF Lens Simulator,PHOTOGRAPHY,,2,33.0,100+,Paid,9.99,"February 27, 2017"
5632,IKEA Store,SHOPPING,3.4,25515,,"10,000,000+",Free,0.0,"July 16, 2018"


In [39]:
apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9659 entries, 0 to 9658
Data columns (total 9 columns):
App             9659 non-null object
Category        9659 non-null object
Rating          8196 non-null float64
Reviews         9659 non-null int64
Size            8432 non-null float64
Installs        9659 non-null object
Type            9659 non-null object
Price           9659 non-null float64
Last Updated    9659 non-null object
dtypes: float64(3), int64(1), object(5)
memory usage: 679.2+ KB


# Data cleaning and processing

### instruction 1 

In [40]:
# removing special characters from installs column in order to canging its type
character_to_remove = [',','+']
for char in character_to_remove:
    apps['Installs'] = apps['Installs'].str.replace(char,'') 

apps.sample(10)


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Last Updated
8826,Survival Forest : Survivor Home Builder,FAMILY,3.6,9389,64.0,1000000,Free,0.0,"May 31, 2018"
1942,VetCode,MEDICAL,4.9,28,5.7,5000,Free,0.0,"July 7, 2018"
1752,Medical ID - In Case of Emergency (ICE),MEDICAL,4.6,717,5.4,5000,Paid,5.99,"May 31, 2018"
6245,Offline Jízdní řády CG Transit,MAPS_AND_NAVIGATION,4.6,7314,7.0,100000,Free,0.0,"September 7, 2017"
7979,3G DZ Configuration,COMMUNICATION,4.3,488,2.5,50000,Free,0.0,"November 19, 2015"
6685,Sniper Killer Shooter,GAME,4.0,119368,19.0,10000000,Free,0.0,"February 21, 2017"
3633,R Programing Offline Tutorial,BOOKS_AND_REFERENCE,5.0,4,3.9,1000,Free,0.0,"March 15, 2018"
2379,Fantasy Football Manager (FPL),SPORTS,4.4,63650,4.6,1000000,Free,0.0,"August 3, 2018"
8879,GO Launcher EX UI5.0 theme,PERSONALIZATION,4.3,111634,13.0,5000000,Free,0.0,"May 26, 2014"
7717,SmartCircle Remote DS,BUSINESS,,2,1.6,500,Free,0.0,"May 22, 2018"


In [41]:
# converting installs column from string into integer
apps['Installs'] = apps['Installs'].astype(int)
apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9659 entries, 0 to 9658
Data columns (total 9 columns):
App             9659 non-null object
Category        9659 non-null object
Rating          8196 non-null float64
Reviews         9659 non-null int64
Size            8432 non-null float64
Installs        9659 non-null int64
Type            9659 non-null object
Price           9659 non-null float64
Last Updated    9659 non-null object
dtypes: float64(3), int64(2), object(4)
memory usage: 679.2+ KB


### instruction 2

In [42]:
# Grouping categories and finding mean of rating and price columns
app_category_info = apps.groupby('Category').agg({'App' :'count' , 'Price' : 'mean', 'Rating': 'mean'})
app_category_info.head(10)

Unnamed: 0_level_0,App,Price,Rating
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ART_AND_DESIGN,64,0.093281,4.357377
AUTO_AND_VEHICLES,85,0.158471,4.190411
BEAUTY,53,0.0,4.278571
BOOKS_AND_REFERENCE,222,0.539505,4.34497
BUSINESS,420,0.417357,4.098479
COMICS,56,0.0,4.181481
COMMUNICATION,315,0.263937,4.121484
DATING,171,0.160468,3.970149
EDUCATION,119,0.150924,4.364407
ENTERTAINMENT,102,0.078235,4.135294


In [43]:
# renaming columns
app_category_info.rename(columns = {'App':'Number of apps' , 'Price':'Average price', 'Rating':'Average rating'}, inplace = True)
app_category_info.head()

Unnamed: 0_level_0,Number of apps,Average price,Average rating
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ART_AND_DESIGN,64,0.093281,4.357377
AUTO_AND_VEHICLES,85,0.158471,4.190411
BEAUTY,53,0.0,4.278571
BOOKS_AND_REFERENCE,222,0.539505,4.34497
BUSINESS,420,0.417357,4.098479


### instruction 3 

In [44]:
# reading and explorind user_reviws dataset
reviews = pd.read_csv('datasets/user_reviews.csv')
print('The shape of data is: ', reviews.shape)
reviews.head()

The shape of data is:  (64295, 4)


Unnamed: 0,App,Review,Sentiment Category,Sentiment Score
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25
2,10 Best Foods for You,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4
4,10 Best Foods for You,Best idea us,Positive,1.0


In [45]:
# extracting only finance free app from apps dataframe 
finance_apps = apps[apps['Category'] == 'FINANCE'] 
finance_apps.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Last Updated
837,K PLUS,FINANCE,4.4,124424,,10000000,Free,0.0,"June 26, 2018"
838,ING Banking,FINANCE,4.4,39041,,1000000,Free,0.0,"August 3, 2018"
839,Citibanamex Movil,FINANCE,3.6,52306,42.0,5000000,Free,0.0,"July 27, 2018"
840,The postal bank,FINANCE,3.7,36718,,5000000,Free,0.0,"July 16, 2018"
841,KTB Netbank,FINANCE,3.8,42644,19.0,5000000,Free,0.0,"June 28, 2018"


In [46]:
# merging two dataframe
merged_finance = pd.merge(reviews, finance_apps, on = 'App')
print('The shape of data is: ', merged_finance.shape)

The shape of data is:  (2200, 12)


In [47]:
# Grouping by app and determining average sentiment score
sentiment_score = merged_finance.groupby('App').agg({'Sentiment Score' : 'mean'})
sentiment_score.head()

Unnamed: 0_level_0,Sentiment Score
App,Unnamed: 1_level_1
A+ Mobile,0.329592
ACE Elite,0.252171
Acorns - Invest Spare Change,0.046667
Amex Mobile,0.175666
Associated Credit Union Mobile,0.388093


In [48]:
# Sorting sentiment score
final_df = sentiment_score.sort_values('Sentiment Score', ascending = False)
final_df.head()

Unnamed: 0_level_0,Sentiment Score
App,Unnamed: 1_level_1
BBVA Spain,0.515086
Associated Credit Union Mobile,0.388093
BankMobile Vibe App,0.353455
A+ Mobile,0.329592
Current debit card and app made for teens,0.327258


In [49]:
# displaying top 10 apps
top_10_user_feedback = final_df.head(10)
top_10_user_feedback

Unnamed: 0_level_0,Sentiment Score
App,Unnamed: 1_level_1
BBVA Spain,0.515086
Associated Credit Union Mobile,0.388093
BankMobile Vibe App,0.353455
A+ Mobile,0.329592
Current debit card and app made for teens,0.327258
BZWBK24 mobile,0.326883
"Even - organize your money, get paid early",0.283929
Credit Karma,0.270052
Fortune City - A Finance App,0.266966
Branch,0.26423
