Some interesting questions that can be answered using this dataset are:

1. What apps are most reviewed? Of those, which ones have the highest rating?

2. 

### Dataset Context

The applications -or *apps*- offered in the Google Play store are in the millions and growing. As of the writing of this project, the Google Play store is estimated to hold 2.6 million applications. The creator of this dataset, Lavanya Gupta, was able to obtain data on 10,000 of these apps.

She obtained the dataset through scraping the store, which uses dynamic page loading. Dynamic page loading means that the store page displays the apps based on what Google knows about the user requesting the page. And scraping means that she wrote a script that runs through the dynamically-loaded page, reads the data, and outputs it into a structured file, such as the csv file I will be working with for this project.

The data files include another file containing a sentiment analysis conducted on this sample of Google Play apps using the nltk Python library, which stands for Natural Language Toolkit. The objective of this analysis is to try to understand user reviews and what they convey about their opinions of these apps. 

### Dataset Content

#### A. googleplaystore.csv

This file contains the main dataset. It has 10,841 rows of data with the following columns:

*App Category*: Category of the app. This could be beauty, business, entertainment, education...etc.

*Rating*: How users rate the app out of 5, with 1 being the lowest rating and 5 being the highest.

*Reviews*: The number of user reviews each app has received.

*Size*: The memory size needed to install the application.

*Installs*: The number of times each application has been installed by users.

*Type*: Whether the app is free or a paid app.

*Price*: The price of the app.

*Content Rating*: This column specifies the intended audience for the app. Can be for teens, mature audience, or everyone.

*Genres*: The sub-category for each app. Example: for the Education category, this could be Education: Pretend Play.

*Last Updated*: Release date of the most recent update for the app.

*Current Ver*: The app's current version.

*Android Ver*: The oldest version of Android OS supported by the app.


#### B. googleplaystore_user_reviews.csv

This file contains the result of the sentiment analysis conducted by the dataset creator. It has 64,295 rows of data with the following columns:

*App* : Name of the app.

*Translated_Review*: Either the original review in English, or a translated version if the orignal review is in another language.

*Sentiment*: The result of the sentiment analysis conducted on a review. The value is either Positive, Neutral, or Negative.

*Sentiment_Polarity*: A value indicating the positivity or negativity of the sentiment, values range from -1 (most negative) to 1 (most positive).

*Sentiment_Subjectivity*: A value from 0 to 1 indicating the subjectivity of the review. Lower values indicate the review is based on factual information, and higher values indicate the review is based on personal or public opinions or judgements.

### Summary of Limitations

The limitations of the Google Play Store Apps data are:

1. The apps included are relevant to the dataset creator's activity on Google-related sites. She is a Machine Learning Software Developer based in India. It is most likely the applications generated are based on their popularity in the geographical region around India, while this analysis is intended for audience in the U.S or North America.


2. With cloud-based storage available for Android users at little or no cost, app size may have no significant contribution to app popularity. Therefore the *Size* column will be removed.


3. I am not sure if apps follow the same software versioning process, therefore I will assume the *Current Ver* column will be irrelevant to this analysis. Otherwise it would have been useful for measuring current support of the app by its developers.


4. I will assume that the vast majority of users can upgrade their Android devices to the latest version. Based on that, the *Android Ver* column will also be excluded. Any limitations that may justify relying on older versions of Android most probably do not apply to the majority of the population.


5. Scraping data off of a Google website is an unconventional way to obtain it, which may result in misplaced data. This largely depends on the scraper built by the dataset creator.


6. The sentiment analysis result is limited by the abilities of Python's nltk library, which does not support all languages. Reviews with unsupported languages will not be translated and should have no values within the analysis output.

I will start by importing the csv files into two Pandas dataframes, one called *app_data* which contains the main data on the applications, and another called *sentiment_data* containing the sentiment analysis results on app reviews.

In [101]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [102]:
#Import the googleplaystore.csv into a Pandas dataframe
app_data = pd.read_csv(r"C:\Users\Mohammad\Documents\Thinkful\7.11 Capstone 1 Analytic Report and Research Proposal\google-play-store-apps\googleplaystore.csv")

#Show the first 3 rows of the dataframe
app_data.head(3)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up


The rows are in their original sorting order. While the most popular apps are would still be most popular according to India and Lavanya's user behavior, I am confident they are still recognizable by North American audience due to their global offering:

In [103]:
# Sort the original dataset by number of installs to see most popular apps first
app_data = app_data.sort_values(by="Installs", ascending=False)
app_data.head(3)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,
420,UC Browser - Fast Download Private & Secure,COMMUNICATION,4.5,17714850,40M,"500,000,000+",Free,0,Teen,Communication,"August 2, 2018",12.8.5.1121,4.0 and up
474,LINE: Free Calls & Messages,COMMUNICATION,4.2,10790289,Varies with device,"500,000,000+",Free,0,Everyone,Communication,"July 26, 2018",Varies with device,Varies with device


*Life Made WI-Fi Touchscreen Photo Frame* is listed in the top, but is not a result of a huge amount of installs. This is probably due to an error in data entry that is attributed to the scraper used to get this dataset. However it is good that only one erroneous row exists beyond the true most-installed app rows. Since erroneous rows are likely to exist outside the range of [0, maximum value] it is a must to check if such rows exist beyond rows with 0 installs as well:

In [104]:
#re-sort the data in ascending order to show least installed apps first
app_data = app_data.sort_values(by="Installs")
app_data.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
9148,Command & Conquer: Rivals,FAMILY,,0,Varies with device,0,,0,Everyone 10+,Strategy,"June 28, 2018",Varies with device,Varies with device
9337,EG | Explore Folegandros,TRAVEL_AND_LOCAL,,0,56M,0+,Paid,$3.99,Everyone,Travel & Local,"January 22, 2017",1.1.1,4.1 and up
9719,EP Cook Book,MEDICAL,,0,3.2M,0+,Paid,$200.00,Everyone,Medical,"July 26, 2015",1.0,3.0 and up
6692,cronometra-br,PRODUCTIVITY,,0,5.4M,0+,Paid,$154.99,Everyone,Productivity,"November 24, 2017",1.0.0,4.1 and up
8081,CX Network,BUSINESS,,0,10M,0+,Free,0,Everyone,Business,"August 6, 2018",1.3.1,4.1 and up


This is good news, no erroneous rows exist below 0 for the *Installs* column. Therefore all we have to do is delete
that one erroneous row. I will proceed to drop the it, plus the *Size*, *Current Ver*, and the *Andoird Ver* columns from the dataset. Note that the erroneous row's index is **10472**, which will be used as the argument to drop the row:

In [105]:
#re-sort the data with most installed apps first
app_data = app_data.sort_values(by="Installs", ascending=False)

#remove the erroneous row from the original dataset
app_data = app_data.drop(10472, axis=0)

#remove unusable columns
app_data = app_data.drop(columns=["Size", "Current Ver","Android Ver"])

app_data.head()

Unnamed: 0,App,Category,Rating,Reviews,Installs,Type,Price,Content Rating,Genres,Last Updated
1661,Temple Run 2,GAME,4.3,8118609,"500,000,000+",Free,0,Everyone,Action,"July 5, 2018"
474,LINE: Free Calls & Messages,COMMUNICATION,4.2,10790289,"500,000,000+",Free,0,Everyone,Communication,"July 26, 2018"
3574,Cloud Print,PRODUCTIVITY,4.1,282460,"500,000,000+",Free,0,Everyone,Productivity,"May 23, 2018"
3326,Gboard - the Google Keyboard,TOOLS,4.2,1859109,"500,000,000+",Free,0,Everyone,Tools,"July 31, 2018"
431,Viber Messenger,COMMUNICATION,4.3,11334973,"500,000,000+",Free,0,Everyone,Communication,"July 18, 2018"


The dataframe looks much better seeing the top rows show apps that are known to worldwide mobile users. To proceed in answering the first part of our first question, *What apps are most reviewed?*, we need to make sure the *Installs* column is numeric. This calls for removing the *+* sign from the end of each value:

ALSO NEED TO REMOVE COMMAS FROM NUMBER, THEN DO LOG TRANSFORMATION

In [110]:
for value in app_data["Installs"]:
    installs = []
    if value.endswith("+"):
        value = str(value)[:-1]
        print(value)
        installs.append(value)

print(installs)
#app_data["Installs"] = pd.Series(installs)
#app_data["Installs"]

500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000,000
500,000
500,000
500,000
500,000
500,000
500,000
500,000
500,000
500,000
500,000
500,000
500,000
500,000
500,000
500,000
500,000
500,000


50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000
50,000,000

5,000
5,000
5,000
5,000
5,000
5,000
5,000
5,000
5,000
5,000
5,000
5,000
5,000
5,000
5,000
5,000
5,000
5,000
5,000
5,000
5,000
5,000
5,000
5,000
5,000
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
100,000,000
10

100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000


10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000

10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000
10,000,000

1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000


1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000
1,000,000


Next, I will import the *googleplaystore_user_reviews* csv file into a dataframe named *sentiment_data*. Then I will proceed to remove all rows where the analysis does not have results. Only before that I want to have an idea of how many reviews were not translated: