### Dataset Context

The applications -or *apps*- offered in the Google Play store are in the millions and growing. As of the writing of this project, the Google Play store is estimated to hold 2.6 million applications. The creator of this dataset, Lavanya Gupta, was able to obtain data on 10,000 of these apps.

She obtained the dataset through scraping the store, which uses dynamic page loading. Dynamic page loading means that the store page displays the apps based on what Google knows about the user requesting the page. And scraping means that she wrote a script that runs through the dynamically-loaded page, reads the data, and outputs it into a structured file, such as the csv file I will be working with for this project.

The data files include another file containing a sentiment analysis conducted on this sample of Google Play apps using the nltk Python library, which stands for Natural Language Toolkit. The objective of this analysis is to try to understand user reviews and what they convey about their opinions of these apps. 

### Dataset Content

#### A. googleplaystore.csv

This file contains the main dataset. It has 10,841 rows of data with the following columns:

*App Category*: Category of the app. This could be beauty, business, entertainment, education...etc.

*Rating*: How users rate the app out of 5, with 1 being the lowest rating and 5 being the highest.

*Reviews*: The number of user reviews each app has received.

*Size*: The memory size needed to install the application.

*Installs*: The number of times each application has been installed by users.

*Type*: Whether the app is free or a paid app.

*Price*: The price of the app.

*Content Rating*: This column specifies the intended audience for the app. Can be for teens, mature audience, or everyone.

*Genres*: The sub-category for each app. Example: for the Education category, this could be Education: Pretend Play.

*Last Updated*: Release date of the most recent update for the app.

*Current Ver*: The app's current version.

*Android Ver*: The oldest version of Android OS supported by the app.


#### B. googleplaystore_user_reviews.csv

This file contains the result of the sentiment analysis conducted by the dataset creator. It has 64,295 rows of data with the following columns:

*App* : Name of the app.

*Translated_Review*: Either the original review in English, or a translated version if the orignal review is in another language.

*Sentiment*: The result of the sentiment analysis conducted on a review. The value is either Positive, Neutral, or Negative.

*Sentiment_Polarity*: A value indicating the positivity or negativity of the sentiment, values range from -1 (most negative) to 1 (most positive).

*Sentiment_Subjectivity*: A value from 0 to 1 indicating the subjectivity of the review. Lower values indicate the review is based on factual information, and higher values indicate the review is based on personal or public opinions or judgements.

### Summary of Limitations

The limitations of the Google Play Store Apps data are:

1. The apps included are relevant to the dataset creator's activity on Google-related sites. She is a Machine Learning Software Developer based in India.


2. I am not sure if apps follow the same software versioning process, therefore I will assume the *Current Ver* column will be irrelevant to this analysis. Otherwise it would have been useful for measuring current support of the app by its developers.


3. I will assume that the vast majority of users can upgrade their Android devices to the latest version. Based on that, the *Android Ver* column will also be excluded. Any limitations that may justify relying on older versions of Android most probably do not apply to the majority of the population.


4. Scraping data off of a Google website is an unconventional way to obtain it, which may result in misplaced data. This largely depends on the scraper built by the dataset creator.


5. The sentiment analysis result is limited by the abilities of Python's nltk library, which does not support all languages. Reviews with unsupported languages will not be translated and should have no values within the analysis output.

I will start by importing the csv files into two Pandas dataframes, one called *app_data* which contains the main data on the applications, and another called *sentiment_data* containing the sentiment analysis results on app reviews.

In [17]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [18]:
#Import the googleplaystore.csv into a Pandas dataframe
app_data = pd.read_csv(r"C:\Users\Mohammad\Documents\Thinkful\7.11 Capstone 1 Analytic Report and Research Proposal\google-play-store-apps\googleplaystore.csv")

#Show the first 3 rows of the dataframe
app_data.head(3)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up


Now I will drop the last two columns from the dataset:

In [21]:
#drop the Current Ver and Android Ver columns
app_data = app_data.drop(columns = ["Current Ver", "Android Ver"])
app_data.head(3)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018"
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018"
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018"


Next, I will import the *googleplaystore_user_reviews* csv file into a dataframe named *sentiment_data*. Then I will proceed to remove all rows where the analysis does not have results. Only before that I want to have an idea of how many reviews were not translated:

In [32]:
#Import the googleplaystore_user_reviews.csv into a Pandas dataframe
sentiment_data = pd.read_csv(r"C:\Users\Mohammad\Documents\Thinkful\7.11 Capstone 1 Analytic Report and Research Proposal\google-play-store-apps\googleplaystore_user_reviews.csv")

#Show an example of rows where reviews could not be translated
unsupported = sentiment_data["Translated_Review"]
unsupported = unsupported.loc("NaN")
unsupported.head()

ValueError: No axis named NaN for object type <class 'pandas.core.series.Series'>