# So you want to make a killer app?  Can you use data from existing apps to tell whether your app is going to be successful?

# What kind of app should you make?

# In this project, we analyze data from the Google Play App Store.  We will use the Pandas package to make two dataframes from csv files pertaining to app data, and user review data.

# After previewing what data we have to work with, we will ask pertinent questions relating to our data, use Pandas to answer those questions, and finally Matplotlib and WordCloud to visualize our findings.

In [3]:
# We first import the common packages relevant to data analysis
import pandas as pd
import seaborn as sns

# We will also use the WordCloud function from the WordCloud package to visualize review data later in the project.
from wordcloud import WordCloud

In [4]:
# This creates dataframes from the csv files holding the data we are analyzing.
app_data = pd.read_csv("Data/google-play-store-apps/googleplaystore.csv")
review_data = pd.read_csv("Data/google-play-store-apps/googleplaystore_user_reviews.csv")

In [5]:
# We use the .head() method to preview what data the app_data dataframe contains.
app_data.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [6]:
# We use the .head() method to preview what data the review_data dataframe contains.
review_data.head()

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
2,10 Best Foods for You,,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
4,10 Best Foods for You,Best idea us,Positive,1.0,0.3


# After seeing the data we have available, we are left with the following questions:

# 1. How well-recieved is each app category; that is, what is the mean rating for each category of apps?
# 2. How populous is each category; that is, which category has the most apps?
# 3. How often is each category of apps installed; that is, what is the total number of installs per category?
# 4. Are there any relationships between mean-rating, number of apps, and total downloads by category?
# 5. Given a list of words of interest, which word appears most often in written reviews?
# 6. Given a list of words of interest, which other words appear in reviews where the word of interest appears?
# 7. Given a list of words of interest, which percentage of reviews with each keyword are positive, negative, and neutral reviews?

## Before we proceed with our analysis, we will look for any limitations our data may have.

In [9]:
# In the following lines, we check the shape of both dataframes using the .shape method, as well as the number of entries in each column using the .count() method.
# By doing so, we check for missing data.
print("app_data:")
print(app_data.shape)
print(app_data.count())
print("\nreview_data:")
print(review_data.shape)
print(review_data.count())

app_data:
(10841, 13)
App               10841
Category          10841
Rating             9367
Reviews           10841
Size              10841
Installs          10841
Type              10840
Price             10841
Content Rating    10840
Genres            10841
Last Updated      10841
Current Ver       10833
Android Ver       10838
dtype: int64

review_data:
(64295, 5)
App                       64295
Translated_Review         37427
Sentiment                 37432
Sentiment_Polarity        37432
Sentiment_Subjectivity    37432
dtype: int64


## We notice that our app_data dataframe has 10,841 rows, yet several columns do not have this many entries.  We also notice that our review_data dataframe has 64,295 rows, yet most columns only have approximately half as many entries.  Before we proceed, we will create a subset of our dataframes using only rows with complete data.

In [8]:
# By using the .dropna method, we drop any rows which have missing data.
app_data_no_na = app_data.dropna(how="any")
review_data_no_na = review_data.dropna(how="any")

In [10]:
# In the following lines, we check the shape of both dataframes using the .shape method, as well as the number of entries in each column using the .count() method.
# By doing so, we confirm that the subsets we are using only contain rows with complete data.
print("App Data:")
print(app_data_no_na.shape)
print(app_data_no_na.count())
print("\nReview Data")
print(review_data_no_na.shape)
print(review_data_no_na.count())

App Data:
(9360, 13)
App               9360
Category          9360
Rating            9360
Reviews           9360
Size              9360
Installs          9360
Type              9360
Price             9360
Content Rating    9360
Genres            9360
Last Updated      9360
Current Ver       9360
Android Ver       9360
dtype: int64

Review Data
(37427, 5)
App                       37427
Translated_Review         37427
Sentiment                 37427
Sentiment_Polarity        37427
Sentiment_Subjectivity    37427
dtype: int64
