# COGS 108 - Final Project 

# Overview

    Nowadays when there is a huge variety of applications, the competition inside this market is extremely high. Every application that wants to be successful has to utilize all available data in order to improve its quality and avoid possible incorrect decisions. Correct analysis of different phenomena related to applications and discovering any hidden patterns inside the relevant data are essential in helping applications achieve these goals. Therefore, our team came up to the idea that an applications name could be an influential indicator of an application’s success.

# Names

- Aliakasandr Samushchyk
- Jiayi Zhang
- Soumya Agrawal
- Richard Duong
- Titan Ngo
- Yaman Jandali

# Group Members IDs

- A15672156
- A14533542
- A14402679
- A15196673
- A15525832
- A15753076

# Research Question

1) Is the length of the app’s name significant in explaining its success?

2) Are there any words/letters/combinations of letters/Capital letters that tend to be present in successful applications?

3) What are the other predictors of applications’ success?


## Background and Prior Work

The dataset that we found presents a lot of interesting insights about user
preferences, and we want to explore the effects of certain user preferences on app
rating and popularity.
There is not a lot of background knowledge that we know about user preferences
affecting app popularity other than the obvious correlations between app
rating/popularity, so we would like to use the data to try and find more subtle biases
that might affect it such as title/title length, etc.

References:
- 1)
“A Statistical Analysis of the Apple App Store” by Colin Eberhardt
Did a statistical analysis of prices of apps in the Apple App store. Did not do much
other than basic statistical analysis, such as looking at the genre distribution of apps
and the price differences across genres. Found a positive correlation between price
and app rating.
Source: https://blog.scottlogic.com/2014/03/20/app-store-analysis.html
- 2)
Did a statistical analysis of various factors that contribute to the success of an app in
the Google Play Store. Found that most free apps are monetized by advertisements.
Learned that ~80% of apps on the playstore have been downloaded less than 50k
times. Found that a small amount of users who install actually take the time to write
a review.
Source:
https://nycdatascience.com/blog/student-works/web-scraping/analysis-of-apps-in-th
e-google-play-store/

The scope of our project is a bit beyond the analysis that these projects present, but
in a similar vein. While these projects analyzed the more basic factors that weigh in
the success of an app, we will be focusing more on subtle user preferences that are
not as obvious to correlate to success.

# Hypothesis


We will measure the success of an application using a few metrics such as the number of installations, the rating of an application, and the number of reviews for the app combined with sentiment analysis of the content of these reviews. We expect that thetitle has an influence on the
applications’ success. In particular, we expect the length of
titles to be a significant explanatory variable; from our point of view, the more time you spend to read the names of anything the less interest you keep. We also expect that other features of applications such as cost, size, and category are important in explaining
success. We will also try to build a machine learning algorithm at the end of our project, that will provide the probability of an app’s success and classify whether an app is successful given its title.

# Dataset(s)

- Dataset Name: Google Play Store Apps
- Link to the dataset: https://www.kaggle.com/lava18/google-play-store-apps?fbclid=IwAR0I6EIgxdnc3LWhwVVg85gZ9RokprTW6xDo47EQxwDu5Qkce24ZC2MbIBs#googleplaystore_user_reviews.csv
- Number of observations:64.3k

This dataset provides up with a zipped folder containing two files. The first file has around 64.3k observations, with 5 variables. These include the app name, translated review, sentiment, sentiment polarity, and sentiment subject. Most of these variables have been preprocessed already, such as the translated review and sentiment. Many observations in this file (or the reviews), are for the same app. As for the second file, there are around 10.8k observations with 13 variables. The variables include app name, category of the app, the overall user rating, number of reviews, size of the app, number of installs, whether the app is paid or free, the price of the app, the content rating, the genres, when the data was last updated, the current version of the app available on the app store, and the minimum required Android version for the app. With these different features of the app, we will be able to ask our main question, which centers around the popularity of an app, and what makes it popular. Using the sentiment analysis on the reviews and analysis on the title (popular keywords in titles, length of the titles), we can determine factors other than the rating or number of installs to determine an app's popularity.

# Setup

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Display plots directly in the notebook instead of in a new window
%matplotlib inline

# Data Cleaning

First of all, to get the most important and necessary data from our dataset, we first need to clean the data because it contains some information that won't be helpful such as Current Version and Android Version. After cleaning the data, we were left with application name, category, rating, reviews, size, installs, type, price, etc. 

Also, the second dataset contains the reviews for over 1000 apps, and preprocessed indicator sentiment, sentiment_polarity, etc. We will also clean this dataset and use the latter three preprocessed information as deference for our analysis. 

In [16]:
#upload the data set
df = pd.read_csv('googleplaystore.csv')
df_review = pd.read_csv('googleplaystore_user_reviews.csv')

#dropping unnecessary data, Genres is very repetitive information compare to Category
df = df.drop(['Genres','Current Ver', 'Android Ver'], axis=1)

#drop all NaNs in first dataset, and for Translated_Review in second dataset
df = df.dropna()
df_review = df_review.dropna(subset = ['Translated_Review'])

#Visualization of the main dataset after cleaning
df

#Question:
#How to merge these two dataset?

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Last Updated
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,"January 7, 2018"
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,"January 15, 2018"
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,"August 1, 2018"
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,"June 8, 2018"
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,"June 20, 2018"
5,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6M,"50,000+",Free,0,Everyone,"March 26, 2017"
6,Smoke Effect Photo Maker - Smoke Editor,ART_AND_DESIGN,3.8,178,19M,"50,000+",Free,0,Everyone,"April 26, 2018"
7,Infinite Painter,ART_AND_DESIGN,4.1,36815,29M,"1,000,000+",Free,0,Everyone,"June 14, 2018"
8,Garden Coloring Book,ART_AND_DESIGN,4.4,13791,33M,"1,000,000+",Free,0,Everyone,"September 20, 2017"
9,Kids Paint Free - Drawing Fun,ART_AND_DESIGN,4.7,121,3.1M,"10,000+",Free,0,Everyone,"July 3, 2018"


# Data Analysis & Results

Include cells that describe the steps in your data analysis.

In [5]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Ethics & Privacy

*Fill in your ethics & privacy discussion here*

# Conclusion & Discussion

*Fill in your discussion information here*