# How can we make and promote an app successfully?

## I. Introduction

According to Statista, the global mobile internet penetration has exceeded half the world’s population, while the average daily time spent accessing online content from a mobile device, such as a smartphone, a tablet computer or wearable, has reached 185 minutes daily among Millennials. Currently, the two largest global platforms for app distribution are Apple’s App Store, which caters to iOS users, and Google Play, the official app store for the Android OS. The fact that mobile apps are relatively easier to create than computer apps, as well as their considerably lower price has translated into a growing industry which produce every year more and more. It is impossible to know exactly how many apps are there, but as of March 2018 there were some 3.6 million apps in Google Play alone. So, how to create a successful mobile app winning more customers in fierce competition? What key ingredients do the popular apps possess? In order to figure out these questions, we make some descriptive statistical analysis in this report.

## II. Dataset 

The dataset contains web scraped data of the most popular 1000 apps in Google Play Store, and we use the dataset related to top 1000 apps in score provided by www.kaggle.com to enrich our project. There are 8 features that describe a given app, which are App Category, Rating, Reviews, Size, Installs, Type, Price, Content Rating. We make some descriptive statistical analysis on the dataset and conclude some attributes of these good Apps to give some suggestion to Apps developers.

### 1. About the Category 

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv("google app final.csv")
df

In [None]:
from wordcloud import WordCloud
df_category = df["category"]
text = df_category
result= "/".join(text) #要分割开才行 如果是txt还要用jieba
wordcloud = WordCloud(font_path="c:\windows\Fonts\simhei.ttf").generate(result) 
%pylab inline
import matplotlib.pyplot as plt
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

![Fig.1](https://github.com/Jasmine-dudu/Final-Project/blob/master/Figures-Final%20Project/Fig.1.png?raw=true)

In [None]:
df['installs'] = df['installs'].apply(lambda x: x.replace('+', '') if '+' in str(x) else x)
df['installs'] = df['installs'].apply(lambda x: x.replace(',', '') if ',' in str(x) else x)
df['installs'] = df['installs'].apply(lambda x: int(x))
df['installs']=df['installs'].astype('float')

In [None]:
#installs = []
#category = []
#for i in df["installs"]:
   # installs = installs.append(i)
#print(installs)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt  
plt.rcParams['font.sans-serif'] = [u'SimHei']
plt.rcParams['axes.unicode_minus'] = False

df = pd.DataFrame({'Category':df["category"],'Installs':df["installs"]})
groupbyDistrict=df.groupby('Category')
groupbyD = groupbyDistrict.mean().sort_values('Installs', ascending=False).reset_index()

'''
groupbyD.columns = groupbyD.columns.droplevel(0) 
groupbyClass.rename(columns = {'':'Installs'},inplace = True)
'''
ax = groupbyD.head(3).plot.bar(x='Category', y='Installs', rot=0, color = ["g","y","y","y","y"])

![Fig.2](https://github.com/Jasmine-dudu/Final-Project/blob/master/Figures-Final%20Project/Fig.2.png?raw=true)

What the quantities are most is the apps related to Tools, Arcade, Action, Entertainment, and Simulation (word cloud diagram). However, Video Players & Editors, Communication, and productivity are the three kinds of applications downloaded most (bar chart). The mismatch between supply and users’ demand indicates there is still much room to develop for the apps offering service including playing and editing videos, communication, and productivity. 

### 2. About the Pricing Strategy 

In [None]:
import csv
import pandas as pd

In [None]:
df_free=pd.read_csv('free app.csv')

In [None]:
df_free

In [None]:
df_free['Type']='Free'

In [None]:
df_free

In [None]:
df_free=df_free[['title','score','size','installs','content_rating','category','Type','date']]

In [None]:
df_free

In [None]:
df_paid=pd.read_csv('paid app.csv')

In [None]:
df_paid

In [None]:
df_paid['Type']='Paid'

In [None]:
df_paid=df_paid[['title','score','size','installs','content_rating','category','Type','date']]

In [None]:
df_paid

In [None]:
frames=[df_free,df_paid]

In [None]:
df_app=pd.concat(frames)

In [None]:
df_app

In [None]:
df_app.to_csv('google app final.csv')

In [None]:
df_app['content_rating'].value_counts()

In [None]:
df=df_app

In [None]:
df['installs'] = df['installs'].apply(lambda x: x.replace('+', '') if '+' in str(x) else x)
df['installs'] = df['installs'].apply(lambda x: x.replace(',', '') if ',' in str(x) else x)
df['installs'] = df['installs'].apply(lambda x: int(x))

In [None]:
df['size'] = df['size'].apply(lambda x: str(x).replace('M', '') if 'M' in str(x) else x)
df['size'] = df['size'].apply(lambda x: str(x).replace(',', '') if 'M' in str(x) else x)
df['size'] = df['size'].apply(lambda x: int(str(x).replace('k', '')) / 1024 if 'k' in str(x) else x)
df['size'].value_counts()

In [None]:
df_app

In [None]:
df_app.describe()

In [None]:
df=df_app.drop_duplicates(subset='title')

In [None]:
df

In [None]:
Freetop10_category=df_free['category'].value_counts().head(10)
Freetop10_category

In [None]:
Paidtop10_category=df_paid['category'].value_counts().head(10)
Paidtop10_category

In [None]:
from pyecharts import Pie
from IPython.display import IFrame

In [None]:
attr_Paid=list(Paidtop10_category.index)
attr_Free=list(Freetop10_category.index)
vPaid=Paidtop10_category
vFree=Freetop10_category
pie=Pie('Type top10 Apps',title_pos='center',width=1000,height=400)
pie.add('Paid',attr_Paid,vPaid,center=[26,50],radius=[35,75],rosetype='radius',
       is_legend_show=False,is_label_show=True,legend_orient='vertical')
pie.add('Free',attr_Free,vFree,center=[70,50],is_random=True,radius=[35,75],rosetype='radius',
       is_legend_show=False,is_label_show=True,legend_orient='vertical')
pie.render('Type top10 Apps(final).html')

In [None]:
Frame('Type top10 Apps(final).html',width=1000,height=400)

![Fig.3](https://github.com/Jasmine-dudu/Final-Project/blob/master/Figures-Final%20Project/Fig.3.png?raw=true)

The graph on the left demonstrates the percentage of category diversity among paid apps, while the diagram on the right shows that among free apps. It is apparent to see that Tools is the most popular category in both free and paid apps. Adventure, Simulation, and Personalization apps are desirable among paid apps, and mobile games apps usually require payment. At the same time, Communication, Social and Shopping applications free of charge are much beloved by users. We suppose these kinds of apps need to provide gratis services in order to attract more users as a large number of users can bring substantial investment.

In [None]:
import csv
import pandas as pd
import numpy as ny 
df = pd.read_csv('googleapp Data cleaning.csv',index_col=0)

In [None]:
Fufei = df[(df['Type'] == 'Paid')]

In [None]:
import matplotlib.pyplot as plt
x = Fufei.Price
y = Fufei.Installs 
plt.scatter(x, y, alpha=0.5)
plt.xlabel('Price')
plt.ylabel('Installs')
plt.title('Price & Installs') 
plt.autoscale(tight=True)
plt.show()

![Fig.4](https://github.com/Jasmine-dudu/Final-Project/blob/master/Figures-Final%20Project/Fig.4.png?raw=true)

The two pie charts show that people are more likely to pay on Personalization and Game, which means app developers need to consider the individuation and funny designs in apps to improve profits. But the prices of apps need to be between a reasonable range, the scatter diagram shows that the apps costing less than $5 are more popular. Users would not like to pay much for purchasing applications, thus, apps which cost more may need to provide a free trial to increase consuming willingness. Or they can change the sales model, cutting down install price and providing purchasing options within the app, just like some mobile games, we call this “Kejin” model.

### 3. About Apps’ Sizes

In [None]:
df=df[~df['size'].str.contains('Varies with device',na=False)]
df['size'].value_counts()

In [None]:
df['size']=df['size'].astype('float')

In [None]:
df.info()

In [None]:
size=df['size']
se_size=pd.Series(size)
bins=[0,20,40,60,80,100,120]
se1=pd.cut(se_size,bins)
se1

In [None]:
size_dis=se1.value_counts()
size_dis

In [None]:
df['size section']=se1

In [None]:
df

In [None]:
installs=df['installs']
se_installs=pd.Series(installs)
bins=[0,100000,10000000,1000000000]
install_section=pd.cut(se_installs,bins)
install_section

In [None]:
installs_dis=install_section.value_counts()
installs_dis

In [None]:
df['install_section'] =install_section
df

In [None]:
import seaborn as sns

In [None]:
ax=sns.violinplot(x='install_section', y='size', data=df)

![Fig.5](https://github.com/Jasmine-dudu/Final-Project/blob/master/Figures-Final%20Project/Fig.5.png?raw=true)

The chart illustrates that when the sizes are in the range of 15~60 M, users are more likely to install the apps. This conclusion leads to a strategy that developers need to control the size of apps in this range, and if the size of an app cannot be compressed, maybe they should solve this problem by dividing the process of download by many times or updating them step by step.   

### 4. About Promotion Strategy 

We grab the information of some apps’ reviews and analyze them from Polarity and Subjectivity. When it comes to Polarity, we aim to define if the reviews are positive, negative or neutral. In terms of Subjectivity, the purpose is to determine whether the users’ attitude is subjective or objective.
Facebook gets 4 stars in scoring and Pass obtains only 3 stars. Their detailed information related to comments is shown below:

In [None]:
#Facebook 4*
from selenium import webdriver

browser = webdriver.Chrome()
browser.get("https://play.google.com/store/apps/details?id=com.facebook.katana&hl=en_GB&showAllReviews=true")

articles_facebook = []
for session in browser.find_elements_by_css_selector('.d15Mdf.bAhLNe'): 
    article = {}
    h = session.find_element_by_css_selector(".UD7Dzf")
    article['headline'] = h.text #find headline block

    #article['url'] = h.get_attribute('href') #get url attributes from headline block
    #article['date'] = session.find_element_by_css_selector("span.cnnDateStamp").text #find date
    articles_facebook.append(article)
    
    
articles_facebook

In [None]:
from textblob import TextBlob
i = len(articles_facebook)-1

while i >= 0:
    print(i+1)
    print(TextBlob(articles_facebook[i]["headline"]).sentiment)
    i=i-1

In [None]:
i=len(articles_facebook)-1
polar=[]
subject=[]

while i >=0:
    a=TextBlob(articles_facebook[i]['headline']).sentiment.polarity
    b=TextBlob(articles_facebook[i]['headline']).sentiment.subjectivity

    subject.append(b)
    polar.append(a)

    i=i-1
#print("polarity: ","\n",polar,"\n","\n","subjectivity:","\n",subject)

import pandas as pd
import numpy as np
df_facebook = pd.DataFrame(polar,columns = ["Polarity"])
df_facebook.insert(0, "Subjectivity", subject)

df_facebook.to_csv("facebook.csv", index=False )
df_facebook

In [None]:
df_facebook.mean()

In [None]:
#Pass 3*
from selenium import webdriver

browser = webdriver.Chrome()
browser.get("https://play.google.com/store/apps/details?id=com.sonymobile.xperialounge.services&hl=en_GB&showAllReviews=true")

articles_pass = []
for session in browser.find_elements_by_css_selector('.d15Mdf.bAhLNe'):  
    article = {}
    h = session.find_element_by_css_selector(".UD7Dzf")
    article['headline'] = h.text #find headline block

    #article['url'] = h.get_attribute('href') #get url attributes from headline block
    #article['date'] = session.find_element_by_css_selector("span.cnnDateStamp").text #find date
    articles_pass.append(article)
    
    
#articles_pass

In [None]:
i = len(articles_pass)-1

while i >= 0:
    print(i+1)
    print(TextBlob(articles_pass[i]["headline"]).sentiment)
    i=i-1

In [None]:
i=len(articles_pass)-1
polar=[]
subject=[]

while i >=0:
    a=TextBlob(articles_pass[i]['headline']).sentiment.polarity
    b=TextBlob(articles_pass[i]['headline']).sentiment.subjectivity

    subject.append(b)
    polar.append(a)

    i=i-1
#print("polarity: ","\n",polar,"\n","\n","subjectivity:","\n",subject)

import pandas as pd
import numpy as np
df_pass = pd.DataFrame(polar,columns = ["Polarity"])
df_pass.insert(0, "Subjectivity", subject)

df_pass.to_csv("pass.csv", index=False )
df_pass

In [None]:
df_pass.mean()

In [None]:
# Facebook 的评论

In [None]:
import pandas as pd
import numpy as np
df_facebook = pd.read_csv('facebook.csv')
df_facebook.mean()

In [None]:
from pyecharts import Pie
from IPython.display import IFrame
attr = ['Negative','Neutral','Positive']
v1 = [18,4,18]
pie = Pie(width ='100%', height='100vh')
pie.add('',attr,v1,is_label_show = True)
pie.render('facebook_Polarity.html') 
pie

In [7]:
from IPython.display import Image
from IPython.core.display import HTML
Image(url= "https://github.com/Jasmine-dudu/Final-Project/blob/master/Figures-Final%20Project/Fig.6.png?raw=true", width=700, height=700)

In [None]:
attr = ['Subjective','Objective']
v1 = [19,21]
pie = Pie(width ='100%', height='100vh')
pie.add('',attr,v1,is_label_show = True)
pie.render('Facebook_Subjectivity.html') 
pie

In [9]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://github.com/Jasmine-dudu/Final-Project/blob/master/Figures-Final%20Project/Fig.8.png?raw=true", width=700, height=700)

In [None]:
# Pass 的评论

In [None]:
import pandas as pd
import numpy as np
df_pass = pd.read_csv('pass.csv')
df_pass.mean()

In [None]:
from pyecharts import Pie
from IPython.display import IFrame

attr = ['Negative','Neutral','Positive']
v1 = [14,12,4]
pie = Pie(width='100%', height='100vh')
pie.add('',attr,v1,is_label_show = True)


pie.render('Pass_Polarity.html') 

pie

In [8]:
from IPython.display import Image
from IPython.core.display import HTML
Image(url= "https://github.com/Jasmine-dudu/Final-Project/blob/master/Figures-Final%20Project/Fig.7.png?raw=true", width=700, height=700)

In [None]:
attr = ['Subjective','Objective']
v1 = [9,31]
pie = Pie(width ='100%', height='100vh')
pie.add('',attr,v1,is_label_show = True)
pie.render('Pass_Subjectivity.html') 
pie

In [10]:
from IPython.display import Image
from IPython.core.display import HTML
Image(url= "https://github.com/Jasmine-dudu/Final-Project/blob/master/Figures-Final%20Project/Fig.9.png?raw=true", width=700, height=700)

#### (1) Polarity:

With a decrease in score, we can come to an agreement that the percentage of positive comments sharply decrease from 45% to 13%, however, there isn’t show an upward trend in the percentage of negative comments. On the contrary, it is neutral which owned a big proportion in the total. Hence it is not difficult to conclude that, among the attitude and comments of an app, the quantity of negative reviews always remains steady. So, the favor of neutrals and their positive comments needs to be wooed for the app developers.

#### (2) Subjectivity: 

From the comparison of pie graphs, we can see that the subjectivity shows more intense in the low scores’ comments. Instead, the positive comments can usually take on the appearance of objectivity. As we suspected, the terrible experience brings bad feelings, such as disappointment, boredom or anger may cause a stronger subjectivity. It is much better for app developers to set the comments pop-up windows after users get a great sense of tremendous fulfillment (such as the situations like the accomplishment of certain tasks or some achievements specific to the game). Certainly, the developers need to avoid the setting of comments pop-up windows after the awful experience (for example, some function modules without improvement).
Of course, our analysis is confined on a small scale. If we need to pass the validation of the conjecture or generalize a conclusion, it is possible to complement more samples in subsequent studies. 

In [None]:
#df['Reviews']

In [None]:
mport pandas as pd
import numpy as np
df = pd.read_csv("googleapp Data cleaning.csv")
df


import pandas as pd
import matplotlib.pyplot as plt  
plt.rcParams['font.sans-serif'] = [u'SimHei']
plt.rcParams['axes.unicode_minus'] = False

df = pd.DataFrame({'Reviews':df["Reviews"],'Installs':df["Installs"]})

groupbyDistrict=df.groupby('Reviews')


groupbyD = groupbyDistrict.mean().sort_values('Installs', ascending=True).reset_index()
#print(groupbyD)

#ax = df.plot.bar(x='Reviews', y='Installs', rot=270)

print(groupbyD.head())
plt.plot(groupbyD.head(100))
plt.show()

![Fig.10](https://github.com/Jasmine-dudu/Final-Project/blob/master/Figures-Final%20Project/Fig.10.png?raw=true)

Also, it is strange to note no matter how much the installs increase, the reviews keep relatively stable with no more than 1k users willing to review. Users' willingness to comment needs to be stimulated by something and the apps developers should not only wait for users to give their positive assessments voluntarily but should guide their users to make positive reviews since it can help to build a good reputation and product image. 

### 5. About Content Rating and Potential Market  

In [None]:
import csv
import pandas as pd
import numpy as ny 
df = pd.read_csv('google app final.csv',index_col=0)

In [None]:
a = df[(df['content_rating'] == 'Teen')]
b = df[(df['content_rating'] == 'Everyone 10+')]
c = df[(df['content_rating'] == 'Mature 17+')] 
d = df[(df['content_rating'] == 'Everyone')]
User_type = pd.concat([a,b,c])

In [None]:
import plotly.plotly as py
import plotly.graph_objs as go

trace1 = go.Bar(
    x = User_type['category'],
    y = d['content_rating'],
    name= '',
    marker=dict(
        color='white',
))
trace2 = go.Bar(
    x = User_type['category'],
    y = a['content_rating'],
    name='Teen'
)
trace3 = go.Bar(
    x = User_type['category'],
    y= b['content_rating'],
    name='Everyone 10+'
)
trace4 = go.Bar(
    x = User_type['category'],
    y =c['content_rating'],
    name='Mature 17+'
)
data = [trace1,trace2, trace3,trace4]
layout = go.Layout(
    barmode='stack',
    title='Content Rating& Category of Popular Apps'
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, title='Content Rating & Category',filename='stacked-bar')

![Fig.11](https://github.com/Jasmine-dudu/Final-Project/blob/master/Figures-Final%20Project/Fig.11.png?raw=true)

In [None]:
import csv
import pandas as pd
import numpy as ny 
df = pd.read_csv('googleapp Data cleaning.csv',index_col=0)

In [None]:
a = df[(df['Content Rating'] == 'Teen')]
b = df[(df['Content Rating'] == 'Everyone 10+')]
c = df[(df['Content Rating'] == 'Mature 17+')]
d = df[(df['Content Rating'] == 'Everyone')]
User_type = pd.concat([a,b,c])

In [None]:
import plotly.plotly as py
import plotly.graph_objs as go

trace1 = go.Bar(
    x = User_type['Category'],
    y = d['Content Rating'],
    name= '',
    marker=dict(
        color='white',
))
trace2 = go.Bar(
    x = User_type['Category'],
    y = a['Content Rating'],
    name='Teen'
)
trace3 = go.Bar(
    x = User_type['Category'],
    y= b['Content Rating'],
    name='Everyone 10+'
)
trace4 = go.Bar(
    x = User_type['Category'],
    y =c['Content Rating'],
    name='Mature 17+'
)
data = [trace1,trace2, trace3,trace4]
layout = go.Layout(
    barmode='stack',
    title='Content Rating& Category of High Score Apps'
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, title='Content Rating & Category',filename='stacked-bar')

![Fig.12](https://github.com/Jasmine-dudu/Final-Project/blob/master/Figures-Final%20Project/Fig.12.png?raw=true)

According to the Content Rating, these apps are divided into four types, including Everyone, Teen, Everyone 10+ and Mature 17+. The bar charts show three types except for Everyone (Fig.11 & Fig.12). Considering the inclusion relation, we mainly focus on the Mature 17+ and Teen type. We discovered Social and Game are the two types of apps which are usually popular, and also gain high scores from people; The apps of Music couldn’t achieve a high score, though their capacity of downloading is always large; and also, the apps of Family cannot gain the users’ attention in spite of high-quality. Moreover, the most popular category for adults is social, but it is apparent to see that this market is very mature, both in the quality and quantity. And there are still gaps in the market both for teen and adults in Education.

In [None]:
df2=df[(df['Content Rating'] == 'Mature 17+')]
df2

![Fig.13](https://github.com/Jasmine-dudu/Final-Project/blob/master/Figures-Final%20Project/Fig.13.png?raw=true)

In addition, we find something interesting. Apps related to the category of Family are mainly for parent-child with the function of education or brain games. There are so many family apps under the mature 17plus tag, which we think is very strange. To give an example, the app called anakare, a typical adventure game not for users under 18, is divided into the family category, while Chat room for kids has the mature 17plus tag. We think this may be a marketing strategy - anakare can face less fierce competition under the category of family, as people may be curious about why such a game with an irrelevant label and install it, then its rank on the list will rise. We suppose this is quite a good way of advertising.

## III. Conclusions 

If someone wants to create an application, we will give some suggestions related to the following aspects: 
Firstly, focusing on the categories like Education and Video Players & Editors will make the developers face less fierce competitions. What’s more, since users would not like to pay much for downloading applications, the operators need to adjust pricing so as to offer attractive products at a bearable margin, and it is better to provide a free trial or insider access for purchasing value-added services. Also, control the sizes of apps under 60 M, otherwise divide the process of download by times or updating them step by step. In addition, do not only wait for users to give their assessments voluntarily but should inspire them to make reviews, especially positive reviews. Last but not least, there are still gaps in the market for teen and adults, developers can design more apps for these two groups.

We hope our project can help you make and promote an app successfully!