# Introduction

In this notebook, we will do a comprehensive analysis of the Android app market by comparing thousands of apps in the Google Play store.

# About the Dataset of Google Play Store Apps & Reviews

**Data Source:** <br>
App and review data was scraped from the Google Play Store by Lavanya Gupta in 2018. Original files listed [here](
https://www.kaggle.com/lava18/google-play-store-apps).

# Import Statements

In [1]:
import pandas as pd


# Notebook Presentation

In [2]:
# Show numeric output in decimal format e.g., 2.15
pd.options.display.float_format = '{:,.2f}'.format

# Read the Dataset

In [3]:
df_apps = pd.read_csv('/media/faisal-chap/Python/DataScience/APPSTORE_Analysis/Final Part/apps.csv')

# Data Cleaning

**Challenge**: How many rows and columns does `df_apps` have? What are the column names? Look at a random sample of 5 different rows with [.sample()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html).

In [4]:
print(df_apps.shape)


(10841, 12)


In [5]:
print(df_apps.columns)


Index(['App', 'Category', 'Rating', 'Reviews', 'Size_MBs', 'Installs', 'Type',
       'Price', 'Content_Rating', 'Genres', 'Last_Updated', 'Android_Ver'],
      dtype='object')


In [6]:
print(df_apps.sample(5))

                        App         Category  Rating  Reviews  Size_MBs  \
435   Jigsaw Puzzles AK 74M  PERSONALIZATION     NaN        0      4.90   
235    Trovami se ci riesci             GAME    5.00       11      6.10   
1030             BP Service         BUSINESS     NaN        0     26.00   
4175             Toronto FC           SPORTS    3.40      144     79.00   
1735         MBU DX Cluster            TOOLS    4.00        5      1.60   

     Installs  Type Price Content_Rating           Genres      Last_Updated  \
435        10  Free     0     Mature 17+  Personalization       May 7, 2018   
235        10  Free     0       Everyone           Arcade    March 11, 2017   
1030      100  Free     0       Everyone         Business  January 17, 2018   
4175   10,000  Free     0           Teen           Sports      May 11, 2018   
1735      500  Free     0       Everyone            Tools     June 16, 2015   

       Android_Ver  
435   4.0.3 and up  
235     2.3 and up  
1030    4.1

### Drop Unused Columns

**Challenge**: Remove the columns called `Last_Updated` and `Android_Version` from the DataFrame. We will not use these columns. 

In [7]:
df_apps = df_apps.drop(columns=['Last_Updated','Android_Ver'])


### Find and Remove NaN values in Ratings

**Challenge**: How may rows have a NaN value (not-a-number) in the Ratings column? Create DataFrame called `df_apps_clean` that does not include these rows. 

In [8]:
print(df_apps['Rating'].isna().sum())
df_apps_clean = df_apps.dropna(subset=['Rating'])
print(df_apps_clean['Rating'].isna().sum())

1474
0


### Find and Remove Duplicates

**Challenge**: Are there any duplicates in data? Check for duplicates using the [.duplicated()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) function. How many entries can you find for the "Instagram" app? Use [.drop_duplicates()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) to remove any duplicates from `df_apps_clean`. 


In [9]:
print(df_apps_clean.duplicated().sum())

476


In [10]:
print(df_apps_clean[df_apps_clean['App'] == 'Instagram'].shape[0])

4


In [11]:
df_apps_clean = df_apps_clean.drop_duplicates()

In [12]:
print(df_apps_clean.duplicated().sum())

0


# Find Highest Rated Apps

**Challenge**: Identify which apps are the highest rated. What problem might you encounter if you rely exclusively on ratings alone to determine the quality of an app?

In [13]:
print(df_apps_clean.sort_values('Rating',ascending=False).head())

                        App Category  Rating  Reviews  Size_MBs Installs  \
186  Basics of Orthopaedics  MEDICAL    5.00        1      5.60       10   
21      KBA-EZ Health Guide  MEDICAL    5.00        4     25.00        1   
28                 Ra Ga Ba     GAME    5.00        2     20.00        1   
47                  Mu.F.O.     GAME    5.00        2     16.00        1   
237                EC Mover     GAME    5.00        5      4.60       10   

     Type  Price Content_Rating   Genres  
186  Free      0       Everyone  Medical  
21   Free      0       Everyone  Medical  
28   Paid  $1.49       Everyone   Arcade  
47   Paid  $0.99       Everyone   Arcade  
237  Free      0       Everyone   Racing  


In [14]:
find = df_apps_clean[df_apps_clean['Content_Rating'] == 'Adults only 18+']
print(find)

                                           App Category  Rating  Reviews  \
6041                               Manga Books   COMICS    3.80     7326   
6272  Manga Master - Best manga & comic reader   COMICS    4.60    24005   
7609         DraftKings - Daily Fantasy Sports   SPORTS    4.50    50017   

      Size_MBs   Installs  Type Price   Content_Rating  Genres  
6041      8.05    500,000  Free     0  Adults only 18+  Comics  
6272      4.90    500,000  Free     0  Adults only 18+  Comics  
7609     41.00  1,000,000  Free     0  Adults only 18+  Sports  


# Find 5 Largest Apps in terms of Size (MBs)

**Challenge**: What's the size in megabytes (MB) of the largest Android apps in the Google Play Store. Based on the data, do you think there could be limit in place or can developers make apps as large as they please? 

In [15]:
print(df_apps_clean.sort_values('Size_MBs',ascending=False).head())

                                  App   Category  Rating  Reviews  Size_MBs  \
9945                  Ultimate Tennis     SPORTS    4.30   183004    100.00   
9944     Gangster Town: Vice District     FAMILY    4.30    65146    100.00   
9943            Miami crime simulator       GAME    4.00   254518    100.00   
9942   Talking Babsy Baby: Baby Games  LIFESTYLE    4.00   140995    100.00   
10687          Hungry Shark Evolution       GAME    4.50  6074334    100.00   

          Installs  Type Price Content_Rating                  Genres  
9945    10,000,000  Free     0       Everyone                  Sports  
9944    10,000,000  Free     0     Mature 17+              Simulation  
9943    10,000,000  Free     0     Mature 17+                  Action  
9942    10,000,000  Free     0       Everyone  Lifestyle;Pretend Play  
10687  100,000,000  Free     0           Teen                  Arcade  


# Find the 5 App with Most Reviews

**Challenge**: Which apps have the highest number of reviews? Are there any paid apps among the top 50?

In [16]:
top_reviewed = df_apps_clean.sort_values('Reviews',ascending=False).head(50)
print(top_reviewed.head())


print(top_reviewed['Type'].value_counts())



                      App       Category  Rating   Reviews  Size_MBs  \
10805            Facebook         SOCIAL    4.10  78158306      5.30   
10811            Facebook         SOCIAL    4.10  78128208      5.30   
10785  WhatsApp Messenger  COMMUNICATION    4.40  69119316      3.50   
10797  WhatsApp Messenger  COMMUNICATION    4.40  69109672      3.50   
10808           Instagram         SOCIAL    4.50  66577446      5.30   

            Installs  Type Price Content_Rating         Genres  
10805  1,000,000,000  Free     0           Teen         Social  
10811  1,000,000,000  Free     0           Teen         Social  
10785  1,000,000,000  Free     0       Everyone  Communication  
10797  1,000,000,000  Free     0       Everyone  Communication  
10808  1,000,000,000  Free     0           Teen         Social  
Type
Free    50
Name: count, dtype: int64


# Plotly Pie and Donut Charts - Visualise Categorical Data: Content Ratings

In [17]:
import plotly.express as px


In [18]:
content_rating = df_apps_clean['Content_Rating'].value_counts().reset_index()

content_rating.columns = ['Content_Rating','Count']






In [19]:
fig = px.pie(content_rating,values='Count',names='Content_Rating',title='Content Rating Distribution of different apps in app store')
fig.show()

In [20]:
fig = px.pie(content_rating,values='Count',names='Content_Rating',title='Content Rating Distribution of different apps in app store',hole=0.4)
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(showlegend=True)
fig.show()

# Numeric Type Conversion: Examine the Number of Installs

**Challenge**: How many apps had over 1 billion (that's right - BILLION) installations? How many apps just had a single install? 

Check the datatype of the Installs column.

Count the number of apps at each level of installations. 

Convert the number of installations (the Installs column) to a numeric data type. Hint: this is a 2-step process. You'll have make sure you remove non-numeric characters first. 

In [21]:
counted_apps = df_apps_clean['Installs'].value_counts()
counted_apps = counted_apps.reset_index()
counted_apps.columns = ['Installs','Count']

counted_apps['Installs'] = counted_apps['Installs'].str.replace('+','').str.replace(',','').astype(int)
counted_apps.sort_values('Installs',ascending=False,inplace=True)
counted_apps = counted_apps.reset_index()
counted_apps = counted_apps.drop(columns=['index'])
counted_apps.columns = ['Installs','Count']
print(f"Apps over 1 Billion Installation in Numbers are: {counted_apps['Count'].head(1).values[0]}")

print(counted_apps.head(20))


Apps over 1 Billion Installation in Numbers are: 49
      Installs  Count
0   1000000000     49
1    500000000     61
2    100000000    369
3     50000000    272
4     10000000   1130
5      5000000    683
6      1000000   1486
7       500000    516
8       100000   1110
9        50000    462
10       10000    989
11        5000    426
12        1000    699
13         500    199
14         100    303
15          50     56
16          10     69
17           5      9
18           1      3


In [22]:
print(f"The App with 1 Installation are: {counted_apps['Count'].tail(1).values[0]}")

The App with 1 Installation are: 3


In [23]:
counted_apps['Installs'] = counted_apps['Installs'].astype(str)
fig = px.bar(counted_apps,
             x='Installs',
             y='Count',
             title='Number of Apps by Install Level',
             text='Count',
             color_continuous_scale='Blues',
             color='Count')
fig.update_layout(
    xaxis_title='Number of Installs',
    yaxis_title='Number of Apps',
    xaxis_tickformat=',',  # Add comma formatting
    xaxis_tickangle=-45,
    template='plotly_white'
)
fig.show()


# Find the Most Expensive Apps, Filter out the Junk, and Calculate a (ballpark) Sales Revenue Estimate

Let's examine the Price column more closely.

**Challenge**: Convert the price column to numeric data. Then investigate the top 20 most expensive apps in the dataset.

Remove all apps that cost more than $250 from the `df_apps_clean` DataFrame.

Add a column called 'Revenue_Estimate' to the DataFrame. This column should hold the price of the app times the number of installs. What are the top 10 highest grossing paid apps according to this estimate? Out of the top 10 highest grossing paid apps, how many are games?


In [24]:
# converting the price to numeric values
df_apps_clean['Price'] = df_apps_clean['Price'].str.replace('$','',regex=False)
df_apps_clean['Price'] = pd.to_numeric(df_apps_clean['Price'])



In [25]:
# converting the installs to numeric values
df_apps_clean['Installs'] = df_apps_clean['Installs'].replace('$','',regex=False).str.replace(',','')
df_apps_clean['Installs'] = pd.to_numeric(df_apps_clean['Installs'])

In [26]:
paid_apps = df_apps_clean[df_apps_clean['Type'] == 'Paid']
print(len(paid_apps))
# print(paid_apps.sort_values('Price',ascending=False))

# removing junk
paid_apps_clean = paid_apps[paid_apps['Price'] < 250]
# print(paid_apps_clean.sort_values('Price',ascending=False))


# top 20
print(paid_apps_clean.sort_values('Price',ascending=False).head(20))



613
                                                    App     Category  Rating  \
2281                          Vargo Anesthesia Mega App      MEDICAL    4.60   
1407                                       LTC AS Legal      MEDICAL    4.00   
2629                                   I am Rich Person    LIFESTYLE    4.20   
2481                            A Manual of Acupuncture      MEDICAL    3.50   
504                           AP Art History Flashcards       FAMILY    5.00   
2463                                 PTA Content Master      MEDICAL    4.20   
4264                            Golfshot Plus: Golf GPS       SPORTS    4.10   
2207                                           EMT PASS      MEDICAL    3.40   
3241  Muscle Premium - Human Anatomy, Kinesiology, B...      MEDICAL    4.20   
4774   Human Anatomy Atlas 2018: Complete 3D Human Body      MEDICAL    4.50   
4772   Human Anatomy Atlas 2018: Complete 3D Human Body      MEDICAL    4.50   
526                          USMLE S

In [27]:
# adding the revenue estimate column
paid_apps_clean['Revenue_Estimate'] = paid_apps_clean['Installs'] * paid_apps_clean['Price']


print(paid_apps_clean.head())


                                        App   Category  Rating  Reviews  \
28                                 Ra Ga Ba       GAME    5.00        2   
47                                  Mu.F.O.       GAME    5.00        2   
233  Chess of Blades (BL/Yaoi Game) (No VA)     FAMILY    4.80        4   
248                            The DG Buddy   BUSINESS    3.70        3   
291                     AC DC Power Monitor  LIFESTYLE    5.00        1   

     Size_MBs  Installs  Type  Price Content_Rating     Genres  \
28      20.00         1  Paid   1.49       Everyone     Arcade   
47      16.00         1  Paid   0.99       Everyone     Arcade   
233     23.00        10  Paid  14.99           Teen     Casual   
248     11.00        10  Paid   2.49       Everyone   Business   
291      1.20        10  Paid   3.04       Everyone  Lifestyle   

     Revenue_Estimate  
28               1.49  
47               0.99  
233            149.90  
248             24.90  
291             30.40  




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



### The most expensive apps sub $250

### Highest Grossing Paid Apps (ballpark estimate)

In [28]:
# top 10 paid apps

top10_paid = paid_apps_clean.sort_values('Revenue_Estimate',ascending=False).head(10)
print(top10_paid)


# checking for games in it 
print(len(top10_paid[top10_paid['Category'] == 'GAME']))

                                App     Category  Rating  Reviews  Size_MBs  \
9224                      Minecraft       FAMILY    4.50  2375336     19.00   
9220                      Minecraft       FAMILY    4.50  2376564     19.00   
8825                  Hitman Sniper         GAME    4.60   408292     29.00   
7151  Grand Theft Auto: San Andreas         GAME    4.40   348962     26.00   
7477            Facetune - For Free  PHOTOGRAPHY    4.40    49553     48.00   
7977        Sleep as Android Unlock    LIFESTYLE    4.50    23966      0.85   
6594            DraStic DS Emulator         GAME    4.60    87766     12.00   
6082                   Weather Live      WEATHER    4.50    76593      4.75   
7044                         Tasker        TOOLS    4.60    43045      3.40   
7954                    Bloons TD 5       FAMILY    4.60   190086     94.00   

      Installs  Type  Price Content_Rating                     Genres  \
9224  10000000  Paid   6.99   Everyone 10+  Arcade;Action

# Plotly Bar Charts & Scatter Plots: Analysing App Categories

In [29]:
fig = px.bar(top10_paid,
             x='App',y='Revenue_Estimate',
             title='Top 10 Paid Earning Apps on PlayStore',
             text='Revenue_Estimate',
             color='Category')
fig.update_layout(
    xaxis_tickangle = 45,
    template = 'plotly_white',
    yaxis_title = 'Revenue Estimate ($)'

)
fig.show()

### Vertical Bar Chart - Highest Competition (Number of Apps)

In [30]:
highest_competition = df_apps_clean['Category'].value_counts().reset_index()
highest_competition.columns = ['Category','Number_of_apps']



fig = px.bar(
    highest_competition,
    x='Category',
    y='Number_of_apps',
    color_continuous_scale='plasma',
    text='Number_of_apps',
    color='Number_of_apps'
)
fig.update_layout(
    xaxis_tickangle = -45,
    template='plotly_white'
)
fig.show()

### Horizontal Bar Chart - Most Popular Categories (Highest Downloads)

In [31]:
category_installs = df_apps_clean.groupby('Category')['Installs'].sum().reset_index()
category_installs = category_installs.sort_values('Installs',ascending=False)


fig = px.bar(
    category_installs,
    x='Installs',y='Category',
    orientation='h',
    text='Installs',
    color='Installs',
    color_continuous_scale='Viridis'
)
fig.update_layout(
    template='plotly_white'
)

fig.show()

### Category Concentration - Downloads vs. Competition

**Challenge**: 
* First, create a DataFrame that has the number of apps in one column and the number of installs in another:

<img src=https://imgur.com/uQRSlXi.png width="350">

* Then use the [plotly express examples from the documentation](https://plotly.com/python/line-and-scatter/) alongside the [.scatter() API reference](https://plotly.com/python-api-reference/generated/plotly.express.scatter.html)to create scatter plot that looks like this. 

<img src=https://imgur.com/cHsqh6a.png>

*Hint*: Use the size, hover_name and color parameters in .scatter(). To scale the yaxis, call .update_layout() and specify that the yaxis should be on a log-scale like so: yaxis=dict(type='log') 

In [34]:
category_conc = df_apps_clean.groupby('Category').agg({
    'Installs': 'sum',
    'App': 'count'
})

category_conc = category_conc.reset_index()
category_conc.rename(columns={'App': 'Number_of_apps', 'Installs': 'Total_Installs',}, inplace=True)

In [37]:
fig = px.scatter(
    category_conc,
    x='Number_of_apps',
    y='Total_Installs',
    size='Total_Installs',
    color='Category',
    hover_name='Category',
    title='Total Installs vs Number of Apps by Category',
    size_max=60,
    template='plotly_white',
    color_continuous_scale='Viridis'
)

fig.update_layout(
    xaxis_title='Number of Apps',
    yaxis_title='Total Installs',
    yaxis=dict(type='log'),
    template='plotly_white'
)

fig.show()

# Extracting Nested Data from a Column

**Challenge**: How many different types of genres are there? Can an app belong to more than one genre? Check what happens when you use .value_counts() on a column with nested values? See if you can work around this problem by using the .split() function and the DataFrame's [.stack() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html). 


In [47]:
# before correction number of genres
genre = df_apps_clean['Genres']
print(genre.value_counts().head(10))




# correcting the confusion in the data
genre_series = df_apps_clean['Genres'].str.split(';',expand=True)
# stacking the multiple genres with stacking to next infrot row with its row id
genre_split = genre_series.stack()
# resets the previous index so that we can use the new stack index
genre_split =  genre_split.reset_index(drop=True)



# top 10 genres
print(genre_split.value_counts().head(10))




Genres
Tools              732
Entertainment      498
Education          446
Action             349
Productivity       334
Finance            317
Personalization    310
Communication      307
Lifestyle          305
Photography        304
Name: count, dtype: int64
Tools              733
Education          626
Entertainment      538
Action             364
Productivity       334
Finance            317
Personalization    310
Communication      308
Lifestyle          307
Photography        304
Name: count, dtype: int64


# Colour Scales in Plotly Charts - Competition in Genres

**Challenge**: Can you create this chart with the Series containing the genre data? 

<img src=https://imgur.com/DbcoQli.png width=400>

Try experimenting with the built in colour scales in Plotly. You can find a full list [here](https://plotly.com/python/builtin-colorscales/). 

* Find a way to set the colour scale using the color_continuous_scale parameter. 
* Find a way to make the color axis disappear by using coloraxis_showscale. 

In [54]:
# genre split comes form upper section of code

genre_count = genre_split.value_counts().reset_index()
genre_count.columns = ['genre','number of apps']

fig = px.bar(
    genre_count,
    x='genre',
    y='number of apps',
    title='Genre Competition',
    color_continuous_scale='Inferno',
    color='number of apps'
)
fig.update_layout(
    xaxis_tickangle=45,
    xaxis_title = 'Genre',
    yaxis_title = 'Number of Apps',
    coloraxis_showscale = False,
    

)

fig.show()

# Grouped Bar Charts: Free vs. Paid Apps per Category

In [56]:
grouped_by_type = df_apps_clean.groupby(['Category','Type'])['App'].count().reset_index()
grouped_by_type.columns = ['Category', 'Type', 'Number_of_Apps']



**Challenge**: Use the plotly express bar [chart examples](https://plotly.com/python/bar-charts/#bar-chart-with-sorted-or-ordered-categories) and the [.bar() API reference](https://plotly.com/python-api-reference/generated/plotly.express.bar.html#plotly.express.bar) to create this bar chart: 

<img src=https://imgur.com/LE0XCxA.png>

You'll want to use the `df_free_vs_paid` DataFrame that you created above that has the total number of free and paid apps per category. 

See if you can figure out how to get the look above by changing the `categoryorder` to 'total descending' as outlined in the documentation here [here](https://plotly.com/python/categorical-axes/#automatically-sorting-categories-by-name-or-total-value). 

In [57]:
fig  = px.bar(
    grouped_by_type,
    x='Category',
    y='Number_of_Apps',
    color='Type',
    title='Number of Apps by Category and Type',
    barmode='group',
    color_discrete_sequence=px.colors.qualitative.Plotly
)

fig.update_layout(
    xaxis_tickangle=-45,
    xaxis_title='Category',
    yaxis_title='Number of Apps',
    template='plotly_white'
)
fig.show()

# Plotly Box Plots: Lost Downloads for Paid Apps

**Challenge**: Create a box plot that shows the number of Installs for free versus paid apps. How does the median number of installations compare? Is the difference large or small?

Use the [Box Plots Guide](https://plotly.com/python/box-plots/) and the [.box API reference](https://plotly.com/python-api-reference/generated/plotly.express.box.html) to create the following chart. 

<img src=https://imgur.com/uVsECT3.png>


In [61]:
fig = px.box(
    df_apps_clean,
    x='Type',
    y='Installs',
    title='Rating Distribution by Category',
    color='Type',
    points='all',
    template='plotly_white',
     color_discrete_sequence=['#00cc96', '#ab63fa'] 
)

fig.update_layout(
    xaxis_tickangle=-45,
    xaxis_title='Type',
    yaxis_title='Installs',
    template='plotly_white'
)
fig.show()  

# Plotly Box Plots: Revenue by App Category

**Challenge**: See if you can generate the chart below: 

<img src=https://imgur.com/v4CiNqX.png>

Looking at the hover text, how much does the median app earn in the Tools category? If developing an Android app costs $30,000 or thereabouts, does the average photography app recoup its development costs?

Hint: I've used 'min ascending' to sort the categories. 

In [64]:
# print(paid_apps_clean[['Category', 'Price', 'Installs', 'Revenue_Estimate']].head())


fig = px.box(
    paid_apps_clean,
    x='Category',
    y='Revenue_Estimate',
    title='Revenue Estimate Distribution by Category',
    color='Category',
    points='all',
    template='plotly_white'
)

fig.update_layout(
    xaxis={'categoryorder': 'min ascending'},
    yaxis_type='log',  # Revenue can range from $0 to millions!
    xaxis_title='App Category',
    yaxis_title='Revenue Estimate ($)',
    showlegend=False,
    title_font_size=22
)
fig.show()



# How Much Can You Charge? Examine Paid App Pricing Strategies by Category

**Challenge**: What is the median price price for a paid app? Then compare pricing by category by creating another box plot. But this time examine the prices (instead of the revenue estimates) of the paid apps. I recommend using `{categoryorder':'max descending'}` to sort the categories.

In [65]:
fig = px.box(
    paid_apps_clean,
    x='Category',
    y='Price',
    title='Price Distribution by Category',
    color='Category',
    points='outliers',
    template='plotly_white'
)


In [66]:
fig.update_layout(
    xaxis={'categoryorder': 'max descending'},
    xaxis_title='App Category',
    yaxis_title='Price (USD)',
    yaxis_range=[0, 60],  # Optional: Cap Y-axis if there are crazy outliers
    showlegend=False,
    title_font_size=22
)
fig.show()
