# Analysis of the Android App Market on Google Play

---
![alt Google Play Logo](images/google_play_store.webp)

---

## Table of Contents
- **Overview**
- **Dataset** 
- **Data Cleaning**
- **Correcting Data Types**
- **Exploring App Categories**
- **Distribution of App Ratings**
- **Relation Between Size and Price of an App**
- **Relation Between App Category and App Price**
- **Filtering out "junk" Apps** 
- **Popularity of Paid Apps vs Free Apps**
- **Sentiment Analysis of Users Reviews**
- **Conclusions**
- **Growth and Next Steps**
- **Acknowledgement** 
---


## Overview 
Mobile apps are an integral part of modern life, offering a wide range of functionalities from entertainment to productivity. With the ease of app development and the potential for significant financial returns, the mobile app market continues to expand rapidly. As a result, more apps are being developed every day, contributing to a highly competitive and dynamic ecosystem.

In this notebook, we will perform a comprehensive analysis of the Android app market by examining data from over ten thousand apps available on the Google Play Store. By exploring key metrics across various app categories, we aim to uncover valuable insights that can inform strategies for app growth and user retention.

Our analysis will address the following key questions:
- What are the most popular app categories?
- How do user ratings and reviews vary across different types of apps?
- What pricing models (free vs. paid) are most prevalent, and how do they impact app success?
- Are there patterns in app size, update frequency, or installation numbers that correlate with app performance?

Through data visualisation and exploration, this notebook seeks to provide actionable insights for app developers, marketers, and stakeholders in the Android app ecosystem.

---



## Dataset 
For this analysis, we will use data from two files that provide comprehensive information about Android apps and user reviews on the Google Play Store:

### 1. `apps.csv`
This file contains detailed information about various applications available on Google Play. It includes **13 features** that describe each app, such as:

- **App**: The name of the application.
- **Category**: The category to which the app belongs (e.g., Games, Productivity).
- **Rating**: Average user rating (out of 5).
- **Reviews**: Number of user reviews.
- **Size**: Size of the app file.
- **Installs**: Total number of installations.
- **Type**: Type of the app (free or paid).
- **Price**: Cost of the app.
- **Content Rating**: Age group for which the app is suitable.
- **Genres**: App genres.
- **Last Updated**: Date of the last update.
- **Current Version**: The app's current version.
- **Android Version**: Minimum Android version required to run the app.

### 2. `user_reviews.csv`
This file contains up to **100 user reviews per app**, focusing on the most helpful reviews. In addition to the review text, the data includes three key features:

- **Sentiment**: Indicates whether the review is Positive, Negative, or Neutral.
- **Sentiment Polarity**: A numerical value ranging from -1 (most negative) to +1 (most positive), representing the emotional tone of the review.
- **Sentiment Subjectivity**: A score between 0 and 1 indicating how subjective or opinion-based the review is (higher values suggest greater subjectivity).

These datasets together provide a rich source of information for understanding app performance, user satisfaction, and market trends. We will use them to explore patterns, draw insights, and develop strategies for improving app growth and user engagement.



In [1]:
# Read in apps dataset
import pandas as pd
apps_with_duplicates = pd.read_csv("datasets/apps.csv")

# Drop duplicates from apps_with_duplicates
apps = apps_with_duplicates.drop_duplicates()

# Print the total number of apps
print('Total number of apps in the dataset = ', apps.shape)

# Have a look at a random sample of 5 rows
print(apps.head())

Total number of apps in the dataset =  (9659, 14)
   Unnamed: 0                                                App  \
0           0     Photo Editor & Candy Camera & Grid & ScrapBook   
1           1                                Coloring book moana   
2           2  U Launcher Lite â€“ FREE Live Cool Themes, Hide ...   
3           3                              Sketch - Draw & Paint   
4           4              Pixel Draw - Number Art Coloring Book   

         Category  Rating  Reviews  Size     Installs  Type Price  \
0  ART_AND_DESIGN     4.1      159  19.0      10,000+  Free     0   
1  ART_AND_DESIGN     3.9      967  14.0     500,000+  Free     0   
2  ART_AND_DESIGN     4.7    87510   8.7   5,000,000+  Free     0   
3  ART_AND_DESIGN     4.5   215644  25.0  50,000,000+  Free     0   
4  ART_AND_DESIGN     4.3      967   2.8     100,000+  Free     0   

  Content Rating                     Genres      Last Updated  \
0       Everyone               Art & Design   January 7, 20

In [2]:
# Read in user reviews dataset
import pandas as pd
reviews_with_duplicates = pd.read_csv("datasets/user_reviews.csv")

# Drop duplicates from reviews_with_duplicates
reviews = reviews_with_duplicates.drop_duplicates()

# Print the total number of reviews
print('Total number of reviews in the dataset = ', reviews.shape)

# Have a look at a random sample of 5 rows
print(reviews.head())

Total number of reviews in the dataset =  (30679, 5)
                     App                                             Review  \
0  10 Best Foods for You  I like eat delicious food. That's I'm cooking ...   
1  10 Best Foods for You    This help eating healthy exercise regular basis   
2  10 Best Foods for You                                                NaN   
3  10 Best Foods for You         Works great especially going grocery store   
4  10 Best Foods for You                                       Best idea us   

  Sentiment  Sentiment_Polarity  Sentiment_Subjectivity  
0  Positive                1.00                0.533333  
1  Positive                0.25                0.288462  
2       NaN                 NaN                     NaN  
3  Positive                0.40                0.875000  
4  Positive                1.00                0.300000  


---

## Data Cleaning

Data cleaning is a crucial step in any data science project, ensuring that the dataset is accurate, consistent, and ready for analysis. Although it can be a time-consuming process, its importance cannot be overstated, as clean data leads to more reliable insights and accurate results.

### Issues Identified in the Dataset
Upon examining a random sample of the dataset, we noticed some inconsistencies in the formatting of key columns, specifically:

- **Installs**: Contains special characters like commas (`,`) and plus signs (`+`), which prevent this column from being purely numeric.
- **Price**: Contains dollar signs (`$`), making it impossible to perform mathematical operations without further processing.

To prepare the data for analysis, we need to clean these columns by removing the unwanted characters and converting the values to numeric types.

### Data Cleaning Steps
1. **Remove Special Characters**:
   - Strip commas (`,`), plus signs (`+`), and dollar signs (`$`) from the `Installs` and `Price` columns.
2. **Convert to Numeric**:
   - Convert the cleaned values in both columns to appropriate numeric data types (e.g., integers or floats).

### Printing the Data Summary
After cleaning, it is good practice to print a summary of the cleaned dataset to ensure all transformations were applied correctly. We will use the `info()` method to check the data types and verify that the `Installs` and `Price` columns are now numeric.

By addressing these issues, we ensure that the dataset is properly formatted and ready for accurate analysis in subsequent steps.


In [3]:
# List of characters to remove
chars_to_remove = ['+', ',', '$']
# List of column names to clean
cols_to_clean = ["Installs", "Price"]

# Loop for each column in cols_to_clean
for col in cols_to_clean:
    # Loop for each char in chars_to_remove
    for char in chars_to_remove:
        # Replace the character with an empty string
        apps[col] = apps[col].apply(lambda x: x.replace(char, ""))
        
# Print a summary of the apps dataframe
print(apps.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9659 entries, 0 to 9658
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      9659 non-null   int64  
 1   App             9659 non-null   object 
 2   Category        9659 non-null   object 
 3   Rating          8196 non-null   float64
 4   Reviews         9659 non-null   int64  
 5   Size            8432 non-null   float64
 6   Installs        9659 non-null   object 
 7   Type            9659 non-null   object 
 8   Price           9659 non-null   object 
 9   Content Rating  9659 non-null   object 
 10  Genres          9659 non-null   object 
 11  Last Updated    9659 non-null   object 
 12  Current Ver     9651 non-null   object 
 13  Android Ver     9657 non-null   object 
dtypes: float64(2), int64(2), object(10)
memory usage: 1.0+ MB
None


---

## Correcting Data Types

During the data cleaning process, we observed that the **Installs** and **Price** columns were categorized as `object` data types instead of the desired `int` or `float` types. This occurred because these columns originally contained a mix of numeric values and special characters (e.g., commas, plus signs, dollar signs), preventing them from being interpreted as purely numerical data.

### Why Correcting Data Types Matters
Accurate data types are essential for performing mathematical operations, statistical analysis, and visualisations. By ensuring that **Installs** and **Price** are numeric, we can efficiently analyze trends, correlations, and distributions in the dataset.

### Columns to Focus On
The four key features we will be analyzing most frequently are:

- **Installs**: Needs conversion to `int` after removing special characters.
- **Price**: Needs conversion to `float` after removing the dollar sign (`$`).
- **Size**: Already a `float` type, no changes needed.
- **Rating**: Already a `float` type, no changes needed.

### Data Type Corrections
To correct the data types:
1. **Installs**: Convert the cleaned `Installs` column to `int`.
2. **Price**: Convert the cleaned `Price` column to `float`.

### Pandas Data Types Overview
In Pandas, common data types include:
- `int64`: Integer values.
- `float64`: Decimal values.
- `object`: Mixed or string data.

For more information on Pandas data types, refer to the Pandas documentation.

By ensuring **Installs** and **Price** are numeric, we set the foundation for accurate and effective analysis in the upcoming sections.


In [4]:
import numpy as np

# Convert Installs to integer data type
apps["Installs"] = apps["Installs"].astype("int64")

# Convert Price to float data type
apps["Price"] = apps["Price"].astype("float")

# Checking dtypes of the apps dataframe
print(apps.dtypes)

Unnamed: 0          int64
App                object
Category           object
Rating            float64
Reviews             int64
Size              float64
Installs            int64
Type               object
Price             float64
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object


---

## Exploring App Categories

With over 1 billion active users in 190 countries, Google Play is a key platform for distributing mobile apps to a global audience. To enhance discoverability, Google groups apps into various **categories**, allowing users to easily find apps that meet their needs. Understanding these categories can provide valuable insights into market trends and opportunities for app developers.

### Key Questions
Our exploration focuses on answering the following questions:
1. **Which category has the highest share of active apps in the market?**
2. **Is any specific category dominating the market?**
3. **Which categories have the fewest number of apps?**



In [12]:
import plotly
plotly.offline.init_notebook_mode(connected=True)
import plotly.graph_objs as go

# Print the total number of unique categories
num_categories = len(apps["Category"].unique())
print('Number of categories = ', num_categories)

# Count the number of apps in each 'Category'. 
num_apps_in_category = apps["Category"].value_counts()

# Sort num_apps_in_category in descending order based on the count of apps in each category
sorted_num_apps_in_category = num_apps_in_category.sort_values(ascending=False)

data = [go.Bar(
        x = sorted_num_apps_in_category.index, # index = category name
        y = sorted_num_apps_in_category.values, # value = count
)]

layout = {
    'title': 'Number of Apps in Each Category',  # Chart title
    'xaxis': {'title': 'Category'},  # X-axis label
    'yaxis': {'title': 'Number of Apps'}  # Y-axis label
}

plotly.offline.iplot({'data': data, 'layout': layout})

Number of categories =  33


### Overview of Categories
The dataset reveals **33 unique app categories**, ranging from entertainment and productivity to specialized fields like medical and educational apps. This classification plays a crucial role in app visibility and user engagement.

### Insights from the Dataset
- **Family** and **Game** apps dominate the market, making up the largest share of active apps. This reflects the high demand for entertainment and educational content for children and families.
- **Tools**, **Business**, and **Medical** apps also rank high, highlighting the growing need for productivity solutions, professional applications, and healthcare-related services.
- On the other hand, categories like **Comics**, **Beauty**, and **Parenting** have fewer apps, indicating potential niches with less competition.

By analysing the distribution of apps across categories, we better understand the dynamics of the Android app market and identify areas where developers might find growth opportunities or face stiff competition.

---

## Distribution of App Ratings

App ratings, ranging from **1 to 5**, are a critical performance indicator in the Android app market. They influence an app's visibility on the Google Play Store, affect user conversion rates, and contribute to a company's overall brand perception. Higher ratings not only improve the likelihood of being featured but also foster trust and engagement among users.



In [None]:
# Average rating of apps
avg_app_rating = apps["Rating"].mean()
print('Average app rating = ', avg_app_rating)

# Distribution of apps according to their ratings
data = [go.Histogram(
        x = apps['Rating']
)]

# Title, x label, y label and vertical dashed line to indicate the average app rating
layout = {
    'title': 'Distribution of App Ratings',  # Title of the histogram
    'xaxis': {'title': 'App Rating'},  # X-axis label
    'yaxis': {'title': 'Count of Apps'},  # Y-axis label
    'shapes': [{
              'type' :'line',
              'x0': avg_app_rating,
              'y0': 0,
              'x1': avg_app_rating,
              'y1': 1000,
              'line': { 'dash': 'dashdot'}
          }]
          }

plotly.offline.iplot({'data': data, 'layout': layout})

Average app rating =  4.173243045387994


### Key Insight: Average Rating
From our analysis, the **average app rating** across all categories is **4.17**, indicating that most apps receive favourable feedback from users.

### Histogram Analysis
A histogram plot of app ratings reveals that the distribution is **left-skewed**:
- The majority of apps are rated highly, with ratings clustered around 4.0 and above.
- Only a small portion of apps receive low ratings (below 3.0), which suggests that poorly rated apps are the exception rather than the norm.

### Implications
The skewed distribution indicates that users generally have positive experiences with most apps, which could be due to the competitive nature of the market encouraging developers to maintain high-quality standards. However, the few low-rated apps highlight the importance of addressing user feedback and maintaining app performance to avoid poor ratings.

Understanding the distribution of ratings helps developers benchmark their app's performance against industry standards and underscores the importance of maintaining high user satisfaction to succeed in the crowded app marketplace.

---
