# **Education Apps: Market Trends, Monetization, and Growth Opportunities**


## Introduction

The education app market has experienced rapid growth in recent years, driven by increased mobile device adoption, digital learning trends, and demand for accessible educational content. This analysis leverages a dataset of over 2 million apps to uncover key trends, revenue strategies, and growth opportunities within the education category.

**Objectives of this analysis:**
1. Identify the distribution of education apps by type (free vs paid) and monetization strategy (ads, in-app purchases, freemium models).  
2. Explore trends in app downloads and user engagement to determine which strategies correlate with higher reach.  
3. Provide actionable insights for companies and app developers to optimize revenue, improve user acquisition, and prioritize app development focus areas.

**Dataset Overview:**
- Size: 2,000,000+ apps  
- Features: app category, pricing model, average installs, ratings, revenue indicators, and more  
- Scope: Analysis focuses specifically on apps within the *Education* category

**Key Value for Stakeholders:**  
By analyzing market patterns and monetization strategies, companies can make informed decisions about app development, marketing, and pricing, targeting segments with the highest growth and revenue potential.



---

### Packages & Setup

We’ll use these packages for data cleaning, analysis, and visualization.

In [1]:
# Kindly upload the packages before starting :)
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px

### Data Import

In [2]:
import os

def load_dataset(file_path):
    """Load CSV dataset with error handling."""
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"The file {file_path} was not found. Please check the path.")
    try:
        df = pd.read_csv(file_path)
        print(f"Dataset loaded successfully: {df.shape[0]} rows, {df.shape[1]} columns")
        return df
    except Exception as e:
        raise Exception(f"Error while loading dataset: {e}")


In [3]:
file_path = r"C:\Users\A\Desktop\playstore_app_market_insights\dataset\Google-Playstore.csv"
df = load_dataset(file_path)

Dataset loaded successfully: 2312944 rows, 24 columns


Dataset contains a comprehensive set of app features useful for revenue and user behavior analysis.

## **Data Cleaning & Transformation**

In [4]:
def clean_dataset(df):
    
    # 1. Handle Missing Values
    df = df.dropna(subset=['App Name'])
    df['Rating'] = df.groupby('Category')['Rating'].transform(lambda x: x.fillna(x.median()))
    df['Released_missing'] = df['Released'].isna().astype(int)
    df['Released'] = df['Released'].fillna(df['Last Updated'])
    df['Developer Id'] = df['Developer Id'].fillna("N/A")
    df['max_inst_miss'] = df['Minimum Installs'].isna().astype(int)
    df['Minimum Installs'] = df['Minimum Installs'].fillna(df['Maximum Installs'])
    df['Currency'] = df['Currency'].fillna("N/A")

    # 2. Drop Useless Columns
    df = df.drop([
        'Developer Website', 'Developer Email', 'Privacy Policy', 'Scraped Time', 
        'App Id', 'Installs', 'Rating Count', 'Minimum Android'
    ], axis=1, errors='ignore')

    # 3. Normalize Size
    df["Size"] = df["Size"].astype(str).str.replace(",", "").str.replace(" ", "")
    def convert_size(value):
        try:
            val = str(value).strip()
            if val.lower() in {"varieswithdevice", "na", "n/a", ""}:
                return np.nan
            if val[-1].lower() == "m":
                return float(val[:-1]) * 1000
            elif val[-1].lower() == "k":
                return float(val[:-1])
            else:
                return float(val)
        except:
            return np.nan
    df["size"] = df["Size"].apply(convert_size)
    df = df.drop(['Size'], axis=1, errors='ignore')

    # 4. Convert Boolean to Int
    df['Free'] = df['Free'].astype(int)
    df['Ad Supported'] = df['Ad Supported'].astype(int)
    df['In App Purchases'] = df['In App Purchases'].astype(int)
    df['Editors Choice'] = df['Editors Choice'].astype(int)

    # 5. Derived Columns
    df['avg_installs'] = ((df['Minimum Installs'] + df['Maximum Installs']) / 2).round(0)
    df['Released'] = pd.to_datetime(df['Released'], errors='coerce')
    df['released_year'] = df['Released'].dt.year

    # 6. Rename Columns (snake_case)
    df = df.rename(columns={
        "App Name": "app_name",
        "Category": "category",
        "Rating": "rating",
        "Free": "app_status",
        "Currency": "currency",
        "Developer Id": "developer_name",
        "Released": "released_date",
        "Last Updated": "last_update",
        "Content Rating": "content_target",
        "Ad Supported": "ads_flag",
        "In App Purchases": "in_app_purchases_flag",
        "Editors Choice": "play_store_recommend"
    })

    # Ensure consistency between Price and app_status
    df.loc[df['Price'] > 0, 'app_status'] = 0  # Paid
    df.loc[df['Price'] == 0, 'app_status'] = 1  # Free

    # 7. Remove Duplicates
    df = df.drop_duplicates(['app_name'], keep='first')

    print(f"Cleaning complete: {df.shape[0]} rows, {df.shape[1]} columns remain.")
    return df


In [5]:
df = clean_dataset(df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Rating'] = df.groupby('Category')['Rating'].transform(lambda x: x.fillna(x.median()))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Released_missing'] = df['Released'].isna().astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Released'] = df['Released'].fillna(df['Last Updated'

Cleaning complete: 2177943 rows, 20 columns remain.


## **Data Validation**

In [None]:
def validate_dataset(df):
    errors = []

    # Ratings between 0–5
    if not df['rating'].between(0, 5).all():
        errors.append("Invalid ratings detected outside range 0–5.")

    # Released Year check
    if df['released_year'].isna().any():
        errors.append("Missing release years detected.")
    invalid_years = df[~df['released_year'].between(2008, 2025)]
    if not invalid_years.empty:
        errors.append(f"{len(invalid_years)} apps have invalid release years.")

    # Boolean flags check
    for col in ['app_status', 'ads_flag', 'in_app_purchases_flag', 'play_store_recommend']:
        if not df[col].isin([0, 1]).all():
            errors.append(f"Invalid values in {col} (should be 0 or 1).")

    if errors:
        print("Validation Issues Found:")
        for e in errors:
            print("-", e)
    else:
        print("Dataset validation passed.")

In [None]:
validate_dataset(df)

All app categories now conform to standard taxonomy; no negative installs found

---

**Save Dataset**

In [6]:
from pathlib import Path

# Suppose 'df' is your transformed DataFrame

# Define save path
save_path = Path(r"C:\Users\A\Desktop\playstore_app_market_insights\dataset") / "Google-Playstore-transformed.csv"

# Save the DataFrame
df.to_csv(save_path, index=False)  # don't save the index
print(f"Saved transformed dataset to: {save_path}")

Saved transformed dataset to: C:\Users\A\Desktop\playstore_app_market_insights\dataset\Google-Playstore-transformed.csv


*Kindly check the part II to explore the EDA part*