<a href="https://colab.research.google.com/github/Dhairyaxshah/Appfluence/blob/main/notebooks/03_supervised_ml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Supervised Machine Learning: App Popularity Prediction

This notebook builds supervised machine learning models to predict app popularity
based on cleaned Google Play Store and Apple App Store datasets.

Exploratory Data Analysis (EDA) and data cleaning were performed in a separate notebook.
Insights from EDA guide feature selection and modeling decisions in this notebook.


In [1]:
import pandas as pd
import numpy as np

gp_url = "https://raw.githubusercontent.com/Dhairyaxshah/Appfluence/main/data/google_play_cleaned.csv"
as_url = "https://raw.githubusercontent.com/Dhairyaxshah/Appfluence/main/data/apple_store_cleaned.csv"

df_gp = pd.read_csv(gp_url)
df_as = pd.read_csv(as_url)

df_gp.shape, df_as.shape
# This code confirms that i dont need to do cleaning again


((8196, 10), (7195, 10))

## Problem Definition

The goal is to predict app popularity using supervised machine learning.

- For Google Play Store, popularity is measured using install counts.
- Due to extreme skew in install values (confirmed via EDA), a classification
  approach is adopted instead of regression.

Apps are categorized into three popularity classes:
Low, Medium, and High.


In [2]:
#Create install class for google play
df_gp['install_class'] = pd.cut(
    df_gp['Installs'],
    bins=[0, 1e5, 1e7, df_gp['Installs'].max()],
    labels=['Low', 'Medium', 'High']
)

df_gp['install_class'].value_counts()


Unnamed: 0_level_0,count
install_class,Unnamed: 1_level_1
Low,4299
Medium,3463
High,434


In [3]:
# For app store we will use user ratings and create a popularity class
df_as['popularity_class'] = pd.qcut(
    df_as['rating_count_tot'],
    q=3,
    labels=['Low', 'Medium', 'High']
)

df_as['popularity_class'].value_counts()


Unnamed: 0_level_0,count
popularity_class,Unnamed: 1_level_1
Low,2405
High,2399
Medium,2391


Feature Selection

In [4]:
# Feature Selection for Google Play store based on EDA
X_gp = df_gp[
    ['Rating', 'Reviews', 'Size', 'Price',
     'Type', 'Category', 'Content Rating']
]

y_gp = df_gp['install_class']


In [5]:
# Feature Selection for App Store based on EDA
X_as = df_as[
    ['price', 'size_mb', 'user_rating',
     'rating_count_ver', 'prime_genre',
     'cont_rating', 'lang.num']
]

y_as = df_as['popularity_class']


In [6]:
# Train-Test Split
from sklearn.model_selection import train_test_split

X_train_gp, X_test_gp, y_train_gp, y_test_gp = train_test_split(
    X_gp, y_gp,
    test_size=0.2,
    random_state=42,
    stratify=y_gp
)
