# 1. Introduction

With technology dominating the modern world; the app industry has become a high thriving. New apps are coming to the market every while. Indeed, app usage is still growing at a steady rate, but some are higher than others. There are different demands for different apps based on several features. Data science potentials can be utilized to drive app-making businesses and app developers to the right road. 

# 2. Dataset Description

Google Play Store is a big digital distribution service that provides apps supported by Android-certified devices and Chrome OS. We found a dataset on Kaggle that contains data of 10k Play Store apps for analyzing the Android market. The dataset has 13 columns -which will be shown in the following subsection- and 10842 rows. It is aiming to use these apps’ statistics to predict which apps are more likely to be installed or get a high rate.

# 2.1 Sneak Peek on the Dataset

This is the link to the dataset on Kaggle (https://www.kaggle.com/lava18/google-play-store-apps)

In [1]:
#import the libraries
import numpy as np
import pandas as pd

#read the dataset
df = pd.read_csv('googleplaystore.csv')
df.head(10)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
5,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6M,"50,000+",Free,0,Everyone,Art & Design,"March 26, 2017",1.0,2.3 and up
6,Smoke Effect Photo Maker - Smoke Editor,ART_AND_DESIGN,3.8,178,19M,"50,000+",Free,0,Everyone,Art & Design,"April 26, 2018",1.1,4.0.3 and up
7,Infinite Painter,ART_AND_DESIGN,4.1,36815,29M,"1,000,000+",Free,0,Everyone,Art & Design,"June 14, 2018",6.1.61.1,4.2 and up
8,Garden Coloring Book,ART_AND_DESIGN,4.4,13791,33M,"1,000,000+",Free,0,Everyone,Art & Design,"September 20, 2017",2.9.2,3.0 and up
9,Kids Paint Free - Drawing Fun,ART_AND_DESIGN,4.7,121,3.1M,"10,000+",Free,0,Everyone,Art & Design;Creativity,"July 3, 2018",2.8,4.0.3 and up


In [2]:
df.shape

(10841, 13)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


# 2.2. Features Explanation

1. App: is for the name of the app.
2. Category: means the category of a certain app
3. Rating: displays the rating of the app in Google Play Store on 5 Point numerical rating scale.
4. Reviews: this shows the number of reviews given to the app.
5. Size: shows the size of the app in megabytes.
6. Installs: shows the number of times the app got installed
7. Type: this shows if the app is free or paid.
8. Price: shows the price of the app in dollars. If the app is free the value will be 0.
9. Content Rating: shows the rating of the content if it's for everyone or specified for a specific audience.
10. Genres: means the genre of a certain app (it appears to be the same as the Category feature)
11. Last Updated: shows the date of the last update of the app.
12. Current Ver: shows the number of the current version of the app.
13. Android Ver: shows the number of Android versions that the app support.

# 3. Purpose of the Project

The goal of this project is to classify apps based on their Rating, taking into consideration other features like Category, Reviews, Installs, Type, and Price. We're also aiming to run many experiments with different dataset splits and different amounts of classifiers parameters’ values to discover their effect on the accuracy scores

# 4. Feature Engineering and Data Cleaning

We plan to perform basic cleaning for the dataset and analyze some features. We will try to drop columns that will appear to be irrelevant to our analysis. 
We also plan to add a new column called (app_demand) that has 4 categories (on_demand, moderate_demand, low_demand, no_demand) which represents the demand on a certain app based on its rating. The first category will have Ratings between Max and above Average values. The second category will have Ratings between above Average and Mean values. The third category will have Ratings between Mean and below Average values. The fourth category will have Ratings between below Average and Min values.
In principle, we plan to use these features (Category, Reviews, Installs, Type, and Price) as input variables, and the new feature (App_Demand) will be generated as output for our classification models.

# 5. Algorithms

We plan to develop Support Vector Machine and K-Nearest Neighbor models for classifying the apps into categories based on their Ratings. The classification approach compares the Rating of each app with Maximum, Average, and Minimum Rating values and assigns the appropriate category for each app.

# 6. Tools

- Anaconda Navigator 2.1.1/ Jupyter Notebook 6.4.5 for implementing both algorithms and creating the models. 
- Set of libraries for modeling and visualization. In principle we are going to import these libraries :
( Panda, Numpy, Matpoltlib, Seaborn, Scikit-learn). We might have to use other libraries during the implementation.

# 6. Experiments and Evaluation

We will try to implement a lot of experiments for both classifiers to discover the best and the worst results. We will try to change the parameters’ values. We will also try to change the dataset split for each experiment. We will record the results and compare them.