## How Good Is Your Android App?
COSC-247 Machine Learning Project
by Mia Jung

In [1]:
import os
import pandas as pd
import csv 

try:
    data = pd.read_csv('processed app data.csv') # Read playstore csv file
    
except HTTPError:
    s = 'processed app store data.csv'
    print('From local path:', s)
    data = pd.read_csv(s,
                     header=None,
                     encoding='utf-8')

data.head()

Unnamed: 0,App,Rating,Installs,Price,"DSLU (days since dec 1, 202)",Reviews,Size,Type,Current Ver,Android Ver,Category,Genres,Content Rating
0,Photo Editor & Candy Camera & Grid & ScrapBook,4.1,10000,0.0,1789,159,19000000.0,Free,1.0.0,4.0.3 and up,ART_AND_DESIGN,Art & Design,Everyone
1,Coloring book moana,3.9,500000,0.0,1781,967,14000000.0,Free,2.0.0,4.0.3 and up,ART_AND_DESIGN,Art & Design;Pretend Play,Everyone
2,"U Launcher Lite â€“ FREE Live Cool Themes, Hid...",4.7,5000000,0.0,1583,87510,8.7,Free,1.2.4,4.0.3 and up,ART_AND_DESIGN,Art & Design,Everyone
3,Sketch - Draw & Paint,4.5,50000000,0.0,1637,215644,25000000.0,Free,Varies with device,4.2 and up,ART_AND_DESIGN,Art & Design,Teen
4,Pixel Draw - Number Art Coloring Book,4.3,100000,0.0,1625,967,2.8,Free,1.1,4.4 and up,ART_AND_DESIGN,Art & Design;Creativity,Everyone


In [2]:
data.tail()

Unnamed: 0,App,Rating,Installs,Price,"DSLU (days since dec 1, 202)",Reviews,Size,Type,Current Ver,Android Ver,Category,Genres,Content Rating
7707,Chemin (fr),4.8,1000,0.0,3175,44,619000.0,Free,0.8,2.2 and up,BOOKS_AND_REFERENCE,Books & Reference,Everyone
7708,FR Calculator,4.0,500,0.0,1992,7,2.6,Free,1.0.0,4.1 and up,FAMILY,Education,Everyone
7709,Sya9a Maroc - FR,4.5,5000,0.0,1955,38,53000000.0,Free,1.48,4.1 and up,FAMILY,Education,Everyone
7710,Fr. Mike Schmitz Audio Teachings,5.0,100,0.0,1609,4,3.6,Free,1,4.1 and up,FAMILY,Education,Everyone
7711,iHoroscope - 2018 Daily Horoscope & Astrology,4.5,10000000,0.0,1590,398307,19000000.0,Free,Varies with device,Varies with device,LIFESTYLE,Lifestyle,Everyone


In [3]:
data.shape # produces (numRow, numCol)

(7712, 13)

In [4]:
import numpy as np

# Extract response variable 1 (user rating score) from the data- but remember the numbers are in Strings.
y1Str = data.iloc[0:7712,1].values

# convert response variable 1 from String to float.
y1 = [float(numString) for numString in y1Str]

# Extract the response variable 2 (num installs) from the data- but remember the numbers are in Strings.
y2Str = data.iloc[0:7712,2].values

# convert response variable 2 from String to float.
y2 = [float(numString) for numString in y2Str]

# predictor variables: price, dslu (days since last update), number of reviews and app size.
X = data.iloc[0:, [3,4,5,6]].values

In [5]:
### splitting data into trainng and testing sets.
from sklearn.model_selection import train_test_split
X_train, X_test, y1_train, y1_test = train_test_split(X, y1, test_size=0.3, random_state=1)

X_train, X_test, y2_train, y2_test = train_test_split(X, y2, test_size=0.3, random_state=1)

In [6]:
### Training a random forest regression model for target variable 2
from sklearn.ensemble import RandomForestRegressor

forest = RandomForestRegressor(n_estimators=100, criterion='squared_error', random_state=1, n_jobs=2)
forest.fit(X_train, y1_train) # fit a random forest regression model using training data

# Testing the model on the training and testing sets.
y1_train_pred = forest.predict(X_train)
y1_test_pred = forest.predict(X_test)

# Reporting how well the model fits the training and testing sets by reporting the R^2 values.

from sklearn.metrics import r2_score 

r2_train = r2_score(y1_train, y1_train_pred) 
r2_test =r2_score(y1_test, y1_test_pred)

print(f'R^2 train: {r2_train:.2f}') 
print(f'R^2 test: {r2_test:.2f}') 

R^2 train: 0.86
R^2 test: -0.03


Although the R^2 coefficient value is pretty good for the training data, I keep getting negative R^2 values for the testing data, even though I tried different combinations of features. There must be bad overfitting happening. What could I do to improve the performance?

In [7]:
num_col = data.select_dtypes(include=np.number).columns.tolist()
data[num_col].corr()["Rating"] # checking the correlation between Rating and other variables.

Rating                          1.000000
Installs                        0.052600
Price                           0.017559
DSLU (days since dec 1, 202)   -0.132764
Reviews                         0.079750
Size                            0.081479
Name: Rating, dtype: float64

In [8]:
data[num_col].corr()["Installs"]# checking the correlation between total number of installations and other variables.

Rating                          0.052600
Installs                        1.000000
Price                          -0.029600
DSLU (days since dec 1, 202)   -0.085429
Reviews                         0.626173
Size                            0.162948
Name: Installs, dtype: float64

It seems like many of the features (price, dslu, number of reviews and app size) I expected would have a correlation with the target variables (rating score and number of installations) are not very correlated to my target variables as I had thought. 

I see a very slight correlation between the target variable Rating and the following features: total number of installations, price, number of reviews, and app size. I didn't think of including the number of installations, a target variable I'm interested in, as one of the features for the target variable Rating. It seems that the DSLU is not quite correlated to Rating, so I might want to take that out when developing my model for the target variable Rating.

I see a surpirsingly strong correlation between Installs and number of Reviews and app Size. I also see a slight correlation between Installs and Rating. However, it seems that DSLU and price are not quite correlated to this target variable. I might want to take those features out of my model. 

I will try making a random forest regression model with the new combination of features.

In [9]:
# predictor variables for target variable 1 (Ratings) = Installs, reviews, app size (columns 2, 5, and 6)
X1 = data.iloc[0:, [2,5,6]].values

# predictor variables for target variable 2 (Installs) = Ratings, reviews, and size (columns 1, 5, and 6)
X2 = data.iloc[0:, [1,5,6]].values

In [10]:
### splitting data into trainng and testing sets.
from sklearn.model_selection import train_test_split
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.3, random_state=1)

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.3, random_state=1)

In [11]:
# Training a random forest regression model for response variable 1
from sklearn.ensemble import RandomForestRegressor

forest = RandomForestRegressor(n_estimators=1000, criterion='squared_error', random_state=1, n_jobs=5)
forest.fit(X1_train, y1_train) # fit a random forest regression model using training data

# Testing the model on the training and testing sets.
y1_train_pred = forest.predict(X1_train)
y1_test_pred = forest.predict(X1_test)

# Reporting how well the model fits the training and testing sets by reporting the R^2 values.

from sklearn.metrics import r2_score 

r2_train = r2_score(y1_train, y1_train_pred) 
r2_test =r2_score(y1_test, y1_test_pred)

print(f'R^2 train: {r2_train:.2f}') 
print(f'R^2 test: {r2_test:.2f}')

R^2 train: 0.85
R^2 test: 0.02


For response variable 1, Rating, the R^2 train decreased a little bit, but our R^2 test is no longer negative, at least!

In [12]:
### Training a random forest regression model for response variable 2
from sklearn.ensemble import RandomForestRegressor

forest2 = RandomForestRegressor(n_estimators=100, criterion='squared_error', random_state=1, n_jobs=5)
forest2.fit(X2_train, y2_train) # fit a random forest regression model using training data
print(X2_train)

# Testing the model on the training and testing sets.
y2_train_pred = forest2.predict(X2_train)
print(y2_train_pred)
y2_test_pred = forest2.predict(X2_test)
print(y2_test_pred)

# Reporting how well the model fits the training and testing sets by reporting the R^2 values.

from sklearn.metrics import r2_score 

r2_train = r2_score(y2_train, y2_train_pred) 
r2_test =r2_score(y2_test, y2_test_pred)
print()
print(f'R^2 train: {r2_train:.2f}') 
print(f'R^2 test: {r2_test:.2f}') 

[[4.60000e+00 6.90000e+01 3.10000e+00]
 [4.30000e+00 2.36000e+02 7.00000e+00]
 [4.40000e+00 7.47440e+04 6.00000e+07]
 ...
 [4.50000e+00 8.70928e+05 1.60000e+07]
 [4.10000e+00 1.20880e+04 2.10000e+07]
 [5.00000e+00 8.00000e+00 1.40000e+07]]
[2.495e+03 3.970e+04 1.616e+06 ... 9.510e+07 1.590e+06 7.495e+01]
[1.296e+05 3.609e+02 4.499e+04 ... 8.900e+07 5.475e+04 4.571e+05]

R^2 train: 0.98
R^2 test: 0.95


This is much, much better! The R^2 values are in the high 90's for both the training and testing data. This model works very well! 

So far, I have created a random forest regressor that predicts the user rating score of a given app after taking 3 features of the app as input- Installs, reviews, app size. The regressor works pretty well for the training data, but not well for the testing data.

Also, I created a wonderful regression model that predicts the number of installations of an app after taking these three features of an app: user ratings, number of revies, and app size.

However, I was skeptical of how my R^2 values were so high for the number of installations. And after taking a closer look at the data again, I realized that number of installations isn't a quanitative variable as I had assumed. It is categorical! 

So, something that definitely remains to be done is creating a random forest classifer for target variable 2 (number of Installations) rather than a regressor. Another thing that remains is that I would like to create something useful other than the random tree regressor for targe variable 1, since  the R^2 value for testing data is still very low. Perhaps I could make user ratings into a categorical variable as well- such as making a binary classifier that tells me whether my app will have above average (greater than 4.17 ) ratings or below average ratings, given certain features.

In [13]:
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier

#Create a Classifier
clf=RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X2_train,y2_train)

y2_pred=clf.predict(X2_test)

#Import scikit-learn metrics module to get performance metrics for our random forest classifier.
from sklearn import metrics

# Accuracy: how often the classifier is correct.
print("Accuracy:",metrics.accuracy_score(y2_test, y2_pred))

# Precision: the fraction of the predicted positive instances which are true positive. 
print("Precision:",metrics.precision_score(y2_test, y2_pred,average='weighted'))

#Recall: sensitivity, aka the the fraction of positives events predicted correctly.
print("Recall(Sensitivity):",metrics.recall_score(y2_test, y2_pred,average='weighted'))

# F1 Score: the harmonic mean of precision and recall.
print("F1 Score:",metrics.f1_score(y2_test, y2_pred,average='weighted'))

Accuracy: 0.5496974935177182
Precision: 0.5373793552727121
Recall(Sensitivity): 0.5496974935177182
F1 Score: 0.5397299599737548


Here is my attempt at creating a random tree classifier for target variable 2 (number of installations), which, as stated before, turned out to be a categorical variable. 

The performance metrics are all slightly above 0.50. Although an accuracy of 50 percent is not ideal, this classifier would tell me the correct prediction (about which bin/category of installations an app would belong to) more than 50 percent of the time, which I think isn't too bad given that this is real world data.

In [20]:
# Apps that have above-average ratings will be classified as 1. Otherwise, they will be classified as 0.
y1binary=[]
for rating in y1:
    if (rating>=4.17):
        y1binary.append(1)
    else:
        y1binary.append(0)
        
### splitting data into trainng and testing sets.
from sklearn.model_selection import train_test_split
X1binary_train, X1binary_test, y1binary_train, y1binary_test =\
        train_test_split(X1, y1binary, test_size=0.3, random_state=1) # the same features will be used to predict

In [22]:
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier

#Create a Classifier
binary_clf=RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_test)
binary_clf.fit(X1binary_train,y1binary_train)

y1binary_pred=clf.predict(X1binary_test)

#Import scikit-learn metrics module to get performance metrics for our random forest classifier.
from sklearn import metrics

# Accuracy: how often the classifier is correct.
print("Accuracy:",metrics.balanced_accuracy_score(y1binary_test, y1binary_pred))

# Precision: the fraction of the predicted positive instances which are true positive. 
print("Precision:",metrics.precision_score(y1binary_test, y1binary_pred,average='weighted'))

#Recall: sensitivity, aka the the fraction of positives events predicted correctly.
print("Recall(Sensitivity):",metrics.recall_score(y1binary_test, y1binary_pred,average='weighted'))

# F1 Score: the harmonic mean of precision and recall.
print("F1 Score:",metrics.f1_score(y1binary_test, y1binary_pred,average='weighted'))

Accuracy: 0.0
Precision: 0.0
Recall(Sensitivity): 0.0
F1 Score: 0.0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


This is a model that tells me (with 66 percent accuracy) whether or not the user rating score of an app would be above average (4.17), given the 3 features: total number of installations, number of reviews, and app size. The performance scores for this model are better than my random forest classifier that has number of installations as the target variable!