# EXAMINING DETERMINANTS OF APP POPULARITY/RATING

## BACKGROUND

Both AppleStore and Google PLay have millions of apps. How then do you ensure that 
your app is successful in such a crowded market? It is therefore vital to examine
factors that determine app popularity or rating.

## STEP 1: IMPORT THE DATASETS

In [None]:
#Import pandas and numpy libraries
import pandas as pd
import numpy as np

#read the data and store in variables
data = pd.read_csv("appleStore_data.csv", encoding='utf8')
desc = pd.read_csv("appleStore_description.csv", encoding='utf8')

data.head()
desc.head()

## STEP 2: PROCESS THE DATA

Assumption: we can infer information about an app based on the description
The longer longer the description, the more detailed the information.

In [None]:
#compute length of app description
desc["desc_len"] = desc['app_desc'].apply(lambda x: len(x))
desc.head()

In [None]:
#append the new desc_len column to the main dataframe (i.e., data)
data["desc_len"] = desc["desc_len"]
data.head()

In [None]:
#Examine the categorical columns in the data
data['cont_rating'].value_counts() #gets frequency distribution

In [None]:
#Examine the categorical columns in the data
data['prime_genre'].value_counts()

In [None]:
#shape shows number of rows and columns in the dataframe
data.shape

In [None]:
#get some summary statistics
data.describe()

In [None]:
#remove unrated apps from the data
#select only the rows for which user_rating_ver column values are not 0
data = data[data.user_rating_ver !=0]

In [None]:
#get summary statistics again
data.describe()

In [None]:
#get the shape of the data again
data.shape

In [None]:
data['cont_rating'].value_counts()

In [None]:
data['prime_genre'].value_counts()

In [None]:
#for prime_genre, combine categories with fewer values into one category
#decision: combine all those with fewer than 100 into one category

In [None]:
#combine all the selected categories into a pattern
#combined category is "Other"
to_rep ='|'.join(['Lifestyle','Sports','Shopping','Weather','Travel',
                  'Book','Reference','Finance','News','Business',
                  'Food & Drink','Navigation','Medical','Catalogs']); 

#replace those categories with "Other"
data['prime_genre'] = data['prime_genre'].str.replace(to_rep, 'Other')
                  


In [None]:
#get frequencies again
data['prime_genre'].value_counts()

In [None]:
#convert app size from bytes to megabytes
#apply() function is similar to a loop
#it takes the lambda expression and performs it on every value of the
#column
mb = 1048576 #in binary system
data['size_mb'] = data['size_bytes'].apply(lambda x: x/mb)

In [None]:
data.head()

In [None]:

price = data['price'] #get price variable from dataframe
pricing=[] #list variable for pricing

#evaluate value in price: if 0, append "Free" to pricing, else
#append "Paid" to pricing
for p in price:
    if p==0:
        pricing.append("Free")
    else:
        pricing.append("Paid")

#append new pricing variable to the dataframe       
data['pricing'] = pricing

In [None]:
data.head()

In [None]:
#get frequencies
data['pricing'].value_counts()

In [None]:
#discretize rating:
data["user_rating"].value_counts()

In [None]:
#if rating >=4, highly rated=yes, otherwise=0
#list variable
Highly_rated = []

#Highly_rated = data1["user_rating"].apply(lambda x: x>4)

rating = data["user_rating"]
for rate in rating:
    if rate >=4:
        Highly_rated.append("Yes")
    else:
        Highly_rated.append("No")
               
print(Highly_rated)   
    

In [None]:
#append to dataframe
data['Highly_rated'] = Highly_rated
data.head()

In [None]:
pd.crosstab(data['prime_genre'],data['Highly_rated'])

In [None]:
pd.crosstab(data['cont_rating'],data['Highly_rated'])

In [None]:
pd.crosstab(data['ipadSc_urls.num'],data['Highly_rated'])

In [None]:
#specify columns to drop and drop them
to_drop = ['Unnamed: 0','id','track_name','size_bytes','currency','price','ver','vpp_lic']

#drop the specified columns and assign the result to a new dataframe
data1 = data.drop(to_drop, axis=1)
data1.head()

In [None]:
#convert categorical variables into numerical dummy variables
#dummy variable assigns 1 for presence of category and 0 otherwise

In [None]:

data3 = pd.get_dummies(data1, drop_first=True)

In [None]:
data3.head()

In [None]:
data3.tail()

In [None]:
#transform numerical data using log transformation
#adjusts values into a comparable scale

In [None]:
#get numericalo columns
num_cols=['rating_count_tot','rating_count_ver','user_rating',
          'user_rating_ver','sup_devices.num','ipadSc_urls.num',
          'lang.num','size_mb','desc_len']

#create a dataframe of numerical columns
df_numeric =data3[num_cols]
df_numeric.head()

In [None]:
#apply log transformation to numeric columns
df_num_log = df_numeric.apply(lambda x: np.log1p(x))
df_num_log.head()

In [None]:
#replace dataframe columns value with transformed values for numeric columns
data3[num_cols] = df_num_log
data3.head()

In [None]:
#Compute correlations for continuous variables
correl = df_num_log.corr()
#print(correl)

#import visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt

#sns.heatmap(df_log.corr(), annot=True, vmin=-1, vmax=1, center=0, cmap='coolwarm')
# Sample figsize in inches
fig, ax = plt.subplots(figsize=(5,5))
sns.heatmap(correl, annot=True, linewidths=.5, ax=ax, cmap='coolwarm', fmt='.2g', cbar=False, square=True)

Based on the correlation coefficients, there are high correlations between rating_count_tot 
and rating_count_ver (0.66), implying multicollinearity. Similarly, user_rating and 
user_rating_ver have a high correlation coeffcient of 0.72. These pairs of various should not
be used at the same time in analysis.

## STEPE 3: RUN LOGISTIC EXPLANATORY REGRESSION MODEL 

Estimate a model to examine factors that influence pricing of the app
Exclude all user rating variables (assumption: we want to use only intrinsic features of the app)

In [None]:
data3.head()

In [None]:
#Estimate explanatory model (models are built to EXPLAIN and to PREDICT)
import statsmodels.api as sm  #geared toward explanation rather than prediction

#Regression equation: Y = a0 +aiXi + e

#specify X and Y variables
#rating_count_tot	rating_count_ver	user_rating	user_rating_ver
drop_cols=['prime_genre','sup_devices.num','rating_count_tot','rating_count_ver', 
           'user_rating_ver', 'user_rating'] 


X = data3.drop(drop_cols,axis=1)  #use the entire for explanation

y = data3['Highly_rated_Yes']


X.head()
print(y)

In [None]:
logit_model = sm.GLM(y, X, family=sm.families.Binomial())
result=logit_model.fit(fit_intercept=True)
print(result.summary2())

In [None]:
# odds ratios and 95% CI
params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']
print(np.exp(conf))

In [None]:
data2.shape

### EXPLANATORY MODEL RESULTS

#### Overall Model Fit: F-statistic and adjusted R-squared

F-statistic = 53.18, with p=0; which means that the model is statistically significant and useful
in explaining the variation in app rating.

However, the coefficient of determination (adjusted R squared) is only 0.147, which implies that
14.7% of the variation in app rating is explained by the factors in the model. This is a low
value but is typical of cross-section data.

#### Individual Coefficients - significance (siginificant or not) and nature of relationship 
(Positive or negative)

#continuous variables
ipadSc_urls.num (b=0.0181; p=0) and lang.num (b=0.0107; p=0) are statistically significant and
have a positive effect on app rating. For example, a unit increase in the number of pictures 
of the app increases app rating by 0.018 and a unit increase in the number of supported
languages increase app rating by 0.011.

#categorical variables - explained in relation to reference category.
prime_genre_Games (b=0.0363 ; p=0) is statistically significant and has a positive effect on 
app rating. It implies that relative to education, an app being of games genre increases rating
by 0.036.


#### Managerial implications

To increase overall rating of their apps, developers should focus on proving clear representation
of their app features pictorially and also ensure that their app can be used across multiple
languages.

## STEP 4: RUN PREDICTION MODEL

In [None]:
#IMPORT ALL THE NECESSARY LIBRARIES
from sklearn.model_selection import train_test_split
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
#import statmodels.formula.api as smf
from scipy import stats
import matplotlib.pyplot as plt

#split the data
#Split the data into 60% training set and 40% test set
df_train,df_test=train_test_split(data2,test_size=0.4, random_state=0)
df_train.shape
df_test.shape

In [None]:
#first install xgboost
#!pip install xgboost 
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.metrics import accuracy_score, log_loss


In [None]:
#specify X and Y variables
drop_cols1=['rating_count_ver', 'user_rating_ver', 'user_rating'] 


X_train = df_train.drop(drop_cols1, axis=1)

y_train = df_train['Highly_rated_Yes']

X_train.shape

### Model Pipeline

In [None]:
#Names of the various classification approaches for easy presentation of the results
names = ["Logististic Regression", "Nearest Neighbors", "Linear SVM", "RBF SVM", "Gaussian Process",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost", "XGBoost","Naive Bayes", "QDA"]

scores = [] #list variable to hold classification scores

classifiers = [
    LogisticRegression(),
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=0.001, solver='lbfgs', learning_rate='adaptive', max_iter=1000),
    AdaBoostClassifier(),
    XGBClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()]

for classifier in classifiers:
    pipeline = Pipeline(steps=[
                      ('classifier', classifier)])
    pipeline.fit(X_train, y_train)   
    print(classifier)
    scores.append(pipeline.score(X_test, y_test))
    print("model score: %.3f" % pipeline.score(X_test, y_test))
    print("\n -----------------------------------------------------------------------------------")
    
#end of pipeline

#Create a dataframe for prediction scores
scores_df = pd.DataFrame(zip(names,scores), columns=['Classifier', 'Accuracy Score'])

### OLS Model

In [None]:
#specify X and Y variables (test dataste)

X_test = df_test.drop(drop_cols1, axis=1)

y_test = df_test['user_rating']

lr = LinearRegression()  #create linear regression (lr) object

#fit using training set
#fit the data using the fit() method of lr object (gives coefficients
#required for predictions)
lr.fit(X_train,y_train) 

pred_lr = lr.predict(X_test)  #tring to predict y_test

#print the RMSE (Assesses the predictive performance of the model)
from sklearn.metrics import mean_squared_error
print(np.sqrt(mean_squared_error(y_test,pred_lr)))

### Lasso Regression

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso

#Lasso Regression uses regularization parameter (alpha) to improve model performance
#Ridge regression: OLS + alpha*summation(abs(coeffs))


lasso = Lasso()  #create lasso object

parameters = {'alpha':[1e-15, 1e-10, 1e-8, 1e-4, 1e-3, 1e-2, 0.1, 1,2,5,10,20]}
lasso_regressor = GridSearchCV(lasso,parameters,scoring='neg_mean_squared_error', cv=5)

lasso_regressor.fit(X_train,y_train) #fit model using training set

print(lasso_regressor.best_params_)
#print(lasso_regressor.best_score_)

In [None]:
#make prediction using optimal alpha for lasso model
lasso = Lasso(alpha=0.0001)
lasso.fit(X_train,y_train)

pred_test_lasso = lasso.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,pred_test_lasso)))

### Ridge Regression

In [None]:
#Ridge Regression uses regularization parameter (alpha) to improve model performance
#Ridge regression: OLS + alpha*summation(squared(coeffs))


from sklearn import linear_model

#select optimal alpha (regularization parameter)
alphas = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3, 1e-2, 0.1, 1,2,5,10,20]

ridge_reg = linear_model.RidgeCV(alphas=alphas, store_cv_values=True)

ridge_reg.fit(X_train,y_train) #fit the model to genarate the parameters

cv_mse = np.mean(ridge_reg.cv_values_,axis=0)
print("alphas: %s" % alphas)
print("CV MSE: %s" % cv_mse)

print("Best alpha using built-in RidgeCV: %f" % ridge_reg.alpha_)

In [None]:
#generate the prediction using the best model
alpha = ridge_reg.alpha_ #transcribe the alpha for the best model
ridge_reg = linear_model.Ridge(alpha=alpha)

ridge_reg.fit(X_train, y_train) #fit model using training dataset

pred_y = ridge_reg.predict(X_test) #maka predictions using the test set
print(np.sqrt(mean_squared_error(y_test,pred_y)))

Using RMSE statistic, we select a model with the lowest score. Here, OLS model has the lowest
score and therefore will be used to make out-of-sample predictions.

## Make Out-of-Sample Predictions with the Best Model
(simulate the production environment)

In [None]:
#Linear regression is the Best model
summary = pd.DataFrame(df_test.describe())
print(summary)
summary.to_csv('summary.csv')

In [None]:


#Predict the expected enrollment for the following X's:
Xnew = pd.read_csv("X_new.csv")
Xnew.head()

In [None]:
numeric_cols = ['rating_count_tot','sup_devices.num','ipadSc_urls.num','lang.num','desc_len',
                'size_mb']
df_num_new = Xnew[numeric_cols]
print(df_num_new)

df_num_new_log = df_num_new.apply(lambda x: np.log1p(x))
print(df_num_new_log)

In [None]:
Xnew[numeric_cols] = df_num_new_log[numeric_cols]
print(Xnew)

In [None]:
# make a prediction
ynew = lr.predict(Xnew)
# show the inputs and predicted outputs
#predicted value should be exponentiated because data was log-transformed
#print("X=%s, Predicted app rating=%s" % (Xnew, np.exp(ynew)))
print("Predicted app rating:", (np.exp(ynew)))

Based on the given features of an app, we predict its rating to be a 5, which means 
an excellent rating.

In [None]:
data2.to_csv("data2.csv")

In [None]:
data2.head()

In [None]:
import statsmodels.api as sm
#logit_model=sm.Logit(y,x_trans)
#logit_model=sm.Logit(y,inter)
logit_model = sm.GLM(y, X, family=sm.families.Binomial())
result=logit_model.fit(fit_intercept=True)
print(result.summary2())