# **PlayStore App Prediction**

This Project was created to predict the number of downloads of a playstore app. A machine learning model was trained on various features like category of an app, rating, review, size, price, updation history and some more. After model training, this project can be used to predict the number of downloads of a new app.

### Importing the modules and data

In [1]:
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score

In [2]:
train_df = pd.read_csv('PlayStore/Train.csv')

In [3]:
train_df.head(5)

Unnamed: 0,Offered_By,Category,Rating,Reviews,Size,Price,Content_Rating,Last_Updated_On,Release_Version,OS_Version_Required,Downloads
0,ps_id-24654,Finance,4.18,1481,Varies with device,Free,Everyone,May 05 2020,Varies with device,Varies with device,"100,000+"
1,ps_id-35329,Music And Audio,4.81,302,10M,Free,Everyone,Mar 26 2020,3.9.18,4.1 and up,"5,000+"
2,ps_id-11044,Game Casual,4.27,374,27M,Free,Everyone,May 01 2020,1.10.1,4.1 and up,"10,000+"
3,ps_id-36068,Business,4.03,122058,Varies with device,Free,Teen,May 02 2020,Varies with device,Varies with device,"10,000,000+"
4,ps_id-35831,Medical,4.6,358,Varies with device,297.5742,Everyone,Nov 29 2018,Varies with device,Varies with device,"5,000+"


In [4]:
train_df.describe()

Unnamed: 0,Rating,Reviews
count,16516.0,16516.0
mean,4.259646,193197.3
std,0.498968,1953846.0
min,1.0,1.0
25%,4.09,147.0
50%,4.36,1890.0
75%,4.58,22669.25
max,5.0,85766430.0


In [5]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16516 entries, 0 to 16515
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Offered_By           16516 non-null  object 
 1   Category             16516 non-null  object 
 2   Rating               16516 non-null  float64
 3   Reviews              16516 non-null  int64  
 4   Size                 16516 non-null  object 
 5   Price                16516 non-null  object 
 6   Content_Rating       16516 non-null  object 
 7   Last_Updated_On      16516 non-null  object 
 8   Release_Version      16516 non-null  object 
 9   OS_Version_Required  16516 non-null  object 
 10  Downloads            16516 non-null  object 
dtypes: float64(1), int64(1), object(9)
memory usage: 1.4+ MB


### Removing less effective feature for model training

In [6]:
train = train_df.drop(['Offered_By', 'Size', 'Last_Updated_On', 'Release_Version', 'OS_Version_Required'], axis = 1)

In [7]:
train.head(5)

Unnamed: 0,Category,Rating,Reviews,Price,Content_Rating,Downloads
0,Finance,4.18,1481,Free,Everyone,"100,000+"
1,Music And Audio,4.81,302,Free,Everyone,"5,000+"
2,Game Casual,4.27,374,Free,Everyone,"10,000+"
3,Business,4.03,122058,Free,Teen,"10,000,000+"
4,Medical,4.6,358,297.5742,Everyone,"5,000+"


### Encoding the labels of the category feature

In [8]:
cat = preprocessing.LabelEncoder()
train['Category'] = cat.fit_transform(train['Category'])

In [9]:
con = preprocessing.LabelEncoder()
train['Content_Rating'] = con.fit_transform(train['Content_Rating'])

In [10]:
dow = preprocessing.LabelEncoder()
train['Downloads'] = dow.fit_transform(train['Downloads'])

In [11]:
train['Price'] = train['Price'].replace('Free', 0)
train['Price'] = train['Price'].astype(float)

### Feature Selection

In [12]:
X = train.drop('Downloads', axis = 1)
y = train['Downloads']

### Data Standardization

In [13]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


### Train Test Split

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle = True)

## Model Training

### Random Forest Classifier

In [15]:
clf = RandomForestClassifier(n_estimators = 20, max_depth = 10, min_samples_leaf = 2)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
score_random = accuracy_score(y_pred, y_test)
print(score_random)

0.5606310768666299


###### Support Vector Machines

In [16]:
from sklearn.svm import SVC
svm = SVC(C = 10, gamma = 'auto')
svm.fit(X_train,y_train)
y_pred = svm.predict(X_test)
score_svm = accuracy_score(y_pred, y_test)
print(score_svm)

0.31773986424509265


## Logistic Regression

In [None]:
lr = LogisticRegression(solver = 'lbfgs', multi_class = 'auto')
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)
score_lr = accuracy_score(y_pred, y_test)
print(score_lr)

### K Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors = 1)
neigh.fit(X_train,y_train)
y_pred = neigh.predict(X_test)
score_knn = accuracy_score(y_pred, y_test)
print(score_knn)

In [None]:
print("Accuracy for Random Forest Classifier - {}".format(score_random))
print("Accuracy for Support Vector Machines- {}".format(score_svm))
print("Accuracy for Logistic Regression - {}".format(score_lr))
print("Accuracy for K Nearest Neighbors - {}".format(score_knn))

### From the above results, **Random Forest Classifier** gave *best* accuracy results after manual *hyperparameter tuning* of all other classifiers.