# **App Rating Prediction**

## **Introduction**

The goal of this project is to use data of 10k Play Store apps and their user reviews to make rating predictions.
The dataset is composed of two *csv* files :
*   *googleplaystore_user_reviews.csv* : contains the first 'most relevant' 100 reviews for each app
*   *googleplaystore.csv* : contains the details of the apps

We start by loading some libraries and importing the dataset.

In [0]:
from sklearn.linear_model import LinearRegression 
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import csv

In [0]:
app_data_file_name = 'googleplaystore.csv'
user_reviews_file_name = 'googleplaystore_user_reviews.csv'

In [0]:
app_data = pd.read_csv(app_data_file_name)
user_reviews = pd.read_csv(user_reviews_file_name)

## **I. Data Exploration**

In this section we will be discovering our data, we will look at some of its relevant properties, like the size, the type ... 

### **I. 1. Details of the apps**


In [0]:
print("we have ",app_data.shape[0], " app, and ", app_data.shape[1], " feature for each app.")

In [0]:
print("these are the features : \n", app_data.columns)

Here is what the dataset looks like, we can see that most of the fields do not have numerical value, we will take care of that in the data cleaning section.


In [0]:
app_data.head()

In [0]:
app_data.info()

For a total of 10841 apps, we have only 9367 apps rated, so we need to scale down our research to the rated portion of the data.

In [0]:
app_data.describe(include='all')

In [0]:
app_data.hist(bins=5,column="Rating", figsize=(5,5), grid=False, range=(0, 5))

### **I. 2. User reviews**

In [0]:
user_reviews.shape

In [0]:
user_reviews.columns

In [0]:
user_reviews.head()

In [0]:
user_reviews.info()

In [0]:
user_reviews.describe(include='all')

In [0]:
user_reviews.hist(bins=20, column = ['Sentiment_Polarity', 'Sentiment_Subjectivity'], figsize=(10,5), grid=False)

## **II. Data Cleaning**

In [0]:
def commas_to_int(commas):
  L = [c for c in commas if c!=',']
  string = ''.join(L)
  return int(string)

def string_to_int(s):
  if s[-1]=='M':
    n = float(s[:-1])*1000000
  elif s[-1]=='K' or s[-1]=='k':
    n = float(s[:-1])*1000
  elif s[-1]=='+':
    n = commas_to_int(s[:-1])
  elif s[0] == "$":
    n = float(s[1:])*100
  else:
    n = float(s)
  return int(n)

def type_to_binary(t):
  if t == "Free":
    return 0
  else :
    return 1
  
def CR_to_int(list_of_ratings):
  list_of_ratings = list(list_of_ratings)
  return lambda rating : len(list_of_ratings) - list_of_ratings.index(rating) - 1 


### II. 1. Dropping unecessary colums

Since we can assume that the Android version, current app version, last update are unrelated to the rating property, we can drop them from our data set

In [0]:
app_data.drop(labels = ['Last Updated','Current Ver','Android Ver'], axis = 1, inplace = True)
app_data.head()

### II.2 Removing Rows with NaN values

In [0]:
previous_number_of_rows = app_data.shape[0]
app_data = app_data.dropna()
print("new shape = ", app_data.shape)
print('Ratio of deleted rows:', round((1-app_data.shape[0]/previous_number_of_rows)*100), '%')

In [0]:
previous_number_of_rows = user_reviews.shape[0]
user_reviews = user_reviews.dropna()
print("new shape = ", user_reviews.shape)
print('Ratio of deleted rows:', round((1-user_reviews.shape[0]/previous_number_of_rows)*100), '%')

### II. 3. Changing fields with numerical nature to numbers



#### a) REVIEWS

In [0]:
app_data['Reviews'].unique()

We only have integer values,  can directly change the reviews into an integer

In [0]:
app_data['Reviews'] = app_data['Reviews'].apply(string_to_int)
app_data['Reviews'].unique()

#### b) SIZE

In [0]:
app_data['Size'].unique()

We need to scale all the sizes to the same unit, and get rid of the 'M' or 'K' and of course remove rows that have 'Varies with device'

In [0]:
app_data = app_data[app_data.Size != 'Varies with device']
app_data['Size'] = app_data['Size'].apply(string_to_int)
app_data['Size'].unique()

#### c) INSTALLS

In [0]:
app_data['Installs'].unique()

We only need to remove the "+"

In [0]:
app_data['Installs'] = app_data['Installs'].apply(string_to_int)
app_data['Installs'].unique()

#### d) PRICE

In [0]:
app_data['Price'].unique()

We have to remove the "$" sign

In [0]:
app_data['Price'] = app_data['Price'].apply(string_to_int)
app_data['Price'].unique()

### II. 4. Changing fields with non numerical nature to numbers

#### a) TYPE

In [0]:
app_data['Type'].unique()

We can ype column into a binary 0,1 column since it contains either 'Free' or 'Paid' tags. 

In [0]:
app_data['Type'] = app_data['Type'].apply(type_to_binary)
app_data['Type'].unique()

#### b) CONTENT RATING

In [0]:
app_data['Content Rating'].unique()

For the content rating we can give an integer order to the tags : for example Unrated => 0, Everyone  => 5

In [0]:
app_data['Content Rating'] = app_data['Content Rating'].apply(CR_to_int(app_data['Content Rating'].unique()))
app_data['Content Rating'].unique()

#### c) CATEGORY

In [0]:
app_data['Category'].unique()

Since we can't order the categories, it's not recommanded to transform this field into an integer field, instead we will create dummy columns to represent each one of the categories 

In [0]:
app_data = pd.get_dummies(app_data, columns=['Category'])
app_data.head(2)

#### d) GENRES

In [0]:
app_data['Genres'].unique()


We can make dummy columns for this field too

In [0]:
app_data = pd.get_dummies(app_data, columns=['Genres'])
app_data.head(2)

### II. 5. Reuniting the two datasets

#### a) SENTIMENT

the most important feature in the user reviews table is the sentiment

In [0]:
user_reviews['Sentiment'].unique()

first we need to transform it into an integer field :  Positive=>2 , Negative =>0

In [0]:
user_reviews['Sentiment'] = user_reviews['Sentiment'].apply(CR_to_int(user_reviews['Sentiment'].unique()))
user_reviews['Sentiment'].unique()

#### b) Can we add the Sentiment average to the app's table

In [0]:
A = app_data['App'].unique()
B = user_reviews['App'].unique()
C = [] 
for b in B:
    if b in A :
        C.append(b)

print("We have ",len(A), " apps in our data set")                
print("there are ",len(B), " apps reviewed")
print("there are ",len(C), " reviewed apps in our data set")

Since the number of reviewed app is very small compared to the size of our data set, user_reviews table does not provide relevant information to be added to app_data table

### II. 6. Scaling data

This makes training less sensitive to the scale of features

In [0]:
app_data.columns[1:]


In [0]:
app_data_final_array = normalize(np.array(app_data)[:,1:], axis = 0, norm = 'max')

In [0]:
app_data_final_df = pd.DataFrame(data = app_data_final_array, columns = app_data.columns[1:] )

In [0]:
app_data_final_df.head()

### II. 7. Visualizing the cleaned data

In [0]:
import seaborn as sns
sub_df = pd.DataFrame(app_data_final_df, columns=['Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating'])
sns.pairplot(sub_df)

In [0]:
sns.heatmap(sub_df.corr(),annot=True)

## **III. Machine Learning Models**

Once the data is cleaned, we can start to apply ML models to it in order to make rating predictions.

In [0]:
X = app_data_final_df.drop(labels = ['Rating'], axis = 1)
y = app_data_final_df['Rating']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

def model_results(model):
  model.fit(X_train, y_train)
  return model.predict(X_test)
  

def display_errors(test, results):
  print('MSE :', metrics.mean_squared_error(test, results))
  print('MAE :', metrics.mean_absolute_error(test, results))
  #print('MSLE :', metrics.mean_squared_log_error(test, results))
  
def plot_results(test, results, title):  
  plt.figure(figsize=(12,7))
  sns.regplot(results, test)
  plt.legend()
  plt.title(title)
  plt.xlabel('Predicted Ratings')
  plt.ylabel('Actual Ratings')
  plt.show()

### **III. 1. Linear Regression Model**


In [0]:
LR_model = LinearRegression()
LR_results = model_results(LR_model)

In [0]:
display_errors(y_test, LR_results)

In [0]:
plot_results(y_test, LR_results, title = 'Linear Regression Model')