# Playstore_App_Rating_Prediction

## Importing the Required Libraries 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly 
import plotly.offline as pyo
from plotly.offline import iplot,plot,init_notebook_mode
import plotly.express as px
import cufflinks as cf
import plotly.graph_objects as go
import seaborn as sns
plt.rc('figure', figsize=(20.0, 10.0))

In [None]:
pyo.init_notebook_mode(connected=True)
cf.go_offline()

In [None]:
import plotly.io as pio
pio.renderers.default = 'colab'

## Importing Dataset

In [None]:
dataset=pd.read_csv("/kaggle/input/playstore-analysis/googleplaystore.csv")

In [None]:
dataset.head()

## Finding and Handling any Missing Data

Analysing various columns for any null values

In [None]:
dataset.isnull().sum()

Dropping the rows which have any null records

In [None]:
dataset=dataset.dropna()
dataset=dataset.reset_index(drop=True)

Checking for any null records

In [None]:
dataset.isnull().sum()

## Data Preparation

Analysing various data types of different rows in dataset

In [None]:
dataset.info()

Converting the Reviews column into integers

In [None]:
dataset['Reviews']=dataset["Reviews"].astype(int)

Converting the size variable into a single type by removing 'M', 'k' and the string "varies with device" present in it

In [None]:
dataset["Size"].unique()

This function will check whether the value have 'M' or 'k' and according to it will change the value.

In [None]:
def mb_to_kb(a):
  if a.endswith("M"):
    return float(a[:-1])*1000
  elif a.endswith("k"):
    return float(a[:-1])
  else:
    return a

In [None]:
dataset["Size"]=dataset["Size"].apply(lambda x:mb_to_kb(x))

In [None]:
dataset[dataset["Size"]=="Varies with device"]

In [None]:
rows=dataset[dataset["Size"]=="Varies with device"].index

In [None]:
dataset.drop(rows,inplace=True)

Removing the '+' symbol from each value in Installs column

In [None]:
dataset["Installs"].value_counts()

In [None]:
dataset["Installs"]=dataset["Installs"].str[:-1]
dataset["Installs"]=dataset["Installs"].apply(lambda x:x.replace(",",""))

In [None]:
dataset["Installs"]=dataset["Installs"].astype(int)

Removing the '$' sign from the Price Column

In [None]:
dataset["Price"].unique()

In [None]:
dataset["Price"]=dataset["Price"].apply(lambda x:x.replace("$",""))
dataset["Price"]=dataset["Price"].astype(float)

Removing the rows with more nummber of rating than installs

In [None]:
dataset["Rating"].between(0,5).sum()

In [None]:
rows=dataset[dataset["Installs"]<dataset["Reviews"]].index
dataset.drop(rows,inplace=True)

## Univariate Analysis

In [None]:
dataset.head()

## Outline Correction

In [None]:
sns.boxplot(data=dataset,orient="h",palette="Set2")

It is evident from the box plot that there are some outliners in Reviews, Installs and Price columns

In [None]:
dataset["Reviews"].value_counts()

Very few apps have very high number of reviews. These are all star apps that don’t help with the analysis and, in fact, will skew it. Thus Removing the applications having reviews more than 2 million

In [None]:
rows=dataset[dataset["Reviews"]>2000000].index

In [None]:
dataset.drop(rows,inplace=True)

From the box plot, it seems like there are some apps with very high price. A price of 200 for an application on the Play Store is very high and suspicious!. Hence removing the applications with price more than 200$

In [None]:
rows=dataset[dataset["Price"]>200].index

In [None]:
dataset.drop(rows,inplace=True)

There seems to be some outliers in installs field too. Hence setting the threshold at 500000.

In [None]:
perc=[.10, .25, .50, .70, .90, .95, .99]
dataset["Installs"].describe(percentiles=perc)

In [None]:
sns.distplot(dataset["Installs"],kde=False)

In [None]:
rows=dataset[dataset["Price"]>500000].index

In [None]:
dataset.drop(rows,inplace=True)

In [None]:
sns.distplot(dataset["Rating"],kde=False)

In [None]:
sns.distplot(dataset["Size"],kde=False)

From above both histograms it is clear that both of them dont have any significant outliners.

## Multivariate Analysis

In [None]:
dataset

In [None]:
plt.figure(figsize=(11,8))
sns.scatterplot(x=dataset["Rating"],y=dataset["Price"],hue=dataset["Rating"])

Well there is no clear pattern that paid apps get better ratings but apps with minimum price of 9$ gets atleast average rating of 2.5

In [None]:
px.scatter(dataset,x="Rating",y="Size",color="Size",color_continuous_scale=px.colors.sequential.Viridis)

This scatterplot also shows that increase in size does not ensure high rating but heavy apps are mostly rated better as compared to lighter apps.

In [None]:
px.scatter(dataset,x="Rating",y="Reviews",color="Size",color_continuous_scale=px.colors.sequential.Viridis)

There is no particular patter that is followed between reviews and rating but we can see that after some point rating becomes independent of popularity.

In [None]:
px.box(dataset,y="Rating",x="Content Rating")

Adukt apps have the highest rating

In [None]:
px.box(dataset,y="Rating",x="Category")

## Data Preprocessing

In [None]:
dataset.columns

Reseting the rows' index

In [None]:
dataset=dataset.reset_index(drop=True)

Droping all teh unnecessary columns from the dataset

In [None]:
dataset.drop(["App","Installs","Type","Content Rating",'Last Updated', 'Current Ver',
       'Android Ver'],axis=1,inplace=True)

In [None]:
dataset

Seprating the independent and dependent variable

In [None]:
X=dataset.iloc[:,1:].values
y=dataset.iloc[:,0].values

In [None]:
X

One Hot Encoder for converting the categorical data present in both the "Category" and "Genre" column

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(sparse=False), [4])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [None]:
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(sparse=False), [-4])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

Spliting the dataset into training and testing dataset

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)

## Model Training


Using the LinearRegression model from sklearn library

In [None]:
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
model=regressor.fit(X_train, y_train)

Predicting the Test result

In [None]:
y_pred=model.predict(X_test)

Finding various metrics for evaluating the regression model from sklearn library

In [None]:
from sklearn.metrics import r2_score,mean_squared_error
print('R2_Score=',r2_score(y_test,y_pred))
print('Root_Mean_Squared_Error(RMSE)=',np.sqrt(mean_squared_error(y_test,y_pred)))

In [None]:

a=pd.DataFrame({'Actual':y_test.flatten(),'Predicted':y_pred.flatten()});a.head(10)

In [None]:
fig=a.head(25)
fig.plot(kind='bar',figsize=(10,8))