**IMPORTING THE LIBRARIES AND DATASET**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv("../input/playstore-analysis/googleplaystore.csv")

In [None]:
df.head()

**FINDING AND HANDLING THE MISSING DATA**

In [None]:
df.isnull().sum()

In [None]:
df.shape

In [None]:
df.dropna(inplace=True)

In [None]:
df.shape

Size column has characters 'M' and 'k' which represents mega and kilo so we have to remove them and convert the values in scale of kilo's(1000's) , after that we will change the data type of Size column to Float64

In [None]:
def change(x):
    if 'M' in x:
        z=x[:-1]
        z=float(z)*1000
        return z
    
    elif 'k' in x:
        z=x[:-1]
        z=float(z)
        return z
    
    else : return None
    
df.Size = df.Size.map(change)    
    
    

In [None]:
df.Size

some values are not defined with 'M','k' suffix which were replaced with NaN which have to be filled.

In [None]:
df["Size"].isnull().sum()

In [None]:
df["Size"].fillna(method='pad',inplace=True)
df["Size"].isnull().sum()

Now we will change the data type of Reviews , Price and Installs columns to float64 but Price and Installs columns have some charecters such as '$' , ',' which has to be replaced.

In [None]:
df["Reviews"]=df["Reviews"].astype('float')

In [None]:
df.Price = df.Price.apply(lambda x: x.replace('$',''))
df.Price=df.Price.astype('float')


In [None]:
df.Installs = df.Installs.apply(lambda x: x.replace(',','').replace('+',''))
df.Installs=df.Installs.astype('float')


Check the data types of each column

In [None]:
df.dtypes

**OUTLIERS DETECTION AND CORRECTION**

In [None]:
df["Rating"].shape

**The maximum and minimum rating alowed in playstore is 5 and 0 respectively so it is not possible to have rating higher tha 5 , if it is true then delete such rows**

In [None]:
a=df.Rating>5

In [None]:
a.value_counts()

Apps. which are free must have no price value but if they have then delete the rows

In [None]:
b=(df.Type=='Free')&(df.Price>0)

In [None]:
b.value_counts()

Number of reviews must be less than number of installs because a user can't give a review without checking the app but if this is the case then delete the row

In [None]:
c=df.Reviews>df.Installs

In [None]:
c.value_counts()

In [None]:
df=df[df.Reviews<df.Installs].copy()
print(df.shape)

*200$ for an app in playstore makes it suspicious so we should consider apps. with prices less than 200$ and drop the rest*

In [None]:
df=df[df.Price<200].copy()
print(df.shape)

Very few apps have very high number of reviews. These are all star apps that don’t help with the analysis and, in fact, will skew it. Drop records having more than 2 million reviews.

In [None]:
d = df.Reviews>2000000

In [None]:
d.value_counts()

In [None]:
df=df[df.Reviews<=2000000].copy()
print(df.shape)

lets check for the outliers still left

In [None]:
df.boxplot()

We can deal with outliers by converting the column into logarithmic fucntion

In [None]:
df.Installs=df.Installs.apply(func=np.log1p)
df.Reviews=df.Reviews.apply(func=np.log1p)

df.hist(column=['Installs','Reviews'])

**Bivariate analysis:**

In [None]:
plt.figure(figsize=(25,8))
sns.scatterplot(df.Price,df.Rating,hue=df.Rating)
plt.show()

Well there is no clear pattern that paid apps get better ratings but apps with minimum price of 9$ gets atleast average rating of 2.5

In [None]:
plt.figure(figsize=(25,8))
sns.scatterplot(df.Size,df.Rating,hue=df.Rating)
plt.show()

This scatterplot also shows that increase in size does not ensure high rating but heavy apps are mostly rated better as compared to lighter apps.

In [None]:
plt.figure(figsize=(25,8))
sns.scatterplot(df.Reviews,df.Rating,hue=df.Rating)
plt.show()

There is no particular patter that is followed between reviews and rating but we can see that after some point rating becomes independent of popularity. 

In [None]:
plt.figure(figsize=(25,8))
sns.boxplot(df["Content Rating"],df["Rating"])
plt.show()

The highest mean rating is gained by "Adults only 18+" apps

In [None]:
plt.figure(figsize=(25,8))
sns.boxplot(df.Category,df.Rating)
plt.xticks(fontsize=18,rotation='vertical')
plt.show()

apps in health & fitness, books and reference category seem to have the highest median ratings.

**APPLYING REGRESSION MODEL TO PREDICT THE RATING **

drop the columns which are not necessary for the analysis

In [None]:
df.drop(["App","Last Updated","Current Ver","Android Ver"],inplace=True,axis=1)

convert the columns with categorical values into dummy variables and drop the first row to avoid **dummy variable trap**

In [None]:
df=pd.get_dummies(df,drop_first=True)

define the dependent and independent variables

In [None]:
x=df.iloc[:,1:]
y=df.iloc[:,:1]

Do the train test split

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split( x, y, test_size=0.30, random_state=1)

Apply linear regression to the training set 

In [None]:
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
model=regressor.fit(x_train, y_train)

get the prediction for the test set

In [None]:
y_pred=regressor.predict(x_test)

In [None]:
from statsmodels.api import OLS
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error as ms

Find the value of adjusted R2 , R2 score and RMSE value 

In [None]:
summ=OLS( y_train,x_train).fit()
summ.summary()


In [None]:
print('R2_Score=',r2_score(y_test,y_pred))
print('Root_Mean_Squared_Error(RMSE)=',np.sqrt(ms(y_test,y_pred)))

**THE VALUE OF adjusted R2 IS AROUND 0.987 WHICH IS A VERY GODD VALUE, MORE THIS VALUE NEARER TO 1 BETTER IS THE CORRELATION BETWEEN PREDICTED AND TEST VALUES **

**UPVOTE THIS NOTEBOOK IF YOU LIKE MY WORK , DO COMMENT YOUR VIEWS AND SUGGESTIONS.**