# Google Playstore Apps Predictions using Linear Regression

**Problem Statement:**

Google need to put predicted ratings for new apps that are uploaded on Google play store.

# Steps to create ML model

1.  Import required libraries
2. Import the dataset/ connect dataset with colab
3. Data preprocessing
4. Convert Categorical data into numerical data
5. Extract Dependent Variable (Targets)-Y and Independent Variable(Features)-X
6. Separate the data into Training and Testing
7. Apply Algorithm-Linear Regression
8. Test the Algorithm

In [1]:
# Import required library
import pandas as pd # Data preprocessing
import numpy as np # Array Management
import matplotlib.pyplot as plt # Data Visualization

In [2]:
# import the dataset
data=pd.read_csv('/content/drive/MyDrive/Python Datasets/googleplaystore.csv')

In [3]:
# Data Understanding
data.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,7-Jan-18,1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,15-Jan-18,2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,1-Aug-18,1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,8-Jun-18,Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,20-Jun-18,1.1,4.4 and up


In [4]:
data.shape # Number of rows and columns in dataset

(10841, 13)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


# Pre-Processing

Pandas Library : Data Cleaning

1. Remove Null Values
2. Correct the data types
3. Clean wrong entries
4. Remove Duplicates
5. Remove unwanted columns

In [7]:
data.isnull().sum() # identify null count of null values in all columns

Unnamed: 0,0
App,0
Category,0
Rating,1474
Reviews,0
Size,0
Installs,0
Type,1
Price,0
Content Rating,1
Genres,0


In [8]:
data.dropna(inplace=True) # Remove null values

In [9]:
data.isnull().sum()

Unnamed: 0,0
App,0
Category,0
Rating,0
Reviews,0
Size,0
Installs,0
Type,0
Price,0
Content Rating,0
Genres,0


In [11]:
data.shape

(9360, 13)

In [10]:
# Make Corrections in Data-type
data.dtypes

Unnamed: 0,0
App,object
Category,object
Rating,float64
Reviews,object
Size,object
Installs,object
Type,object
Price,object
Content Rating,object
Genres,object


In [12]:
data['Reviews']=data['Reviews'].astype(int) # Convert Reviews from object to int

In [13]:
data.dtypes

Unnamed: 0,0
App,object
Category,object
Rating,float64
Reviews,int64
Size,object
Installs,object
Type,object
Price,object
Content Rating,object
Genres,object


In [14]:
data['Installs']=data['Installs'].astype(int)

ValueError: invalid literal for int() with base 10: '10,000+'

In [15]:
data['Installs']=data['Installs'].str.replace(',','')
data['Installs']=data['Installs'].str.replace('+','')

In [16]:
data['Installs']=data['Installs'].astype(int)

In [17]:
data.dtypes

Unnamed: 0,0
App,object
Category,object
Rating,float64
Reviews,int64
Size,object
Installs,int64
Type,object
Price,object
Content Rating,object
Genres,object


In [18]:
data['Price']=data['Price'].astype(float)

ValueError: could not convert string to float: '$4.99 '

In [19]:
data['Price']=data['Price'].str.replace('$','')

In [20]:
data['Price']=data['Price'].astype(float)

In [21]:
data['Size']=data['Size'].astype(float)

ValueError: could not convert string to float: '19M'

In [22]:
data['Size'].unique()

array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
       '28M', '12M', '20M', '21M', '37M', '5.5M', '17M', '39M', '31M',
       '4.2M', '23M', '6.0M', '6.1M', '4.6M', '9.2M', '5.2M', '11M',
       '24M', 'Varies with device', '9.4M', '15M', '10M', '1.2M', '26M',
       '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k', '3.6M', '5.7M',
       '8.6M', '2.4M', '27M', '2.7M', '2.5M', '7.0M', '16M', '3.4M',
       '8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M',
       '2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
       '7.1M', '22M', '6.4M', '3.2M', '8.2M', '4.9M', '9.5M', '5.0M',
       '5.9M', '13M', '73M', '6.8M', '3.5M', '4.0M', '2.3M', '2.1M',
       '42M', '9.1M', '55M', '23k', '7.3M', '6.5M', '1.5M', '7.5M', '51M',
       '41M', '48M', '8.5M', '46M', '8.3M', '4.3M', '4.7M', '3.3M', '40M',
       '7.8M', '8.8M', '6.6M', '5.1M', '61M', '66M', '79k', '8.4M',
       '3.7M', '118k', '44M', '695k', '1.6M', '6.2M', '53M', '1.4M',
      

In [29]:
def change_size(size):
  if 'M' in size:
    x=size.replace('M','')
    return float(x)
  elif 'k' in size:
    x=size.replace('k','')
    return float(x)/1000
  else:
    return None

In [30]:
data['Size']=data['Size'].apply(change_size)

In [31]:
data.dtypes

Unnamed: 0,0
App,object
Category,object
Rating,float64
Reviews,int64
Size,float64
Installs,int64
Type,object
Price,float64
Content Rating,object
Genres,object


In [32]:
data.isnull().sum()

Unnamed: 0,0
App,0
Category,0
Rating,0
Reviews,0
Size,1637
Installs,0
Type,0
Price,0
Content Rating,0
Genres,0


In [33]:
data['Size'].fillna(method='ffill',inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Size'].fillna(method='ffill',inplace=True)
  data['Size'].fillna(method='ffill',inplace=True)


In [35]:
inp1=data.copy() # to create a duplicate copy

In [38]:
inp1.drop(['App','Last Updated','Current Ver','Android Ver'],axis=1,inplace=True)

In [39]:
inp1.head()

Unnamed: 0,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres
0,ART_AND_DESIGN,4.1,159,19.0,10000,Free,0.0,Everyone,Art & Design
1,ART_AND_DESIGN,3.9,967,14.0,500000,Free,0.0,Everyone,Art & Design;Pretend Play
2,ART_AND_DESIGN,4.7,87510,8.7,5000000,Free,0.0,Everyone,Art & Design
3,ART_AND_DESIGN,4.5,215644,25.0,50000000,Free,0.0,Teen,Art & Design
4,ART_AND_DESIGN,4.3,967,2.8,100000,Free,0.0,Everyone,Art & Design;Creativity


In [40]:
inp1.shape

(9360, 9)

# Convert Categorical data in numerical data

In [41]:
inp1=pd.get_dummies(inp1)

In [42]:
inp1.shape

(9360, 161)

# Extract Independent and dependent variables

In [44]:
y=inp1.pop('Rating') # Dependent variables
x=inp1 # independent Variables

In [45]:
# Separate the dataframe into X_train, X_test, Y_train, Y_test
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=42)

# Apply Linear Regression

In [46]:
from sklearn.linear_model import LinearRegression
model=LinearRegression()
model.fit(x_train,y_train)
y_pred=model.predict(x_test)

# Test your Linear Regression Model

In [47]:
from sklearn.metrics import mean_squared_error,r2_score
print(mean_squared_error(y_test,y_pred))
print(r2_score(y_test,y_pred))

0.249903383490216
0.030521419467818278
