# Google Play Store Predictions using Linear Regression

***Problem Statement:*** Whenever a new app is uploaded on Google play store, that app will be assigned by a predictive ratings.

# Steps to create ML model
1. Import Library
2. Import the dataset
3. Data pre-Processing
  

*   Data Cleaning
*   Understaning the Data

4. Convert the categorical data into numerical data
5. Extract Dependent variables(Targets) and Independent Variable(Features)
6. Separate the data into TRAIN and TEST
7. Applying Algorithm(Linear Regression)
8. Test the Algorithm





# STEP-1: Import Library
1. Pandas- Data Cleaning/Manupulatiion
2. Numpy- Manage Array
3. Matplotlib- Visualization
4. Seaborn- Advanced Analytical Visualization

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Step-2: Import the data
1. Pandas library to import dataset

In [4]:
data= pd.read_csv('/content/drive/MyDrive/googleplaystore.csv')

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Understand Your dataset
1. head()
2. shape
3. info()

In [None]:
data.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,7-Jan-18,1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,15-Jan-18,2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,1-Aug-18,1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,8-Jun-18,Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,20-Jun-18,1.1,4.4 and up


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [None]:
data.isnull().sum()

App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64

# Pre-Processing
Pandas Library: Data Cleaning

1. Clean Empty cells( Remove Null Values)
2. Clean WrongFormat ( Data Types)
3. Clean Wrong Data (Mistakes)
4. Remove Duplicates

In [5]:
data.isnull().sum()

App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64

In [7]:
data.dropna(inplace=True)

In [8]:
data.isnull().sum()

App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              0
Price             0
Content Rating    0
Genres            0
Last Updated      0
Current Ver       0
Android Ver       0
dtype: int64

**Clean Data Type mistakes**

In [11]:
data.dtypes

App                object
Category           object
Rating            float64
Reviews             int64
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

In [10]:
data['Reviews']= data['Reviews'].astype(int)

In [13]:
data.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,7-Jan-18,1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,15-Jan-18,2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,1-Aug-18,1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,8-Jun-18,Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,20-Jun-18,1.1,4.4 and up


In [12]:
data['Installs']= data['Installs'].astype(int)

ValueError: invalid literal for int() with base 10: '10,000+'

In [14]:
data['Installs']=data['Installs'].str.replace('+','')
data['Installs']=data['Installs'].str.replace(',','')

In [15]:
data['Installs']= data['Installs'].astype(int)

In [16]:
data.dtypes

App                object
Category           object
Rating            float64
Reviews             int64
Size               object
Installs            int64
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

In [17]:
data['Price']= data['Price'].astype(float)

ValueError: could not convert string to float: '$4.99 '

In [18]:
data['Price']=data['Price'].str.replace('$','')
data['Price']= data['Price'].astype(float)

In [19]:
data.dtypes

App                object
Category           object
Rating            float64
Reviews             int64
Size               object
Installs            int64
Type               object
Price             float64
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

In [20]:
data['Size']= data['Size'].astype(float)

ValueError: could not convert string to float: '19M'

In [24]:
data['Size']


0                       19M
1                       14M
2                      8.7M
3                       25M
4                      2.8M
                ...        
10834                  2.6M
10836                   53M
10837                  3.6M
10839    Varies with device
10840                   19M
Name: Size, Length: 9360, dtype: object

In [25]:
def change_size(size):
  if 'M' in size:
    x=size[:-1] #19M
    x=float(x)*1000
    return x
  elif 'k' in size:
    x=size[:-1] #19k
    x=float(x)
    return x
  else:
    return None

In [27]:
data['Size']= data['Size'].apply(change_size)

In [28]:
data.dtypes

App                object
Category           object
Rating            float64
Reviews             int64
Size              float64
Installs            int64
Type               object
Price             float64
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

In [29]:
data.isnull().sum()

App                  0
Category             0
Rating               0
Reviews              0
Size              1637
Installs             0
Type                 0
Price                0
Content Rating       0
Genres               0
Last Updated         0
Current Ver          0
Android Ver          0
dtype: int64

In [30]:
data['Size'].fillna(method='ffill',inplace=True)

In [31]:
inp1=data.copy()

In [32]:
inp1.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19000.0,10000,Free,0.0,Everyone,Art & Design,7-Jan-18,1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14000.0,500000,Free,0.0,Everyone,Art & Design;Pretend Play,15-Jan-18,2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8700.0,5000000,Free,0.0,Everyone,Art & Design,1-Aug-18,1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25000.0,50000000,Free,0.0,Teen,Art & Design,8-Jun-18,Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2800.0,100000,Free,0.0,Everyone,Art & Design;Creativity,20-Jun-18,1.1,4.4 and up


In [34]:
inp1.drop(['App','Last Updated','Current Ver','Android Ver'],axis=1,inplace=True)

In [35]:
inp1.head()

Unnamed: 0,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres
0,ART_AND_DESIGN,4.1,159,19000.0,10000,Free,0.0,Everyone,Art & Design
1,ART_AND_DESIGN,3.9,967,14000.0,500000,Free,0.0,Everyone,Art & Design;Pretend Play
2,ART_AND_DESIGN,4.7,87510,8700.0,5000000,Free,0.0,Everyone,Art & Design
3,ART_AND_DESIGN,4.5,215644,25000.0,50000000,Free,0.0,Teen,Art & Design
4,ART_AND_DESIGN,4.3,967,2800.0,100000,Free,0.0,Everyone,Art & Design;Creativity


In [36]:
inp1.shape

(9360, 9)

# Convert Categorical data into numerical data

In [37]:
inp1=pd.get_dummies(inp1)

In [38]:
inp1.shape

(9360, 161)

In [42]:
inp1.head()

Unnamed: 0,Rating,Reviews,Size,Installs,Price,Category_ART_AND_DESIGN,Category_AUTO_AND_VEHICLES,Category_BEAUTY,Category_BOOKS_AND_REFERENCE,Category_BUSINESS,...,Genres_Tools,Genres_Tools;Education,Genres_Travel & Local,Genres_Travel & Local;Action & Adventure,Genres_Trivia,Genres_Video Players & Editors,Genres_Video Players & Editors;Creativity,Genres_Video Players & Editors;Music & Video,Genres_Weather,Genres_Word
0,4.1,159,19000.0,10000,0.0,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,3.9,967,14000.0,500000,0.0,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,4.7,87510,8700.0,5000000,0.0,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,4.5,215644,25000.0,50000000,0.0,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,4.3,967,2800.0,100000,0.0,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


# Extract Independent and dependent variables

In [43]:
y=inp1.pop('Rating')
x=inp1

In [44]:
# Separate the dataframe into X_train, X_test, y_train and y_test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

# Apply Linear Regression

In [45]:
from sklearn.linear_model import LinearRegression
linear_reg= LinearRegression()

In [46]:
linear_reg.fit(X_train,y_train) # Train your data with Linear Regression model

In [47]:
y_pred=linear_reg.predict(X_test)

In [49]:
from sklearn.metrics import mean_squared_error,r2_score
print('MSE: ',mean_squared_error(y_test,y_pred))
print('R2 Score: ',r2_score(y_test,y_pred))

MSE:  0.24990338348778796
R2 Score:  0.030521419477237854
