#**Project 7: Google App Rating Predictor**

***NumPy*** : *Fundamental package for numerical computing in Python.*

***Seaborn*** : *Statistical data visualization based on Matplotlib.*

***Matplotlib*** : *Comprehensive library for creating static, animated, and interactive visualizations in Python.*

***Pandas*** : *Powerful data analysis and manipulation library for Python.*

***Warnings*** : *In Python are messages indicating potential issues or non-critical errors in code execution.*


# **Data Loading**

**Importing necessary libraries for Data Loading , Data Visualization and Data Cleaning.**

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
%matplotlib inline

**The code warnings.simplefilter("ignore") in Python sets the warning filter to ignore all warnings that might occur during program execution. This means warnings will not be displayed or logged.**

In [2]:
warnings.simplefilter("ignore")

**This code reads a CSV file containing Google Play Store data into a Pandas DataFrame named "df" and displays the first 5 rows of the DataFrame.**

In [3]:
df = pd.read_csv("/content/googleplaystore.csv")
df.head(5)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,07-Jan-18,1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,15-Jan-18,2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,01-Aug-18,1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,08-Jun-18,Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,20-Jun-18,1.1,4.4 and up


**This line of code retrieves the dimensions (number of rows and columns) of the DataFrame "df".**

In [4]:
df.shape

(10841, 13)

**This line generates descriptive statistics of the numerical columns in the DataFrame "df".**

In [5]:
df.describe()

Unnamed: 0,Rating
count,9367.0
mean,4.193338
std,0.537431
min,1.0
25%,4.0
50%,4.3
75%,4.5
max,19.0


# **Data Cleaning**

**This line of code computes the sum of missing values (NaN) for each column in the DataFrame "df".**

In [6]:
df.isnull().sum()

Unnamed: 0,0
App,0
Category,0
Rating,1474
Reviews,0
Size,0
Installs,0
Type,1
Price,0
Content Rating,1
Genres,0


**This code creates a "SimpleImputer" object that replaces missing values with the mean of the column. It then applies this imputer to the "Rating" column of the DataFrame "df", filling in any missing values with the column's mean.**

In [7]:
from sklearn.impute import SimpleImputer

Impute = SimpleImputer(strategy = 'mean')

df["Rating"] = Impute.fit_transform(df[["Rating"]])

**"df = df.dropna(how='any')" removes rows with any missing values from the DataFrame. The "df.isnull().sum()" command then counts and returns the number of missing values in each column, which should now be zero.**

In [8]:
df = df.dropna(how = 'any')

df.isnull().sum()

Unnamed: 0,0
App,0
Category,0
Rating,0
Reviews,0
Size,0
Installs,0
Type,0
Price,0
Content Rating,0
Genres,0


**After using "df = df.dropna(how='any')", "df.head(5)" will display the first 5 rows of the DataFrame "df" with any rows containing missing values removed. This shows the initial portion of the cleaned DataFrame.**

In [9]:
df.head(5)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,07-Jan-18,1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,15-Jan-18,2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,01-Aug-18,1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,08-Jun-18,Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,20-Jun-18,1.1,4.4 and up


**After executing 'df = df.drop(columns=["Last Updated", "App"])", the DataFrame "df" will no longer include the columns "Last Updated" and "App". Calling "df.head(5)" will display the first 5 rows of the updated DataFrame, reflecting these column removals.**

In [10]:
df = df.drop(columns=["Last Updated", "App"])

df.head(5)

Unnamed: 0,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Current Ver,Android Ver
0,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,1.0.0,4.0.3 and up
1,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,2.0.0,4.0.3 and up
2,ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,1.2.4,4.0.3 and up
3,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,Varies with device,4.2 and up
4,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,1.1,4.4 and up


**Sure, this command retrieves all unique values from the "Category" column in the DataFrame "df".**

In [11]:
df["Category"].unique()

array(['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY',
       'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION',
       'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE',
       'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME',
       'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL',
       'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL',
       'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER',
       'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION'],
      dtype=object)

**This code converts categorical variables in the "Category" column of DataFrame "df" into dummy variables and displays the first 5 rows of the updated DataFrame.**

In [12]:
df = pd.get_dummies(df, columns=["Category"])

df.head(5)

Unnamed: 0,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Current Ver,Android Ver,...,Category_PERSONALIZATION,Category_PHOTOGRAPHY,Category_PRODUCTIVITY,Category_SHOPPING,Category_SOCIAL,Category_SPORTS,Category_TOOLS,Category_TRAVEL_AND_LOCAL,Category_VIDEO_PLAYERS,Category_WEATHER
0,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,1.0.0,4.0.3 and up,...,False,False,False,False,False,False,False,False,False,False
1,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,2.0.0,4.0.3 and up,...,False,False,False,False,False,False,False,False,False,False
2,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,1.2.4,4.0.3 and up,...,False,False,False,False,False,False,False,False,False,False
3,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,Varies with device,4.2 and up,...,False,False,False,False,False,False,False,False,False,False
4,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,1.1,4.4 and up,...,False,False,False,False,False,False,False,False,False,False


**This command returns a list of all column names in the DataFrame "df" after any transformations or operations that may have been applied to it.**

In [13]:
df.columns

Index(['Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price',
       'Content Rating', 'Genres', 'Current Ver', 'Android Ver',
       'Category_ART_AND_DESIGN', 'Category_AUTO_AND_VEHICLES',
       'Category_BEAUTY', 'Category_BOOKS_AND_REFERENCE', 'Category_BUSINESS',
       'Category_COMICS', 'Category_COMMUNICATION', 'Category_DATING',
       'Category_EDUCATION', 'Category_ENTERTAINMENT', 'Category_EVENTS',
       'Category_FAMILY', 'Category_FINANCE', 'Category_FOOD_AND_DRINK',
       'Category_GAME', 'Category_HEALTH_AND_FITNESS',
       'Category_HOUSE_AND_HOME', 'Category_LIBRARIES_AND_DEMO',
       'Category_LIFESTYLE', 'Category_MAPS_AND_NAVIGATION',
       'Category_MEDICAL', 'Category_NEWS_AND_MAGAZINES', 'Category_PARENTING',
       'Category_PERSONALIZATION', 'Category_PHOTOGRAPHY',
       'Category_PRODUCTIVITY', 'Category_SHOPPING', 'Category_SOCIAL',
       'Category_SPORTS', 'Category_TOOLS', 'Category_TRAVEL_AND_LOCAL',
       'Category_VIDEO_PLAYERS', 'Category_W

**This code converts the columns listed in "columns_to_encode" from categorical variables (likely represented as dummy variables or strings) into integer type in the DataFrame "df".**

In [14]:
columns_to_encode = ['Category_ART_AND_DESIGN', 'Category_AUTO_AND_VEHICLES',
       'Category_BEAUTY', 'Category_BOOKS_AND_REFERENCE', 'Category_BUSINESS',
       'Category_COMICS', 'Category_COMMUNICATION', 'Category_DATING',
       'Category_EDUCATION', 'Category_ENTERTAINMENT', 'Category_EVENTS',
       'Category_FAMILY', 'Category_FINANCE', 'Category_FOOD_AND_DRINK',
       'Category_GAME', 'Category_HEALTH_AND_FITNESS',
       'Category_HOUSE_AND_HOME', 'Category_LIBRARIES_AND_DEMO',
       'Category_LIFESTYLE', 'Category_MAPS_AND_NAVIGATION',
       'Category_MEDICAL', 'Category_NEWS_AND_MAGAZINES', 'Category_PARENTING',
       'Category_PERSONALIZATION', 'Category_PHOTOGRAPHY',
       'Category_PRODUCTIVITY', 'Category_SHOPPING', 'Category_SOCIAL',
       'Category_SPORTS', 'Category_TOOLS', 'Category_TRAVEL_AND_LOCAL',
       'Category_VIDEO_PLAYERS', 'Category_WEATHER']

df[columns_to_encode] = df[columns_to_encode].astype(int)

**df.head(5) displays the first 5 rows of the DataFrame df.**

In [15]:
df.head(5)

Unnamed: 0,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Current Ver,Android Ver,...,Category_PERSONALIZATION,Category_PHOTOGRAPHY,Category_PRODUCTIVITY,Category_SHOPPING,Category_SOCIAL,Category_SPORTS,Category_TOOLS,Category_TRAVEL_AND_LOCAL,Category_VIDEO_PLAYERS,Category_WEATHER
0,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,1.0.0,4.0.3 and up,...,0,0,0,0,0,0,0,0,0,0
1,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,2.0.0,4.0.3 and up,...,0,0,0,0,0,0,0,0,0,0
2,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,1.2.4,4.0.3 and up,...,0,0,0,0,0,0,0,0,0,0
3,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,Varies with device,4.2 and up,...,0,0,0,0,0,0,0,0,0,0
4,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,1.1,4.4 and up,...,0,0,0,0,0,0,0,0,0,0


**This code snippet replaces specific substrings in the "Size" column of the DataFrame "df" and then displays the first 5 rows of the updated DataFrame, presumably to standardize size representations (e.g., converting "Varies with device" to "0", removing "M" and "k").**

In [16]:
df["Size"] = df["Size"].str.replace("Varies with device", "0")
df["Size"] = df["Size"].str.replace("M", "")
df["Size"] = df["Size"].str.replace("k", "")
df.head(5)

Unnamed: 0,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Current Ver,Android Ver,...,Category_PERSONALIZATION,Category_PHOTOGRAPHY,Category_PRODUCTIVITY,Category_SHOPPING,Category_SOCIAL,Category_SPORTS,Category_TOOLS,Category_TRAVEL_AND_LOCAL,Category_VIDEO_PLAYERS,Category_WEATHER
0,4.1,159,19.0,"10,000+",Free,0,Everyone,Art & Design,1.0.0,4.0.3 and up,...,0,0,0,0,0,0,0,0,0,0
1,3.9,967,14.0,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,2.0.0,4.0.3 and up,...,0,0,0,0,0,0,0,0,0,0
2,4.7,87510,8.7,"5,000,000+",Free,0,Everyone,Art & Design,1.2.4,4.0.3 and up,...,0,0,0,0,0,0,0,0,0,0
3,4.5,215644,25.0,"50,000,000+",Free,0,Teen,Art & Design,Varies with device,4.2 and up,...,0,0,0,0,0,0,0,0,0,0
4,4.3,967,2.8,"100,000+",Free,0,Everyone,Art & Design;Creativity,1.1,4.4 and up,...,0,0,0,0,0,0,0,0,0,0


**This command retrieves all unique values from the "Installs" column in the DataFrame "df".**

In [17]:
df["Installs"].unique()

array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
       '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
       '1,000,000,000+', '1,000+', '500,000,000+', '50+', '100+', '500+',
       '10+', '1+', '5+', '0+'], dtype=object)

**This code snippet removes specific characters ("+" and ",") from the "Installs" column in the DataFrame "df" and displays the first 5 rows of the updated DataFrame.**

In [18]:
df["Installs"] = df["Installs"].str.replace("+", "")
df["Installs"] = df["Installs"].str.replace(",", "")
df.head(5)

Unnamed: 0,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Current Ver,Android Ver,...,Category_PERSONALIZATION,Category_PHOTOGRAPHY,Category_PRODUCTIVITY,Category_SHOPPING,Category_SOCIAL,Category_SPORTS,Category_TOOLS,Category_TRAVEL_AND_LOCAL,Category_VIDEO_PLAYERS,Category_WEATHER
0,4.1,159,19.0,10000,Free,0,Everyone,Art & Design,1.0.0,4.0.3 and up,...,0,0,0,0,0,0,0,0,0,0
1,3.9,967,14.0,500000,Free,0,Everyone,Art & Design;Pretend Play,2.0.0,4.0.3 and up,...,0,0,0,0,0,0,0,0,0,0
2,4.7,87510,8.7,5000000,Free,0,Everyone,Art & Design,1.2.4,4.0.3 and up,...,0,0,0,0,0,0,0,0,0,0
3,4.5,215644,25.0,50000000,Free,0,Teen,Art & Design,Varies with device,4.2 and up,...,0,0,0,0,0,0,0,0,0,0
4,4.3,967,2.8,100000,Free,0,Everyone,Art & Design;Creativity,1.1,4.4 and up,...,0,0,0,0,0,0,0,0,0,0


**This command retrieves all unique values from the "Type" column in the DataFrame "df".**

In [19]:
df["Type"].unique()

array(['Free', 'Paid'], dtype=object)

**This code converts categorical variables in the "Type" column of DataFrame "df" into dummy variables and displays the first 5 rows of the updated DataFrame.**

In [20]:
df = pd.get_dummies(df, columns=["Type"])
df.head(5)

Unnamed: 0,Rating,Reviews,Size,Installs,Price,Content Rating,Genres,Current Ver,Android Ver,Category_ART_AND_DESIGN,...,Category_PRODUCTIVITY,Category_SHOPPING,Category_SOCIAL,Category_SPORTS,Category_TOOLS,Category_TRAVEL_AND_LOCAL,Category_VIDEO_PLAYERS,Category_WEATHER,Type_Free,Type_Paid
0,4.1,159,19.0,10000,0,Everyone,Art & Design,1.0.0,4.0.3 and up,1,...,0,0,0,0,0,0,0,0,True,False
1,3.9,967,14.0,500000,0,Everyone,Art & Design;Pretend Play,2.0.0,4.0.3 and up,1,...,0,0,0,0,0,0,0,0,True,False
2,4.7,87510,8.7,5000000,0,Everyone,Art & Design,1.2.4,4.0.3 and up,1,...,0,0,0,0,0,0,0,0,True,False
3,4.5,215644,25.0,50000000,0,Teen,Art & Design,Varies with device,4.2 and up,1,...,0,0,0,0,0,0,0,0,True,False
4,4.3,967,2.8,100000,0,Everyone,Art & Design;Creativity,1.1,4.4 and up,1,...,0,0,0,0,0,0,0,0,True,False


**This code converts the boolean dummy variables "Type_Free" and "Type_Paid" (presumably created by "pd.get_dummies") into integers ("0" or "1") and displays the first 5 rows of the DataFrame "df" with these columns updated.**

In [21]:
df["Type_Free"] = df["Type_Free"].astype(int)
df["Type_Paid"] = df["Type_Paid"].astype(int)
df.head(5)

Unnamed: 0,Rating,Reviews,Size,Installs,Price,Content Rating,Genres,Current Ver,Android Ver,Category_ART_AND_DESIGN,...,Category_PRODUCTIVITY,Category_SHOPPING,Category_SOCIAL,Category_SPORTS,Category_TOOLS,Category_TRAVEL_AND_LOCAL,Category_VIDEO_PLAYERS,Category_WEATHER,Type_Free,Type_Paid
0,4.1,159,19.0,10000,0,Everyone,Art & Design,1.0.0,4.0.3 and up,1,...,0,0,0,0,0,0,0,0,1,0
1,3.9,967,14.0,500000,0,Everyone,Art & Design;Pretend Play,2.0.0,4.0.3 and up,1,...,0,0,0,0,0,0,0,0,1,0
2,4.7,87510,8.7,5000000,0,Everyone,Art & Design,1.2.4,4.0.3 and up,1,...,0,0,0,0,0,0,0,0,1,0
3,4.5,215644,25.0,50000000,0,Teen,Art & Design,Varies with device,4.2 and up,1,...,0,0,0,0,0,0,0,0,1,0
4,4.3,967,2.8,100000,0,Everyone,Art & Design;Creativity,1.1,4.4 and up,1,...,0,0,0,0,0,0,0,0,1,0


**This command retrieves all unique values from the "Price" column in the DataFrame "df".**

In [22]:
df["Price"].unique()

array(['0', '$4.99', '$3.99', '$6.99', '$1.49', '$2.99', '$7.99', '$5.99',
       '$3.49', '$1.99', '$9.99', '$7.49', '$0.99', '$9.00', '$5.49',
       '$10.00', '$24.99', '$11.99', '$79.99', '$16.99', '$14.99',
       '$1.00', '$29.99', '$12.99', '$2.49', '$10.99', '$1.50', '$19.99',
       '$15.99', '$33.99', '$74.99', '$39.99', '$3.95', '$4.49', '$1.70',
       '$8.99', '$2.00', '$3.88', '$25.99', '$399.99', '$17.99',
       '$400.00', '$3.02', '$1.76', '$4.84', '$4.77', '$1.61', '$2.50',
       '$1.59', '$6.49', '$1.29', '$5.00', '$13.99', '$299.99', '$379.99',
       '$37.99', '$18.99', '$389.99', '$19.90', '$8.49', '$1.75',
       '$14.00', '$4.85', '$46.99', '$109.99', '$154.99', '$3.08',
       '$2.59', '$4.80', '$1.96', '$19.40', '$3.90', '$4.59', '$15.46',
       '$3.04', '$4.29', '$2.60', '$3.28', '$4.60', '$28.99', '$2.95',
       '$2.90', '$1.97', '$200.00', '$89.99', '$2.56', '$30.99', '$3.61',
       '$394.99', '$1.26', '$1.20', '$1.04'], dtype=object)

**This code removes the dollar sign ('$') from the values in the "Price" column of the DataFrame "df" and displays the first 5 rows of the updated DataFrame.**

In [23]:
df["Price"] = df["Price"].str.replace("$", "")
df.head(5)

Unnamed: 0,Rating,Reviews,Size,Installs,Price,Content Rating,Genres,Current Ver,Android Ver,Category_ART_AND_DESIGN,...,Category_PRODUCTIVITY,Category_SHOPPING,Category_SOCIAL,Category_SPORTS,Category_TOOLS,Category_TRAVEL_AND_LOCAL,Category_VIDEO_PLAYERS,Category_WEATHER,Type_Free,Type_Paid
0,4.1,159,19.0,10000,0,Everyone,Art & Design,1.0.0,4.0.3 and up,1,...,0,0,0,0,0,0,0,0,1,0
1,3.9,967,14.0,500000,0,Everyone,Art & Design;Pretend Play,2.0.0,4.0.3 and up,1,...,0,0,0,0,0,0,0,0,1,0
2,4.7,87510,8.7,5000000,0,Everyone,Art & Design,1.2.4,4.0.3 and up,1,...,0,0,0,0,0,0,0,0,1,0
3,4.5,215644,25.0,50000000,0,Teen,Art & Design,Varies with device,4.2 and up,1,...,0,0,0,0,0,0,0,0,1,0
4,4.3,967,2.8,100000,0,Everyone,Art & Design;Creativity,1.1,4.4 and up,1,...,0,0,0,0,0,0,0,0,1,0


**This command retrieves all unique values from the "Content Rating" column in the DataFrame "df".**

In [24]:
df['Content Rating'].unique()

array(['Everyone', 'Teen', 'Everyone 10+', 'Mature 17+',
       'Adults only 18+', 'Unrated'], dtype=object)

**This code converts categorical variables in the "Content Rating" column of DataFrame "df" into dummy variables and displays the first 5 rows of the updated DataFrame.**

In [25]:
df = pd.get_dummies(df, columns=["Content Rating"])
df.head(5)

Unnamed: 0,Rating,Reviews,Size,Installs,Price,Genres,Current Ver,Android Ver,Category_ART_AND_DESIGN,Category_AUTO_AND_VEHICLES,...,Category_VIDEO_PLAYERS,Category_WEATHER,Type_Free,Type_Paid,Content Rating_Adults only 18+,Content Rating_Everyone,Content Rating_Everyone 10+,Content Rating_Mature 17+,Content Rating_Teen,Content Rating_Unrated
0,4.1,159,19.0,10000,0,Art & Design,1.0.0,4.0.3 and up,1,0,...,0,0,1,0,False,True,False,False,False,False
1,3.9,967,14.0,500000,0,Art & Design;Pretend Play,2.0.0,4.0.3 and up,1,0,...,0,0,1,0,False,True,False,False,False,False
2,4.7,87510,8.7,5000000,0,Art & Design,1.2.4,4.0.3 and up,1,0,...,0,0,1,0,False,True,False,False,False,False
3,4.5,215644,25.0,50000000,0,Art & Design,Varies with device,4.2 and up,1,0,...,0,0,1,0,False,False,False,False,True,False
4,4.3,967,2.8,100000,0,Art & Design;Creativity,1.1,4.4 and up,1,0,...,0,0,1,0,False,True,False,False,False,False


**The df.columns command will return a list of all column names in the DataFrame df.**

In [26]:
df.columns

Index(['Rating', 'Reviews', 'Size', 'Installs', 'Price', 'Genres',
       'Current Ver', 'Android Ver', 'Category_ART_AND_DESIGN',
       'Category_AUTO_AND_VEHICLES', 'Category_BEAUTY',
       'Category_BOOKS_AND_REFERENCE', 'Category_BUSINESS', 'Category_COMICS',
       'Category_COMMUNICATION', 'Category_DATING', 'Category_EDUCATION',
       'Category_ENTERTAINMENT', 'Category_EVENTS', 'Category_FAMILY',
       'Category_FINANCE', 'Category_FOOD_AND_DRINK', 'Category_GAME',
       'Category_HEALTH_AND_FITNESS', 'Category_HOUSE_AND_HOME',
       'Category_LIBRARIES_AND_DEMO', 'Category_LIFESTYLE',
       'Category_MAPS_AND_NAVIGATION', 'Category_MEDICAL',
       'Category_NEWS_AND_MAGAZINES', 'Category_PARENTING',
       'Category_PERSONALIZATION', 'Category_PHOTOGRAPHY',
       'Category_PRODUCTIVITY', 'Category_SHOPPING', 'Category_SOCIAL',
       'Category_SPORTS', 'Category_TOOLS', 'Category_TRAVEL_AND_LOCAL',
       'Category_VIDEO_PLAYERS', 'Category_WEATHER', 'Type_Free', 'Typ

**This code snippet converts the columns listed in "columns_to_encode" from categorical variables (likely represented as dummy variables or strings) into integer type in the DataFrame "df".**

In [27]:
columns_to_encode = ['Content Rating_Adults only 18+', 'Content Rating_Everyone',
       'Content Rating_Everyone 10+', 'Content Rating_Mature 17+',
       'Content Rating_Teen', 'Content Rating_Unrated']

df[columns_to_encode] = df[columns_to_encode].astype(int)

**df.head(5) will display the first 5 rows of the updated DataFrame "df".**

In [28]:
df.head(5)

Unnamed: 0,Rating,Reviews,Size,Installs,Price,Genres,Current Ver,Android Ver,Category_ART_AND_DESIGN,Category_AUTO_AND_VEHICLES,...,Category_VIDEO_PLAYERS,Category_WEATHER,Type_Free,Type_Paid,Content Rating_Adults only 18+,Content Rating_Everyone,Content Rating_Everyone 10+,Content Rating_Mature 17+,Content Rating_Teen,Content Rating_Unrated
0,4.1,159,19.0,10000,0,Art & Design,1.0.0,4.0.3 and up,1,0,...,0,0,1,0,0,1,0,0,0,0
1,3.9,967,14.0,500000,0,Art & Design;Pretend Play,2.0.0,4.0.3 and up,1,0,...,0,0,1,0,0,1,0,0,0,0
2,4.7,87510,8.7,5000000,0,Art & Design,1.2.4,4.0.3 and up,1,0,...,0,0,1,0,0,1,0,0,0,0
3,4.5,215644,25.0,50000000,0,Art & Design,Varies with device,4.2 and up,1,0,...,0,0,1,0,0,0,0,0,1,0
4,4.3,967,2.8,100000,0,Art & Design;Creativity,1.1,4.4 and up,1,0,...,0,0,1,0,0,1,0,0,0,0


**This code removes the columns "Genres", "Current Ver", and "Android Ver" from the DataFrame "df" and displays the first 5 rows of the updated DataFrame.**

In [29]:
df = df.drop(columns=["Genres", "Current Ver", "Android Ver"])
df.head(5)

Unnamed: 0,Rating,Reviews,Size,Installs,Price,Category_ART_AND_DESIGN,Category_AUTO_AND_VEHICLES,Category_BEAUTY,Category_BOOKS_AND_REFERENCE,Category_BUSINESS,...,Category_VIDEO_PLAYERS,Category_WEATHER,Type_Free,Type_Paid,Content Rating_Adults only 18+,Content Rating_Everyone,Content Rating_Everyone 10+,Content Rating_Mature 17+,Content Rating_Teen,Content Rating_Unrated
0,4.1,159,19.0,10000,0,1,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0
1,3.9,967,14.0,500000,0,1,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0
2,4.7,87510,8.7,5000000,0,1,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0
3,4.5,215644,25.0,50000000,0,1,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0
4,4.3,967,2.8,100000,0,1,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0


# **Model Building**

**These imports bring in functionalities from scikit-learn ("sklearn") for splitting data into training and testing sets ("train_test_split") and for calculating mean squared error ("mean_squared_error").**

In [30]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

**These two lines select the features (X) and target variable (y) from the DataFrame df for a machine learning task.**

In [31]:
X = df[['Reviews', 'Size', 'Installs', 'Price',
       'Category_ART_AND_DESIGN', 'Category_AUTO_AND_VEHICLES',
       'Category_BEAUTY', 'Category_BOOKS_AND_REFERENCE', 'Category_BUSINESS',
       'Category_COMICS', 'Category_COMMUNICATION', 'Category_DATING',
       'Category_EDUCATION', 'Category_ENTERTAINMENT', 'Category_EVENTS',
       'Category_FAMILY', 'Category_FINANCE', 'Category_FOOD_AND_DRINK',
       'Category_GAME', 'Category_HEALTH_AND_FITNESS',
       'Category_HOUSE_AND_HOME', 'Category_LIBRARIES_AND_DEMO',
       'Category_LIFESTYLE', 'Category_MAPS_AND_NAVIGATION',
       'Category_MEDICAL', 'Category_NEWS_AND_MAGAZINES', 'Category_PARENTING',
       'Category_PERSONALIZATION', 'Category_PHOTOGRAPHY',
       'Category_PRODUCTIVITY', 'Category_SHOPPING', 'Category_SOCIAL',
       'Category_SPORTS', 'Category_TOOLS', 'Category_TRAVEL_AND_LOCAL',
       'Category_VIDEO_PLAYERS', 'Category_WEATHER', 'Type_Free', 'Type_Paid',
       'Content Rating_Adults only 18+', 'Content Rating_Everyone',
       'Content Rating_Everyone 10+', 'Content Rating_Mature 17+',
       'Content Rating_Teen', 'Content Rating_Unrated']]

y = df["Rating"]

**These two lines split the dataset "X" (features) and "y" (target variable) into training ("X_train", "y_train") and testing ("X_test", "y_test") sets, with 20% of the data allocated for testing ("test_size=0.2").**

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

**1: Random FOrest Regressor It's a machine learning algorithm used for regression tasks, which means predicting continuous values rather than classes or categories.**

**2: This metric quantifies the average squared difference between predicted values and actual values in the test set. Lower MSE values indicate better model performance.**

**3: The printed MSE value represents how well the RandomForestRegressor model fitted to "X_train" and "y_train" predicts "y_test" (the actual values) compared to its predictions ("y_pred"). A lower MSE suggests the model is making more accurate predictions on unseen data.**

In [33]:
from sklearn.ensemble import RandomForestRegressor

RFR = RandomForestRegressor(n_estimators=500)

RFR.fit(X_train, y_train)

y_pred = RFR.predict(X_test)

MSE = mean_squared_error(y_test, y_pred)

print("Random Forest Regressor Mean Squared Error: ", MSE)

Random Forest Regressor Mean Squared Error:  0.21095479069029385



**1: Linear Regression is a fundamental machine learning algorithm used for predicting continuous numeric values. It assumes a linear relationship between the input features ("X_train") and the target variable ("y_train").**

**2: Mean Squared Error (MSE) measures the average squared difference between predicted values ("y_pred") and actual values ("y_test") in the test dataset. Lower MSE values indicate that the model's predictions are closer to the actual values.**

**3: The printed MSE value quantifies how well the Linear Regression model, fitted to "X_train" and "y_train", predicts "y_test" (the actual values) compared to its predictions ("y_pred"). A lower MSE suggests a more accurate model in predicting house prices based on the given features ("X_train").**

In [34]:
from sklearn.linear_model import LinearRegression

LR = LinearRegression(n_jobs=-1)

LR.fit(X_train, y_train)

y_pred = LR.predict(X_test)

MSE = mean_squared_error(y_test, y_pred)

print("Linear Regression Mean Squared Error: ", MSE)

Linear Regression Mean Squared Error:  0.23104416042197473


**1: Decision Tree Regressor is a machine learning algorithm used for regression tasks, where it predicts continuous values based on input features (X_train).**

**2: Mean Squared Error (MSE) measures the average squared difference between predicted values (y_pred) and actual values (y_test) in the test dataset. A lower MSE indicates better model performance, as it signifies smaller errors between predicted and actual values.**

**3: The printed MSE value reflects how well the DecisionTreeRegressor model, with a maximum depth of 20, fits to the training data (X_train, y_train) and predicts y_test (the actual values). Lower MSE values suggest the model is making more accurate predictions on unseen data, given its training.**








In [35]:
from sklearn.tree import DecisionTreeRegressor

DTR = DecisionTreeRegressor(max_depth=20)

DTR.fit(X_train, y_train)

y_pred = DTR.predict(X_test)

MSE = mean_squared_error(y_test, y_pred)

print("Decision Tree Classifier Mean Squared Error: ", MSE)

Decision Tree Classifier Mean Squared Error:  0.3094600114058787


**1: SVR (Support Vector Regressor) is a regression algorithm that predicts continuous values based on input features (X_train), using a radial basis function ('rbf') kernel.**

**2: Mean Squared Error (MSE) measures the average squared difference between predicted values (y_pred) and actual values (y_test) in the test dataset. Lower MSE values indicate more accurate predictions by the SVR model.**

**3: The printed MSE value shows how well the SVR model, trained on X_train and y_train, predicts y_test. A lower MSE indicates better performance, suggesting the SVR with an RBF kernel is effective for this regression task.**







In [36]:
from sklearn.svm import SVR

svr = SVR(kernel='rbf')

svr.fit(X_train, y_train)

y_pred = svr.predict(X_test)

MSE = mean_squared_error(y_test, y_pred)

print("SVR Mean Squared Error: ", MSE)

SVR Mean Squared Error:  0.24573895154857378


**1: Ridge Regression is a linear regression algorithm that incorporates regularization (L2 regularization) to prevent overfitting by penalizing large coefficients.**

**2: Mean Squared Error (MSE) calculates the average squared difference between predicted values (y_pred) and actual values (y_test) in the test dataset. Lower MSE values indicate better predictive performance of the Ridge Regression model.**

**3: The printed MSE value quantifies how well the Ridge Regression model, with regularization parameter alpha=1.0, fits the training data (X_train, y_train) and predicts y_test (actual values). A lower MSE suggests the model is making more accurate predictions on unseen data due to its regularization.**

In [37]:
from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1.0)

ridge.fit(X_train, y_train)

y_pred = ridge.predict(X_test)

MSE = mean_squared_error(y_test, y_pred)

print("Ridge Mean Squared Error: ", MSE)

Ridge Mean Squared Error:  0.2310347053287163


**1: Lasso Regression is a linear regression algorithm that incorporates regularization (L1 regularization) to prevent overfitting by penalizing the absolute size of coefficients.**

**2: Mean Squared Error (MSE) calculates the average squared difference between predicted values (y_pred) and actual values (y_test) in the test dataset. Lower MSE values indicate better predictive performance of the Lasso Regression model.**

**3: The printed MSE value quantifies how well the Lasso Regression model, with regularization parameter alpha=1.0, fits the training data (X_train, y_train) and predicts y_test (actual values). A lower MSE suggests the model is making more accurate predictions on unseen data due to its regularization, which encourages sparsity in feature selection.**

In [38]:
from sklearn.linear_model import Lasso

Ls = Lasso(alpha=1.0)

Ls.fit(X_train, y_train)

Ls.predict(X_test)

y_pred = Ls.predict(X_test)

MSE = mean_squared_error(y_test, y_pred)

print("Lasso Mean Squared Error: ", MSE)

Lasso Mean Squared Error:  0.2362909524938178


# **Conclusion:**

**Based on the mean squared error (MSE) results you've provided for different regression models:**

- **Random Forest Regressor (RFR): MSE =  0.210**
- **Linear Regression: MSE = 0.231**
- **Decision Tree Regressor: MSE = 0.309**
- **Support Vector Regressor (SVR): MSE = 0.245**
- **Ridge Regression: MSE = 0.231**
- **Lasso Regression: MSE = 0.236**

**It's evident that the Random Forest Regressor (RFR) outperforms the other models in terms of predictive accuracy, as it has the lowest mean squared error. A lower MSE indicates better predictive performance because it signifies that the model's predictions are closer to the actual values.**

**Therefore, based on the MSE results provided, the Random Forest Regressor (RFR) appears to be the best model among those tested for the given dataset or problem. It generally combines the strengths of multiple decision trees (ensemble method) to achieve robust performance in regression tasks.**

# **Save The Model**

**The code snippet uses Python's pickle module to serialize and save a trained RandomForestRegressor model (RFR) as a binary file named "Finalized-Model.pickle", facilitating easy reuse and deployment for future predictions without retraining.**

In [39]:
import pickle

with open("Finalized-Model.pickle", "wb") as file:
  pickle.dump(RFR, file)