## EDA And Feature Engineering of Google Play Store Dataset 

## problem Statement 
The Google Play Store hosts a vast array of mobile applications across various categories, catering to diverse user needs. However, with the increasing volume of apps, understanding the factors that contribute to an app’s success is critical for developers and businesses aiming to optimize visibility, engagement, and revenue on the platform. The goal is to analyze the Google Play Store dataset to uncover insights regarding app success metrics, including app ratings, download numbers, and the impact of various features such as category, size, price, and content rating. Through this analysis, we aim to answer the following questions:

What are the key factors that influence app ratings and downloads?
How do category, app size, content rating, and price affect an app's performance in terms of ratings and downloads?
What are the general trends in app pricing, and how does pricing relate to ratings and download counts?
How can app developers leverage these insights to make data-driven decisions for future app development and marketing strategies?
By examining these aspects, this analysis will help app developers, marketers, and strategists enhance their understanding of the Google Play Store landscape, guiding them to make improvements that resonate with user preferences and expectations.

## Dataset Description:

The Google Play Store dataset includes various features that provide insights into the characteristics and performance of mobile applications. Key attributes in the dataset are as follows:

App Name: The name of the application.
Category: The genre or category to which the app belongs, such as "Education," "Entertainment," or "Productivity."
Rating: The user rating of the app, ranging from 1 to 5.
Reviews: The number of user reviews received.
Size: The size of the app, represented in MB.
Installs: The number of times the app has been downloaded.
Type: Indicates whether the app is "Free" or "Paid."
Price: The price of the app, if it’s a paid app.
Content Rating: The age group for which the app is appropriate, such as "Everyone" or "Teen."
Genres: The genre of the app, providing a more granular view than the broader category field.
Last Updated: The date the app was last updated.
Current Version: The current version of the app.
Android Version: The minimum Android version required to run the app.
These features collectively provide a comprehensive view of each app’s characteristics, helping us analyze how various factors correlate with app ratings and download numbers.

# Steps we are going to follow 
1 . Data cleaning 

2 . Exploratory Data analysis 

3 . Feature Engineering 

In [20]:
import numpy as np 
import pandas as pd    
import matplotlib.pyplot as plt                 
%matplotlib inline  
       


In [21]:
df = pd.read_csv('googleplaystore.csv')
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [22]:
df.shape

(10841, 13)

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [24]:
# Summary of Dataset 

df.describe()

Unnamed: 0,Rating
count,9367.0
mean,4.193338
std,0.537431
min,1.0
25%,4.0
50%,4.3
75%,4.5
max,19.0


In [25]:
df.isnull().sum()

App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64

# Insight And Observation 

In [26]:
df.head(2)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up


In [27]:
df['Reviews'].unique()

array(['159', '967', '87510', ..., '603', '1195', '398307'], dtype=object)

In [28]:
df['Reviews'].str.isnumeric().sum()

np.int64(10840)

In [29]:
df[~df['Reviews'].str.isnumeric()]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


In [30]:
df_copy  = df.copy()

In [31]:
df_copy = df_copy.drop(df_copy.index[10472])

In [32]:
df_copy = df_copy['Reviews'].astype(int)

In [34]:
df_copy.head()

0       159
1       967
2     87510
3    215644
4       967
Name: Reviews, dtype: int64