# **Project Name**    - 
Google Play Store review

##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 -** Omkar Dandvate


# **Project Summary -**

The project aims to analyze data from the Google Play Store to gain insights into app categories, user ratings, and user reviews. The data will be obtained through web scraping and will include information such as app name, category, rating, and reviews. The project will use Python and data analysis libraries such as Pandas, NumPy, and Matplotlib to clean and analyze the data.

The first step of the project will involve cleaning and transforming data into a usable format using Pandas. The analysis will then focus on identifying the most popular app categories, analyzing user ratings, and identifying common themes in user reviews.

The project will also explore the relationship between app size and user ratings and analyze the impact of pricing on user ratings. Additionally, the project will look at the relationship between app size and app category and examine the popularity of free vs. paid apps.

The project will conclude with visualizations and insights that can help developers understand user preferences and make data-driven decisions in developing and marketing their apps.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

#### **Define Your Business Objective?**

The business objective of the Google Play Store data analysis project is to provide app developers with insights that can help them optimize their app development and marketing strategies, increase user engagement and retention, and drive revenue growth by understanding user preferences and market trends.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Import Libraries

### Dataset Loading

In [None]:
# Load Dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')


In [None]:
dataset= pd.read_csv('/content/drive/MyDrive/Play Store Data.csv')
reviews= pd.read_csv('/content/drive/MyDrive/User Reviews.csv')


In [None]:
df=dataset
rv=reviews

In [None]:
sns.heatmap(df.corr(), annot = True, linewidths=.05, fmt=".01f")
plt.title("Heatmap for numerical columns", size=5)

### Dataset First View

In [None]:
# Dataset First Look

In [None]:
dataset.head()

In [None]:
reviews.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

In [None]:
dataset.shape

In [None]:
reviews.shape

In [None]:
dataset.columns

In [None]:
reviews.columns

### Dataset Information

In [None]:
# Dataset Info

In [None]:
dataset.info()

In [None]:
reviews.info()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
# Let's set the theme of plots.
sns.set_theme()
sns.set(rc={"figure.dpi":300, "figure.figsize":(8,5)})

In [None]:
sns.heatmap(df.isnull(), cbar=False)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

In [None]:
dataset.isnull().sum()

In [None]:
reviews.isnull().sum()

####**Ploting Graph to check outliers **

In [None]:
dataset.boxplot()

In [None]:
dataset.hist();

### What did you know about your dataset?

One thing can be seen that Dtype of Rating is float, means only graph for rating will be plotted. As box plot or histogram takes only numerical values.

From ploting graph we got to know that there is some outsider data present in dataset.

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

In [None]:
dataset[dataset.Rating>5]

In [None]:
dataset.drop([10472],inplace=True)

In [None]:
#to confirm our row is deleted successfully

dataset[10470:10475]


In [None]:
#Now lets make box plot again

dataset.boxplot()

dataset.hist()

### What all manipulations have you done and insights you found?

Above we found that we have few null values, Lets clean our data to get some meaningfull outsights.

In [None]:
dataset.isnull().sum()

In [None]:
dataset['Rating'].fillna(dataset['Rating'].median,inplace=True)

In [None]:
dataset.isnull().sum()

In [None]:
dataset['Android Ver'].fillna(dataset['Android Ver'].mode,inplace=True)
dataset['Current Ver'].fillna(dataset['Current Ver'].mode,inplace=True)
dataset['Type'].fillna(dataset['Type'].mode,inplace=True)

In [None]:
dataset.isnull().sum()

In [None]:
dataset['Price'] = dataset['Price'].apply((lambda x: str(x).replace('$',' ') if '$' in str(x) else str(x)))
dataset['Price'] = dataset['Price'].apply(lambda x :float(x))

In [None]:
dataset['Reviews']= pd.to_numeric(dataset['Reviews'],errors='coerce')

In [None]:
dataset['Installs'] = dataset['Installs'].apply((lambda x: str(x).replace('+',' ') if '+' in str(x) else str(x)))
dataset['Installs'] = dataset['Installs'].apply((lambda x: str(x).replace(',','') if ',' in str(x) else str(x)))
dataset['Installs'] = dataset['Installs'].apply(lambda x : float(x))


In [None]:
dataset['Installs'] = dataset['Installs'].apply(lambda x :int(x))

In [None]:
#importing the datetime library
from datetime import datetime

In [None]:
#changing the data type of last updated column from string to datetime
dataset['Last Updated'] = dataset['Last Updated'].apply(lambda x: datetime.strptime(x,'%B %d, %Y'))

In [None]:
dataset.info()

In [None]:
from pandas.core.algorithms import unique
dataset['Content Rating'].unique()

In [None]:
from pandas.core.algorithms import unique
dataset['Type'].value_counts()

In [None]:
from pandas.core.algorithms import unique
dataset['Reviews'].value_counts()

In [None]:
# Duplicate values in App column
dataset['App'].value_counts()

In [None]:
#deleting the duplicate values in App column
dataset.drop_duplicates(subset = 'App', inplace = True)

In [None]:
dataset['App'].value_counts()

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

In [None]:
 # No of apps in each category

dataset['Category'].value_counts().plot.barh(figsize=(10,12), color = 'r').invert_yaxis()
plt.ylabel('App Categories')
plt.xlabel('Number of apps')
plt.title('Number of apps in each category in the playstore')
plt.legend()

In [None]:
grpcat= dataset.groupby('Category')
x = grpcat['Installs'].agg(np.mean)
y= grpcat['Price'].agg(np.sum)
z= grpcat['Reviews'].agg(np.mean)
#print (x)
#print (y)
#print(z)

grpcat['Installs'].agg(np.mean)
grpcat['Price'].agg(np.sum)
grpcat['Reviews'].agg(np.mean)


In [None]:
plt.figure(figsize=(16,5))
plt.plot(x,'ro' , color='b')
plt.xticks(rotation=90)
plt.title('Category vs Installs')
plt.xlabel('Category')
plt.ylabel('Installs')
plt.show()

In [None]:
plt.figure(figsize=(16,5))
plt.plot(y,'r--' , color='k')
plt.xticks(rotation=90)
plt.title('Category vs Price')
plt.xlabel('Category')
plt.ylabel('Price')
plt.show()

In [None]:
plt.figure(figsize=(16,5))
plt.plot(x,'r' , color='c')
plt.xticks(rotation=90)
plt.title('Category vs Reviews')
plt.xlabel('Category')
plt.ylabel('Reviews')
plt.show()

In [None]:
dataset.dtypes

##### 1. What is/are the insight(s) found from the chart?

Answer We can clearly see that the apps in the Communication, Video players and Social categories has the highest number of average installs compared to the apps in other categories.

##### 2. Will the gained insights help creating a positive business impact? 
From above projection we can see currently there is great scope to devlope/ create apps in booming domain of Communication.

Answer Here

#### Chart - 2

Lets work on Dataset on basis of '**Rating**'

In [None]:
contains_alpha = dataset['Rating'].str.contains('[a-zA-Z]').any()

if contains_alpha:
    print("col2 contains alphabets.")
else:
    print("col2 does not contain any alphabets.")

In [None]:
dataset['Rating'] = pd.to_numeric(dataset['Rating'], errors='coerce')

In [None]:
points = dataset.Rating.unique()
points

There is **nan** **value** in data

In [None]:
# Find Median of all non NaN values of rating column
median_rating = dataset[~dataset['Rating'].isnull()]['Rating'].median()

In [None]:
dataset['Rating'].fillna(value=median_rating, inplace=True)

In [None]:
dataset.Rating.unique()

####Now data is clean, We can apply functions for our analysis

In [None]:
def rates(var):

  try:
    if var >= 1 and var <2:
      return 'Under Rated'
    elif var >= 2 and var <3:
      return 'Average'
    elif var >= 3 and var <4:
      return 'Above Average'
    elif var >= 4 and var <5:
      return 'Top Rated'
    else:
      return '90 and above'
  except:
    return var

In [None]:
# applying the Rates function in the main df

dataset['Rates'] = dataset['Rating'].apply(lambda x: rates(x))
dataset.head(10)

In [None]:
# no of apps belonging to each Rate group

dataset['Rates'].value_counts().nlargest(30).plot.barh(figsize=(8,4), color='g').invert_yaxis()
plt.title("Number of apps in different size groups")
plt.xlabel('No of apps')
plt.ylabel('App size in MB')
plt.legend()

In [None]:
dataset['Rates'].value_counts().plot.pie(figsize = (7,14), autopct='%1.1f%%')
plt.legend()

#### It is seen that most of the apps from Google play store are Top Rated.

In [None]:
#just for our refrence 
dataset[dataset['Rating']==4.3].value_counts()

In [None]:
# Average app ratings

dataset['Rating'].value_counts().plot.bar(figsize=(12,7) )
plt.xlabel('Average rating')
plt.ylabel('Number of apps')
plt.title('Avg rating of apps in Google playstore')
plt.legend()

##### 2. What is/are the insight(s) found from the chart?

*Answer* We can clearly see that most of the apps have got an average rating of 4.3 that means most of the apps have a good rating

In [None]:
# Chart - 3 visualization code

##### 1. Why did you pick the specific chart?

#### Lets clean Content Rating coloum

In [None]:
dataset['Content Rating'].hist()

In [None]:
dataset['Content Rating'].value_counts()

As 2 tabs has very less contents, it can be ignored for better results

In [None]:
dataset[dataset['Content Rating'] == 'Unrated']

In [None]:
dataset[dataset['Content Rating'] == 'Adults only 18+']

In [None]:
dataset.drop([298,3043,6424],inplace=True)

In [None]:
dataset.drop([7312,8266],inplace=True)

In [None]:
dataset['Content Rating'].value_counts()

In [None]:
dataset['Content Rating'].value_counts().plot.pie(figsize = (7,14), autopct='%1.1f%%')
plt.legend()

##### 2. What is/are the insight(s) found from the chart?

Answer Hence, We can clearly see majority of apps can be used by everyone. And few apps are also made focusing teens 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.
' Adults only 18+, Unrated '  this kind of Content are not majorly used.         

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code

####Lets try to findout insights from 'Type' coloumn.

In [None]:
dataset['Type'].value_counts()

In [None]:
dataset[dataset['Type'] == 'Free'].head()

In [None]:
dataset['Type'].value_counts().plot.pie(figsize = (5,5), autopct='%1.1f%%')
#plt.legend()

##### 2. What is/are the insight(s) found from the chart?



```
`# This is formatted as code`
```

Answer Its clear that most app used are Free.

Answer Here

In [None]:
# Creating a df for only free apps
 
free_df = df[df['Type'] == 'Free']

In [None]:
free_df

In [None]:
free_df.shape

In [None]:
top_free_df = free_df[free_df['Installs'] == free_df['Installs'].max()]
top_free_df.head()

In [None]:
top_free_df.shape

In [None]:
# Categories in which the top 20 free apps belong to

top_free_df['Category'].value_counts().plot.bar(figsize=(12,5))
plt.xlabel('Category')
plt.ylabel('Number of apps')
plt.title('Categories in which the top 20 free apps belong')
plt.xticks(rotation=90)
plt.legend()

In [None]:
top_free_df['App']

In [None]:
# Creating a df containing only paid apps

paid_df = df[df['Type'] == 'Paid']
paid_df.head()

In [None]:
top_paiddf = paid_df.sort_values('Installs', ascending=False).head(20)

In [None]:
top_paiddf.head()

In [None]:
top_paiddf['Category'].value_counts()

In [None]:
top_paiddf['Category'].value_counts().plot.bar(figsize=(12,5))
plt.xlabel('Category')
plt.ylabel('Number of apps')
plt.title('Categories in which the top 20 paid apps belong')
plt.xticks(rotation=90)
plt.legend()

In [None]:
# Number of apps that can be installed at a particular price 

paid_df.groupby('Price')['App'].count().sort_values(ascending= False).plot.bar(figsize = (20,5))

In [None]:
paid_df['Price'].value_counts()

In [None]:
def price_group(var):

  try:
    if var < 1:
      return 'Below 1'
    elif var >= 1 and var <5:
      return '1-5'
    elif var >= 5 and var <10:
      return '5-10'
    elif var >= 10 and var <50:
      return '10-50'
    elif var >= 50 and var <100:
      return '50-100'
    else:
      return '100 and above'
  except:
    return var

In [None]:
# applying the price_group function in the paid df

paid_df['Paid_group'] = paid_df['Price'].apply(lambda x: price_group (x))
paid_df.head(10)

In [None]:
paid_df['Paid_group'].value_counts()

In [None]:
paid_df['Paid_group'].value_counts().plot.pie(figsize = (10,18), autopct='%1.1f%%')
plt.legend()

In [None]:
# Average price of paid apps in each category

paid_df.groupby('Category')['Price'].mean().sort_values(ascending=False).plot.barh(figsize = (10,9), color='r').invert_yaxis()
plt.xlabel('Average Price (USD)')
plt.title('Average price of paid apps in each category')
plt.legend()

#### It can be seen that apps in Finance,Lifestyle,Style are costly

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

In [None]:
dataset['Size'].tail()

In [None]:
# Function to group the apps based on its size in MB
def size_group(var):

  try:
    if var < 1:
      return 'Below 1'
    elif var >= 1 and var <10:
      return '1-10'
    elif var >= 10 and var <20:
      return '10-20'
    elif var >= 20 and var <30:
      return '20-30'
    elif var >= 30 and var <40:
      return '30-40'
    elif var >= 40 and var <50:
      return '40-50'
    elif var >= 50 and var <60:
      return '50-60'
    elif var >= 60 and var <70:
      return '60-70'
    elif var >= 70 and var <80:
      return '70-80'
    elif var >= 80 and var <90:
      return '80-90'
    else:
      return '90 and above'
  except:
    return var

In [None]:
# applying the size_group function in the main df

dataset['Size_group'] = dataset['Size'].apply(lambda x: size_group (x))
dataset.head(10)

In [None]:
abc_df = dataset.sort_values('Size', ascending=False)

In [None]:
abc_df['Size'].value_counts()

In [None]:
pp_df = abc_df[abc_df['Size'] != 'Varies with device']


In [None]:
pp_df['Size'].value_counts().nlargest(20).plot.barh(figsize=(12,4), color='m').invert_yaxis()
plt.title("Number of apps in different size groups")
plt.xlabel('No of apps')
plt.ylabel('App size in MB')
plt.legend()

#### Most of apps are of 11M to 15M

##### 2. What is/are the insight(s) found from the chart?

Answer Majority of apps lie in the range from 1 to 20 MB

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

In [None]:
from numpy import mean ,median
reviews.info()


In [None]:
reviews.isnull().sum()

In [None]:
reviews.shape

In [None]:
reviews[reviews['Translated_Review'].isnull()]

In [None]:
# Deleting the rows containing NaN values

reviews = reviews.dropna()

In [None]:
reviews.shape

In [None]:
reviews['Sentiment'].value_counts()

In [None]:
reviews['Sentiment'].value_counts()

In [None]:
reviews['Sentiment'].value_counts().plot.pie(figsize = (7,15), autopct='%1.1f%%')

In [None]:
reviews.describe()

In [None]:
rv.head(10)

In [None]:
# Review sentiment for each app

rv.groupby('App')['Sentiment'].value_counts()

In [None]:
# positive reviews

positive_reviews_df = rv[rv['Sentiment'] == 'Positive']
positive_reviews_df

In [None]:
positive_reviews_df.groupby('App')['Sentiment'].value_counts().nlargest(20).plot.barh(figsize=(12,4), color='m').invert_yaxis()
plt.xlabel('Total number of positive reviews')
plt.title('Apps with the highest number of positive reviews')

It can be seen that Helix Jump has got the highest positive reviews

In [None]:
# negative reviews

negative_reviews_df = rv[rv['Sentiment'] == 'Negative']
negative_reviews_df

In [None]:
negative_reviews_df.groupby('App')['Sentiment'].value_counts().nlargest(20).plot.barh(figsize=(12,4), color='g').invert_yaxis()
plt.xlabel('Total number of negative reviews')
plt.title('Apps with the highest number of negative reviews')

It is clear that Angry Birds Classic has got highest negative reviews

# **Conclusion**

There is a strong positive  correlation between the  Reviews and Installs.


We can clearly see majority of apps can be used by everyone

From the plot, it is evident that Family, Game, and Tools category have the maximum number of apps compared to other categories.

The Communication, Video  Players, and Social category  has the highest number of  average app installs

From the plot, it is seen that Finance , Family, and Lifestyle category have the maximum number of paid apps compared to other categories.

The majority of the apps in  the Play Store (approx  80 %)  are top rated.

It is seen that most of the apps from Google play store are Top Rated and customers are happy with App services


Highest positive reviews - Helix Jump 
Highest negative reviews - Angry Birds 


Most of the app in playstore are rated 4.3 which is on higher side.

There are total more than 90% apps in play store are free

Most of Free app belongs to category as follows,
1.Communication
2. Social 

Most of Free app belongs to category as follows,
1. Games
2. Family

It can be seen that apps in Finance , Lifestyle , Style are costly.

The majority of the paid apps in the Play Store has  fees   between  1  $  to   5 $.

It can be concluded that
If excluded “Varies with device ” values, most of the apps in play store are of size between 11 M to 15M

Most of the app has received positive reviews




### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***