<a href="https://colab.research.google.com/github/MaazAnsari-OO7/Play-Store-App-Review-Analysis/blob/main/Play_Store_App_Review_Analysis_Maaz_Ansari_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <b> The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market. </b>

## <b> Each app (row) has values for catergory, rating, size, and more. Another dataset contains customer reviews of the android apps.</b>

## <b> Explore and analyze the data to discover key factors responsible for app engagement and success. </b>

In [1]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
playstore_data_path = '/content/drive/MyDrive/Capstone Files/EDA/Play Store Data.csv'
user_review_data_path = '/content/drive/MyDrive/Capstone Files/EDA/User Reviews.csv'

In [4]:
playstore_df = pd.read_csv(playstore_data_path)
review_df = pd.read_csv(user_review_data_path)

# Exploring Playstore and User Review Dataframe.

In [None]:
playstore_df.shape

In [None]:
playstore_df.head()

In [None]:
playstore_df.tail()

In [None]:
playstore_df.info()

In [None]:
review_df.shape

In [None]:
review_df.info()

In [None]:
review_df.head()

In [None]:
review_df.tail()

# Data Cleaning.

In [13]:
df = playstore_df.copy()

# 1- Converting "Review" column type from "object" to "int".


In [14]:
df['Reviews']=df['Reviews'].apply(lambda x : eval(x))

SyntaxError: ignored

## While converting Reviews into "int" type we found some Error. 
## Which show that one of the value in review column is 3.0M.
## So we have to remove that row from the data set.

In [15]:
# Finding the row index of that value.
df[df['Reviews']=='3.0M']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


## We have found that index no 10472 contain the value 3.0M.


In [16]:
# Dropping that index value using drop method and reseting index again.
df = df.drop(10472).reset_index(drop=True)

In [17]:
# Evaluating the values.
df['Reviews']=df['Reviews'].apply(lambda x : eval(x))

# 2- Converting "Size" type from "object" to "float".

In [18]:
df.tail()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10835,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10836,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10837,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10838,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device
10839,iHoroscope - 2018 Daily Horoscope & Astrology,LIFESTYLE,4.5,398307,19M,"10,000,000+",Free,0,Everyone,Lifestyle,"July 25, 2018",Varies with device,Varies with device


### It is observed that "Size column contain values in MB (M), kb(k) and "Varies with device".


In [19]:
df[df['Size']=='Varies with device'].shape

(1695, 13)

## As we can see there are **1695** rows that has **"Varies with device"** value in **"Size"** column. Replacing this value with mean value will affect the visualization. Replacing **Varies with device** value with np.NaN which is of float type.


In [20]:
def converting_size_into_float(string):
  '''
  This function helps in removing 'M'(MB) and 'k'(kb) which are present at the end of the string and replace 'Varies with device' with np.NaN. This function also evaluate the values present. 
  '''
  
  if string[-1] == 'M':
    return eval(string.strip('M'))

  elif string[-1] == 'k':
    a =string.strip('k')
    b = str(round(eval(a)/1024,1))
    return eval(b)

  elif string == 'Varies with device':
    string = np.NaN  
    return string
  else:
    return eval(string)

In [21]:
# Applying defined function.
df['Size']= df['Size'].apply(lambda x : converting_size_into_float(x))

# 3- Converting "Install" type from "object to "int".

## In **Install** column values are of **object** type and contain '+' and ',' in them. So we are going remove '+' and ',' from the values and then convert them into **int** type using **eval** method.

In [22]:
# Creating function to remove + and ,.
def remove_plus_and_comma(string):
  '''
  This function removes '+' and ',' from the string.
  '''
  string = string.replace(',','')
  string = string.strip('+')
  return string

In [23]:
# Applying defined function on the column and evaluating those values.
df['Installs'] = df['Installs'].apply(lambda x: eval(remove_plus_and_comma(x)))

# 4- Converting "Price" type from "object" to "float".

## **Price** column value has $ symbol in them and they are of object type. we'll remove the symbol and change the type.  

In [24]:
# Creating function to remove $ symbol.
def remove_sign(string):
  '''
  This function removes $ symbol from the string and convert given string data type from 'str' to 'float'.
  '''
  return round(float(string.strip('$')),2)

In [25]:
# Applying the function.
df['price_in_dollar'] = df['Price'].apply(lambda x : remove_sign(x))

# 5- Converting **Last Update** type from "object" to "datetime".

In [26]:
# Converting str into datetime formate using 'to_datetime' function.
df['Last Updated'] = pd.to_datetime(df['Last Updated'])

# 6- Handling Rating column 

In [27]:
len(df[df['Rating'].isnull()]['Rating'])

1474

## Rating column contain 1474 NaN values. We cannot drop this much amount of row from dataset.
## We gonna replace all the NaN values with the average of non-null values.

In [28]:
# Finding average of non-null values from Rating column.
non_null_mean= round(df[~df['Rating'].isnull()]['Rating'].mean(),1)

In [29]:
# Replacing null values with average rating.
df['Rating'].fillna(value= non_null_mean, inplace=True)


# 7- Removing Null value from "Type" column.

## **Type** column contain only one NaN value.

In [30]:
df[df['Type'].isnull()]['Type']

9148    NaN
Name: Type, dtype: object

## It is observed that Type column contain Null value at index 9148.

In [31]:
# Removing row having index 9148 and reseting the index.
df = df.drop(9148).reset_index(drop=True)

# 8- Removing Null values from "Current Ver".

In [32]:
df[df['Current Ver'].isnull()]['Current Ver'].shape

(8,)

## Number of Null value present in the 'Current Ver' column is 8.


In [33]:
# Removing the rows containing Null values from Dataframe.
df = df[~df['Current Ver'].isnull()]

# 9- Removing Null values from "Android Ver" column.

In [34]:
len(df[df['Android Ver'].isnull()]['Android Ver'])

2

## Number of Null values present in the "Android Ver" column is 2.

In [35]:
# Removing Rows containing Null value from dataframe.
df= df[~df['Android Ver'].isnull()]

## Android Ver type is of object and there are different values. we'll get the create another column in the Dataframe which will store the minimum android version for the App.

In [36]:
# creating a function to obtain minimum version for the App.
def get_ver(string):
  '''
  This function is used to obtain minimum android version required.
  '''

  if string =='Varies with device':
    return eval('1.0')
  else:
    string = string[0:3]
    return eval(string)
  




In [37]:
# Applying the function.
df['min_ver']=df['Android Ver'].apply(lambda x : get_ver(x))

# Checking and Removing Duplicate values from the data set.

## As we can see in the Dataframe only one column that contain **UNIQUE** value which cannot be repeate and that column is **App** column.

In [38]:
# Creating temperory dataframe to find the number of duplicate apps.
a = df['App'].value_counts().reset_index()

In [39]:
# Finding the number of duplicate rows.
len(a[a['App']>=2])

798

### As we can see above Dataframe contains 798 duplicate rows.

In [40]:
# Removing DUPLICATES from the Dataframe.
df=df.drop_duplicates(subset = 'App')

Reseting index

In [41]:
df=df.reset_index()

In [42]:
df.drop(['index'],axis = 1,inplace=True)

In [43]:
df.shape

(9648, 15)

### - After removing "Duplicates" and "NaN values" from the Dataframe we now have a modified Dataframe with 9648 rows and 15 columns.

## Cleaning User Review data

In [44]:
r_df = review_df.copy()

In [45]:
r_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64295 entries, 0 to 64294
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   App                     64295 non-null  object 
 1   Translated_Review       37427 non-null  object 
 2   Sentiment               37432 non-null  object 
 3   Sentiment_Polarity      37432 non-null  float64
 4   Sentiment_Subjectivity  37432 non-null  float64
dtypes: float64(2), object(3)
memory usage: 2.5+ MB


In [46]:
r_df.head()

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
2,10 Best Foods for You,,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
4,10 Best Foods for You,Best idea us,Positive,1.0,0.3


# Handling Null values in User data review Dataframe.

In [47]:
r_df[r_df['Translated_Review'].isnull()].shape

(26868, 5)

## There are 26868 Null Values in T**ranslated Review** column.

## Removing NaN values from **Translated_Review** column, cause the rows containing NaN values are of no use and we cannot impute null values for these column. 
## If there is no review then there will be no sentiment.
## Therefore, We will remove all the rows that contains NaN values in Translated_Review columm.

In [48]:
r_df= r_df[~r_df['Translated_Review'].isnull()]

Reseting index

In [49]:
r_df= r_df.reset_index()

In [50]:
r_df.drop(['index'],axis=1,inplace=True)

In [51]:
r_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37427 entries, 0 to 37426
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   App                     37427 non-null  object 
 1   Translated_Review       37427 non-null  object 
 2   Sentiment               37427 non-null  object 
 3   Sentiment_Polarity      37427 non-null  float64
 4   Sentiment_Subjectivity  37427 non-null  float64
dtypes: float64(2), object(3)
memory usage: 1.4+ MB


# **Data Visualization**

# Q- Total number of app in each category.

In [52]:
# Creating temperory dataset which contain unique categories and no of apps in that category.
cat_df = df.groupby('Category')['App'].count().sort_values(ascending = False).reset_index()

In [53]:
# Plotting Bar graph.
fig =px.bar(cat_df,x= 'Category',y= 'App',labels={'App':'Number of Apps'},text_auto=True)
fig.update_layout(title_text='Number of Apps in each Category', title_x=0.5,titlefont=dict(size =22, color='black', family='Arial, sans-serif'))

fig.show()

## **Observation:**
### As we can see most number of apps in the Playstore are of **Family** category followed by **Game** and **Tools** category.
### **Beauty** and **Comics** category has least number of apps.

# Q-Top 10 apps of Free Type.

In [54]:
# Dataframe of free type apps.
free_df=df[df['Type']=='Free']

In [135]:
# Top 10 free app by installs.
free_df[['App','Installs']].sort_values(by='Installs',ascending=False).head(10)

Unnamed: 0,App,Installs
2928,Google Play Movies & TV,1000000000
1354,Subway Surfers,1000000000
303,Gmail,1000000000
301,Google Chrome: Fast & Secure,1000000000
299,WhatsApp Messenger,1000000000
298,Messenger – Text and Video Chat for Free,1000000000
2189,Google Photos,1000000000
2906,YouTube,1000000000
700,Google Play Games,1000000000
2418,Maps - Navigate & Explore,1000000000


# Q- Top 10 apps of Paid type.

Paid apps can be classified on the basis of **revenue** they generated and not by number of installs.

In [58]:
# Creating Revenue column.
df['Revenue']= df['Installs']*df['price_in_dollar']

In [59]:
# Top apps of Paid type.
top_paid_df= df[df['Type']=='Paid'].sort_values(by='Revenue',ascending=False)

In [136]:
 # Top 10 Paid apps.
top_paid_df[['App','Revenue']].head(10)

Unnamed: 0,App,Revenue
1741,Minecraft,69900000.0
4392,I am rich,39999000.0
4396,I Am Rich Premium,19999500.0
3206,Hitman Sniper,9900000.0
6362,Grand Theft Auto: San Andreas,6990000.0
2258,Facetune - For Free,5990000.0
4603,Sleep as Android Unlock,5990000.0
7678,DraStic DS Emulator,4990000.0
3467,I'm Rich - Trump Edition,4000000.0
3463,💎 I'm rich,3999900.0


In [134]:
# PLotting Bar graph for top 10 paid apps.
fig = px.bar(top_10_paid_apps,x='App',y='Revenue',color= 'App',text_auto=True)
fig.update_layout(title_text='Top 10 apps in Paid Type by Revenue', title_x=0.5,titlefont=dict(size =22, color='black', family='Arial, sans-serif'))
fig.show()

## **Observation:**
### App which has generated most revenue through download is **Minecraft** making **$69.9Millions**.

# Q-Number of apps in each category type wise(Free & Paid).

In [63]:
type_df = df.groupby(['Category','Type'])['App'].count().reset_index()

In [121]:
# Top Category of Free type.
type_df[type_df['Type']=='Free'].sort_values(by='App',ascending=False).head(2)

Unnamed: 0,Category,Type,App
20,FAMILY,Free,1646
26,GAME,Free,877


In [122]:
# Top Category of Paid type.
type_df[type_df['Type']=='Paid'].sort_values(by='App',ascending=False).head(2)

Unnamed: 0,Category,Type,App
21,FAMILY,Paid,182
38,MEDICAL,Paid,83


In [126]:
# Category with least number of apps in Free type.
type_df[type_df['Type']=='Free'].sort_values(by='App',ascending=False).tail(2)

Unnamed: 0,Category,Type,App
9,COMICS,Free,56
4,BEAUTY,Free,53


In [124]:
# Category with least number of apps in Paid type.
type_df[type_df['Type']=='Paid'].sort_values(by='App',ascending=False).tail(2)

Unnamed: 0,Category,Type,App
32,LIBRARIES_AND_DEMO,Paid,1
19,EVENTS,Paid,1


In [132]:
fig = px.bar(type_df, x='App', y='Category', color='Type',text_auto=True,height=800,labels={'App':'Number of Apps'})
fig.update_layout(title_text='Number of Apps in Each Category type wise', title_x=0.5,titlefont=dict(size =22, color='black', family='Arial, sans-serif'))
fig.show()

## **Observation:**
### In playstore **Family** category has most number of apps in both Free and Paid types.1646 in Free type and 182 in Paid type.
### **Comics** and **Beauty** have least number of apps in Free type which is 56 and 53.
### In paid type **Libraries And Demo** and **Event** contains only 1 app.

# Q-Percentage of Review Sentiments.

In [119]:
# Plotting user sentiments using pie chart.
fig=px.pie(r_df['Sentiment'],names='Sentiment',color='Sentiment')
fig.update_layout(title_text='Sentiments percentage wise', title_x=0.5,titlefont=dict(size =22, color='black', family='Arial, sans-serif'))
fig.show()

## **Observation:**
### Reviews obtain from customers about playstore apps are **64.1%** are of **Positive** sentiment followed  by **Negative** review which is **22.1%** and **13.8%** reviews are of **Neutral** type.


# Q- Average rating of free and paid type apps category wise.

In [117]:
#Creating avg dataframe which store Category,Type and avg. Rating columns. 
avg_df=round(df.groupby(['Category','Type'])['Rating'].mean().reset_index(),1)

In [116]:
# Top Rated category in Free type.
avg_df[avg_df['Type']=='Free'].sort_values(by='Rating',ascending=False).head()

Unnamed: 0,Category,Type,Rating
0,ART_AND_DESIGN,Free,4.4
18,EVENTS,Free,4.4
4,BEAUTY,Free,4.3
5,BOOKS_AND_REFERENCE,Free,4.3
14,EDUCATION,Free,4.3


In [115]:
# Top Rated category in paid type.
avg_df[avg_df['Type']=='Paid'].sort_values(by='Rating',ascending=False).head()

Unnamed: 0,Category,Type,Rating
40,NEWS_AND_MAGAZINES,Paid,4.8
15,EDUCATION,Paid,4.8
1,ART_AND_DESIGN,Paid,4.7
17,ENTERTAINMENT,Paid,4.6
50,SHOPPING,Paid,4.5


In [103]:
fig= px.bar(avg_df, x="Category", y="Rating",color='Type', barmode='group',height=600,text_auto=True,labels={'Rating':'Average Rating'})
fig.update_layout(title_text='Average rating category wise', title_x=0.5,titlefont=dict(size =22, color='black', family='Arial, sans-serif'))
fig.show()

## **Observation:**
### It is observed that in **Free** type **ART AND DESIGN** and **EVENTS** are the highest rated category in playstore with a rating of 4.4. 
### In **Paid** type **NEWS AND MAGAZINES** and **EDUCATION** are the highest rated category in playstore with a rating of 4.8.