<a href="https://colab.research.google.com/github/Parag-A/Play-Store-Review-Analysis/blob/main/Play_Store_App_Review_Analysis_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Play Store App Review Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 - Parag Agrawal**


# **Project Summary -**

### **📱 Hi everybody !**

In this notebook, I'm gonna analyze Google Play Store datas. While I was analyzing the data, I used Python. This study is my first data analyzing study.

Google Play Store apps and reviews Mobile apps are everywhere. They are easy to create and can be lucrative. Because of these two factors, more and more apps are being developed. In this notebook, we will do a comprehensive analysis of the Android app market by comparing over ten thousand apps in Google Play across different categories. We'll look for insights in the data to devise strategies to drive growth and retention.

Let's take a look at the data, which consists of two files:

* **playstore data.csv:** contains all the details of the applications on Google Play. There are 13 features that describe a given app.
* **user_reviews.csv:** contains 100 reviews for each app, most helpful first. The text in each review has been pre-processed and attributed with three new features: Sentiment (Positive, Negative or Neutral), Sentiment Polarity and Sentiment Subjectivity.


# **Problem Statement**



1. What are the top categories on Play Store?
2. Are majority of the apps Paid or Free?
3. How importance is the rating of the application?
4. Which categories from the audience should the app be based on?
5. Which category has the most no. of installations?
6. How does the count of apps varies by Genres?
7. How does the last update has an effect on the rating?
8. How are ratings affected when the app is a paid one?
9. How are reviews and ratings co-related?
10. Lets us discuss the sentiment subjectivity.
11. Is subjectivity and polarity proportional to each other?
12. What is the percentage of review sentiments?
13. How is sentiment polarity varying for paid and free apps?
14. How Content Rating affect over the App?
15. Does Last Update date has an effects on rating?
16. Distribution of App update over the Year.
17. Distribution of Paid and Free app updated over the Month.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
#Connect the Colab with Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
# plotly
import plotly
plotly.offline.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import warnings
#sns.set(font_scale=1.5)
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
# Load Play Store Dataset
Play_store_df = pd.read_csv('/content/drive/MyDrive/Almabetter/Project/Datasets/Play Store Data1.csv')
Play_store_df

In [None]:
# Load User Review Dataset
User_Review_df = pd.read_csv('/content/drive/MyDrive/Almabetter/Project/Datasets/User Reviews1.csv')
User_Review_df

### Dataset First View

In [None]:
# Play Store Dataset First Look
Play_store_df.head()

### Dataset Rows & Columns count

In [None]:
# Play Store dataset Rows & Columns count
Play_store_df.shape

### Dataset Information

In [None]:
# Play Store Dataset Info
Play_store_df.info()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count of Play store Datasets
print(Play_store_df.isnull().sum())

In [None]:
# Visualizing the missing values
# We have already imported libraries with essentials as pd,np and sns

#Calculate the count of missing values in each column
missing_data = Play_store_df.isnull().sum()

# Set Figure size for the plot
plt.figure(figsize=(10,6))

# Create a bar plot to visualize missing values
plt.bar(missing_data.index, missing_data.values)

# Set labels and title for the plot
plt.xlabel('Columns')
plt.ylabel('Missing Values Count')
plt.title('Missing Values by Column')

# Optionally, rotate x-axis labels for better readability
plt.xticks(rotation=90)

# Show the plot
plt.show()

In [None]:
# Set the figure size for the heatmap
plt.figure(figsize=(10, 6))

# Create a heatmap to visualize missing values
sns.heatmap(Play_store_df.isnull(), cbar=False, cmap='viridis')

# Set a title for the heatmap
plt.title('Missing Values Heatmap')

# Show the plot
plt.show()

### What did you know about your dataset?

Here's what we know about our datasets

Let's first talk about our Play Store App Review dataset:


1. The dataset contains 10,358 entries
2. It includes data in float and object only.
3. Their are 13 column in total which are App,Category, Rating, Reviews, Size, Installs, Type, Price, Content Rating, Genre, Last Updated, Currect Ver, Andriod Ver
4. Some columns have missing values such as Rating, Current version and Andriod version but Rating has the most missing values
5. There are no dublicate values anymore in tha dataset
6. This dataset provide info on play store apps, their rating, downloads, versions, Category, Size, Installation number, Genre and more






## ***2. Understanding Your Variables***

In [None]:
# First Dataset Columns
Play_store_df.columns

In [None]:
#  First Dataset Describe
Play_store_df.describe()

### Variables Description

here is a description of the variables in the Play Store dataset:

1. **App** - It tells us about the name of the application with a short description (optional).
2. **Category** - It gives the category to the app.
3. **Rating** - It contains the average rating the respective app received from its users.
4. **Reviews** - It tells us about the total number of users who have given a review for the application.
5. **Size** - It tells us about the size being occupied the application on the mobile phone.
6. **Installs** - It tells us about the total number of installs/downloads for an application.
7. **Type** - It states whether an app is free to use or paid.
8. **Price** - It gives the price payable to install the app. For free type apps, the price is zero.
9. **Content Rating** - It states whether or not an app is suitable for all age groups or not.
10. **Genres** - It tells us about the various other categories to which an application can belong.
11. **Last Updated** - It tells us about, when the application was last time updated.
12. **Current Ver** - It tells us about the current version of the application.
13.**Android Ver** - It tells us about the android version which can support the application on its platform.

## **Cleaning of the data**

The three features that we will be working with most frequently henceforth are Installs, Size, and Price. A careful glance of the dataset reveals that some of these columns mandate data cleaning in order to be consumed by code we'll write later. Specifically, the presence of special characters (, $ +) and letters (M k) in the Installs, Size, and Price columns make their conversion to a numerical data type difficult. Let's clean by removing these and converting each column to a numeric type.



Removing the Nan value

 **Handling the NaN values in the Play store data and Check the Unique values for each variable**

In [None]:
# This user define function will give the type,count of null and non null values as well as null ratio
def playstoreinfo():
  temp=pd.DataFrame(index=Play_store_df.columns)
  temp["datatype"]=Play_store_df.dtypes
  temp["not null values"]=Play_store_df.count()
  temp["null value"]=Play_store_df.isnull().sum()
  temp["% of the null value"]=Play_store_df.isnull().mean()
  temp["unique count"]=Play_store_df.nunique()
  return temp
playstoreinfo()

**Findings**

`The number of null values are:`

Rating has 1465 null values which contributes 14.14% of the data.
Type has 1 null value which contributes 0.001% of the data.
Content_Rating has 1 null value which contributes 0.001% of the data.
Current_Ver has 8 null values which contributes 0.07% of the data.
Android_Ver has 3 null values which contributes 0.03% of the data.

## 3. ***Data Wrangling***

### Data Wrangling Code

### What all manipulations have you done and insights you found?

Lets first deal with the columns which contains lesser number of NaN values. By going through the NaN values, we must come up with a way to replace them with non NaN values or we need to come up with a reason for having NaN.

### **`1). Android Ver: There are a total of 3 NaN values in this column.`**

---



In [None]:
# The rows containing NaN values in the Android Ver column
Play_store_df[Play_store_df["Android Ver"].isnull()]

In [None]:
# Finding the different values the 'Android Ver' column takes
Play_store_df["Android Ver"].value_counts()

Since the NaN values in the Android Ver column cannot be replaced by any particular value, and, since there are only 3 rows which contain NaN values in this column, which accounts to less than 0.03% of the total rows in the given dataset, it can be be dropped.

In [None]:
# dropping rows corresponding to the to the NaN values in the 'Android Ver' column.
Play_store_df=Play_store_df[Play_store_df['Android Ver'].notna()]
# Shape of the updated dataframe
Play_store_df.shape

We were successfully able to handle the NaN values in the` Android Ver `column.

### **`2). Current Ver: There are a total of 8 NaN values in this column.`**

In [None]:
# The rows containing NaN values in the Current Ver column
Play_store_df[Play_store_df["Current Ver"].isnull()]

In [None]:
# Finding the different values the 'Current Ver' column takes
Play_store_df['Current Ver'].value_counts()

Since there are only 8 rows which contain NaN values in the Current Ver column, and it accounts to just around 0.07% of the total rows in the given dataset, and there is no particular value with which we can replace it, these rows can be dropped.

In [None]:
# dropping rows corresponding to the values which contain NaN in the column 'Current Ver'.
Play_store_df=Play_store_df[Play_store_df["Current Ver"].notna()]
# Shape of the updated dataframe
Play_store_df.shape

### **`3). Type: There is only one NaN value in this column.`**

In [None]:
# The row containing NaN values in the Type column
Play_store_df[Play_store_df["Type"].isnull()]

In [None]:
# Finding the different values the 'Type' column takes
Play_store_df["Type"].value_counts()

The `Type `column contains only two entries, namely, `Free` and `Paid`. Also, if the app is of `type-paid`, the price of that app will be printed in the corresponding `Price` column, else, it will show as '0'. In this case, the price for the respective app is printed as '0', which means the app is of type-free. Hence we can replace this NaN value with Free.

In [None]:
# Replacing the NaN value in 'Type' column corresponding to row index 9148 with 'Free'
Play_store_df.loc[9148,'Type']='Free'

In [None]:
Play_store_df[Play_store_df["Type"].isnull()]

### **`4). Rating: This column contains 1461 NaN values.`**

In [None]:
# The rows containing NaN values in the Rating column
Play_store_df[Play_store_df['Rating'].isnull()]

We know that the rating of any app in the play store will be in between 1 and 5. Lets check whether there are any ratings out of this range.

In [None]:
Play_store_df[(Play_store_df['Rating'] <1) | (Play_store_df['Rating']>5)]

* The `Rating` column contains 1461 NaN values which accounts to approximately 14.14% of the rows in the entire dataset. It is not practical to drop these rows because by doing so, we will loose a large amount of data, which may impact the final quality of the analysis.
* The NaN values in this case can be imputed by the aggregate (mean or median) of the remaining values in the Rating column.

In [None]:
# Finding mean and median in the Rating column excluding the NaN values.

mean_rating = round(Play_store_df[~Play_store_df['Rating'].isnull()]['Rating'].mean(),4)

median_rating = Play_store_df[~Play_store_df['Rating'].isnull()]['Rating'].median()

[mean_rating , median_rating]

**Visualization of distribution of rating using displot and detecting the outliers through boxplot.**

In [None]:
fig, ax = plt.subplots(2,1, figsize=(12,7))
sns.distplot(Play_store_df['Rating'],color='firebrick',ax=ax[0])
sns.boxplot(x='Rating',data=Play_store_df, ax=ax[1])

* The mean of the average ratings (excluding the NaN values) comes to be 4.2.

* The median of the entries (excluding the NaN values) in the 'Rating' column comes to be 4.3. From this we can say that 50% of the apps have an average rating of above 4.3, and the rest below 4.3.
* From the distplot visualizations, it is clear that the ratings are left skewed.
* We know that if the variable is skewed, the mean is biased by the values at the far end of the distribution. Therefore, the median is a better representation of the majority of the values in the variable.
* Hence we will input the NaN values in the Rating column with its median.

In [None]:
# Replacing the NaN values in the 'Rating' colunm with its median value
Play_store_df['Rating'].fillna(value=median_rating,inplace=True)

# **Handling duplicates values and Manipulating dataset:**


### **`1).Handling the duplicates in the  `App `column`**

In [None]:
# Handling the error values in the Play store data
Play_store_df.head()

In [None]:
Play_store_df['App'].value_counts()

In [None]:
# Inspecting the duplicates values.
Play_store_df[Play_store_df['App']=='ROBLOX']

In [None]:
Play_store_df[Play_store_df.duplicated()]

In [None]:
# dropping duplicates from the 'App' column.
Play_store_df.drop_duplicates(subset = 'App', inplace = True)
Play_store_df.shape


In [None]:
# Checking whether the duplicates in the 'App' column are taken care of or not
Play_store_df[Play_store_df['App']=='ROBLOX']

We have successfully handled all the duplicate values in the App column. The resultant number of rows after droping the duplicate rows in the app column come out to be 9649.

### **`2). Changing the datatype of the Last Updated column from string to datetime.`**

In [None]:
# Pandas to_datetime() function applied to the values in the last updated column helps to convert string Date time into Python Date time object.
Play_store_df["Last Updated"] = pd.to_datetime(Play_store_df['Last Updated'])
Play_store_df.head()

### **`3). Changing the datatype of the Price column from string to float`.**

In [None]:
#Converting the data type of the Price column from string to float for easier use.
Play_store_df['Price'].value_counts()

To convert this column from string to float, we must first drop the $ symbol from the all the values. Then we can assign float datatype to those values.

Applying the `drop_dollar` function to convert the values in the` Price` column from string datatype to float datatype.

In [None]:
# Creating a function drop-dollar which drops the $ symbol if it is present and returns the output which is of float datatype.
def convert_dollar(val):
  '''
  This funtion drops the $ symbol if present and returns the value with float datatype.
  '''
  if '$' in val:
    return float(val[1:])
  else:
    return float(val)

In [None]:
# The drop_dollar funtion applied to the price column
Play_store_df['Price']=Play_store_df['Price'].apply(lambda x: convert_dollar(x))
Play_store_df.head()

In [None]:
#Check for Paid Application
Play_store_df[Play_store_df['Price']!=0].head()

We have successfully converted the datatype of values in the Price column from string to float.

### **`4). Converting the values of the Installs column from string datatype to integer datatype.`**

In [None]:
# Checking the contents of the 'Installs' column
Play_store_df['Installs'].value_counts()

To convert all the values in the **Installs** column from string datatype to integer datatype, we must first drop the '+' symbol from all the entries if present and then we can change its datatype.



Applying the `convert_plus` function to convert the values in the `Installs` column from string datatype to float datatype.

In [None]:
# Creating a function convert_plus which drops the '+' symbol if it is present and returns the output which is of integer datatype.

def convert_plus(val):
  '''
  This function drops the + symbol if present and returns the value with int datatype.
  '''
  if '+' and ',' in val:
    new = int(val[:-1].replace(',',''))
    return new
  elif '+' in val:
    new1 = int(val[:-1])
    return new1
  else:
    return int(val)

In [None]:
# The drop_plus funtion applied to the main dataframe

Play_store_df['Installs'] = Play_store_df['Installs'].apply(lambda x: convert_plus(x))
Play_store_df.head()

he resultant values in the **Installs** column are of the integer datatype, and it represents the least number of times a particular app has been installed.





* **Installs** = 0 indicates that that particular app has not been installed by anyone yet.
* **Installs** = 1 indicates that the particular app has been installed by atleast
one user.
* **Installs** = 1000000 indicates that the particular app has been installed by atleast one million users. So on and so forth.
* We have successfully converted the datatype of values in the Installs column from string to int.

### **`5). Converting the values in the `Size` column to a same unit of measure(MB).`**

In [None]:
Play_store_df['Size'].value_counts()

We can see that the values in the Size column contains data with different units. 'M' stands for MB and 'k' stands for KB. To easily analyse this column, it is necessary to convert all the values to a single unit. In this case, we will convert all the units to MB.

We know that 1MB = 1024KB, to convert KB to MB, we must divide all the values which are in KB by 1024.

In [None]:
# Defining a function to convert all the entries in KB to MB and then converting them to float datatype.

def convert_kb_to_mb(val):
  '''
  This function converts all the valid entries in KB to MB and returns the result in float datatype.
  '''
  try:
    if 'M' in val:
      return float(val[:-1])
    elif 'k' in val:
      return round(float(val[:-1])/1024, 4)
    else:
      return val
  except:
    return val

Applying the kb_to_mb function to convert the values in the Size column to a single unit of measure (MB) and the datatype from string to float.

In [None]:
# The kb_to_mb funtion applied to the size column

Play_store_df['Size'] = Play_store_df['Size'].apply(lambda x: convert_kb_to_mb(x))
Play_store_df.head()

In [None]:
Play_store_df['Size'] = Play_store_df['Size'].apply(lambda x: str(x).replace('Varies with device', 'NaN') if 'Varies with device' in str(x) else x)
Play_store_df['Size'] = Play_store_df['Size'].apply(lambda x: float(x))

A vast majority of the entries in the Size column contain the entry Varies with device. Since this entry cannot be used for analysis lets see if it can be imputed with the mean or median value of the entries in this column.

In [None]:
# Finding max, min, mean, and median in the Size column excluding the 'Varies with device' values.

max_size = Play_store_df[Play_store_df['Size'] != 'Varies with device']['Size'].max()

min_size = Play_store_df[Play_store_df['Size'] != 'Varies with device']['Size'].min()

mean_size = round(Play_store_df[Play_store_df['Size'] != 'Varies with device']['Size'].mean(),4)

median_size = Play_store_df[Play_store_df['Size'] != 'Varies with device']['Size'].median()

[max_size, min_size, mean_size, median_size]

**Visualization of distribution of `**Size**` using displot and detecting the outliers through boxplot.**

In [None]:
# Distplot
fig, ax = plt.subplots(2,1, figsize=(12,7))
sns.distplot(Play_store_df[Play_store_df['Size'] != 'Varies with device']['Size'], color='purple', ax=ax[0])
sns.boxplot(x='Size',data=Play_store_df, ax=ax[1])

* It is clear from the visualizations that the data in the **Size** column is skewed towards the right.
* Also, we see that a vast majority of the entries in this column are of the value **Varies with device**, replacing this with any central tendency value (mean or median) may give incorrect visualizations and results. Hence these values are left as it is.

* We have successfully converted all the valid entries in the **Size** column to a single unit of measure (MB) and the datatype from string to float.

### **`6). Converting the datatype of values in the `Reviews` column from string to int.`**

In [None]:
# Converting the datatype of the values in the reviews column from string to int
Play_store_df['Reviews'] = Play_store_df['Reviews'].astype(int)
Play_store_df.head()

In [None]:
Play_store_df.describe()

We have successfully converted the datatype of the values in the Reviews column from string to int.

Now that we have handled the errors and NaN values in the playstoredata.csv file, lets do the same for the userreviews.csv file.

## **USER Reveiw Dataset Cleaning**

### Dataset First View

In [None]:
# Checking the top 10 rows of the data
User_Review_df.head()

### Dataset Rows & Columns count

In [None]:
# Play Store dataset Rows & Columns count
User_Review_df.shape

### Dataset Information

In [None]:
# Play Store Dataset Info
User_Review_df.info()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count of Play store Datasets
print(User_Review_df.isnull().sum())

In [None]:
# Visualizing the missing values
# We have already imported libraries with essentials as pd,np and sns

#Calculate the count of missing values in each column
missing_data = User_Review_df.isnull().sum()

# Set Figure size for the plot
plt.figure(figsize=(10,6))

# Create a bar plot to visualize missing values
plt.bar(missing_data.index, missing_data.values)

# Set labels and title for the plot
plt.xlabel('Columns')
plt.ylabel('Missing Values Count')
plt.title('Missing Values by Column')

# Optionally, rotate x-axis labels for better readability
plt.xticks(rotation=90)

# Show the plot
plt.show()

In [None]:
# Set the figure size for the heatmap
plt.figure(figsize=(10, 6))

# Create a heatmap to visualize missing values
sns.heatmap(User_Review_df.isnull(), cbar=False, cmap='viridis')

# Set a title for the heatmap
plt.title('Missing Values Heatmap')

# Show the plot
plt.show()

### Variables Description

**Let us first define what information the columns contain based on our inspection.**

user_reviews dataframe has 64295 rows and 5 columns. The 5 columns are identified as follows:

* **App:** Contains the name of the app with a short description (optional).
* **Translated_Review:** It contains the English translation of the review dropped by the user of the app.
* **Sentiment:** It gives the attitude/emotion of the writer. It can be ‘Positive’, ‘Negative’, or ‘Neutral’.
* **Sentiment_Polarity:** It gives the polarity of the review. Its range is [-1,1], where 1 means ‘Positive statement’ and -1 means a ‘Negative statement’.
* **Sentiment_Subjectivity:** This value gives how close a reviewers opinion is to the opinion of the general public. Its range is [0,1]. Higher the subjectivity, closer is the reviewers opinion to the opinion of the general public, and lower subjectivity indicates the review is more of a factual information.

In [None]:
def Urinfo():
  temp1=pd.DataFrame(index=User_Review_df.columns)
  temp1["datatype"]=User_Review_df.dtypes
  temp1["not null values"]=User_Review_df.count()
  temp1["null value"]=User_Review_df.isnull().sum()
  temp1["% of the null value"]=User_Review_df.isnull().mean().round(4)*100
  temp1["unique count"]=User_Review_df.nunique()
  return temp1
Urinfo()

**Findings**

The number of null values after removing duplicates are:
* **Translated_Review** has 987 null values which contributes **3.22%** of the data.
* **Sentiment** has 982 null values which contributes **3.20%** of the data.
* **Sentiment_Polarity**  has 982 null values which contributes **3.20%** of the data.
* **Sentiment_Subjectivity** has 982 null values which contributes **3.20%** of the data.

### **Handling the error and NaN values in the User reviews**

In [None]:
# Finding the total no of NaN values in each column.
User_Review_df.isnull().sum()

There are a lot of NaN values. We need to analyse these values and see how we can handle them.

In [None]:
# checking the NaN values in the translated review column
User_Review_df[User_Review_df['Translated_Review'].isnull()]

There are a total of 987 rows containing NaN values in the Translated_Review column.

We can say that the apps which do not have a review (NaN value insted) tend to have NaN values in the columns `Sentiment, Sentiment_Polarity, and Sentiment_Subjectivity` in the majority of the cases.

**Lets check if there are any exceptions.**

In [None]:
# The rows corresponding to the NaN values in the translated_review column, where the rest of the columns are non null.
User_Review_df[User_Review_df['Translated_Review'].isnull() & User_Review_df['Sentiment'].notna()]

In the few exceptional cases where the values of remaining columns are non null for null values in the translated_Review column, there seems to be errors. This is because the Sentiment, sentiment ploarity and sentiment subjectivity of the review can be determined if and only if there is a corresponding review.

Hence these values are wrong and can be deleted altogether.

In [None]:
# Deleting the rows containing NaN values
User_Review_df = User_Review_df.dropna()

In [None]:
# The shape of the updated df
User_Review_df.shape

There are a total of 29692 rows in the updated df.

Hence we have taken care of all the NaN values in the df.

Lets inspect the updated df

In [None]:
# Inspecting the sentiment column
User_Review_df['Sentiment'].value_counts()

The values in the `Sentiment_Polarity and Sentiment_Subjectivity`looks correct.

On the given datasets, we successfully developed a data pipeline. We can now examine this data flow and create user-friendly visuals. It is easy to compare different measures using the visualizations, and thus to draw implications from them.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

### **Data Visualization on play store data:**
We have sucessfully cleaned the dirty data. Now we can perform some data visualization and come up with insights on the given datasets.



### **`1) Correlation Heatmap`**

In [None]:
# Finding correlation between different columns in the play store data
Play_store_df.corr()

In [None]:
# Heat map for play_store
plt.figure(figsize = (20,10))
sns.heatmap(Play_store_df.corr(), annot= True)
plt.title('Corelation Heatmap for Playstore Data', size=20)

* There is a strong positive correlation between the Reviews and Installs column. This is pretty much obvious. Higher the number of installs, higher is the user base, and higher are the total number of reviews dropped by the users.
* The` Price `is slightly negatively correlated with the `Rating, Reviews, and Installs.` This means that as the prices of the app increases, the average rating, total number of reviews and Installs fall slightly.
* The` Rating` is slightly positively correlated with the` Installs and Reviews` column. This indicates that as the the average user rating increases, the app installs and number of reviews also increase.

#### **Let us check if there is any co-relation in both the dataframes.**

In [None]:
merged_df = pd.merge(Play_store_df, User_Review_df, on='App', how = "inner")

In [None]:
def merged_dfinfo():
  temp = pd.DataFrame(index=merged_df.columns)
  temp['data_type'] = merged_df.dtypes
  temp["count of non null values"] = merged_df.count()
  temp['NaN values'] = merged_df.isnull().sum()
  temp['% NaN values'] =merged_df.isnull().mean()
  temp['unique_count'] = merged_df.nunique()
  return temp
merged_dfinfo()

In [None]:
merged_df.corr()

In [None]:
# Correlation heatmap
# Heat Map for the merged data frame
plt.figure(figsize = (15,10))
sns.heatmap(merged_df.corr(), annot= True, cmap='Greens')
plt.title(' Heatmap for merged Dataframe', size=20)

In [None]:
merged_df = merged_df.dropna(subset=['Sentiment', 'Translated_Review'])

In [None]:
merged_df.head()

### **`2) What is the ratio of number of Paid apps and Free apps?`**


In [None]:
data = Play_store_df['Type'].value_counts()
labels = ['Free', 'Paid']

# create pie chart
plt.figure(figsize=(10,10))
colors = ["#00EE76","#7B8895"]
explode=(0.01,0.1)
plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%',explode=explode,textprops={'fontsize': 15})
plt.title('Distribution of Paid and Free apps',size=15,loc='center')
plt.legend()

**Findings:**

From the above graph we can see that 93% of apps in google play store are free and 7%are paid.

In [None]:
Play_store_df['Content Rating'].unique()

### **`3) Which category of Apps from the Content Rating column are found more on playstore ?`**

In [None]:
# Content rating of the apps
data = Play_store_df['Content Rating'].value_counts()
labels = ['Everyone', 'Teen', 'Everyone 10+', 'Mature 17+','Adults only 18+', 'Unrated']

#create pie chart
plt.figure(figsize=(10,10))
explode=(0,0.1,0.1,0.1,0.0,1.3)
colors = ['C4', 'r', 'c', 'g', 'm', 'k']
plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%',explode=explode,textprops={'fontsize': 15})
plt.title('Content Rating',size=20,loc='center')
plt.legend()

A majority of the apps (81%) in the play store are can be used by everyone.The remaining apps have various age restrictions to use it.

### **`4) Top categories on Google Playstore?`**

In [None]:
Play_store_df.groupby("Category")["App"].count().sort_values(ascending= False)

In [None]:
x = Play_store_df['Category'].value_counts()
y = Play_store_df['Category'].value_counts().index
x_list = []
y_list = []
for i in range(len(x)):
    x_list.append(x[i])
    y_list.append(y[i])

In [None]:
#Number of apps belonging to each category in the playstore
plt.figure(figsize=(20,8))
plt.xlabel('Number of Apps', size=15)
plt.ylabel('App Categories', size=15)
graph = sns.barplot(y = x_list, x = y_list, palette= "tab10")
graph.set_title("Top categories on Playstore", fontsize = 25)
graph.set_xticklabels(graph.get_xticklabels(), rotation= 45, horizontalalignment='right',);

**Findings:**

So there are all total 33 categories in the dataset From the above output we can come to a conclusion that in playstore most of the apps are under` FAMILY & GAME` category and least are of `EVENTS & BEAUTY` Category.

In [None]:
# Percentage of apps belonging to each category in the playstore
plt.figure(figsize=(11,11))
plt.pie(Play_store_df.Category.value_counts(), labels=Play_store_df.Category.value_counts().index, autopct='%1.2f%%')
my_circle = plt.Circle( (0,0), 0.50, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.title('% of apps share in each Category', fontsize = 25)
plt.show()


### **`5) Which category App's have most number of installs?`**

In [None]:
# total app installs in each category of the play store

a = Play_store_df.groupby(['Category'])['Installs'].sum().sort_values()
a.plot.barh(figsize=(15,8), color = 'c', )
plt.ylabel('Total app Installs', fontsize = 15)
plt.xlabel('App Categories', fontsize = 15)
plt.xticks()
plt.title('Total app installs in each category', fontsize = 20)

**Findings:**

This tells us the category of apps that has the maximum number of installs. The `Game,` `Communication and Tools` categories has the highest number of installs compared to other categories of apps.

### **`6). Average rating of the apps`**

In [None]:
# Average app ratings

Play_store_df['Rating'].value_counts().plot.bar(figsize=(20,8), color = 'm' )
plt.xlabel('Average rating',fontsize = 15 )
plt.ylabel('Number of apps', fontsize = 15)
plt.title('Average rating of apps in Playstore', fontsize = 20)
plt.legend()

We can represent the ratings in a better way if we group the ratings between certain intervals. Here, we can group the rating as follows:

* 4-5: Top rated
* 3-4: Above average
* 2-3: Average
* 1-2: Below average

**Lets create a new column `Rating group` in the main dataframe and apply these filters.**

In [None]:
# Defining a function grouped_rating to group the ratings as mentioned above
def Rating_app(val):
  ''''
  This function help to categories the rating from 1 to 5
  as Top_rated,Above_average,Average & below Average
  '''
  if val>=4:
    return 'Top rated'
  elif val>3 and val<4:
    return 'Above Average'
  elif val>2 and val<3:
    return 'Average'
  else:
    return 'Below Average'

**Lets apply the `grouped_rating` function on the Rating column and save the output in new column named as `Rating group` in the main df**.

In [None]:
# Applying grouped_rating function
Play_store_df['Rating_group']=Play_store_df['Rating'].apply(lambda x: Rating_app(x))

In [None]:
# Average app ratings
Play_store_df['Rating_group'].value_counts().plot.bar(figsize=(15,5), color = 'royalblue')
plt.xlabel('Rating Group', fontsize = 12)
plt.ylabel('Number of apps', fontsize = 12)
plt.title('Average app ratings', fontsize = 18)
plt.xticks(rotation=0)
plt.legend()

### **`7). What are the Top 10 installed apps in any category?`**

In [None]:
def findtop10incategory(str):
    str = str.upper()
    top10 = Play_store_df[Play_store_df['Category'] == str]
    top10apps = top10.sort_values(by='Installs', ascending=False).head(10)
    plt.figure(figsize=(20,5), dpi=100)
    plt.title('Top 10 Installed Apps',size = 20)
    graph = sns.barplot(x = top10apps.App, y = top10apps.Installs, palette= "icefire")
    graph.set_xticklabels(graph.get_xticklabels(), rotation= 45, horizontalalignment='right')

In [None]:
findtop10incategory('GAME')

**Findings:**

From the above graph we can see that in the **`Game category Subway Surfers,Pou, Temple Run 2`** has the highest installs. In the same way we by passing different category names to the function, we can get the top 10 installed apps.

### **`8). Top apps that are of free type.`**


In [None]:
 # Creating a df for only free apps

 free_df = Play_store_df[Play_store_df['Type'] == 'Free']

In [None]:
# Creating a df for top free apps

top_free_df = free_df[free_df['Installs'] == free_df['Installs'].max()]
top10free_apps=top_free_df.nlargest(10, 'Installs', keep='first')
top10free_apps.head(10)

In [None]:
# Top free apps

top_free_df['App']

In [None]:
# Categories in which the top 20 free apps belong to
top_free_df['Category'].value_counts().plot.bar(figsize=(20,6), color= ('darkcyan','blueviolet'))
plt.xlabel('Category', size=15)
plt.ylabel('Number of apps', size=15)
plt.title('Categories in which the top 20 free apps belong', size=19)
plt.xticks(rotation=45)
plt.legend()

### **`9). Top apps that are of paid type.`**

In [None]:
# Creating a df containing only paid apps
paid_df=Play_store_df[Play_store_df['Type']=='Paid']

In [None]:
# Number of apps that can be installed at a particular price

paid_df.groupby('Price')['App'].count().sort_values(ascending= False).plot.bar(figsize = (20,6), color = 'crimson')

* The paid apps charge the users a certain amount to download and install the app. This amount varies from one app to another.
* There are a lot of apps that charge a small amount whereas some apps charge a larger amount. In this case the price to download an app varies from USD 0.99 to USD 400.
* In order to select the top paid apps, it won't be fair to look just into the numer of installs. This is because the apps that charge a lower installation fee will be installed by more number of people in general.
* Here a better way to determine the top apps in the paid category is by finding the revenue it generated through app installs.
* This is given by:

 Revenue generated through installs = (Number of installs)x(Price to install the app)


**Lets define a new column Revenue in paid_df which gives the revenue generated by the app through installs alone.**

In [None]:
# Creatng a new column 'Revenue' in paid_df
paid_df['Revenue'] = paid_df['Installs']*paid_df['Price']
paid_df.head()

In [None]:
# Top app in the paid category

paid_df[paid_df['Revenue'] == paid_df['Revenue'].max()]

In [None]:
# Top 10 paid apps in the play store
top10paid_apps=paid_df.nlargest(10, 'Revenue', keep='first')
top10paid_apps['App']

In [None]:
# Categories in which the top 10 paid apps belong to
top10paid_apps['Category'].value_counts().plot.bar(figsize=(15,5), color= ["orange", "red", "green", "blue", "purple"])
plt.xlabel('Category',size=12)
plt.ylabel('Number of apps',size=12)
plt.title('Categories in which the top 10 paid apps belong', size=15)
plt.xticks(rotation=0)
plt.legend()

In [None]:
# Top paid apps according to the revenue generated through installs alone
top10paid_apps.groupby('App')['Revenue'].mean().sort_values(ascending= True).plot.barh(figsize=(16,10), color='darkorange')
plt.xlabel('Revenue Generated (USD)', size=15)
plt.title('Top apps based on revenue generated through installation fee', size=20)
plt.legend()

In [None]:
# Paid apps with the highest number of installs
paid_df[paid_df['Revenue'] == paid_df['Revenue'].max()]

### **`10). Distribution of apps based on its size`**

In [None]:
# Values calculated earlier
[mean_size,median_size,max_size,min_size]


*   The size of an app in our database varies from 100 MB to 0.0083 MB.
We can analyse the size of the apps if we can group them into certain intervals.

*   We have already established that the data in the numeric values in the 'Size' column are skewed towards the left.
*   Lets group the data in the size column as follows into intervals of 10 each:

(< 1 MB, 1-10, 10-20, 20-30, ..., 90-100, 'Varies with device')



**Lets create a function to create the size intervals**

In [None]:
# Function to group the apps based on its size in MB

def size_apps(var):
  '''
  This function groups the size of an app
  between ~0 to 100 MB into certain intervals.
  '''
  try:
    if var < 1:
      return 'Below 1'
    elif var >= 1 and var <10:
      return '1-10'
    elif var >= 10 and var <20:
      return '10-20'
    elif var >= 20 and var <30:
      return '20-30'
    elif var >= 30 and var <40:
      return '30-40'
    elif var >= 40 and var <50:
      return '40-50'
    elif var >= 50 and var <60:
      return '50-60'
    elif var >= 60 and var <70:
      return '60-70'
    elif var >= 70 and var <80:
      return '70-80'
    elif var >= 80 and var <90:
      return '80-90'
    else:
      return '90 and above'
  except:
    return var

**Lets apply the `size_group` function on the Size column and store the results in a new column named `Size` group.**

In [None]:
Play_store_df['size_group']=Play_store_df['Size'].apply(lambda x : size_apps(x))
Play_store_df.head()

In [None]:
# no of apps belonging to each size group
Play_store_df['size_group'].value_counts().plot.barh(figsize=(20,8),color='r').invert_yaxis()
plt.title("Number of apps in different size groups", size=20)
plt.ylabel('App size in MB', size=15)
plt.xlabel('No of apps', size=15)
plt.legend()

In [None]:
# average no of user reviews in each size group
Play_store_df.groupby('size_group')['Reviews'].mean().sort_values().plot.barh(figsize=(20,8), color = 'green')
plt.title("Average number of user reviews (in millions)", size=20)
plt.xlabel('Average no of user reviews', size=15)
plt.ylabel('App size in MB', size=15)
plt.legend()

In [None]:
# average number of app installs in each category

Play_store_df.groupby('size_group')['Installs'].mean().sort_values(ascending= False).plot.barh(figsize=(20,8),color='sandybrown').invert_yaxis()
plt.title("Average number of app installs (In 10 millions)", size=20)
plt.ylabel('App size in MB', size=15)
plt.xlabel('Average no of app installs',  size=15)
plt.legend()


*   The sizes of the majority of the apps range in between 1 and 20 MB.
*   There are a good number of apps whose size varies with the device.

*   The apps which are smaller in size on average have lower no of app installs and user reviews.

### **`11). Android version based on each category`**
Now I am going to group it to 1 till 8 version of android. Change the null value to 1.0.

In [None]:
Play_store_df['Android Ver'].replace(to_replace=['4.4W and up','Varies with device'], value=['4.4','1.0'],inplace=True)
Play_store_df['Android Ver'].replace({k: '1.0' for k in ['1.0','1.0 and up','1.5 and up','1.6 and up']},inplace=True)
Play_store_df['Android Ver'].replace({k: '2.0' for k in ['2.0 and up','2.0.1 and up','2.1 and up','2.2 and up','2.2 - 7.1.1','2.3 and up','2.3.3 and up']},inplace=True)
Play_store_df['Android Ver'].replace({k: '3.0' for k in ['3.0 and up','3.1 and up','3.2 and up']},inplace=True)
Play_store_df['Android Ver'].replace({k: '4.0' for k in ['4.0 and up','4.0.3 and up','4.0.3 - 7.1.1','4.1 and up','4.1 - 7.1.1','4.2 and up','4.3 and up','4.4','4.4 and up']},inplace=True)
Play_store_df['Android Ver'].replace({k: '5.0' for k in ['5.0 - 6.0','5.0 - 7.1.1','5.0 - 8.0','5.0 and up','5.1 and up']},inplace=True)
Play_store_df['Android Ver'].replace({k: '6.0' for k in ['6.0 and up']},inplace=True)
Play_store_df['Android Ver'].replace({k: '7.0' for k in ['7.0 - 7.1.1','7.0 and up','7.1 and up']},inplace=True)
Play_store_df['Android Ver'].replace({k: '8.0' for k in ['8.0 and up']},inplace=True)
Play_store_df['Android Ver'].fillna('1.0', inplace=True)

In [None]:
print(Play_store_df.groupby('Category')['Android Ver'].value_counts())
Type_cat = Play_store_df.groupby('Category')['Android Ver'].value_counts().unstack().plot.bar(figsize=(25,8), width=2)
plt.xticks()
plt.show()

**Findings:**

It is clearly evident from the above plot that majority of the apps are working on **`Android_Ver 4.0 and up`**.

#**Data Exploration--Univariate & Bivariate Analysis**
**Pair plot** is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make linear separation in our data-set.

Plot a pairwise plot between all the quantitative variables to look for any evident patterns or relationships between the features

In [None]:
Rating = Play_store_df['Rating']
Size = Play_store_df['Size']
Installs = Play_store_df['Installs']
Reviews = Play_store_df['Reviews']
Type = Play_store_df['Type']
Price = Play_store_df['Price']

p = sns.pairplot(pd.DataFrame(list(zip(Rating, Size, np.log(Installs), np.log10(Reviews), Price, Type)),
                        columns=['Rating','Size', 'Installs', 'Reviews', 'Price','Type']), hue='Type')
p.fig.suptitle("Pairwise Plot - Rating, Size, Installs, Reviews, Price",x=0.5, y=1.0, fontsize=16)

**FINDINGS**
* Most of the App are Free.
* Most of the Paid Apps have Rating around 4
* As the number of installation increases the number of reviews of the particaular app also increases.
*Most of the Apps are light-weighted.

## **☘ Let us see what insight we can have on the basis of Size of an app**

## **`Size vs Rating`**

In [None]:
sns.set_style("whitegrid", {'axes.grid' : False})
sns.lmplot(y='Rating',x='Size',data=ps_df,col="Category", hue="Category",col_wrap=4,line_kws={'color': 'red'})


# **Data Visualization on User Reviews:**


### **`1). Percentage of Review Sentiments`**

In [None]:
# Basic inspection
User_Review_df.columns

In [None]:
import matplotlib
counts = list(User_Review_df['Sentiment'].value_counts())
labels = 'Positive Reviews', 'Negative Reviews','Neutral Reviews'
matplotlib.rcParams['font.size'] = 20
matplotlib.rcParams['figure.figsize'] = (10, 15)
plt.pie(counts, labels=labels, explode=[0.01, 0.05, 0.05], shadow=True, autopct="%.2f%%")
plt.title('Percentage of Review Sentiments', fontsize=20)
plt.axis('off')
plt.legend(bbox_to_anchor=(0.9, 0, 0.5, 1))
plt.show()

**Findings:**

1. Positive reviews are **64.04%**
2. Negative reviews are **21.29%**
3. Neutral reviews are **14.67%**

### **`2). Apps with the highest number of positive reviews`**

In [None]:
# positive reviews
positive_ur_df=User_Review_df[User_Review_df['Sentiment']=='Positive']
positive_ur_df

In [None]:
positive_ur_df.groupby('App')['Sentiment'].value_counts().nlargest(10).plot.barh(figsize=(15,8),color='seagreen').invert_yaxis()
plt.title("Top 10 positive review apps")
plt.xlabel('Total number of positive reviews')
plt.legend()

### **`3). Apps with the highest number of negative reviews.`**

In [None]:
negative_ur_df=User_Review_df[User_Review_df['Sentiment']=='Negative']
negative_ur_df

In [None]:
negative_ur_df.groupby('App')['Sentiment'].value_counts().nlargest(10).plot.barh(figsize=(15,8),color='crimson').invert_yaxis()
plt.title("Top 10 negative review apps")
plt.xlabel('Total number of negative reviews')
plt.legend()

### **`4). Histogram of Subjectivity`**

In [None]:
merged_df.Sentiment_Subjectivity.value_counts()

In [None]:
plt.figure(figsize=(18,9))
plt.xlabel("Subjectivity")
plt.title("Distribution of Subjectivity")
plt.hist(merged_df[merged_df['Sentiment_Subjectivity'].notnull()]['Sentiment_Subjectivity'])
plt.show()

 **Findings:**

**`0 - objective(fact), 1 - subjective(opinion)`**

It can be seen that maximum number of sentiment subjectivity lies between 0.4 to 0.7. From this we can conclude that maximum number of users give reviews to the applications, according to their experience.

### **`5). Is sentiment_subjectivity proportional to sentiment_polarity?`**

In [None]:
# scatterplot of sentiment polarity and sentiment subjectivity
plt.figure(figsize=(15, 10))
sns.scatterplot(x=User_Review_df['Sentiment_Subjectivity'], y=User_Review_df['Sentiment_Polarity'],hue = User_Review_df['Sentiment'], edgecolor='white', palette="inferno")
plt.title("Google Play Store Reviews Sentiment Analysis", fontsize=20)
plt.show()

From the above scatter plot it can be concluded that sentiment subjectivity is not always proportional to sentiment polarity but in maximum number of case, shows a proportional behavior, when variance is too high or low

# **How Content Rating affect over the App**

### **1.) Paid App Content Rating**

In [None]:
paid_df['Content Rating'].value_counts().plot.bar(figsize=(10,10),color='c')
plt.legend()

### **2.) Free App content Rating**

In [None]:
free_df['Content Rating'].value_counts().plot.bar(figsize=(10,10),color='blue')
plt.legend()

Most Number of content ratings which got on Google Play Store can be used by everyone.The remaining apps have various age restrictions to use it.

### **3.) Does Last Update date has an effects on rating?**

In [None]:
print(Play_store_df['Last Updated'].head())
#fetch update year from date
Play_store_df["Update year"] = Play_store_df["Last Updated"].apply(lambda x: x.strftime('%Y')).astype('int64')

In [None]:
fig, ax = plt.subplots(figsize=(12,6))
sns.regplot(x="Update year", y="Rating", data=Play_store_df)
plt.title("Update Year VS Rating")

### **4.) Distribution of App update over the Year**

In [None]:
paid_df["Update year"] = paid_df["Last Updated"].apply(lambda x: x.strftime('%Y')).astype('int64')
free_df["Update year"] = free_df["Last Updated"].apply(lambda x: x.strftime('%Y')).astype('int64')

In [None]:
paid_df.groupby("Update year")["App"].count().plot.line(marker='o')
free_df.groupby('Update year')['App'].count().plot.line(marker='o')

In the above plot, we plotted the apps updated or added over the years comparing Free vs. Paid, by observing this plot we can conclude that before 2011 there were no paid apps, but with the years passing free apps has been added more in comparison to paid apps, By comparing the apps updated or added in the year 2011 and 2018 free apps are increases from 80% to 96% and paid apps are goes from 20% to 4%. So we can conclude that most of the people are after free apps

## **5.) Distribution of Paid and Free app updated over the Month**

In [None]:
paid_df["Update month"] = paid_df["Last Updated"].apply(lambda x: x.strftime('%m')).astype('int64')
free_df["Update month"] = free_df["Last Updated"].apply(lambda x: x.strftime('%m')).astype('int64')

In [None]:
paid_df.groupby("Update month")["App"].count().plot.bar(figsize=(10,8), color= "green")
plt.title("Paid Apps update over the month", size=20)
plt.legend()

Most of the paid apps too updates in the month of July same as free apps.

In [None]:
free_df.groupby("Update month")["App"].count().plot.bar(figsize=(10,8), color='purple')
plt.title("Free Apps update over the month", size=20)
plt.legend()

In this data almost 50% apps are added or updated on the month of July, 25% of apps are updated or added on the month of August and rest of 25% remaining months.

# ▶**Analysis Summary**
In this project of analyzing play store applications, we have worked on several parameters which would help AlmaBetter to do well in launching their apps on the play store.

In the initial phase, we focused more on the problem statements and data cleaning, in order to ensure that we give them the best results out of our analysis.

AlmaBetter needs to focus more on:
1. Developing apps related to the least categories as they are not explored much. Like events and beauty.
2. Most of the apps are Free, so focusing on free app is more important.
3. Focusing more on content available for Everyone will increase the chances of getting the highest installs.
4. They need to focus on updating their apps regularly, so that it will attract more users.
5. They need to keep in mind that the sentiments of the user keep varying as they keep using the app, so they should focus more on users needs and features.

* Percentage of free apps = ~92%
* Percentage of apps with no age restrictions = ~82%
* Most competitive category: Family
* Category with the highest average app installs: Game
* Percentage of apps that are top rated = ~80%
* Family, Game and Tools are top three categories having 1906, 926 and 829 app count.
* Tools, Entertainment, Education, Buisness and Medical are top Genres.
* 8783 Apps are having size less than 50 MB. 7749 Apps are having rating more than 4.0 including both type of apps.
* There are 20 free apps that have been installed over a billion times.
* Minecraft is the only app in the paid category with over 10M installs. This app has also produced the most revenue only from the installation fee.
* Category in which the paid apps have the highest average installation fee: Finance
* The median size of all apps in the play store is 12 MB.
* The apps whose size varies with device has the highest number average app installs.
* The apps whose size is greater than 90 MB has the highest number of average user reviews, ie, they are more popular than the rest.
* Helix Jump has the highest number of positive reviews and Angry Birds Classic has the highest number of negative reviews.
* Overall sentiment count of merged dataset in which Positive sentiment count is 64%, Negative 22% and Neutral 13%.

**1.Rating**

Most of the apps have rating in between 4 and 5.

Most numbers of apps are rated at 4.3

Categories of apps have more than 4 average rating.item

 **2.Size**

Maximum number of applications present in the dataset are of small size.

**3.Installs**

Majority of the apps come into these three categories, Family, Game, and Tools.

Maximum number of apps present in google play store come under Family, Game and tools but as per the installation and requirement in the market plot, scenario is not the same. Maximum installed apps comes under Game, Communication, Productivity and Social.

Subway Surfers, Facebook, Messenger and Google Drive are the most installed apps.

**4.Type(Free/Paid)**

About 92% apps are free and 8% apps are of paid type.

The category ‘Family’ has the highest number of paid apps.

Free apps are installed more than paid apps.

The app “I’m Rich — Trump Edition” from the category ‘Lifestyle’ is the most costly app priced at $400

**5.Content Rating**

Content having Everyone only has most installs, while unrated and Adults only 18+ have less installs.

**6.Reviews**

Number of installs is positively correlated with reviews with correlation 0.64.
Sentiment Analysis

**7.Sentiment**

Most of the reviews are of Positive Sentiment, while Negative and Neutral have low number of reviews.

**8.Sentiment Polarity / Sentiment Subjectivity**

Collection of reviews shows a wide range of subjectivity and most of the reviews fall in [-0.50,0.75] polarity scale implying that the extremely negative or positive sentiments are significantly low.
Most of the reviews show a mid-range of negative and positive sentiments.

Sentiment subjectivity is not always proportional to sentiment polarity but in maximum number of case, shows a proportional behavior, when variance is too high or low.

Sentiment Polarity is not highly correlated with Sentiment Subjectivity.

# **Challenges & Future Work**
1. Our major challenge was data cleaning.
2. 13.60% of reviews were NaN values, and even after merging both the dataframes, we could not infer much in order to fill them. Thus we had to drop them.
3. The merged data frame of both play store and user reviews, had only 816 common apps. This is just 10% of the cleaned data, we could have given more valuable analysis, if we had atleast 70% - 80% of the data available in the merged dataframes.
4. User Reviews had 42% of NaN values, which could have been used for developing an understanding of the category wise sentiments, which would help us to fill 13.60% NaN values of the Reviews column.
5. There is so much more which can be explored. Like we have current version, android version available which can be explored in detail and we can come out with more analysis where we can tell how does these things effect and needs to be kept in mind while developing app for the users.
6. We can explore the correlation between the size of the app and the version of Android on the number of installs.
7. Machine learning can help us to deploy more insights by developing models which can help us interpret even more better. We have left this as future work as this is something where we can work on.