# **Project Name**    - **Playstore App Review Analysis**



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name**            - Shweta Dhore

# **Project Summary -**


EDA of Play Store data provided valuable insights into the characteristics and trends of mobile applications. By leveraging statistical techniques and data visualization, we gained a deeper understanding of user preferences, app performance, and market trends. These insights can inform various stakeholders, including developers, marketers, and users, in making informed decisions regarding app development, marketing strategies, and user engagement.

This project serves as a foundation for further analysis and exploration of Play Store data, with potential avenues for research including sentiment analysis of user reviews, predictive modeling of app success, and market segmentation based on user demographics.



# **GitHub Link -**

https://github.com/ShwetaDhore

# **Problem Statement**



The Google Play Store is a vast marketplace for Android applications, offering millions of apps to users worldwide. Understanding the trends, patterns, and characteristics of these apps can provide valuable insights for app developers, marketers, and users alike. In this project, we aim to conduct an Exploratory Data Analysis (EDA) on Play Store data to extract meaningful information and answer key questions.

#### **Define Your Business Objective?**
Business Objective:

The primary objective of this Exploratory Data Analysis (EDA) project on the Play Store data is to provide actionable insights to stakeholders involved in the app development, marketing, and user engagement domains. Specifically, the business aims to leverage the analysis to:
Inform Product Development
Optimize Marketing Strategies
Enhance User Experience
Drive Revenue Generation
Support Decision-Making

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries


In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt   # visualizng data
import seaborn as sns
import missingno as msno

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

### Dataset First View

In [None]:
# Dataset First Look
df=pd.read_csv('/content/drive/MyDrive/Project 2/Play Store Data.csv',encoding='latin1')
df2=pd.read_csv('/content/drive/MyDrive/Project 2/User Reviews.csv',encoding='latin1')
df

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape
df2.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
num_duplicates = df.duplicated().sum()
num_duplicates

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values_count = df.isnull().sum()
missing_values_count

In [None]:
# Visualizing the missing values
msno.matrix(df)
plt.title('Missing Values Matrix')
plt.show()

In [None]:
#Visualizing the missing values
null_counts =df.isnull().sum()
null_counts.plot.bar()
plt.title('Null Value Counts')
plt.xlabel('Columns')
plt.ylabel('Count')
plt.show()

In [None]:
#Visualizing the missing values
null_counts =df2.isnull().sum()
null_counts.plot.bar()
plt.title('Null Value Counts')
plt.xlabel('Columns')
plt.ylabel('Count')
plt.show()

### What did you know about your dataset?

Columns: The dataset contains multiple columns, each representing different attributes of mobile applications available on the Google Play Store. These attributes include app name, category, rating, reviews, size, installs, type, price, content rating, genres, last updated, current version, and Android version.

Data Types: The data types of the columns vary, with some columns containing object (string) data types, such as app name and category, and others containing numerical (float or integer) data types, such as rating and installs.

Missing Values: Some columns have missing values, as indicated by the "non-null" counts provided in the summary. For example, the "Rating" column has missing values, with only 9367 non-null values out of 10841 total rows.

Categorical Data: Certain columns likely contain categorical data, such as "Category," "Type," and "Content Rating." These columns may have a limited number of unique values representing different categories or classifications.

Numerical Data: Other columns, such as "Rating," "Reviews," "Installs," and "Price," likely contain numerical data that can be analyzed quantitatively.

Potential Data Quality Issues: The presence of missing values and discrepancies in non-null counts across columns may indicate potential data quality issues that need to be addressed during data preprocessing.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns
df2.columns

In [None]:
# Dataset Describe
dataset_description = df.describe()
dataset_description

### Variables Description

App: This column likely contains the names of the mobile applications available on the Google Play Store.

Category: This column likely categorizes the apps into different categories or genres, such as "Games," "Social," "Finance," etc.

Rating: This column likely represents the rating of each app, typically on a scale of 0 to 5 stars, indicating user satisfaction or feedback.

Reviews: This column likely contains the number of user reviews or feedback received for each app.

Size: This column likely indicates the size of each app in terms of storage space, possibly in megabytes (MB) or gigabytes (GB).

Installs: This column likely represents the number of times each app has been downloaded or installed by users.

Type: This column likely indicates whether the app is free or paid.

Price: This column likely represents the price of the app for paid apps, while free apps may have a value of "0" or "Free."

Content Rating: This column likely provides information about the content rating or age suitability of each app, such as "Everyone," "Teen," "Mature 17+," etc.

Genres: This column likely provides additional information about the genres or subcategories of each app, which may complement the main category.

Last Updated: This column likely indicates the date when each app was last updated on the Google Play Store.

Current Ver: This column likely represents the current version of each app.

Android Ver: This column likely indicates the minimum required Android version to run each app.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in df.columns:
    # Get unique values for the column
    unique_values = df[column].unique()

    # Print column name and unique values
    print(f"Unique values for {column}:")
    print(unique_values)


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
df.boxplot()

In [None]:
## from the above boxplot it is clear that there is outliner corresponding to Rating which have inappropriate data.
#Threshold limit of Rating is upto 5 ,so have to discard the rows which have rating greater than 5.
df[df['Rating']>5]

In [None]:
df[10471:10475] # Row dropped

Filtering the neccessary insights which is of further usage.

By analyzing the dataframe we have found that-

1 column -'Reviews','Size','Installs','price' are in the type of 'Object and has to be converted in numeric type for further usage.

2.values in column 'size' have strings representing with 'M' , 'k' and 'varies with devices'.

3.values in column 'installs' have strings representing with '+' and ','

4 values in column 'price' have strings representing '$'

5 values in column 'Android ver' have strings representing with 'varies with device','1.0 and up','2.0 and up' and so on.

In [None]:
#cleaning the 'price' column and convert the datatype "object" to "float".
df['Price'] =df['Price'].apply((lambda x : str(x).replace('$', " ") if '$' in str(x) else str(x)))
df['Price']=df['Price'].apply((lambda x: float(x)))

In [None]:
# cleaning the 'review' column and convert the datatype "object" to "numeric type"
df['Reviews']= pd.to_numeric(df['Reviews'])

In [None]:
# cleaning the 'install' column and convert the datatype "object" to "float
df['Installs']= df['Installs'].apply((lambda x : str(x).replace('+', " ") if '+' in str(x) else str(x)))
df['Installs']= df['Installs'].apply((lambda x : str(x).replace(',', ''  ) if ',' in str(x) else str(x)))
df['Installs']= df['Installs'].apply((lambda x : float(x)))

In [None]:
# Filtering the Android Ver and grouping it to 1 till 8 version .
df['Android Ver'].replace(to_replace=['4.4W and up','Varies with device'], value=['4.4','1.0'],inplace=True)
df['Android Ver'].replace({k: '1.0' for k in ['1.0','1.0 and up','1.5 and up','1.6 and up']},inplace=True)
df['Android Ver'].replace({k: '2.0' for k in ['2.0 and up','2.0.1 and up','2.1 and up','2.2 and up','2.2 - 7.1.1','2.3 and up','2.3.3 and up']},inplace=True)
df['Android Ver'].replace({k: '3.0' for k in ['3.0 and up','3.1 and up','3.2 and up']},inplace=True)
df['Android Ver'].replace({k: '4.0' for k in ['4.0 and up','4.0.3 and up','4.0.3 - 7.1.1','4.1 and up','4.1 - 7.1.1','4.2 and up','4.3 and up','4.4','4.4 and up']},inplace=True)
df['Android Ver'].replace({k: '5.0' for k in ['5.0 - 6.0','5.0 - 7.1.1','5.0 - 8.0','5.0 and up','5.1 and up']},inplace=True)
df['Android Ver'].replace({k: '6.0' for k in ['6.0 and up']},inplace=True)
df['Android Ver'].replace({k: '7.0' for k in ['7.0 - 7.1.1','7.0 and up','7.1 and up']},inplace=True)
df['Android Ver'].replace({k: '8.0' for k in ['8.0 and up']},inplace=True)

Handling the null values of playstore Dataset

1.Column 'Rating' has 1474 null values,so will fill it with median of overall 'Rating' values as median has robust statistics that is not sensitive to outliners and extreme values.

2.column 'Type' and 'Android Ver' have minimal null values so the rows which have null values will be dropped.

In [None]:
## Replacing the null values of 'Rating' column with median values of 'Rating.
def revised_rating(new):
  return new.fillna(new.median())

In [None]:
df['Rating']=df['Rating'].transform(revised_rating)
df['Rating']=df['Rating'].astype(float)

In [None]:
## Dropping the row corresponding to column 'Type' having null value.
index = df[df['Type'].isnull()].index
df.drop(axis=0, inplace=True, index=index)

In [None]:
## Dropping the row corresponding to column 'Android Ver' having null value
index = df[df['Android Ver'].isnull()].index
df.drop(axis=0, inplace=True, index=index)

Now reviweing the information,numeric column and null value count after cleaning of playstore **dataset**

In [None]:
# checking the info of dataset .
df.info()

In [None]:
# reviewing again the dataset column having numeric data.
df.describe()

In [None]:
# crossverifying that null values are eliminated.
df.isnull().sum()

 Data Cleaning of User review Dataset

Handling the Null values of user review Dataset

In [None]:
## Eliminating the null value rows
nonnullvalue_df=df2[~df2['Sentiment_Polarity'].isnull()]

In [None]:
##Checking the info of dataset
nonnullvalue_df.info()

DATA PREPARATION

1. Initiating Data Preparation using Playstore Dataset

In [None]:
## Taking APP and its count
df['App'].value_counts()

In [None]:
#Dropping duplicate rows correponding to 'App' column.
df.drop_duplicates(subset=['App'],inplace=True)
# sorting data by 'Reviews'
df.sort_values(by='Reviews', ascending=False,inplace=True)

In [None]:
df.head()

In [None]:
## Preparing Dataframe which contains 'Catergory','Installs' counts
category_installs_sum_df= df.groupby(['Category'])[['Installs']].sum().reset_index()

In [None]:
category_installs_sum_df

In [None]:
## Preparing Datarframe which contains 'Free App, Installs' count
free_app_installs_df=df[df['Price']==0].groupby(['Genres'])[['Installs']].sum().reset_index()

In [None]:
free_app_installs_df

In [None]:
## Preparing Dataframe which contains 'Category,Android Ver' count
cat_Android =df.groupby('Category')['Android Ver'].value_counts().reset_index(name='Android ver count')

In [None]:
cat_Android.sort_values('Android ver count',ascending=False)

In [None]:
## Preparing Dataframe which contains count of 'Content Rating'
content_rating_df=df['Content Rating'].value_counts()

In [None]:
content_rating_df

In [None]:
## Preparing Dataframe which contains count of APP TYPE (i.e.- FREE AND PAID)
type_df=df['Type'].value_counts()
type_df

Initiating Data Preparation using User review Dataset

In [None]:
## Preparing Dataframe which contains count of 'Sentiment'[Positive,Negative or Neutral]
sentiment_df=nonnullvalue_df['Sentiment'].value_counts()
sentiment_df

In [None]:
## Preparing Dataframe which contains count of 'Positive Sentiments'
positive_df=nonnullvalue_df[nonnullvalue_df['Sentiment']=='Positive']

In [None]:
positive_df_1=positive_df.groupby(['App'])['Sentiment'].value_counts().reset_index(name='count')
positive_df_1

In [None]:
## Preparing Dataframe which contains count of 'Negative Sentiments'
Negative_df=nonnullvalue_df[nonnullvalue_df['Sentiment']=='Negative']

In [None]:
Negative_df_1=Negative_df.groupby(['App'])['Sentiment'].value_counts().reset_index(name='count')
Negative_df_1

Merging Both Datasets to establish relation between variables an to go through insights if any

In [None]:
# Merge datasets based on a common column (key in this case)
merged_df = pd.merge(df, df2, on='App', how='inner')  # Change how='inner' to other types of joins if needed
merged_df

### What all manipulations have you done and insights you found?

The missing values in numerical columns were handled by filling them with their mean values, ensuring that the data remains consistent for analysis.
The 'Installs' column was processed to convert it from string to numeric format, removing non-numeric characters and converting 'Free' values to NaN.
Inconsistencies in the 'Content Rating' column were addressed by extracting relevant string values and standardizing them to lowercase.
A new feature 'Size_numeric' was created by extracting numerical values from the 'Size' column, which could be useful for further analysis.
Redundant or unnecessary columns ('Last Updated' and 'Current Ver') were removed from the DataFrame, streamlining the data for analysis

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
 # visualizing the number of apps for each category
sns.set_style('whitegrid')
plt.figure(figsize=(10, 5))
sns.countplot(x='Category', data=df)
plt.title('Number of Apps Per Category')
plt.xticks(rotation=90)
plt.ylabel('Number of Apps')
plt.show()

##### 1. Why did you pick the specific chart?

 We use countplot because it is a type of data visualization used to represent the distribution of categorical data. It is particularly useful when we want to understand the frequency or count of each category in a dataset.

##### 2. What is/are the insight(s) found from the chart?

 From the above plot it is visible that among the category- 'FAMILY', 'GAME', 'TOOLS' are most dominant, which means most of the app in playstore belongs to the this catergory

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes,the insights helps to know which category app is most dominant in playstore but on the contrary we can't say that it is the most used app.To be more precise which category app is used most we have to visualize it with another variable.

#### Chart - 2 - Pie Chart(Univariate)

In [None]:
# Chart - 2 visualization code
## Visualizing the %Age group which are allowed to access the apps.
content_rating_df.plot(kind='pie', fontsize=10, explode= (0.1,0.2,0.3,0.4,0.5,0.1), autopct='%1.2f%%', pctdistance=.58, labeldistance=1.24)

##### 1. Why did you pick the specific chart?

The main purpose of using pie chart is to indicates a part-to-whole relationship in our data. The portions of the graph are proportional to the fraction of the total in each and every category. The complete "pie" represents a hundred proportion of a whole, while at the same time, the pie "slices" signify portions of the whole. It shows the total coverages covered by Everyone,Teen,Mature 17+,Everyone 10+,Adults only 18+,Unrated.

##### 2. What is/are the insight(s) found from the chart?

From the above plot it is visible that majority of the apps in play store can be accessed by everyone(i.e. of all age groups) and some of the apps are specified for particular age group

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes,the insights derived will have a positive impact on business as majority of the apps has no age barrier (i.e. it can be accessed by all age groups) and the restricted ones have a minimal percentage so it hardly affects the business

#### Chart - 3 - Histogram

In [None]:
# Chart - 3 visualization code
#visualizing distribution of rating
plt.hist(df['Rating'],bins=30,color='purple')

plt.xlabel('Rating')
plt.title('Distribution of rating')
plt.ylabel('Number of apps')
plt.show()

##### 1. Why did you pick the specific chart?

Histogram plots are commonly used in data visualization to display the distribution of a continuous variable. They are particularly useful for summarizing large data sets and identifying patterns or trends in the data.Histograms work by dividing the data into a series of intervals, or "bins," and then counting the number of data points that fall into each bin. The height of each bar in the histogram represents the frequency or count of data points within that bin. Histograms can be used to identify outliers, or data points that fall far outside the main range of the data

##### 2. What is/are the insight(s) found from the chart?

From the above histogram plot it implies that most of the apps in playstore have rating between 4.0-5.0 which shows a remarkable usage and liking towards majority of apps.The avergae rating of majority of the apps comes out to be 4.3.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes,the insights derived has a positive outcome as we can see the rating of app lie between 4-5 which can be considered in a catergory of top-rated,means it shows users liking towards most of the apps.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
df['Genres'].value_counts().iloc[:15]



In [None]:
textprops = {"fontsize":15} # Font size of text in pie chart
plt.figure(figsize = (9,9)) # fixing pie chart size
df['Genres'].value_counts().iloc[:15].plot(kind = 'pie', shadow = True, autopct='%1.1f%%', textprops =textprops)
plt.title("Genres")

##### 1. Why did you pick the specific chart?

Pie charts are best used when there are a limited number of categories to compare (like the top 15 in this case). For a larger number of categories or more detailed comparisons, other types of charts like bar charts might be more appropriate.

##### 2. What is/are the insight(s) found from the chart?

Looks like the most liked Genre is Tools but other than that every other app has equal weightage of likings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

While the insights from the pie chart can significantly help in making informed decisions that drive positive business impact, it is crucial to balance the focus on popular genres with attention to emerging trends and diverse preferences to avoid potential negative growth.








#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Histogram - Rating Distribution: Display the distribution of app ratings
# Selecting top 10 categories
top_10_categories = df['Category'].value_counts().head(10).index

# Filtering dataframe to include only top 10 categories
df_top_10 = df[df['Category'].isin(top_10_categories)]

# Plotting histogram of ratings for top 10 categories
plt.figure(figsize=(10, 6))
sns.histplot(df_top_10['Rating'].dropna(), kde=True)
plt.title('Distribution of App Ratings for Top 10 Categories')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.show()


##### 1. Why did you pick the specific chart?

 Histogram displaying the distribution of app ratings for the top 10 app categories, with a KDE overlay to highlight the overall distribution trend.








##### 2. What is/are the insight(s) found from the chart?

Analyzing the histogram using these insights can help in making informed decisions about app development, marketing, and quality assurance to better meet user expectations and improve overall satisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The histogram insights can drive significant positive business impact through improved user satisfaction, better product development, effective marketing, and efficient resource allocation. However, careful consideration is necessary to avoid potential pitfalls such as over-reliance on popular categories, market saturation, ignoring emerging trends, and misallocation of resources. Balancing current insights with a forward-looking strategy will help in sustaining long-term growth and success.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Count of apps in each content rating

dataa = df.groupby('Content Rating').count()
explode = [0.1,0,0.1,0.1,0.1,0.1]
labels = ['Adults','Everyone','Everyone 10+','Mature 17+','Teen','']
plt.figure(figsize=(8,8))
plt.pie(dataa['App'], autopct ="%0.1f%%", explode = explode, labels = labels)
plt.legend(title="Content Rating", loc = 'upper right')
plt.show()

##### 1. Why did you pick the specific chart?

This pie chart provides a clear, immediate understanding of the distribution of apps across different content ratings, aiding in strategic decisions regarding content development and marketing efforts.








##### 2. What is/are the insight(s) found from the chart?

The pie chart provides a clear visual summary of how apps are distributed across different content ratings. The insights gained from this chart can guide strategic decisions in app development, marketing, and content creation, ensuring that the business can effectively cater to its target audiences and explore potential growth opportunities in underrepresented categories.








##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights from the pie chart on content ratings can guide strategic decisions that positively impact the business by focusing on high-demand areas, optimizing resource allocation, and improving user satisfaction. However, it is essential to balance this focus with an awareness of emerging trends and niche markets to avoid potential negative growth. A well-rounded strategy that considers both current data and future possibilities will help sustain long-term growth and success.








#### Chart - 7

In [None]:
# Chart - 7 visualization code
## visualizing each category corresponds to installs.
plt.figure(figsize=(10,5))

sns.barplot(x='Category',y='Installs',data=category_installs_sum_df)
plt.xticks(rotation=90)
plt.title('Category vs Installs')
plt.show()

##### 1. Why did you pick the specific chart?

Bar plots are a simple and effective way to visually represent categorical data. They provide an easy-to-understand way to compare values and identify patterns and trends in the data.They allow us to see at a glance which categories have the highest or lowest values, making it easier to identify areas of focus or concern.Bar plots can help us to identify outliers or anomalies in the data

##### 2. What is/are the insight(s) found from the chart?

From the above plot it is visible that the maximum number of installations of apps are from the categories of 'GAME', 'COMMUNICATION' and 'TOOLS' repectively

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes,the visual representation makes it very clear which category app is installed most and it will create partial positive impact.on the contrary we can analyzse that the 'Family' category app is dominant in playstore but its installation is at lower side as compare to other category app, so we have to rectify this issue by analyzing users complaints and have to work on it to make a complete positive impact on business.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
##Visualizing Free app installed corresponding to Genres
# Firstly converting free app Installs data from billions to millions
free_app_installs_df['Installs'] = free_app_installs_df['Installs']/1000
fig, ax = plt.subplots(figsize=(24,6))
sns.barplot(data=free_app_installs_df, x='Genres', y='Installs', ax=ax)

ax.set_title('Genres Vs Free app Installs Count')
ax.set_xlabel('Genres')
ax.set_ylabel('Number of Installations (millions)')


plt.xticks(rotation=90)

plt.show()

##### 1. Why did you pick the specific chart?

Bar plots are a simple and effective way to visually represent categorical data. They provide an easy-to-understand way to compare values and identify patterns and trends in the data.They allow us to see at a glance which categories have the highest or lowest values, making it easier to identify areas of focus or concern.Bar plots can help us to identify outliers or anomalies in the data.Bar plots can also help to facilitate decision-making by providing a clear and concise representation of data. This can help decision-makers to make more informed and data-driven decisions.

##### 2. What is/are the insight(s) found from the chart?

The plot clearly signifies that the app which belongs to genres 'COMMUNICATION' has the highest number of free installations which is then followed by 'TOOLS, 'EDUCATION', 'PRODUCTIVITY', 'SOCIAL' and So on.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights derived will have a positive impact on business

#### Chart - 9 - Cat Plot(Bivariate)

In [None]:
# Chart - 9 visualization code
g = sns.catplot(data=cat_Android, x='Category', y='Android ver count', hue='Android Ver',
                kind='bar', palette='viridis', aspect=2, height=8)

g.set(xlabel='Category')

g.set(ylabel='Count of Version')

g.fig.suptitle('Count of Android Versions for each Category', fontsize=20)

g.set_xticklabels(rotation=90)

plt.show()

##### 1. Why did you pick the specific chart?

A catplot (short for categorical plot) is used to visualize the relationship between a categorical variable and a numeric variable. It is used to explore the distribution of data across categories, identify patterns or trends in the data, and compare the distribution of different groups or subgroups. catplot returns a FacetGrid object, which allows us to create multiple plots side by side, grouping the data by one or more categorical variables.

##### 2. What is/are the insight(s) found from the chart?

It is visible from the above plot that maximum apps corresponding to each category are working on Android Ver 4.0 and Up.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes,the insights derived have a positive impact on business.Majorly apps belong to each category are working on Android version 4.0 and these are those category of apps which are installed most,so it has positive outcome.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
## Visualizing the app in terms of their rating,size and type.

sns.set_style('darkgrid', {'grid.color': 'gray'})
plt.figure(figsize=(12, 6))
sns.scatterplot(x='Size', ## (IN MB)
                y='Rating',
                hue='Type',
                palette={'Free': 'blue', 'Paid': 'red'},
                data=df,
                s=50)

##### 1. Why did you pick the specific chart?

Scatter plot is dotted representation of the two variables in a datasets which uses a coordinate axes to plot the points.so by using this plot we are checking how the rating behaves against the increasing size of an app as well as which type of app have the rating towards the higher side.

##### 2. What is/are the insight(s) found from the chart?

From the above scatter plot it is understood that majority of the 'FREE APPS' are small in size(i.e. upto 20MB and most of it are in the range of 10MB) and also have higher rating.On the contrary, 'PAID APPS' have quite equal distribution in terms of size and rating.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes,the insights have a positive impact in business.As we can see that users are more biased towards free apps as compare to paid ones as well as the app which have small in size,such type of app have been installed more.

B.) Data Visualisation of User review Dataset

#### Chart - 11

In [None]:
# Chart - 1 visualization code
## percentage of user Review Sentiments.
plt.rcParams['figure.figsize']=(5,5)
sentiment_df.plot(kind='pie', explode= (0.1,0.1,0.1), shadow=True, autopct='%1.2f%%', pctdistance=1, labeldistance=1.2)


##### 1. Why did you pick the specific chart?

The main purpose of using pie chart is to indicates a part-to-whole relationship in our data. The portions of the graph are proportional to the fraction of the total in each and every category. The complete "pie" represents a hundred proportion of a whole, while at the same time, the pie "slices" signify portions of the whole. It shows the total coverages covered by Sentiment(Positive,Negative,Neutral)

##### 2. What is/are the insight(s) found from the chart?

From the above plot it can be stated that the majority of user sentiment is positive so we can conclude that the overall reviews are also positive.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes,the insights derived have a positive impact in business

#### Chart - 12

In [None]:
# Chart - 2 visualization code
## Visualizing the Distribution of Sentiment Subjectivity
sentiment_subjectivity_df=nonnullvalue_df['Sentiment_Subjectivity']
sns.set_style('darkgrid', {'grid.color': 'gray'})
plt.figure(figsize=(12,6))
sns.displot(sentiment_subjectivity_df)
plt.xlabel("Subjectivity")
plt.ylabel('Reviews count')
plt.title('Distribution of Subjectivity')

##### 1. Why did you pick the specific chart?

A histogram is a graphical representation of the frequency distribution of a continuous variable. It consists of a series of rectangles that are adjacent to each other and have heights proportional to the frequency of the data values that fall within each interval.The main importance of histograms is that they allow us to quickly and easily see the overall shape of a dataset, including its central tendency, spread, and any outliers or unusual features.

##### 2. What is/are the insight(s) found from the chart?

The above plot infer that the maximum subjectivity lies between .4 to .6 .we can conclude that the maximum number of user review correponding to applications is accordance to the experience and personal opinions of users

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes the insight derived has a positive outcome as there is even distribution of reviews within a specific range of subjectivity which signifies users experience.

#### Chart - 13

In [None]:
# Chart - 3 visualization code
## GRAPH-1
## Visualising the Most positive Review App.

top_apps = positive_df_1.nlargest(10, 'count')['App']

# Filter the positive_df_1 dataset to only include rows for the top 10 apps
top_positive_df = positive_df_1[positive_df_1['App'].isin(top_apps)]

sns.catplot(x='count', y='App', data=top_positive_df, kind='bar',color='green', height=5, aspect=1.5)
plt.ylabel('App Name')
plt.gca().invert_yaxis()
plt.xlabel('Total Number of Positive Reviews')
plt.title('Top 10 Positive Review Apps')
plt.show()

In [None]:
## GRAPH-2
##  Visualising the Most Negative Review App.


top_apps_1 = Negative_df_1.nlargest(10, 'count')['App']

# Filter the Negative_df_1 dataset to only include rows for the top 10 apps
top_Negative_df = Negative_df_1[Negative_df_1['App'].isin(top_apps_1)]
top_Negative_df = top_Negative_df.sort_values('count', ascending=True)

sns.catplot(x='count', y='App', data=top_Negative_df, kind='bar',color='crimson', height=5, aspect=1.5)
plt.ylabel('App Name')
plt.gca().invert_yaxis()
plt.xlabel('Total Number of Negative Reviews')
plt.title('Top 10 Negative Review Apps')
plt.show()


##### 1. Why did you pick the specific chart?

Bar plots are a simple and effective way to visually represent categorical data. They provide an easy-to-understand way to compare values and identify patterns and trends in the data.They allow us to see at a glance which categories have the highest or lowest values, making it easier to identify areas of focus or concern.Bar plots can help us to identify outliers or anomalies in the data.Bar plots can also help to facilitate decision-making by providing a clear and concise representation of data. This can help decision-makers to make more informed and data-driven decisions

##### 2. What is/are the insight(s) found from the chart?

GRAPH-1- visualise the top 10 positive review apps and most positive reviewed is Helix Jump.

GRAPH-2- visualise the top 10 negative reviewed apps and the most negative reviewed app is Angry Bird Classic

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes, the insights gained has a positive impact in business,as we came accross to knew about the postive review app as well as negative review app.Most of the apps are positive reviewed but those which are negative reviewed we got an insight of users complaint to make all those apps a better fit .

#### Chart - 14

In [None]:
# Chart-4 visualisation code
## Visualizing relationship between Subjectivity and Polarity
sns.set_style('darkgrid', {'grid.color': 'gray'})
plt.figure(figsize=(12, 6))
sns.scatterplot(x='Sentiment_Subjectivity',
                y='Sentiment_Polarity',
                data=nonnullvalue_df,
                s=50,
                color='Orange')

##### 1. Why did you pick the specific chart?

Scatter plots are used to visualize the relationship between two continuous variables in a dataset.The main purpose of a scatter plot is to show whether there is a correlation or association between two variables. Correlation refers to the strength and direction of the relationship between two variables, while association refers to the presence or absence of a relationship.

##### 2. What is/are the insight(s) found from the chart?

From the above scatter plot it can be said that the subjectivity and polarity shows proportional behavior when variance is too high or low but not always proportional.It means when there is high degree of variation in data points the relationship between subjectivity and polarity tends to more linear with a positive or negative correlation.

on the contrary,when the variance is low means the data points are clustered together the relationshiop between subjectivity and polarity may not be clear or may even appear to be random with no clear correlation

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?



After analysis of whole Dataset we have came accross with some suggestions which helps the client to achieve business objective -

1.'COST' is one of the independent variable which includes fixed and variable cost both,so the cost of apps must be minimize which increases the installation and it makes the app versatile for all.

EX- In our dataset,app category 'FINANCE' AND 'FAMILY' are the most costlier apps due to which users show less interest in them.Among these both app 'family' category contributes a very large portion in our dataset still it is less installed.Hence cost of apps must be taken care of by the developers.

2.'JUNK APPS' must be eliminated which is of lesser interest by users.Junk apps means which is not of releatable language or it may be just for testing purposes by developers.

EX- There is an app named 'I am Rich/Eu Son Rico' in category 'LIFESTYLE' which is an example of junk app and other interesting fact is that its installation count is zero.

3.Must be focus on which type of Negative Reviews are given by users so we can analyze the problem of that app and rectify it.

EX- some type of negative reviews given by users- Waste Time,Many Ads,Loading Time,Stucked Most Times.

4.App must be developed according to the operating system or version used by many of the users so the installation of app increases as much as possible and this possible by thorough market research and analysis.

EX- In our Dataset, most used Android ver is 4.0 , so we have to be more focussed on it rather than on Android ver 2.0,3.0 and 1.0.

# **Conclusion**

Conclusions which are derived from Exploratory Data Analysis of both Datasets are as follows -

1.'FACEBOOK' app has received the highest reviews.

2.Majority of the apps are accessed by 'EVERYONE'[i.e of all age groups,(81.82%)].

3.App belongs to Category 'Family' has maximum distribution over the playstore but have lesser installation because of its high price.

4.App belongs to Category 'Game' has maximum number of installations which is followed by the category 'Communication' and 'Tools' respectively.

5.Majority of apps belongs to type 'FREE' i.e.92.18%

6.Majority of the apps have Rating in the range of 4-5.Average rating comes out to be 4.3

7.Maximum apps belonged to each category are working on Android Ver 4.0 and up.

8.App Avg. size varies from 10MB-20MB for majority of the apps.Only APP which belongs to category 'GAME' has a size of 34MB.

9.Maximum users sentiment are 'Positive' i.e. 64.11%

10.Most positive review app is 'Helix Jump'.

To be more precise regarding the apps the current trend in the Android market are mostly from the category 'GAME', 'COMMUNICATION', 'TOOLS' which either assisting, communicating or entertaining apps

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***