<a href="https://colab.research.google.com/github/ShouryaKumar1996/EDA_Project/blob/main/EDA_PlayStore_Final_Shourya.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - EDA on Playstore app data




##### **Project Type**    - EDA on Playstore Data **set**
##### **Contribution**    - Individual
##### **Prepared By**    - Shourya KUmar




# **Project Summary -**

In this project, the aim is to conduct an Exploratory Data Analysis (EDA) on a dataset comprising information from the Google Play Store. The dataset in question is divided into two main parts: "Play Store Data.csv" and "User Reviews.csv". The "Play Store Data.csv" file includes various details about apps available in the Google Play Store, such as app names, categories, ratings, reviews, size, installs, type (free or paid), price, content rating, genres, last updated, current version, and Android version. Meanwhile, the "User Reviews.csv" file contains user-generated reviews for these apps, providing insights into user sentiment and feedback.

This project's core objective is to extract valuable insights and trends from the dataset, which could be beneficial for developers, marketers, and analysts to understand the current app market landscape on the Google Play Store. It involves cleaning the data by handling missing values, converting non-numeric values to a suitable numeric format for analysis, and deriving additional metrics that could offer deeper insights into app performance and user preferences.We will answer some business questions in this project that can help us get some important insight from data.


# **GitHub Link -**

Provide your GitHub Link here.

Q1. How are apps distributed across different categories in the Google Play Store?

Q2. What is the relationship between app ratings and the number of reviews?

Q3. How do app ratings vary across different content ratings (e.g., Everyone, Teen, Mature)?

Q4. What are the top 10 apps by number of installs in each category?

Q5. What is the distribution of app prices in the paid apps category?

Q6. How frequently are apps updated with ratings?

Q7. What are the common words used in positive and negative reviews in the User Reviews dataset?

Q8. How does the sentiment polarity of reviews vary across top categories?

Q9. What percentage of apps support different versions of Android?

Q10. What are Number of app installs by content rating?

Q11. What is the Percentage of free and paid apps?

Q12. Calculate the average rating per category.

Q13. What is the Correlation between Rating,review,Install and price?

Q14. Average app rating per year.

# STEPS:


Steps involved:

1. Importing libraries

2. Loading the dataset

3. Data Cleaning

4. Exploratory Data Analysis

5. Visualization

6. Observations

7. Conclusion

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from datetime import datetime
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
playstore_data = pd.read_csv('/content/drive/MyDrive/EDA_shourya/Play Store Data.csv')
user_review_data = pd.read_csv('/content/drive/MyDrive/EDA_shourya/User Reviews.csv')

### Dataset First View

In [None]:
# Dataset First Look
playstore_data.head(5)

In [None]:
playstore_data.tail(5)

In [None]:
playstore_data.info()

In [None]:
playstore_data.describe()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
playstore_data.shape,user_review_data.shape

### Dataset Information

In [None]:
# Dataset Info
user_review_data.head()

In [None]:
user_review_data.tail()

In [None]:
user_review_data.info()

In [None]:
user_review_data.describe()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_counts_ps= playstore_data.duplicated(keep=False).sum()
duplicate_counts_ur = user_review_data.duplicated(keep=False).sum()


print("No. of duplicates in Playstore data" ,duplicate_counts_ps)
print("No. of duplicates in " ,duplicate_counts_ur)




In [None]:

# Data preparation
duplicate_status = ['Duplicates', 'Unique']
counts = [playstore_data.duplicated(keep=False).sum(), (~playstore_data.duplicated(keep=False)).sum()]

# Plotting
plt.figure(figsize=(8, 6))
plt.bar(duplicate_status, counts, color=['red', 'green'])
plt.title('Duplicate vs Unique Rows (Playstore data)')
plt.ylabel('Count')
plt.show()


In [None]:

# Data preparation
duplicate_status = ['Duplicates', 'Unique']
counts = [user_review_data.duplicated(keep=False).sum(), (~user_review_data.duplicated(keep=False)).sum()]

# Plotting
plt.figure(figsize=(8, 6))
plt.bar(duplicate_status, counts, color=['red', 'green'])
plt.title('Duplicate vs Unique Rows (User review data)')
plt.ylabel('Count')
plt.show()


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
null_play=playstore_data.isna().sum()
null_play


In [None]:
null_user=user_review_data.isna().sum()
null_user

In [None]:
# Visualizing the missing values
import matplotlib.pyplot as plt

# Plotting
plt.figure(figsize=(10, 8))
null_play.plot(kind='bar', color='coral')
plt.title('Number of Null Values per Column')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.show()


In [None]:
# Visualizing the missing values
import matplotlib.pyplot as plt

# Plotting
plt.figure(figsize=(10, 8))
null_user.plot(kind='bar', color='coral')
plt.title('Number of Null Values per Column')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.show()


In [None]:
Rating_Mode = playstore_data['Rating'].mode().values[0]
Type_Mode = playstore_data['Type'].mode().values[0]
Content_Rating_Mode = playstore_data['Content Rating'].mode().values[0]
Android_Ver_Mode = playstore_data['Android Ver'].mode().values[0]

playstore_data[playstore_data['Rating'].isna()] = Rating_Mode
playstore_data[playstore_data['Type'].isna()] = Type_Mode
playstore_data[playstore_data['Content Rating'].isna()] = Content_Rating_Mode
playstore_data[playstore_data['Android Ver'].isna()] = Android_Ver_Mode

playstore_data['Current Ver'].fillna('Not Available',inplace=True)

In [None]:
playstore_data.isna().sum()

In [None]:
Data_to_drop = playstore_data[playstore_data['App'] == playstore_data['Rating']]
Data_to_drop

In [None]:
playstore_data.drop(Data_to_drop.index[:],axis=0,inplace=True)

In [None]:
user_review_data.dropna(subset=['Translated_Review','Sentiment','Sentiment_Polarity','Sentiment_Subjectivity'],inplace=True)
user_review_data.info()

In [None]:
playstore_data.info()



In [None]:

  playstore_data.rename({'Price':'Price (in $)'},inplace=True,axis=1)

  playstore_data['Reviews']=playstore_data['Reviews'].replace('4.1 and up',0).replace('Everyone',0).astype(int)

  playstore_data['Rating'] = playstore_data['Rating'].replace('4.1 and up',playstore_data['Rating'].mean()).replace('Everyone',playstore_data['Rating'].mean()).astype(float)

  playstore_data['Installs']=playstore_data['Installs'].apply(lambda x : int(x[0:(len(x)-1)].replace(',','')))

  playstore_data['Price (in $)'] = playstore_data['Price (in $)'].replace('4.1 and up',0).replace('Everyone',0).apply(lambda x: float(x.replace('$','')))

  playstore_data['Last Update Year'] = pd.DatetimeIndex(pd.to_datetime(playstore_data['Last Updated'].apply(lambda x : x[-4:]),format='%Y')).year



In [None]:
playstore_data.info()

In [None]:
user_review_data.info()

In [None]:
playstore_data.head(70)

#Data cleaning

### What all manipulations have you done and insights you found?

  1. Replaces any occurrences of '4.1 and up' and 'Everyone' in the Reviews column with 0. These replacements indicate that the dataset might have some data entry errors or placeholders that need to be handled. Typically, 'Reviews' should be numeric, and these values do not fit the expected format.

  2. Similar to the Reviews column, this line handles erroneous or placeholder values ('4.1 and up' and 'Everyone') in the Rating column by replacing them with the mean rating of the dataset.
  Converts the Rating column to a float data type to ensure ratings are in decimal form.

  3. Cleaning the 'Installs' Column.

  4. Cleans the Price (in $) column by replacing '4.1 and up' and 'Everyone' with 0, indicating that apps with these placeholders are considered free.

  5. Extracting and Converting the 'Last Update Year'.

  6. Dropping no rating and review rows in user review data.

  7. Replacing NA values with suitable data filling.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
playstore_data.columns

In [None]:
user_review_data.columns

In [None]:
# Dataset Describe
playstore_data.describe()

In [None]:
user_review_data.describe()

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in playstore_data.columns:
    print(f"Unique values in '{column}':", playstore_data[column].nunique())


In [None]:
for column in user_review_data.columns:
    print(f"Unique values in '{column}':", user_review_data[column].nunique())


Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Distribution of apps across different categories

In [None]:
# Q1: Distribution of apps across different categories
plt.figure(figsize=(12, 8))
category_count = playstore_data['Category'].value_counts()
sns.barplot(x=category_count, y=category_count.index, palette='viridis')
plt.title('Distribution of Apps Across Categories in Google Play Store')
plt.xlabel('Number of Apps')
plt.ylabel('Category')
plt.show()


##### 1. Why did you pick the specific chart?


The bar charts help us compare different categories as length of bars depict the number of apps.


##### 2. What is/are the insight(s) found from the chart?

Family,Games and tools category have the highest number of apps.

##### 3. Will the gained insights help creating a positive business impact?


The business can aim for less popular categories and find ideas for an app that might become the next best app of that category as for the top categories the number of apps are very high and scope of improvement might be less.

#### Chart - 2 Relationship between app ratings and number of reviews

In [None]:

# Q2: Relationship between app ratings and number of reviews
plt.figure(figsize=(10, 6))
sns.scatterplot(data=playstore_data, x='Rating', y='Reviews', alpha=0.5)
plt.title('Relationship Between App Ratings and Number of Reviews')
plt.xlabel('Rating')
plt.ylabel('Number of Reviews')
plt.show()


##### 1. Why did you pick the specific chart?

The scatter plot was chosen to illustrate the relationship between two quantitative variables: app ratings and the number of reviews.

##### 2. What is/are the insight(s) found from the chart?

1. There is dense clustering of data points with a rating of around 4.0, which may indicate that the majority of apps have good ratings and a moderate number of reviews.
2. The highest concentration of reviews is among apps rated between 4 and 5. Very few apps with low ratings (below 3) have a large number of reviews, indicating that users might be less inclined to review poorly rated apps.

#### Chart - 3 App ratings vary across different content ratings

In [None]:

# Q3: App ratings vary across different content ratings
plt.figure(figsize=(10, 6))
sns.boxplot(data=playstore_data, x='Content Rating', y='Rating', palette='coolwarm')
plt.title('App Ratings Across Different Content Ratings')
plt.xlabel('Content Rating')
plt.ylabel('Rating')
plt.show()


##### 2. What is/are the insight(s) found from the chart?


The box plot visualizes the distribution of app ratings across different content ratings in the Google Play Store. Each box plot corresponds to one content rating category ('Everyone', 'Teen', 'Everyone 10+', 'Mature 17+', 'Adults only 18+', and 'Unrated').
Analyzing each content rating:

Everyone: The box plot shows that most apps rated for "Everyone" have a median rating around 4.3-4.5, with a fairly tight IQR, indicating that ratings are generally consistent. There are several outliers, particularly on the lower end, suggesting that there are a number of such apps with lower ratings.

Teen: Apps with a "Teen" rating have a similar median rating to the "Everyone" category with less spread of extreme outliers.


Mature 17+: The median rating is lower than the other categories for younger audiences, and the IQR is broad, suggesting greater variability in ratings. There are a few outliers showing both very high and very low ratings.

Adults only 18+: This category has the highest median rating but also a very small box and fewer data points, which suggests that there are fewer apps in this category and they tend to be rated more positively, although the sample size might be too small to draw definitive conclusions.

Unrated: The "Unrated" box plot is quite unique with a very small range and no IQR, implying very few apps in this category, and they tend to have similar ratings, which are around the median value.

Overall, the graph suggests that while there are differences in rating distributions across content ratings, most apps are rated above 4.0, with a trend toward a slightly lower median rating as the intended audience age increases. However, the "Adults only 18+" category bucks this trend, potentially due to the small number of apps with this rating.

#### Chart - 4 Free vs. Paid Apps in Categories

In [None]:
# Q4: Free vs. Paid Apps in Categories
plt.figure(figsize=(12, 8))
# Create a new column to easily filter free and paid apps
playstore_data['Type'] = playstore_data['Price (in $)'].apply(lambda x: 'Free' if x == 0 else 'Paid')
category_type_counts = playstore_data.groupby(['Category', 'Type']).size().unstack().fillna(0)
category_type_counts.plot(kind='bar', stacked=True, figsize=(14, 8), color=['skyblue', 'orange'])
plt.title('Number of Free vs. Paid Apps Across Categories')
plt.xlabel('Category')
plt.ylabel('Number of Apps')
plt.xticks(rotation=90)
plt.show()


##### 1. Why did you pick the specific chart?

Stacked bar chart is crisp and can accomodate two insights at a time.

##### 2. What is/are the insight(s) found from the chart?


Paid App Presence: While the number of paid apps is significantly lower compared to free apps, certain categories like "Personalization" ,"Medical" and "Tools" seem to have a relatively higher presence of paid apps. This could suggest that users may be more willing to pay for apps that offer utility or customization.


Niche Markets: Categories with a lower overall number of apps, including paid apps, might represent niche markets. These could be areas where specialized apps cater to specific user needs, and there may be opportunities for new entrants if they can provide value.



#### Chart - 5 Distribution of App Prices


In [None]:
# Q5: Distribution of App Prices
plt.figure(figsize=(10, 6))
# Filter out free apps for the price distribution of paid apps
paid_apps = playstore_data[playstore_data['Type'] == 'Paid']
sns.histplot(paid_apps['Price (in $)'], kde=True, bins=30, color='purple')
plt.title('Distribution of App Prices in Paid Apps Category')
plt.xlabel('Price (in $)')
plt.ylabel('Frequency')
plt.show()


##### 2. What is/are the insight(s) found from the chart?

Price Skew: The distribution of app prices is heavily right-skewed, meaning most of the paid apps are priced at the lower end of the spectrum.

Common Price Range: A large number of apps are clustered around the lowest price point, suggesting that most paid apps are relatively inexpensive.

Long Tail: There is a long tail extending toward the more expensive price points, indicating that while there are paid apps with higher prices, they are much rarer.

High-Price Outliers: The bars at the far end of the x-axis suggest there are a few apps with very high prices compared to the rest. These could be specialty apps or those aimed at a niche market.

Affordable Pricing Strategy: The high frequency of apps in the lower price range may indicate that developers price apps to encourage impulse purchases or to make the price of entry low enough to attract a larger user base.

Limited High-Priced App Market: The graph suggests that the market for very expensive apps is limited, with few developers pricing their apps in the higher brackets, possibly due to a lower demand at these price points.

Majority Below $50


#### Chart - 6 Update Frequency and Ratings

In [None]:
# Q6: Update Frequency and Ratings
plt.figure(figsize=(10, 6))
# Assuming 'Last Update Year' is a column with the year of the last update
sns.scatterplot(data=playstore_data, x='Last Update Year', y='Rating', alpha=0.5)
plt.title('App Update Frequency vs. Rating')
plt.xlabel('Last Update Year')
plt.ylabel('Rating')
plt.show()


##### 2. What is/are the insight(s) found from the chart?

The number of data points for each year seems to increase over time, suggesting that more apps have been updated in recent years, or possibly that the number of apps in the store has been increasing.

#### Chart - 7 What are the common words used in positive and negative reviews in the User Reviews dataset?

In [None]:
#Q7. What are the common words used in positive and negative reviews in the User Reviews dataset?
from wordcloud import WordCloud

# Assuming 'Sentiment' is a column in 'user_review_data' that marks reviews as Positive, Negative, or Neutral
# And 'Translated_Review' is the column with the review text

# Filter positive and negative reviews
positive_reviews = ' '.join(user_review_data[user_review_data['Sentiment'] == 'Positive']['Translated_Review'].fillna(''))
negative_reviews = ' '.join(user_review_data[user_review_data['Sentiment'] == 'Negative']['Translated_Review'].fillna(''))

# Generate word clouds
plt.figure(figsize=(20, 10))
plt.subplot(1, 2, 1)
pos_wordcloud = WordCloud(width=800, height=400, background_color ='white').generate(positive_reviews)
plt.imshow(pos_wordcloud, interpolation='bilinear')
plt.title('Common Words in Positive Reviews')
plt.axis('off')

plt.subplot(1, 2, 2)
neg_wordcloud = WordCloud(width=800, height=400, background_color ='white').generate(negative_reviews)
plt.imshow(neg_wordcloud, interpolation='bilinear')
plt.title('Common Words in Negative Reviews')
plt.axis('off')
plt.show()


In positive reviews: Words like "great," "love," "good," "best," and "awesome" are prominent, suggesting strong satisfaction with the apps.

The most prominent words are "bad," "problem," "issue," "worst," and "annoying," indicating common themes in negative reviews.

#### Chart - 8 Sentiment Polarity Across Categories

In [None]:
playstore_df = playstore_data[['App','Category']]
user_review_df = user_review_data[['App','Sentiment','Sentiment_Polarity']]
new = pd.merge(playstore_df, user_review_df, how='right',left_on=['App'],right_on='App')
new

In [None]:
#Q8: Sentiment Polarity Across Categories
plt.figure(figsize=(14, 8))
top_categories = new['Category'].value_counts().nlargest(5).index
filtered_reviews = new[new['Category'].isin(top_categories)]
sns.boxplot(data=filtered_reviews, x='Category', y='Sentiment_Polarity', palette='Set2')
plt.title('Sentiment Polarity of Reviews Across Top Categories')
plt.xlabel('Category')
plt.ylabel('Sentiment Polarity')
plt.xticks(rotation=45)
plt.show()


##### 2. What is/are the insight(s) found from the chart?

Health and Fitness: This category has a median sentiment polarity above 0.25, which indicates generally positive sentiment in the reviews. The spread of sentiment is narrow, suggesting that opinions about these apps are mostly positive with few outliers.


Neutral Sentiment: Categories with medians close to zero might suggest that on average, reviews tend to be more neutral or mixed, with positive and negative sentiments balancing each other out.

This box plot helps app developers and marketers understand which categories are well-received and which may require attention to improve user satisfaction, as reflected in the sentiment of the reviews.

#### Chart - 9 What percentage of apps support different versions of Android?

In [None]:

#Q9.What percentage of apps support different versions of Android?
android_version_counts = playstore_data['Android Ver'].value_counts(normalize=True) * 100
plt.figure(figsize=(30, 15))
android_version_counts.plot(kind='pie', autopct='%1.1f%%')
plt.title('Percentage of Apps by Android Version Support')
plt.ylabel('')  # Hide the y-label as it's not needed for pie charts
plt.show()


##### 2. What is/are the insight(s) found from the chart?

Broad Compatibility: A significant proportion of apps support a wide range of Android versions (e.g., "4.0.3 and up"), suggesting developers aim for broad compatibility to reach a larger user base.

Modern Android Support: There is a notable percentage of apps that require more recent versions of Android (e.g., "5.0 and up"), indicating developers are taking advantage of newer Android features and technologies.

Fragmentation: The variety of Android versions supported reflects the fragmentation of the Android ecosystem. Developers must decide whether to target newer versions with more features or older versions with a potentially larger user base.

"Varies with device": This category is quite substantial, showing that many apps have different minimum version requirements depending on the device. This could indicate adaptive app development practices where compatibility is tailored to individual device capabilities.These might be apps for specific companies as their inbuilt app.

#### Chart - 10 Number of app installs by content rating.


Answer Here

In [None]:
#Q10. Number of app installs by content rating.

# Summarizing total installs by content rating
installs_by_content_rating = playstore_data.groupby('Content Rating')['Installs'].sum().reset_index()

# Sorting the data to ensure the chart is ordered by the number of installs
installs_by_content_rating = installs_by_content_rating.sort_values('Installs', ascending=False)


In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(data=installs_by_content_rating, x='Content Rating', y='Installs', palette='coolwarm')
plt.title('Total Installs by Content Rating')
plt.xlabel('Content Rating')
plt.ylabel('Total Number of Installs')
plt.xticks(rotation=45)
plt.show()


Dominance of 'Everyone' Category: Apps rated as appropriate for 'Everyone' have by far the most installs. This suggests that apps with the broadest possible audience reach the largest number of installations.

Decreasing Installs with Age Restriction: As the content rating becomes more restrictive (from 'Everyone' to 'Teen' to 'Everyone 10+' and so on), the total number of installs decreases. This trend could be due to a smaller target audience as the content rating narrows to older age groups.

#### Chart - 11 - Percentage of free and paid apps?

In [None]:

# Q11. Percentage of free and paid apps?
plt.figure(figsize=(7, 7))  # Adjust size to be square to accommodate the pie chart
app_types.plot(kind='pie', autopct='%1.1f%%', colors=['green', 'red'], startangle=90)
plt.title('Proportion of Free vs. Paid Apps')
plt.ylabel('')  # Remove the y-axis label for clarity as it's not needed for pie charts
plt.show()


93 % apps are free and paid apps are approx 7 %.

#### Chart - 12 Calculate the average rating per category.

In [None]:
#Q12 Calculate the average rating per category.

average_ratings = playstore_data.groupby('Category')['Rating'].mean().sort_values()

# Plot
plt.figure(figsize=(10, 8))
average_ratings.plot(kind='barh', color='skyblue')
plt.title('Average Rating by Category')
plt.xlabel('Average Rating')
plt.ylabel('Category')
plt.show()



Some categories might inherently receive higher ratings due to the nature of the apps they contain. For instance, "Education" and "Events" apps might fulfill specific needs, leading to higher satisfaction and thus higher ratings.

 The "Dating" category shows a lower average rating compared to others, indicating that users might have a harder time finding apps that meet their expectations or that they experience more dissatisfaction in this category.

Chart 13 : Correlation Heatmap among: Rating , Review, Install, Price

In [None]:
data_corr = playstore_data[['Rating', 'Reviews', 'Installs', 'Price (in $)']]
corr_playstore_data = data_corr.corr()
corr_playstore_data

In [None]:
plt.figure(figsize=(10,5))
sns.heatmap(corr_playstore_data, vmin=-1, cmap='coolwarm', annot=True)

For correlation, heat maps are very easy to understand.

The overall insight from this heatmap is that the number of reviews and installs are moderately correlated, while price does not show a significant correlation with any of the other variables. Additionally, ratings have a very weak correlation with reviews and installs, and almost no correlation with price.

Chart14 Average app rating per year.

In [None]:
# Convert 'Last Updated' to datetime and extract the year
playstore_data['Last Updated'] = pd.to_datetime(playstore_data['Last Updated'])
playstore_data['Year Updated'] = playstore_data['Last Updated'].dt.year

# Calculate the average rating per year
average_rating_per_year = playstore_data.groupby('Year Updated')['Rating'].mean().reset_index()


In [None]:
plt.figure(figsize=(12, 6))
sns.lineplot(data=average_rating_per_year, x='Year Updated', y='Rating', marker='o')
plt.title('Average App Rating Over Time')
plt.xlabel('Year')
plt.ylabel('Average Rating')
plt.xticks(average_rating_per_year['Year Updated'])  # Ensure all years are displayed
plt.grid(True)  # Add a grid for easier reading
plt.show()


Initial Decline: There's a sharp decline in the average rating from 2010 to 2011. This might indicate either a surge of new apps with lower quality or changes in user expectations or rating behaviors.

Low Point in 2012: The year 2012 shows the lowest point in average app ratings. It would be interesting to explore if any significant market events occurred around this time, such as a flood of new app entries or policy changes in the store.

Recovery and Growth: Post-2012, there is a noticeable recovery and a steady growth in ratings. This could be due to app developers improving app quality, users becoming more familiar with the app ecosystem, or possibly adjustments in the app rating system itself.

Stabilization Period: Between 2013 and 2016, the average rating stabilizes around 4.0. This could suggest market maturity where apps maintain a consistent quality, or it might reflect a period where there were no significant changes impacting user ratings.

Uptrend in Recent Years: There's a clear upward trend from 2016 to 2018, indicating an improvement in average app ratings. This could be due to better app development practices, enhanced user engagement strategies, or perhaps a shift in the user base.

Overall Range: The overall range of the average ratings from the lowest to highest point is within a relatively narrow band (from about 3.8 to 4.3). This might suggest that while there are fluctuations, app quality—as perceived by users—does not vary drastically year over year.

# **Conclusion**

1. Family, games,Tool category has the most number of apps.Therefore the developer can aim for developing a high quality app in less popular categories as there can be a scope for exploration in categories that are less explored.

2. Highest concentration of reviews is among the apps thar are rated between 4 and 5 which indicates that users might be less inclined to review poorly rated apps.

3. There are differences in rating distributions across content ratings, most apps are rated above 4.0, with a trend toward a slightly lower median rating as the intended audience age increases. However, the "Adults only 18+" category bucks this trend, potentially due to the small number of apps with this rating.

4. Certain categories like "Personalization" ,"Medical" and "Tools" seem to have a relatively higher presence of paid apps. This could suggest that users may be more willing to pay for apps that offer utility or customization.

   Niche Markets: Categories with a lower overall number of apps, including paid apps, might represent niche markets. These could be areas where specialized apps cater to specific user needs, and there may be opportunities for new entrants if they can provide value.

5. Mostly the paid apps lie below 50$ mark, this might be a strategy from developers to price the apps at a low entry price for impulse purchases by customers.

6. The number of data points for each year seems to increase over time, suggesting that more apps have been updated in recent years, or possibly that the number of apps in the store has been increasing.

7. In positive reviews: Words like "great," "love," "good," "best," and "awesome" are prominent, suggesting strong satisfaction with the apps.

  The most prominent words are "bad," "problem," "issue," "worst," and "annoying," indicating common themes in negative reviews.

8. Health and fitness category has a median sentiment polarity above 0.25 and the spread shows that mostly the reviews are positive. Categories with median line close to zero suggests neutral sentiment.

9. Broad Compatibility: A significant proportion of apps support a wide range of Android versions (e.g., "4.0.3 and up"), suggesting developers aim for broad compatibility to reach a larger user base.

  Modern Android Support: There is a notable percentage of apps that require more recent versions of Android (e.g., "5.0 and up"), indicating developers are taking advantage of newer Android features and technologies.

  Fragmentation: The variety of Android versions supported reflects the fragmentation of the Android ecosystem. Developers must decide whether to target newer versions with more features or older versions with a potentially larger user base.

  "Varies with device": This category is quite substantial, showing that many apps have different minimum version requirements depending on the device. This could indicate adaptive app development practices where compatibility is tailored to individual device capabilities.These might be apps for specific companies as their inbuilt app.

10. No. of installs reduce as age restriction increases.

11. 93 % apps are free and only 7 % are paid apps. This tells us that mostly free apps are being preferred by developers and end users but as spending capacity of people increase, the 7% figure can grow in future.

12. Education and Event category has a high rating because these apps are fullfilling specific needs.

  Dating category shows lower rating as different users might have different feature expectations and current features of app might not fullfill it.
So regular updates can help in such cases.

13. Number of reviews and installs are moderately correlated, while price does not show a significant correlation with any of the other variables. Additionally, ratings have a very weak correlation with reviews and installs, and almost no correlation with price.

14. A steep rise in app ratings trend starting from 2012 to 2013, this might be due to the boom of andrioid phone market or better app development.

  Uptrend in recent years as more sophisticated apps are being developed with good development practice.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***