# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

In today's digital age, user feedback plays a crucial role in shaping the success of mobile applications. Understanding user sentiments and preferences can help app developers make informed decisions to enhance their products. The Play Store App Review Analysis project aims to analyze user reviews and associated data from the Google Play Store to extract insights that can guide app developers in improving their offerings.

This study undertakes a comprehensive analysis related to Play Store apps data using Python, with a focus on unearthing pivotal factors influencing app engagement and success. Leveraging Python libraries such as NumPy, Pandas, Seaborn, and Matplotlib for data manipulation and visualization, the objective is to provide actionable insights to refine app performance in the Android market.The project revolves around two datasets—the Play Store Dataset and the User Reviews Dataset.

Results will be visualized using graphs, charts, and word clouds to make the findings more interpretable and accessible. The project will gather a large dataset of app reviews from the Google Play Store API or through web scraping techniques.

In conclusion, The Play Store App Review Project offers a comprehensive approach to understanding user feedback and improving app quality. By leveraging advanced data analysis techniques, we can gain valuable insights into user sentiments, preferences, and pain points, enabling developers to make informed decisions and deliver better-performing applications. Continuous monitoring and analysis of user reviews are essential for maintaining app competitiveness and ensuring customer satisfaction in the dynamic mobile app market.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The mobile applications on the Google Play Store presents both opportunities and challenges for developers aiming to create successful apps. The problem at hand is to conduct a comprehensive analysis of the Play Store app ecosystem to uncover actionable insights that can guide developers towards creating more impactful and successful applications

#### **Define Your Business Objective?**

Analyze the Play Store apps dataset using Python. Gain insights into competitor's strengths and weaknesses by analyzing user reviews of competing apps, identifying gaps in the market, and capitalizing on opportunities to differentiate and outperform competitors.

Encourage positive reviews and ratings by actively addressing user concerns, acknowledging feedback, and continuously improving the app based on user input, leading to improved app visibility and credibility on the Play Store.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

path = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Play_Store_Data.csv')

user_review_path = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/User_Reviews_project.csv')


### Dataset First View

In [None]:
# Dataset First Look

# display the play store app data
print('Play_Store_Data')
path.head()

In [None]:
# Dataset First Look

# display the user reviews project
print('User_Reviews_project')
path.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print('Play_Store_Data Rows:', path.shape[0])
print('Play_Store_Data Columns:', path.shape[1])
print('User_Reviews_project Rows:', user_review_path.shape[0])
print('User_Reviews_project columns:', user_review_path.shape[1])


### Dataset Information

In [None]:
# Dataset Info
print(path.info())
print('\n')

print(user_review_path.info())

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

duplicate_count = path.duplicated().sum()
print(f' the duplicate value count in play store data:',duplicate_count)

duplicate_count = user_review_path.duplicated().sum()
print(f' the duplicate value count in user reviews project:',duplicate_count)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

missing_values_count = path.isnull().sum()
print(f'count the missing value in each column:',missing_values_count,sep='\n')

print('\n')

missing_values_count = user_review_path.isnull().sum()
print(f'count the missing value in each column:',missing_values_count,sep='\n')

In [None]:
# Visualizing the missing values
# checking null values by plotting heatmap for play store data

print('Null value heatmap for play store data')

sns.heatmap(path.isnull(), cbar=False)
plt.title('Null value heatmap for play store data')
plt.show()


In [None]:
# Visualizing the missing values
# checking null values by plotting heatmap for play store data

print('Null value heatmap for User Reviews project')

sns.heatmap(user_review_path.isnull(), cbar=False)
plt.title('Null value heatmap for user reviews project')
plt.show()

### What did you know about your dataset?

The datasets are associated to conduct a comprehensive analysis of the Play Store app on android.

1) The play store data have 10,841 rows and 13 columns.

2) The highest missing values in column's play store data is in rating: 1474 null values.

3) Play store Dataset Duplicate Value Count is 483.

4) The user reviews project have 64,295 rows and 5 columns.

5) The highest missing values in column's user reviews project is in translated_Reviews: 26868 null values.

6) User reviews project Dataset Duplicate Value Count is 33616.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

print('Play_Store_Data columns:',path.columns,sep='\n')

print('\n')

print('User_Reviews_project columns:',user_review_path.columns,sep='\n')

In [None]:
# Dataset Describe
print('Play store Dataset description:',path.describe(include='all'),sep='\n')

print('\n')

print('user reviews project Dataset description:',user_review_path.describe(include='all'),sep='\n')

### Variables Description

**Variables descriptions for Play Store Dataset:**

App: "App" is a shortened form of the word application.Apps can serve a wide range of purposes, including productivity, entertainment, communication, education, gaming, and more.

Category: It is typically used in app stores or marketplaces to make it easier for users to browse and discover apps that meet their needs or interests.

Rating: users can rate and review the apps they have downloaded or used.

Reviews: It refer to feedback or evaluations provided by users who have used the app.

Size: The space the app occupies on a mobile phone.

Installs: to set up or load software onto a device so that it can be used.

Type: It refer to different classifications or categories based on various criteria. Indicates whether the app is free or paid.

Price: the amount of money that is charged or paid for an app.

Content Rating: Specifies if the app is suitable for all age groups.

Genres: It refer to categories or classifications.

Last Updated: The date of the app's last update.

Current Ver: The app's current version.

Android Ver: The Android version supporting the app.

**Variables descriptions for User Reviews Dataset:**

App: The app's name with a brief description.

Translated_Review: English translation of the user's review.

Sentiment: The reviewer’s attitude categorized as 'Positive', 'Negative', or 'Neutral'.

Sentiment_Polarity: The review's polarity, ranging from -1 (Negative) to 1 (Positive).

Sentiment_Subjectivity: The score indicates the degree to which a reviewer’s opinion aligns with the general public’s opinion, with a range of [0, 1]. Higher scores suggest opinions closer to the general public, while lower scores indicate more factual information in the review

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# unique value for each variable in play store dataset

unique_value_in_play_store_data = path.nunique()
print('unique value for each variable in play store dataset:',unique_value_in_play_store_data,sep='\n')

print('\n')

# unique value for each variable in user reviews dataset
unique_value_in_user_reviews_project_data = user_review_path.nunique()
print('unique value for each variable in play store dataset:',unique_value_in_user_reviews_project_data ,sep='\n')

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

In [None]:
# Print the rows with non-numeric characters in the 'Reviews' column
non_numeric_reviews =path[path['Reviews'].str.contains(r'\D')]
print("Rows with non-numeric characters in 'Reviews' column:")
non_numeric_reviews

In [None]:
# The row at index 10472 contains data that is entirely incorrect or irrelevant.
# The row is deemed unusable or misleading for the analysis, and removing it ensures the integrity and accuracy of the overall dataset.
path = path.drop(index=10472)

# Resetting the index ensures that the DataFrame has continuous and ordered indices after dropping a row.
path = path.reset_index(drop=True)

In [None]:
# Convert the 'Reviews' column to integer datatype
path['Reviews'] = path['Reviews'].astype(int)

In [None]:
# Convert the 'Last Updated' column to datetime format
path['Last Updated'] = pd.to_datetime(path['Last Updated'])


In [None]:
# Creating a function drop_dollar, which drops the $ symbol if present and returns the value as a float.
def drop_dollar(value):

    if '$' in value:
        return float(value[1:])
    else:
        return float(value)

# Applying the drop_dollar function to the 'Price' column
path['Price'] = path['Price'].apply(lambda x: drop_dollar(x))


In [None]:
# Defining a function drop_plus that removes the '+' symbol if present and returns the result as an integer.

def drop_plus(value):
    '''
    This function drops the + symbol if present and returns the value with int datatype.
    If the value is not a valid integer, return 0.
    '''
    try:
        if '+' and ',' in value:
            return int(value[:-1].replace(',', ''))
        elif '+' in value:
            return int(value[:-1])
        else:
            return int(value)
    except ValueError:
        return 0
    '''
    The 'Installs' column now contains integer values representing the minimum number of times an app has been installed.
    An 'Installs' value of 0 means the app has not been installed.
    An 'Installs' value of 1 means the app has been installed at least once.
    An 'Installs' value of 1000000 means the app has been installed by at least one million users, and so on.
    '''
# The drop_plus function applied to the 'Installs' column
path['Installs'] = path['Installs'].apply(lambda x: drop_plus(x))


In [None]:
# Defining a function to convert all the entries in KB to MB and then converting them to float datatype.

def kb_to_mb(entry):
    '''
    Converts size entries to MB. Returns as a float if in megabytes (M), or converts and rounds to 4 decimal places if in kilobytes (k).
    Returns the original entry if not in either format or if any conversion exception occurs.
    '''
    try:
        if 'M' in entry:
            return float(entry[:-1])
        elif 'k' in entry:
            return round(float(entry[:-1]) / 1024, 4)
        else:
            return entry
    except:
        return entry

# The kb_to_mb funtion applied to the size column
path['Size'] = path['Size'].apply(lambda x: kb_to_mb(x))

In [None]:
# Verifying the data type information after type conversion
print('Play Store Updated Data Info:')
path.info()

In [None]:
# Extract non-float values in 'Size' column
non_float_size_values = path['Size'][~path['Size'].apply(lambda x: isinstance(x, float))]

# Calculate the percentage of non-float values in 'Size' column
percentage_non_float = (len(non_float_size_values) / len(path['Size'])) * 100

# Print the result
print(f"Non-float values in the 'Size' column: {non_float_size_values.unique()}")
print(f"Percentage of non-float values in the 'Size' column: {percentage_non_float:.2f}%")


#Varies with device- being the only non-float entry, constituting 15.64%,
#led to the decision to retain rows with this value in the 'Size' column.


In [None]:
# Show Dataset Rows & Columns count Before Removing Duplicates

print('Play_Store_Data Rows count:', path.shape[0])
print('Play_Store_Data Columns count:', path.shape[1])
print('User_Reviews_project Rows count:', user_review_path.shape[0])
print('User_Reviews_project columns count:', user_review_path.shape[1])
print('\n')
# remove duplicates
path.drop_duplicates(inplace=True)
user_review_path.drop_duplicates(inplace=True)


# Show Dataset Rows & Columns count After Removing Duplicates

print('Play_Store_Data Rows count:', path.shape[0])
print('Play_Store_Data Columns count:', path.shape[1])
print('User_Reviews_project Rows count:', user_review_path.shape[0])
print('User_Reviews_project columns count:', user_review_path.shape[1])


In [None]:
#Fill missing values for numerical columns with the median and categorical with the mode
# For Play Store
path['Rating'].fillna(path['Rating'].median(), inplace=True)
path['Type'].fillna(path['Type'].mode()[0], inplace=True)
path['Content Rating'].fillna(path['Content Rating'].mode()[0], inplace=True)
path['Current Ver'].fillna('Varies with device', inplace=True)
path['Android Ver'].fillna('Varies with device', inplace=True)

# For User Reviews
user_review_path['Sentiment_Polarity'].fillna(user_review_path['Sentiment_Polarity'].median(), inplace=True)
user_review_path['Sentiment_Subjectivity'].fillna(user_review_path['Sentiment_Subjectivity'].median(), inplace=True)
user_review_path['Sentiment'].fillna(user_review_path['Sentiment'].mode()[0], inplace=True)
user_review_path['Translated_Review'].fillna('No review', inplace=True)

# Check missing values again to confirm
user_reviews_missing_updated = user_review_path.isnull().sum()
play_store_missing_updated = path.isnull().sum()

print('\nUpdated number of missing values in Play Store dataset:')
print(play_store_missing_updated)
print('Updated number of missing values in User Reviews dataset:')
print(user_reviews_missing_updated)

In [None]:
# Show Dataset Rows & Columns count Before Removing Outliers
print('Shape Before Removing Outliers:')
print('Play Store Data Rows count:',path.shape[0])
print('Play Store Data Columns count:',path.shape[1])
print('User Reviews Data Rows count:',user_review_path.shape[0])
print('User Reviews Data Columns count:',user_review_path.shape[1],end='\n\n')

# Removing Outliers from Data
# Define the quantile range
quantile_low = 0.05
quantile_high = 0.95

# Remove outliers for Reviews column
path = path[(path['Reviews'] >= path['Reviews'].quantile(quantile_low)) &
                      (path['Reviews'] <= path['Reviews'].quantile(quantile_high))]

# Remove outliers for Installs column
path = path[(path['Installs'] >= path['Installs'].quantile(quantile_low)) &
                      (path['Installs'] <= path['Installs'].quantile(quantile_high))]

# Show Dataset Rows & Columns count After Removing Outliers
print('Shape After Removing Outliers:')
print('Play Store Data Rows count:',path.shape[0])
print('Play Store Data Columns count:',path.shape[1])
print('User Reviews Data Rows count:',user_review_path.shape[0])
print('User Reviews Data Columns count:',user_review_path.shape[1])

### What all manipulations have you done and insights you found?

**The following actions were taken to make the datasets analysis-ready:**

1) Identifying Non-Numeric Reviews:

Checked and printed rows with non-numeric characters in the 'Reviews' column.

2)Removing Irrelevant Row:

Dropped the row at index 10472 as it contained incorrect or irrelevant data, ensuring dataset integrity.

3) Converting Reviews to Integer:

Converted the 'Reviews' column to integer data type for numerical analysis.

4) Converting Last Updated to Datetime:

Converted the 'Last Updated' column to datetime format for temporal analysis.

5) Handling Price Values:

Created a function (drop_dollar) to drop the '$' symbol and convert the 'Price' column to float data type.

6) Handling Installs Values:

Created a function (drop_plus) to drop the '+' symbol and convert the 'Installs' column to integer data type.

7) Converting Size Entries:

Created a function (kb_to_mb) to convert size entries to MB and handle 'k' or 'M' units.

8) Verifying Data Types:

Checked and printed the updated data type information after the type conversion.

9) Removing Duplicates:

Removed duplicate rows from both the Play Store and User Reviews datasets.

10) Handling Missing Values:

Filled missing values for numerical columns with the median and categorical columns with the mode.
Checked and printed the updated number of missing values in both datasets.

11) Handling Outliers:

Visualized outliers through box plots for Reviews and Installs.
Removed outliers from data based on quantile range (5% to 95%) for Reviews and Installs.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Distribution of Sentiment Polarity
# Set up the visualisation settings
sns.histplot(user_review_path['Sentiment_Polarity'], edgecolor='black', kde=True , bins=30, color='skyblue')
plt.xlabel('Sentiment Polarity', size=15)
plt.ylabel('Frequency', size=15)
plt.title('Distribution of Sentiment Polarity')
plt.show()

##### 1. Why did you pick the specific chart?

Primarily because they are effective at visually representing the distribution of data.

##### 2. What is/are the insight(s) found from the chart?

**Positive sentiment:** histogram shows the positive sentiment side, indicating that the users who is using is positive.

**Negative sentiment:** histogram shows the negative sentiment side, where the users indicatings the personal feedback, harms & distractions or dis-satisfaction.

**Histogram chart:** the chart of this histogram is distributed through higher to lower concentrations among users & satisfaction with play store's app.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

POSITIVE BUSINESS IMPACT:

* The terms positive impacts lies on users across globe, positive statement indicates that most of the apps are clean, user friendly, gives good experience and attract new users. All the positive reviews and satisfaction well directed to the app's developers for popularity and success.

* The right-skewed distribution suggests a higher concentration of positively received apps. This implies a general trend of user satisfaction, which can contribute to positive growth. Developers can leverage this positive sentiment to reinforce features users appreciate and further enhance user experiences.


NEGATIVE GROWTH:

* The negative growth indicates that some apps face major bugs which users facing. The users dis-satisfactions address the failure of apps which comes with negative reviwes and feedbacks. It may lead to users seeking alternate app that give better experience.

* Users dis-satisfaction provides negative growth which downs the app ratings and may be hard to recovers users trust in the market.


















#### Chart - 2

In [None]:
# Chart - 2 visualization code
# distribution of app rating in play store data
sns.histplot(path['Rating'], bins=40, kde=True, edgecolor= 'black')
plt.title('Distribution of app rating', size=20)
plt.xlabel('Rating', size=15)
plt.ylabel('Frequency', size=15)
plt.xlim(0,6)
plt.show()

##### 1. Why did you pick the specific chart?

Histogram provides clear chart of visualizing the rating variables in play store dataset and providing clear frequency of rating level.

##### 2. What is/are the insight(s) found from the chart?

Rating given by users are satisfied by the app they had installed.
The users giving rating between 4 to 5 this shows that the users enjoying the apps.

The lower ratings apps have negative reviews, dis-satisfactions by the users and need to improve for betterment.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positivve business impact:**

* Positive impact: the positive impact means the majority of users are satisfied with the apps.

* Rating enthusiasm: this can enhance app's reputation and values with positive rating.

**Negative Growth:**

* User dis-satisfaction: the low rating app that are not fully good for users may decreases value in the market. Users find alternating app that are useful for them.

* A lots of apps in the market users have various alternative to use. The low rating might get down in market.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Distribution of Free and Paid app
# Count the each app type in the 'Type' column
app_type = path['Type'].value_counts()
labels =['Free', 'Paid']
explode = [0,0.2]
colors=['lightskyblue','yellow']

plt.pie(app_type,labels=labels, explode=explode, colors=colors, autopct='%1.2f%%', shadow=True)
plt.title('Distribution of Free and Paid app')
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart are useful to visualize data easyily. This chart provide a simple and straight forward way to visualize data effectively.

##### 2. What is/are the insight(s) found from the chart?

In this pie chart the larger slice represent the free app with 92.34% and the smaller size slice represent the paid app with 7.66% of the total.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impact:**

* free apps are more likely to attract users to download.

* developers are monetization avenues, advertising and premium features to generate revenue.

**Negative growth:**

* Paid app are less users as compare to free. Users don't find to spend money to buy app premium as they search for alternative.

* developers earns money by advertising which impacting the user experience. In the middle of using advertise approach which leads users to hold for minutes to use app.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Top 5 install genres
genre_install = path.groupby('Genres')['Installs'].sum().reset_index()

# Sort by installs in descending order
sorted_genre_installs = genre_install.sort_values(by='Installs', ascending=False)

# select the top 5 genres
top_5_genres= sorted_genre_installs.head(5)

sns.barplot(x='Genres', y='Installs', data=top_5_genres, palette ='Set2')
plt.title('Top 5 Genres', size=20)
plt.xlabel('App Genres',size=15)
plt.ylabel('Installs',size=15)

plt.show()


##### 1. Why did you pick the specific chart?

Bar plots are useful for visualizing categorical data and comparing the values of different categories.

##### 2. What is/are the insight(s) found from the chart?

* Tools apps are the most popular genre. This is likely due to the increasing reliance on smartphones and tablets for work and productivity.

* Action apps are the second most popular genre. Action games are typically fast-paced and exciting, and they appeal to a wide range of users.

* Photography apps are the third most popular genre. This is likely due to the increasing popularity of smartphone photography.

* Entertainment apps are the fourth most popular genre. Entertainment apps include various streaming services as well as social media apps.

* Communication apps are the fifth most popular genre. Communication apps include messaging apps and video conferencing apps.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

1) Tools Apps Popularity: With Tools apps being the most popular genre, businesses can capitalize on this trend by developing and optimizing apps that cater to work and productivity needs. This insight suggests a demand for applications that enhance users' efficiency and organization, presenting an opportunity for businesses to create valuable and practical tools.

2) Action Apps Appeal: The popularity of Action apps, known for their fast-paced and exciting nature, indicates a broad appeal. Developers can leverage this insight to create engaging and entertaining games, potentially attracting a wide user base. This genre's popularity suggests a demand for immersive and thrilling experiences, which can be monetized effectively.

3) Photography Apps Trend: The increasing popularity of Photography apps aligns with the growing trend of smartphone photography. Businesses can seize this opportunity by developing innovative photography apps, offering features that enhance photo editing, organization, and sharing. This genre's popularity reflects a consumer interest in visual content creation.

4) Entertainment Apps Opportunities: Entertainment apps, encompassing streaming services and social media apps, hold significant popularity. Businesses can explore opportunities within this genre, either by creating new content streaming services or optimizing social media platforms. This trend indicates a sustained demand for diverse entertainment options.

5) Communication Apps Importance: The popularity of Communication apps, including messaging and video conferencing, highlights the essential role of connectivity. Businesses can focus on creating user-friendly and feature-rich communication apps, meeting the increasing demand for seamless connectivity and collaboration.

**Negative Growth:**

1) Competitive Challenges: If certain genres have a saturated market with intense competition, it may be challenging for new apps to gain visibility and user traction. This could lead to negative growth for apps in those highly competitive categories.

2) Addressing Negative Reviews: Negative sentiment in reviews or lower app ratings may indicate areas for improvement. Ignoring or failing to address these issues could result in negative user experiences, leading to decreased installs and usage.

3) Adapting to Trends: App markets are dynamic, and user preferences can change. Failing to adapt to emerging trends or technological advancements may result in declining popularity and negative growth for apps that become outdated.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Content Ratings by Installs
# Group by Content Rating, summing the installs for each content rating
content_rating_installs = path.groupby('Content Rating')['Installs'].sum().reset_index()

# Sort by installs in descending order
sorted_content_rating_installs = content_rating_installs.sort_values(by='Installs', ascending=False)

# Set up the plot

sns.barplot(x='Installs', y='Content Rating',data=sorted_content_rating_installs, palette='Set2')

# Customize the plot
plt.title('Content Ratings by Installs',size=20)
plt.xlabel('Total Installs',size=15)
plt.ylabel('Content Ratings',size=15)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

Bar charts serve as a visual tool to effectively represent and compare data across distinct categories. Utilizing bars of varying lengths, these charts visually convey the values associated with each category, facilitating the observation of differences and similarities. Hence, I opted for a bar chart to analyze the variations in content ratings based on installations.

##### 2. What is/are the insight(s) found from the chart?

The categories Everyone and Teen stand out with the highest number of installs, indicating preferences for apps suitable for all ages or users aged 13 and above. These categories encompass apps with minimal or mild content, including educational, entertainment, or social apps.

The Everyone 10+ category follows with the third-highest installs, suggesting a preference for apps suitable for users aged 10 and above. Such apps may contain more moderate content, such as fantasy or science fiction.

The Mature 17+ and Adults only 18+ categories exhibit significantly fewer installs. This implies a limited preference for apps tailored to users aged 17 or older or 18 and older, which may feature intense or graphic content like violence, sexual content, drug use, or gambling.

The Unrated category records the fewest installs, suggesting minimal interest in apps lacking official ratings. These apps may have unknown or variable content, potentially unsuitable for some users.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**
The Everyone and Teen categories exhibit the highest number of installs, suggesting a positive reception and indicating a market demand for apps suitable for a broad audience, including educational and entertainment content.

The Everyone 10+ category, with the third-highest installs, reflects a positive response to apps tailored for users aged 10 and above. This indicates potential business opportunities in developing content with moderate themes for this demographic.

**Negative Growth Consideration:**
Limited installs for the Mature 17+ and Adults only 18+ categories suggest a potential negative impact. The lower preference for apps with intense or graphic content for users aged 17 or older may indicate a narrower market, prompting consideration before heavy investment in such content development.

The Unrated category, recording the fewest installs, highlights user reluctance towards apps lacking official ratings. This hesitation could be attributed to uncertainties about the app's content, posing a challenge for positive user engagement and potential business growth.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Size Groups by Installs
# Define the function to group app sizes
def size_groups(value):
    try:
        if value < 20:
            return 'Below 20'
        elif value >= 20 and value <= 40:
            return '20-40'
        elif value > 40 and value <= 60:
            return '40-60'
        elif value > 60 and value <= 80:
            return '60-80'
        elif value >80 and value <=100:
            return '80-100'
        else:
            return 'Above 100'
    except:
        return value

# Apps with size 'Varies with device' have dynamic sizes
# that are not explicitly stated, making it challenging to categorize them accurately.
# Exclude rows where size is 'Varies with device'
df_filtered_size = path[path['Size'] != 'Varies with device']

# Apply the size_groups function to create a new 'Size Group' column
df_filtered_size['Size Group'] = df_filtered_size['Size'].apply(size_groups)

# Group by Size Group, summing the installs for each size group
size_group_installs = df_filtered_size.groupby('Size Group')['Installs'].sum().reset_index()

# Sort by installs in descending order
sorted_size_group_installs = size_group_installs.sort_values(by='Installs', ascending=False)

# Set up the plot
plt.figure(figsize=(10, 5))
sns.barplot(x='Installs', y='Size Group', data=sorted_size_group_installs, palette='pastel')

# Customize the plot
plt.title('Size Groups by Installs',size=20)
plt.xlabel('Total Installs',size=15)
plt.ylabel('Size Groups(MB)',size=15)

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

Using bars of varying lengths, bar charts are a visual way to efficiently compare and contrast data across different categories. The values for each category are shown by the bars, making it easy to see the differences and similarities. For this purpose, I have used a bar chart to show the connection between size groups and the total installs of Play Store apps.

##### 2. What is/are the insight(s) found from the chart?

The size group with the most installs is the below 20 group, followed by the 20-40 group, followed by the 40-60 group, then the 60-80 group, and finally the 80-100 group.

The size group with the least installs is the 80-100 group, which has less than a quarter of the installs of the below 20 group. This suggests that the users prefer smaller apps over larger apps. This could indicate that the users have limited storage space on their devices, or that they are more selective about the apps they download and install

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

Optimizing App Sizes: Adapting app sizes to align with user preferences, especially in the popular Below 20 and 20-40 size groups, can lead to increased downloads and positive user experiences.

Enhancing User Satisfaction: Meeting user expectations for smaller app sizes addresses potential device constraints and enhances overall user satisfaction.

Strategic Development: Focusing on developing apps within the preferred size ranges may result in a positive impact on business growth.

**Negative Growth Consideration:**

Limited Installs in 80-100 Size Group: The 80-100 size group exhibits significantly fewer installs, raising concerns about potential negative growth.

User Preference Challenges: Neglecting user preferences for smaller apps, as evident in the popular size groups, may lead to reduced downloads and user engagement.

App Availability Concerns: Insufficient installs in the 80-100 size group may indicate either a lack of user interest in larger apps or a scarcity of apps within this size range.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Distribution of Ratings by Update Year
# Extract the year from the 'Last Updated' column
path['Update Year'] = path['Last Updated'].dt.year

# Set up the regression plot
plt.figure(figsize=(10, 5))
sns.regplot(x='Update Year', y='Rating', data=path, scatter_kws={'alpha':0.5}, line_kws={'color':'orange'})
plt.title('Distribution of Ratings by Update Year',size=20)
plt.xlabel('Update Year',size=15)
plt.ylabel('Rating',size=15)
plt.show()


##### 1. Why did you pick the specific chart?

A regplot is a statistical visualization that reveals the relationship between two continuous variables through a scatter plot and a best-fit linear regression line. It helps identify trends, patterns, and the strength of the correlation between variables. I utilized regplot to explore the distribution of Ratings with respect to Update Years.

##### 2. What is/are the insight(s) found from the chart?

* The average rating has shown an improvement, rising from approximately 3.5 in 2010 to nearly 4.5 in 2018. This indicates a general trend of increasing satisfaction among users with the product over the years.

* The red line shows that the overall trend is towards increasing ratings. This is a positive sign for the product.

* The slope of the red line is positive, which indicates that the relationship between rating and update year is positive. This means that ratings tend to increase as the update year increases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

Increasing Average Rating: The rise in average ratings from approximately 3.5 in 2010 to nearly 4.5 in 2018 indicates an upward trend in user satisfaction. This is a positive signal as it suggests that users are generally more content with the product over time.

Improvement Over Time: The red line representing the trend shows a positive slope, indicating a consistent increase in ratings. This implies that developers are making continuous improvements, positively influencing user satisfaction.

Possible Explanations for Rating Increase:

Introduction of New Features: Developers may be adding new features, enhancing the product's functionality, and providing users with more value.

Enhancements in Reliability and Usability: The product may be improving in terms of reliability and user-friendliness, contributing to a better overall experience.

Increased Popularity: A growing user base could lead to a more positive user experience, as popularity often correlates with user satisfaction.

**Negative Growth Consideration:**

Limited Historical Context: While the increasing trend is positive, it's crucial to consider the context. Ratings might be influenced by various factors, and without a deeper understanding of the product's evolution or changes, solely relying on the increasing trend might be limited.

Potential Plateau: Over time, achieving consistently higher ratings becomes challenging, and there might be a plateau effect. If the ratings reach a saturation point, further improvements might yield diminishing returns, potentially leading to stagnation.

User Base Shift: Ratings might be influenced by changes in the user base. If the user demographic shifts or if newer users have different expectations, the historical trend may not accurately reflect the current user sentiment.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Top 10 Categories by Average Revenue
# Calculate Revenue for each app
path['Revenue'] = path['Installs'] * path['Price']

# Group by Category and calculate the mean revenue, sorted in descending order
category_revenue = path.groupby('Category')['Revenue'].mean().sort_values(ascending=False)

# Select the top 10 categories
top_10_categories_revenue = category_revenue.head(10)

# Title of the plot
plt.title('Top 10 Categories by Average Revenue',size=20)

# Create a bar plot for the top 10 categories
sns.barplot(x=top_10_categories_revenue.values, y=top_10_categories_revenue.index, palette='Set3')
plt.xlabel('Average Revenue (USD)',size=15)
plt.ylabel('Category',size=15)
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is a valuable visualization for comparing diverse data points, particularly when examining and contrasting data across different categories. So, I opted for a bar chart to explore the top 10 categories by average revenue.

##### 2. What is/are the insight(s) found from the chart?

The top 3 revenue-generating categories, namely Lifestyle, Finance, and Weather, indicate a willingness among users to invest in products and services associated with their personal lives and finances.

The following 3 categories in revenue ranking Game, Photography, and Family are all linked to entertainment and leisure, signaling an increasing trend of expenditure on experiences meant for enjoyment and shared moments.

Contrastingly, the Sports category records the lowest average revenue, implying a lack of popularity among users. Similarly, the Education category follows closely with the second lowest average revenue, highlighting lower profitability.

The Personalization category secures the fourth-lowest average revenue, suggesting a relatively lower level of user interest. Lastly, the Medical category ranks third lowest in average revenue, indicating potential challenges in terms of convenience or security for apps within this category.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

Strategic Focus on High-Revenue Categories: Businesses can leverage the knowledge about the top revenue-generating categories like Lifestyle, Finance, and Weather to strategically focus on developing and optimizing apps within these genres. This aligns with user preferences and the willingness to invest in products associated with personal lives and finances.

Entertainment and Leisure Trends: Recognizing the revenue potential in Game, Photography, and Family categories allows businesses to tap into the increasing trend of expenditure on experiences related to entertainment and leisure. Developers can create engaging and enjoyable apps within these genres to attract users and generate revenue.

**Negative Growth Consideration:**

Low Revenue in Sports and Education Categories: The Sports category recording the lowest average revenue suggests a lack of popularity among users. Similarly, the Education category having the second lowest average revenue indicates lower profitability. If businesses are heavily invested in these categories, they may face challenges in generating substantial revenue. This doesn't necessarily lead to negative growth, but it does signal areas for strategic evaluation and potential adjustments.

Challenges in Personalization and Medical Categories: The relatively low average revenue in the Personalization and Medical categories suggests challenges. In the Personalization category, there may be a lower level of user interest, while in the Medical category, potential issues related to convenience or security may be impacting revenue. Addressing these challenges through targeted improvements or considering alternative strategies is essential to mitigate any negative impact.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Comparing the Revenue and Android Version of the Top 8 Paid Apps in the Play Store

#Exclude apps with Android version 'Varies with device'
top_8_paid_apps = path[path['Android Ver'] != 'Varies with device'].nlargest(8, 'Revenue', keep='first')

# Set up the plot
plt.figure(figsize=(10, 5))

# Plotting lollipops for revenue using Seaborn scatterplot
sns.scatterplot(x='App', y='Revenue', hue='Android Ver', data=top_8_paid_apps, palette='Dark2', s=300, zorder=2)

# Plotting bars for revenue using Seaborn barplot
sns.barplot(x='App', y='Revenue', data=top_8_paid_apps, color='darkorange', width=0.08, zorder=1)

# Customize the plot
plt.xlabel('App', size=15)
plt.xticks(rotation=45,ha='right')
plt.ylabel('Revenue (USD)', size=15)
plt.title('Top Eight Paid Apps: Revenue and Android Version', size=20)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

Lollipop charts offer a visually appealing alternative to traditional bar charts. In this chart, the lollipop's colour shows the Android version, while the length represents the app's revenue. It effectively highlights the top 10 revenue-generating apps and provides insights into Android version compatibility making this visualization my preferred choice.

##### 2. What is/are the insight(s) found from the chart?

The apps with Android versions 4.0 and above dominate the higher revenue ranks, suggesting a correlation between app compatibility with newer Android versions and revenue generation.

Among the top 8 high revenue apps, six are designed for Android versions 4.0 and above. The exceptions are "Grand Theft Auto: San Andreas" (Android 3.0 and up) and "DraStic DS Emulator" (Android 2.3 and up), both of which are on the lower end of the revenue spectrum.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

Targeted Development: Knowing that apps compatible with Android versions 4.0 and above tend to generate higher revenue can guide developers in focusing their efforts on creating and optimizing apps for these versions. This targeted development approach may result in more successful and lucrative applications.

Market Alignment: Aligning app development with the Android versions preferred by users can enhance market penetration and user adoption. This alignment may lead to increased downloads and, subsequently, higher revenue.

**Negative Growth Consideration:**

Compatibility Challenges: Apps designed for older Android versions (e.g., Android 2.3 and 3.0) are associated with lower revenue. Investing resources in developing or maintaining apps for these versions may not yield significant returns, potentially leading to negative growth.

Revenue Discrepancy: The notable revenue difference between apps for Android 4.0 and above versus older versions suggests a market preference for more recent Android iterations. Failing to adapt to this preference may result in negative growth as user demand shifts toward newer Android releases.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Sentiment counts across categories.
# Set style to 'whitegrid'
sns.set(style='whitegrid')

# Set the figure size
plt.figure(figsize=(14, 6))

# Merging the Play Store data with the User Reviews data
merged_df = pd.merge(path, user_review_path, on='App', how='inner')

# Calculating the average sentiment polarity for each app
average_sentiment_polarity = merged_df.groupby('App')['Sentiment_Polarity'].mean()

# Merging the average sentiment polarity back to the original play store dataframe
play_store_sentiment_df = path.join(average_sentiment_polarity, on='App')

# Count the number of sentiments for each category
grouped_df = merged_df.groupby(['Category', 'Sentiment']).size().unstack()

sns.barplot(data=grouped_df.reset_index(), x='Category', y='Positive', color='skyblue', label='Positive')
sns.barplot(data=grouped_df.reset_index(), x='Category', y='Negative', color='gray', bottom=grouped_df['Positive'], label='Negative')
sns.barplot(data=grouped_df.reset_index(), x='Category', y='Neutral', color='orange', bottom=grouped_df['Positive'] + grouped_df['Negative'], label='Neutral')

# Adding labels and title
plt.xlabel('Category',size=15)
plt.xticks(rotation=90)
plt.ylabel('Count of Sentiments',size=15)
plt.title('Sentiments vs Category',size=20)

# Adding legend
plt.legend(title='Sentiment')

# Displaying the plot
plt.show()

##### 1. Why did you pick the specific chart?

Stacked bar charts are useful when we want to compare the composition of different components that contribute to a whole. Each bar is divided into sub-bars, representing levels of the second categorical variable, offering a clear depiction of contributions to the total. So, I used it to display sentiment counts across categories.

##### 2. What is/are the insight(s) found from the chart?

Top 5 categories with the highest positive sentiments:

* GAME
* FAMILY
* HEALTH_AND_FITNESS
* TRAVEL_AND_LOCAL
* DATING

Top 5 categories with the highest negative sentiments:

* GAME
* FAMILY
* TRAVEL_AND_LOCAL
* DATING
* SPORTS

The stacked bar chart reveals a complex interplay of positive and negative sentiments across different categories. While people express positive sentiments towards categories like games, family, health and fitness, travel and local, and dating, there is also a notable presence of negative sentiment associated with most of these same categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

Mixed Sentiments Across Top Categories:

The presence of both positive and negative sentiments within the same categories (GAME, FAMILY, TRAVEL_AND_LOCAL, DATING) suggests a nuanced user experience. Users seem to have mixed feelings and varied interactions within these categories.

Positive Exclusivity in HEALTH_AND_FITNESS:

HEALTH_AND_FITNESS stands out for exclusively having positive sentiments, indicating a notably favorable perception among users. This category appears to provide positive experiences with very less negative sentiments.

**Negative Growth Consideration:**

Addressing Negative Sentiments:

Negative sentiments within the same categories suggest potential challenges or dissatisfaction among users. Ignoring or neglecting these negative sentiments may lead to negative growth, impacting user retention and brand reputation.

Mitigating Challenges in SPORTS:

Given the high negative sentiments in the SPORTS category, businesses should conduct a detailed analysis to understand and address the challenges. Proactive measures to improve user experiences in this category are essential to preventing negative growth.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Progression of update counts and the distribution of sentiment counts over time
# Group by 'Update Year' and count the number of updates
update_counts = merged_df.groupby("Update Year")["App"].count()

# Group by 'Update Year' and 'Sentiment' and count occurrences
sentiment_counts = merged_df.groupby(['Update Year', 'Sentiment']).size().unstack()

# Plotting
plt.figure(figsize=(8, 8))

# Plotting the number of updates received
plt.subplot(2, 1, 1)
plt.plot(update_counts.index, update_counts, label='Number of Updates', marker='o', color='gray')
plt.ylabel('Number of Updates', size=15)
plt.title('Number of Updates and Sentiments over Update Years', size=15)
plt.legend()

# Plotting sentiments
plt.subplot(2, 1, 2)
plt.plot(sentiment_counts.index, sentiment_counts['Positive'], label='Positive', marker='o')
plt.plot(sentiment_counts.index, sentiment_counts['Negative'], label='Negative', marker='o')
plt.plot(sentiment_counts.index, sentiment_counts['Neutral'], label='Neutral', marker='o')
plt.xlabel('Update Year', size=15)
plt.ylabel('Number of Sentiments', size=15)
plt.legend()

# Adjust layout for better readability
plt.tight_layout()

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

Line charts are an effective way to present data in a sequential manner, especially over time. In this case, I used line charts to illustrate the progression of update counts and the distribution of sentiment counts over time. The purpose was to investigate whether there is any correlation between these two sets of data.

##### 2. What is/are the insight(s) found from the chart?

There is a general trend of increasing positive sentiments over time. This suggests that people are generally becoming more satisfied with the updates they are receiving.

The number of updates is increasing over time. This suggests that the developers are releasing new updates more frequently.

The number of negative sentiments is relatively stable. This suggests that people are generally not very unhappy with the updates they are receiving.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

Increasing Positive Sentiments Over Time: Positive sentiments trending upward over time indicates growing satisfaction among users. This is a positive signal for the business, as satisfied customers are more likely to stay engaged, recommend the product to others, and contribute positively to the brand image.

Increasing Number of Updates: The rising number of updates implies that developers are actively working on improving the product and addressing user needs. This can have a positive impact on user engagement and loyalty, as users appreciate continuous improvement and new features.

Stable Negative Sentiments: The stability of negative sentiments suggests that, in general, users are not significantly dissatisfied with the updates. While some negative sentiments may be inevitable, the fact that they are stable indicates that any issues are not escalating. This stability can be seen as a positive aspect, as it implies that negative feedback is manageable and not worsening.

**Negative Growth Consideration:**

Continuously Monitor Negative Feedback: Regularly monitoring negative sentiments and feedback can help identify specific areas for improvement. Even if negative sentiments are stable, addressing specific concerns can lead to enhanced customer satisfaction.

Engage with Users: Proactively engaging with users to understand their concerns and suggestions can provide valuable insights. Addressing user feedback and concerns demonstrates a commitment to customer satisfaction and can contribute to positive growth.

Competitor Analysis: Assessing competitor products and user feedback can provide a comparative perspective. Understanding what competitors are doing well and areas where they face challenges can help in refining the business strategy.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Relationship between Sentiment Polarity, Rating, and Installs
# Set style to 'whitegrid'
sns.set(style='whitegrid')

# Set up the plot
plt.figure(figsize=(10, 5))

# Scatter plot with size based on the number of installs
sns.scatterplot(x='Rating', y='Sentiment_Polarity', size='Installs', data=play_store_sentiment_df, sizes=(50, 300), edgecolor='white',legend=True)

# Customize the plot
plt.title('Sentiment Polarity by Rating and Installs', size=15)
plt.xlabel('Rating', size=15)
plt.ylabel('Average Sentiment Polarity', size=15)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

A bubble chart visually represents data points as bubbles, it's an extension of a scatter plot that uses the size of the data points to represent a third dimension of data, making it ideal for illustrating the relationships among three variables. So, I used it to explore the connections between average sentiment polarity, app rating, and the number of installs.

##### 2. What is/are the insight(s) found from the chart?

Apps with higher ratings tend to have higher average sentiment polarity: This correlation is logical, as users are more inclined to leave positive reviews for apps they enjoy using.

Niche Apps with High Sentiment Polarity: Some apps exhibit high average sentiment polarity despite having relatively low install counts. This indicates the presence of niche apps that are deeply cherished by their user base, even though they may not enjoy widespread popularity.

Popularity vs. Sentiment Polarity Discrepancy: The large bubbles representing the most installed apps generally show lower average sentiment polarity. This implies that widespread popularity doesn't consistently align with positive user sentiment.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

Identifying popular apps with high sentiment: Focusing marketing efforts towards popular apps with high sentiment polarity can further boost their popularity and attract new users.

Promoting niche apps with high sentiment: Promoting niche apps with high sentiment polarity can help them reach their target audience and achieve sustainable growth.

Understanding user sentiment variations: By analyzing how sentiment polarity varies across different app features and user segments, developers can identify areas for improvement and implement changes to increase user satisfaction.

Prioritizing feedback based on popularity and sentiment: Insights from the chart can guide developers in prioritizing user feedback based on both app popularity and user sentiment, ensuring resources are allocated effectively.

**Negative Growth Consideration:**

Focusing solely on popular apps: Focusing solely on promoting popular apps might neglect niche apps with high user satisfaction, potentially missing out on valuable market segments.

Misinterpreting sentiment polarity: Taking average sentiment polarity at face value might lead to overlooking important aspects of user feedback. A deeper analysis of individual reviews is necessary to understand specific user concerns.

Ignoring install count trends: Ignoring the relationship between install count and sentiment polarity might result in neglecting potential issues with popular apps, leading to user churn and dissatisfaction.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Relationship between Sentiment Subjectivity and Sentiment Polarity
# Set the size of the figure
plt.figure(figsize=(12, 6))

# Create a scatter plot using seaborn
sns.scatterplot(data=merged_df, x='Sentiment_Subjectivity', y='Sentiment_Polarity', hue='Sentiment', palette='Set2')

# Set labels for the x and y axes
plt.xlabel('Sentiment Subjectivity', size=15)
plt.ylabel('Sentiment Polarity', size=15)

# Set the title of the plot
plt.title('Relationship between Sentiment Subjectivity and Sentiment Polarity', size=15)

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

Scatter plots are a powerful tool for exploratory data analysis, as they allow us to quickly identify patterns and correlations in the data. By using the hue parameter, we can color-code the data points based on a third variable, making it easier to identify patterns and correlations between the variables. So, I used a scatter plot to identify the patterns between Sentiment Subjectivity and Sentiment Polarity with the Sentiment variable represented by hue.

##### 2. What is/are the insight(s) found from the chart?

The scatter plot of Sentiment Polarity and Sentiment Subjectivity shows a moderate positive correlation between the two variables. This means that, in general, as sentiment polarity increases, sentiment subjectivity tends to increase as well. However, the relationship is not very strong. This suggests that there is a tendency for people to express their opinions more strongly when they are feeling positive than when they are feeling negative.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

Targeted Marketing: Understanding that positive sentiments are linked with increased subjectivity, businesses can tailor marketing messages and campaigns to resonate with the emotional and expressive aspects of their satisfied customers.

Enhanced Customer Engagement: Recognizing the correlation, businesses can focus on creating platforms for customers to share their positive experiences in a more detailed and expressive manner. This could include encouraging reviews, testimonials, or social media interactions that capture the enthusiasm of satisfied customers.

Product/Service Improvements: Analyzing the correlation might lead to insights into what aspects of products or services generate strong positive sentiments. Businesses can use this information to prioritize and refine features that contribute to customer satisfaction.

**Negative Growth Consideration:**

Overreliance on Subjective Opinions: Businesses should not solely rely on subjective opinions to drive decisions, as they may not reflect broader customer sentiment or objective performance metrics.

Misinterpretation of Feedback: Strong expressions of negative sentiment may not always be indicative of a widespread issue. Careful analysis and context-specific interpretation are crucial.

Potential for Bias: Subjective opinions may be influenced by individual biases, personal experiences, and external factors, potentially leading to inaccurate assessments of overall customer sentiment.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
#  Relationships among Rating, Reviews, Installs, Price, Sentiment_Polarity, and Sentiment_Subjectivity
# Selecting numerical columns from the merged dataframe

numerical_columns = merged_df[['Rating', 'Reviews', 'Installs', 'Price', 'Sentiment_Polarity', 'Sentiment_Subjectivity']]

# Create a correlation matrix
correlation_matrix = numerical_columns.corr()

# Create a heatmap
plt.figure(figsize=(10, 5))
sns.heatmap(correlation_matrix, annot=True, cmap='rocket', fmt=".2f")
plt.title('Correlation Heatmap',size=20)
plt.show()

##### 1. Why did you pick the specific chart?

A correlation heatmap is a powerful tool for identifying relationships between pairs of variables in a dataset. The color-coding allows us to see at a glance whether variables are positively correlated (similar movements) or negatively correlated (opposite movements). The colors in the heatmap represent the strength of the correlation, with brighter colors indicating stronger correlations and darker colors indicating weaker or no correlations. The range of correlation is from -1 to 1. So, I used a correlation heatmap to find the relationships among Rating, Reviews, Installs, Price, Sentiment_Polarity, and Sentiment_Subjectivity.

##### 2. What is/are the insight(s) found from the chart?

Rating has a moderate positive correlation with Reviews, Installs, and Sentiment_Polarity. This means that apps with higher ratings tend to have more reviews, more installs, and more positive sentiment in their reviews. Rating has a weak negative correlation with Price. This means that apps with higher ratings tend to be slightly cheaper.

Reviews has a strong positive correlation with Installs. This means that apps with more reviews tend to have more installs. Reviews has a weak negative correlation with Price. This means that apps with more reviews tend to be slightly cheaper.

Installs has a weak negative correlation with Price. This means that apps with more installs tend to be slightly cheaper.

Sentiment_Polarity has a moderate positive correlation with Sentiment_Subjectivity. This means that apps with more positive sentiment in their reviews tend to have slightly more subjective reviews.

#### Chart - 15 - Pair Plot

In [None]:
# Chart - 15 - Pair Plot visualization code
# Correlations among Rating, Installs, Reviews and Price.
# Selecting numerical columns from the merged dataframe
included_columns = merged_df[['Rating', 'Installs', 'Reviews', 'Price', 'Type']]

# Log-transform 'Installs' and 'Reviews'
included_columns['Installs(Log)'] = np.log(included_columns['Installs'])
included_columns['Reviews(Log)'] = np.log10(included_columns['Reviews'])

# Selecting columns for pair plot
selected_columns = included_columns[['Rating', 'Installs(Log)', 'Reviews(Log)', 'Price', 'Type']]

# Create a pair plot
p = sns.pairplot(selected_columns, hue='Type', markers=["o", "s"], palette={"Free": "blue", "Paid": "orange"})
p.fig.suptitle("Pair Plot - Rating, Installs, Reviews, Price", x=0.5, y=1.02, fontsize=20)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

Pair plots are a powerful tool for visualizing the relationships between multiple variables in a dataset. Each subplot in the grid represents a pair of variables, making it easy to identify patterns, trends, and correlations. So, I used a pair plot to explore the patterns and correlations between Rating, Installs, Reviews and Price.

##### 2. What is/are the insight(s) found from the chart?

Rating and Installs: There is a positive correlation between rating and installs, meaning that apps with higher ratings tend to have more installs. This is likely because users are more likely to install apps that have been positively reviewed by other users.

Rating and Reviews: There is also a positive correlation between rating and reviews, meaning that apps with higher ratings tend to have more reviews. This is likely because users are more likely to write reviews for apps that they enjoy using.

Rating and Price: There is a weak negative correlation between rating and price, meaning that apps with higher ratings tend to be slightly cheaper. This is likely because developers of high-quality apps are able to charge lower prices due to the high demand for their apps.

Installs and Reviews: There is a strong positive correlation between installs and reviews, meaning that apps with more installs tend to have more reviews. This is likely because users are more likely to write reviews for apps that they have used extensively.

Installs and Price: There is a weak negative correlation between installs and price, meaning that apps with more installs tend to be slightly cheaper. This is likely because developers of popular apps are able to charge lower prices due to the high volume of installs.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

1) Understand your target audience: Conduct thorough market research to understand the needs, preferences, and behaviors of your target audience. Use analytics tools to gather data on user demographics, location, device types, and usage patterns.

2) Improve user experience: Focus on providing a seamless and intuitive user experience within the app. Ensure that the app is easy to navigate, visually appealing, and responsive across different devices and screen sizes.

3) Optimize app store presence: Invest in app store optimization (ASO) to improve the visibility and discoverability of your app on the Play Store. This includes optimizing keywords, titles, descriptions, and visuals to attract more downloads and higher rankings in search results.

4) Engage users with valuable content: Regularly update the app with fresh and relevant content that adds value to users. This could include new features, updates, promotions, discounts, or informative content related to your app's niche.

5) Encourage user engagement and retention: Implement features that encourage users to engage with the app frequently and stay active over time. This could include gamification elements, push notifications, personalized recommendations, loyalty programs, or social features like user communities and forums.

6) Collect and analyze user feedback: Actively solicit feedback from users and use it to continuously improve the app. Pay attention to user reviews, ratings, and comments on the Play Store, as well as feedback gathered through in-app surveys or feedback forms.

7) Monetize strategically: If your app includes monetization strategies such as in-app purchases, subscriptions, or ads, ensure that these are implemented in a user-friendly and non-intrusive manner. Balance revenue generation with providing value to users to maintain a positive user experience.

8) Stay informed about industry trends: Keep up-to-date with trends, developments, and best practices in the mobile app industry. Adapt your strategies accordingly to capitalize on emerging opportunities and stay ahead of the competition.

9) Build a strong brand presence: Establish and maintain a strong brand presence for your app across various channels, including social media, blogs, email newsletters, and other marketing channels. Consistent branding helps to build trust and loyalty among users.

10) Address Negative Feedback: Investigate apps with negative sentiment to pinpoint specific issues causing dissatisfaction. Prioritize improvements in areas highlighted by negative sentiment to enhance user satisfaction and overall app performance

# **Conclusion**

This project has successfully analyzed the Play Store app dataset using Python, uncovering valuable insights into key factors for app engagement and success. The data visualizations and interpretations provide a comprehensive understanding of user sentiment, app ratings, genre preferences, content suitability, and the impact of updates.

Based on these insights, I have crafted actionable recommendations for the client to optimize app performance and achieve their business objectives. I advise them to focus on fostering positive sentiment, addressing user concerns promptly, catering to user preferences like smaller apps and frequent updates, targeting specific genres and content ratings strategically, and building loyal followings in niche markets.

By embracing a data-driven approach and continuously adapting to user preferences, the client can ensure long-term success in the competitive Android app market.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***


**Analysis by: Jonty Dutta**

