# **Project Name**    - Play Store App Review Analysis





```
# This is formatted as code
```

##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**


This project endeavors to harness the potential of Play Store app data to offer actionable insights for app-making enterprises aiming to flourish in the Android marketplace.

Evaluating customer reviews for sentiment is particularly crucial as it offers a direct line to user experiences and feedback. Sentiment analysis of these reviews can reveal user preferences, common pain points and areas where the app excels or falls short. This information is invaluable for refining app features and ensuring that updates align with user expectations.

Furthermore visually intuitive representations of the data such as charts and graphs can make it easier for stakeholders to grasp complex insights and trends at a glance.This data-driven approach not only helps in retaining current users but also in attracting new ones thus fostering business expansion.




# **GitHub Link -**

**Github Link** :




# **Problem Statement**


**Write Problem Statement Here.**

**Objective**: Harness Play Store app data to provide actionable insights for app-making enterprises.

#### **Define Your Business Objective?**



- To give the insights about the app playstore data.
- Utilize Play Store app data to derive actionable insights.
- Facilitate app-making businesses success in the Android marketplace.
- Identify key factors influencing app engagement and success.
- Optimize app features and user experiences based on data-driven insights.
- Make informed decisions to drive business growth and innovation.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Define file paths
apps_data_path = '/content/drive/My Drive/EDA/Play_Store_Data.csv'
reviews_data_path = '/content/drive/My Drive/EDA/UserReviews.csv'

# Load datasets
apps_data = pd.read_csv(apps_data_path)
reviews_data = pd.read_csv(reviews_data_path)

### Dataset First View

In [None]:
# Dataset First Look
apps_data.head()

In [None]:
apps_data.tail()

In [None]:
reviews_data.head()

In [None]:
reviews_data.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
apps_data.shape

In [None]:
reviews_data.shape

### Dataset Information

In [None]:
# Dataset Info
apps_data.info()

In [None]:
reviews_data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
apps_duplicate_count = apps_data.duplicated().sum()
print("Number of duplicate values in apps_data:", apps_duplicate_count)

In [None]:
# Find duplicate count in reviews_data
reviews_duplicate_count = reviews_data.duplicated().sum()
print("Number of duplicate values in reviews_data:", reviews_duplicate_count)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Find missing values count in apps_data
apps_missing_count = apps_data.isnull().sum()
print("Missing values count in apps_data:")
print(apps_missing_count)

In [None]:
# Calculate percentage of null values in apps_data
apps_null_percentage = (apps_data.isnull().sum() / len(apps_data)) * 100
apps_null_percentage

In [None]:
# Find missing values count in reviews_data
reviews_missing_count = reviews_data.isnull().sum()
print("\nMissing values count in reviews_data:")
print(reviews_missing_count)

In [None]:
# Calculate percentage of null values in reviews_data
reviews_null_percentage = (reviews_data.isnull().sum() / len(reviews_data)) * 100
reviews_null_percentage

In [None]:
# Visualizing the missing values
import missingno as msno

In [None]:
# Visualize missing values in apps_data
plt.figure(figsize=(10, 8))
msno.matrix(apps_data)
plt.title('Missing Values in Apps Data')
plt.show()


In [None]:
# Visualize missing values in apps_data
plt.figure(figsize=(10, 6))
msno.matrix(apps_data)
plt.title('Missing Values in Apps Data')
plt.show()


### What did you know about your dataset?

*italicized text*

- The apps data consists of 10,841 entries and 13 columns, including
  information such as the app name, category, rating, number of reviews, size, installs, and more.

- The reviews data consists of 64,295 entries and 5 columns, including the
  app name, translated review, sentiment, sentiment polarity, and sentiment subjectivity.

- There are 483 duplicate values in the apps data and 33,616 duplicate
  values in the reviews data.

- The apps data contains missing values in the 'Rating', 'Type', 'Content
  Rating', 'Current Ver', and 'Android Ver' columns, with the highest number of missing values in the 'Rating' column (1,474).

- The reviews data contains missing values in the 'Translated_Review',
  'Sentiment', 'Sentiment_Polarity', and 'Sentiment_Subjectivity' columns, with the highest number of missing values in the 'Translated_Review' column (26,868).

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
apps_data.columns

In [None]:
apps_data.dtypes

In [None]:
 reviews_data.columns

In [None]:
 reviews_data.dtypes

In [None]:
# Dataset Describe

In [None]:
apps_data.describe(include = 'all')

In [None]:
reviews_data.describe(include = 'all')

### Variables Description

For Apps_data:
- App (object) is the app's name. Category (object) indicates the app's category.
- Rating (float64) shows the average user rating.
- Reviews (object) lists the number of user reviews.
--Size (object) represents the app's size (e.g., '19M').
- Installs (object) indicates the number of installs (e.g., '1,000,000+').
- Type (object) denotes if the app is 'Free' or 'Paid'.
- Price (object) shows the price in USD if paid.
- Content Rating (object) specifies the age suitability.
- Genres (object) provides additional genre details.
- Last Updated (object) is the date of the latest update.
- Current Ver (object) denotes the app's current version.
- Android Ver (object) indicates the required Android version.

For Reviews_data:

The reviews_data dataset consists of several variables providing insights into App data.

- App (object) indicates the name of the application being reviewed and Translated_Review (object) contains the user review text translated into
 English.
-  Sentiment (object) categorizes the review's overall sentiment as Positive, Negative, or Neutral.
- Sentiment_Polarity (float64) is a numerical score ranging from -1 to 1 representing the sentiment polarity, with -1 being very negative, 0 neutral and 1 very positive.
- Sentiment_Subjectivity (float64) scores from 0 to 1 indicating the review's subjectivity where 0 is very objective and 1 is very subjective.


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in apps_data.columns:
    unique_values = apps_data[column].unique()
    print(f"{column}: {unique_values}")

In [None]:
for column in reviews_data.columns:
    unique_values = reviews_data[column].unique()
    print(f"{column}: {unique_values}")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

In [None]:
# Dropping the Missing Values since the missing data has a significant percdentage in apps_data and reviews_data

In [None]:
# Drop duplicates from apps_data
apps_data.drop_duplicates(inplace = True)
apps_data.shape

In [None]:
reviews_data.drop_duplicates(inplace = True)
reviews_data.shape

In [None]:
apps_data.isnull().sum()

In [None]:
reviews_data.isnull().sum()

In [None]:
# Drop missing values from apps_data
apps_data_cleaned = apps_data.dropna()
print("Shape of apps_data_cleaned:", apps_data_cleaned.shape)

In [None]:
# Drop missing values from reviews_data
reviews_data_cleaned = reviews_data.dropna()
print("Shape of reviews_data_cleaned:", reviews_data_cleaned.shape)

In [None]:
# Check data types of columns in apps_data_cleaned
print("Data types of columns in apps_data_cleaned:")
print(apps_data_cleaned.dtypes)

In [None]:
# Check data types of columns in reviews_data_cleaned
print("\nData types of columns in reviews_data_cleaned:")
print(reviews_data_cleaned.dtypes)

In [None]:
apps_data_cleaned.dtypes

In [None]:
apps_data_cleaned.head()

In [None]:
apps_data_cleaned['Rating'] = apps_data_cleaned['Rating'].astype(float)


In [None]:
apps_data_cleaned.dtypes

In [None]:
apps_data_cleaned['Reviews']



In [None]:
# Remove rows where 'Varies with device' is found in the 'Size' column
apps_data_cleaned = apps_data_cleaned[apps_data_cleaned['Size'] != 'Varies with device']


In [None]:
apps_data_cleaned.shape

In [None]:
# Remove commas from the 'Size' column
apps_data_cleaned['Size'] = apps_data_cleaned['Size'].str.replace(',', '')

In [None]:

# Remove '+' sign from the 'Size' column
apps_data_cleaned['Size'] = apps_data_cleaned['Size'].str.replace('+', '')

In [None]:
# Multiply by 1000 if 'k' is found, and by 1000000 if 'M' is found
def clean_size(size):
    if size.endswith('M'):
        return float(size[:-1]) * 1000000
    elif size.endswith('k'):
        return float(size[:-1]) * 1000
    else:
        return float(size)

apps_data_cleaned['Size'] = apps_data_cleaned['Size'].apply(clean_size)


In [None]:
apps_data_cleaned.head()

In [None]:
apps_data_cleaned.dtypes

In [None]:

# Remove rows where 'Free' is found in the 'Installs' column
apps_data_cleaned = apps_data_cleaned[apps_data_cleaned['Installs'] != 'Free']
# Remove commas from the 'Installs' column
apps_data_cleaned['Installs'] = apps_data_cleaned['Installs'].str.replace(',', '')

# Remove '+' sign from the 'Installs' column
apps_data_cleaned['Installs'] = apps_data_cleaned['Installs'].str.replace('+', '')

In [None]:
apps_data_cleaned.shape

In [None]:
apps_data_cleaned.dtypes

In [None]:
# Convert the 'Installs' column to float
apps_data_cleaned['Installs'] = apps_data_cleaned['Installs'].astype(float)

In [None]:
apps_data_cleaned.head()

In [None]:
## Remove rows where '0' and blanks are present in the 'Type' column
apps_data_cleaned = apps_data_cleaned[(apps_data_cleaned['Type'] != '0') & (apps_data_cleaned['Type'] != '')]


In [None]:
apps_data_cleaned.shape

In [None]:
apps_data_cleaned.head()

In [None]:
apps_data_cleaned.dtypes

In [None]:
# Remove dollar sign ('$') from the 'Price' column
apps_data_cleaned['Price'] = apps_data_cleaned['Price'].str.replace('$', '')

# Convert the 'Price' column to float
apps_data_cleaned['Price'] = apps_data_cleaned['Price'].astype(float)

In [None]:
apps_data_cleaned.dtypes

In [None]:
# Remove rows with blanks in the 'Content Rating' column
apps_data_cleaned = apps_data_cleaned[apps_data_cleaned['Content Rating'] != '']


In [None]:
apps_data_cleaned[['Last Updated']]

In [None]:
# Convert 'Last Updated' column to datetime format
apps_data_cleaned['Last Updated'] = pd.to_datetime(apps_data_cleaned['Last Updated'])

In [None]:
apps_data_cleaned.head()

In [None]:
apps_data_cleaned.tail()

In [None]:
reviews_data_cleaned.head()

In [None]:
reviews_data_cleaned.dtypes

In [None]:
reviews_data_cleaned.isnull().sum()

In [None]:
reviews_data_cleaned.head()

In [None]:
reviews_data_cleaned.to_excel("/content/drive/My Drive/EDA/b.xlsx")

In [None]:
reviews_data_cleaned

In [None]:
# Check Unique Values for each variable.
for column in reviews_data_cleaned.columns:
    unique_values = reviews_data_cleaned[column].unique()
    print(f"{column}: {unique_values}")

In [None]:
reviews_data_cleaned

\### What all manipulations have you done and insights you found?

-  I have dropped this missing values in apps_data and reviews_data since the missing values are very large in number.
- I have cleaned columns of both dataset and converted the data type of each column.

In [None]:
reviews_data_cleaned.dtypes

In [None]:
for column in reviews_data_cleaned.columns:
    no_of_unique_values = reviews_data_cleaned[column].nunique()
    print(f"{column}: {no_of_unique_values}")

In [None]:
for column in apps_data_cleaned.columns:
    no_of_unique_values1 = apps_data_cleaned[column].nunique()
    print(f"{column}: {no_of_unique_values1}")

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Assuming df1 and df2 are your DataFrames and 'Company Name' is the common column
merged_df = pd.merge(apps_data_cleaned, reviews_data_cleaned, on='App', how='inner')


In [None]:
apps_data_cleaned.shape, reviews_data_cleaned.shape

In [None]:
merged_df

In [None]:
merged_df.dtypes

In [None]:
# Chart - 1 visualization code

import pandas as pd
import matplotlib.pyplot as plt

# Assuming your DataFrame is named df
category_counts = merged_df['Category'].value_counts()

# Plotting the bar graph
plt.figure(figsize=(10, 6))
category_counts.plot(kind='bar', color='skyblue')
plt.title('Number of Apps in Each Category')
plt.xlabel('Category')
plt.ylabel('Number of Apps')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I have picked the bar chart since bar chart is best for representation of categorical data.

##### 2. What is/are the insight(s) found from the chart?

Here, I have plotted the number of apps in each category. Maximum number of apps are found in Category 'Game' and Minimum number of apps are found in Category 'Comics'.

##### 3. Will the gained insights help creating a positive business impact?


- The "GAME" category is highly saturated indicating strong competition. If you're planning to launch a game app, it will be crucial to offer unique features or a highly engaging user experience to stand out.

- Categories with fewer apps such as "COMICS" or "WEATHER" may present opportunities to capture market share with less competition.

- Investing in categories like "HEALTH_AND_FITNESS" or "FINANCE" which have substantial numbers of apps but not as high as "GAME" or "FAMILY" can be lucrative. These areas might still have robust user interest but with less intense competition.

- Understanding where the majority of apps are concentrated helps in making informed decisions about where to allocate development and marketing resources. For instance, in highly saturated categories more marketing budget might be needed to gain visibility.

- Different categories might require different engagement strategies. For example, gaming apps often rely on frequent updates and community building while health and fitness apps might benefit from partnerships with fitness influencers and integrating with wearable tech.

- Observing the distribution of apps across categories can also help in identifying emerging trends. If a traditionally less populated category starts seeing an increase in app numbers, it might indicate a growing interest or a new trend.

- Companies can use this data to strategically plan their app development pipeline. For instance, if entering a saturated market, a differentiated product is necessary. Conversely, entering a less crowded market can be an opportunity to establish dominance.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

# Histograms
numeric_vars = ['Rating', 'Price', 'Size', 'Installs']
merged_df[numeric_vars].hist(figsize=(10, 8))
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Histogram actually  shows the frequency of numerical data using rectangles. That's why, I have picked this plot to show the distribution of numeric data.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
merged_df.dtypes

In [None]:
import seaborn as sns

In [None]:
# Increase figure size for better visibility
plt.rcParams["figure.figsize"] = (14, 10)

# Barplots
categorical_vars = ['Category', 'Type', 'Content Rating', 'Genres']
for var in categorical_vars:
    sns.countplot(data=merged_df, y=var)
    plt.show()

##### 1. Why did you pick the specific chart?

Here, I have picked the count plot to show the counts of categorical variables.

##### 2. What is/are the insight(s) found from the chart?

- Category having maximum count is Game and Category having minimum count is
  Comics. The dominance of the game category over the comics category within the app market has significant business implications affecting revenue generation, market competition, user engagement, monetization strategies, demographic appeal, content diversity and investment opportunities While the game category benefits from a larger user base and more diverse monetization options, the comics category may need to focus on niche audiences and alternative revenue streams to enhance its competitiveness and profitability within the market.
- Free Apps are more in comparison to paid apps. Free apps tend to attract a
  larger user base compared to paid apps. This broader user base can potentially translate into more opportunities for revenue generation through other means such as in-app advertisements, in-app purchases or premium subscriptions.
- The distribution of apps across various content ratings, revealing that the
  majority are rated for "Everyone," followed by "Teen," "Mature 17+," "Everyone 10+," and "Adults only 18+." This indicates that developers predominantly target a broad audience, ensuring accessibility to all age groups, while also catering to specific segments like teenagers and adults with more mature content.


Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
for column in merged_df.columns:
    # Check if the column contains numeric data
    if merged_df[column].dtype in ['int64', 'float64']:
        # Create boxplot using Seaborn
        plt.figure(figsize=(8, 6))
        sns.boxplot(data=merged_df[column])
        plt.title(f'Boxplot of {column}')
        plt.ylabel(column)
        plt.show()

##### 1. Why did you pick the specific chart?

I have picked box plot to get an idea about outliers present in numeric columns.

##### 2. What is/are the insight(s) found from the chart?

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

By identifying and analyzing outliers through box plots, businesses can gain valuable insights into user experiences, address critical issues, and make data-driven decisions that enhance overall app performance and user satisfaction.

#### Chart - 5

In [None]:
# Pie charts
for var in ['Content Rating', 'Type']:
    plt.figure(figsize=(6, 6))
    merged_df[var].value_counts().plot.pie(autopct='%1.1f%%')
    plt.axis('equal')
    plt.title(f'Distribution of {var}')
    plt.show()


##### 1. Why did you pick the specific chart?

I have picked pie chart to visualize the distribution of content rating and type of subscriptions available for different apps.

##### 2. What is/are the insight(s) found from the chart?

Fron the Content Rating Pie Chart, I found these key findings:

- Everyone: 78.9% - This category is the most prevalent indicating that the majority of content is suitable for all audiences.

- Teen: 10.9% - A smaller portion of content is intended for teenage users.
Mature 17+: 5.5% - Content in this category is suitable for individuals aged 17 and above.

- Everyone 10+: 4.6% - This category covers content appropriate for users aged 10 and above.

- Adults only 18+: 0.1% - A very small percentage of content is restricted to adult users only.

This distribution can help in understanding the target audience for content in an app store, with the majority being accessible to everyone.

Another Pie Chart for distribution of app subscription is showing these key findings:

- Free: 98.6% - The vast majority of apps are free, indicating that most users can download and use them without any cost.

- Paid: 1.4% - A small fraction of apps require payment, reflecting a limited market for paid applications.

This distribution suggests that free apps dominate the market significantly.

> Add blockquote



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(10, 6))
sns.violinplot(data=merged_df, x='Category', y='Rating')
plt.xticks(rotation=90)
plt.show()


##### 2. What is/are the insight(s) found from the chart?


This plot displays the distribution of app ratings across different categories in the app store:

Y-axis (Rating): Shows the range of ratings from about 2.5 to 5.
X-axis (Category): Represents various app categories like Art & Design, Family, Auto & Vehicles, etc.
Shape of each plot: Indicates the distribution density of ratings within that category. Wider sections imply a higher concentration of ratings at that level.

Key Findings:

Most categories have a high concentration of ratings around 4.0 to 4.5.
Some categories, like Family and Auto & Vehicles, show a wider spread of ratings.
Categories such as Books and Reference, and Events have more tightly clustered ratings, suggesting consistency in user feedback.

This visualization helps identify how user satisfaction varies across different app categories.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(10, 6))
sns.barplot(data=merged_df, x='Category', y='Rating', errorbar='sd')  # sd for standard deviation
plt.xticks(rotation=90)
plt.show()


##### 2. What is/are the insight(s) found from the chart?

The bar chart presents the average ratings of various app categories on a scale from 0 to 5. The x-axis lists categories like ART_AND_DESIGN, FAMILY, AUTO_AND_VEHICLES, and so on, while the y-axis represents the average rating. Most categories have average ratings between 4.0 and 4.5, indicating generally high user satisfaction. The EDUCATION category stands out with the highest average rating, nearing 5, whereas categories like DATING and COMICS have slightly lower averages, closer to 4. The error bars, which show the variability of ratings within each category, are relatively short across all categories, suggesting consistent ratings. Overall, the chart indicates that app ratings are generally positive and consistent across different categories, with EDUCATION apps being particularly well-rated.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(10, 6))
sns.boxplot(data=merged_df, x='Category', y='Rating', hue='Type')
plt.xticks(rotation=90)
plt.show()


##### 2. What is/are the insight(s) found from the chart?

Key Observations:

ART_AND_DESIGN: Both free and paid apps have high median ratings, with paid apps showing a slightly higher median and a narrower IQR.

FAMILY: Free apps have a wider range of ratings and more outliers, while paid apps have a higher median rating.

BEAUTY: Free apps have a higher median rating than paid apps, with a wider spread.

BOOKS_AND_REFERENCE: Paid apps have a higher median rating compared to free apps.

BUSINESS: Free apps have a relatively consistent rating with a few outliers, while paid apps have a slightly higher median rating.

COMICS: Ratings for free apps are lower and more variable compared to paid apps.

EDUCATION: Both free and paid apps have high median ratings, with paid apps having a higher median.

MEDICAL: Paid apps have a higher median rating than free apps, with less variability.

SPORTS: Free apps have a wider range of ratings, while paid apps have higher and more consistent ratings.

TOOLS: Paid apps have a higher median rating and less variability compared to free apps.

Overall, paid apps generally have higher median ratings and less variability compared to free apps across many categories. This suggests that users tend to rate paid apps more favorably, potentially reflecting higher quality or better user experience.








#### Chart - 10

In [None]:
# Chart - 10 visualization code
g = sns.FacetGrid(merged_df, col="Type", row="Category", margin_titles=True)
g.map(sns.scatterplot, "Rating", "Installs")
plt.show()


##### 2. What is/are the insight(s) found from the chart?

From these plots, it is evident that certain categories exhibit a positive correlation between the number of installs and the average rating, suggesting that more popular apps tend to have higher ratings. For instance, categories like "GAME," "TOOLS," and "COMMUNICATION" show a trend where higher installs correlate with better ratings. However, in some categories, such as "DATING" and "COMICS," the correlation is less pronounced or even negative, indicating that higher installs do not necessarily equate to higher ratings.

Overall, while there is a general trend of more installs correlating with higher ratings in many categories, this is not a universal rule. Each category shows unique patterns, suggesting that factors influencing app ratings can vary significantly depending on the app type.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
from wordcloud import WordCloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(merged_df['Translated_Review'].dropna()))
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Select only integer and float columns
numeric_columns = merged_df.select_dtypes(include=['int64', 'float64'])

# Correlation Heatmap visualization code
plt.figure(figsize=(10, 8))
sns.heatmap(numeric_columns.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.show()


##### 1. Why did you pick the specific chart?

This heatmap shows the correlation matrix for variables related to mobile applications. Rating has a positive correlation with Size, Sentiment_Polarity, and Sentiment_Subjectivity, and a weak negative correlation with Price. Size is positively correlated with Installs but negatively with Sentiment_Polarity. Installs has a positive correlation with Size but shows negligible correlations with other variables. Price has very weak correlations with all other variables. Sentiment_Polarity is negatively correlated with Size but positively correlated with Sentiment_Subjectivity. Lastly, Sentiment_Subjectivity has a slight positive correlation with Rating and Sentiment_Polarity.






#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
import seaborn as sns

# Select only integer and float columns
numeric_columns = merged_df.select_dtypes(include=['int64', 'float64'])

# Create pair plot
sns.pairplot(numeric_columns)
plt.show()


##### 2. What is/are the insight(s) found from the chart?

This pair plot displays the relationships between various variables related to mobile applications. The diagonal plots show the distribution of each variable, with Ratings skewed towards higher values and Prices mostly clustered around zero. Size shows a positive relationship with Installs but a negative one with Sentiment_Polarity, while Sentiment_Subjectivity is widely spread. Ratings have a slight positive relationship with Size and Sentiment_Subjectivity but show no clear pattern with Installs and Price. Installs are concentrated at lower values and have a positive relationship with Size but no clear relationship with other variables. Sentiment_Polarity has a negative relationship with Size and a slight positive relationship with Sentiment_Subjectivity. Overall, the pair plot highlights how these variables interact, with some showing clear relationships while others exhibit more scattered distributions.








### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***