<a href="https://colab.research.google.com/github/Rudreshmishraa/Exploratory_data_analysis/blob/main/M2_PROJECT_SUBMISSION_RUDRESH_MISHRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team
##### **Team Member 1 -**
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

In the competitive world of the Google Play Store, developers face intense global competition with thousands of new apps launching daily. The revenue model is complex, especially where free apps dominate, making it difficult to gauge the impact of in-app purchases, ads, and subscriptions on success. Consequently, an app's success is often measured by installs and user reviews, with revenue being secondary. User ratings, while important, can be skewed by insufficient votes and discrepancies between ratings and reviews.

In this environment, understanding the factors that contribute to an app's success is crucial. Developers need to focus on both user acquisition and retention through quality updates, engaging content, and responsive support. This holistic approach helps attract and maintain a loyal user base, leading to long-term success in

This research conducts an extensive analysis of Play Store app data using Python, aiming to identify key factors that drive app engagement and success. By utilizing Python libraries such as NumPy, Pandas, Seaborn, and Matplotlib for data manipulation and visualization, the goal is to derive actionable insights to enhance app performance in the Android market.

The project centers on two datasets—the Play Store Dataset and the User Reviews Dataset—that contain a variety of app information. The Play Store dataset, with its 10,841 rows and 13 columns, faces a significant issue with 483 duplicated rows. This redundancy compromises the dataset's integrity and skews analytical results, leading to potentially misleading conclusions.

Furthermore, the presence of missing values in essential columns complicates the dataset. The 'Rating' column, critical for app quality assessment, has 13.60% null values, causing ambiguity and impeding accurate evaluations. The 'Type,' 'Content Rating,' 'Current Ver,' and 'Android Ver' columns also suffer from missing values, requiring careful attention to maintain a reliable dataset.

The User Reviews dataset, containing 64,295 rows and 5 columns, also presents challenges, particularly with 33,616 duplicated rows. These duplicates create redundancy and can exaggerate the impact of certain reviews, affecting the dataset's representativeness.

Data cleaning becomes crucial when dealing with the high percentage of missing values in the 'Translated_Review,' 'Sentiment,' 'Sentiment_Polarity,' and 'Sentiment_Subjectivity' columns. With about 41.78% null values, a significant portion of the dataset is incomplete, necessitating strategic handling to ensure coherent and informative analysis. This process includes identifying non-numeric reviews, converting data types, handling missing values, and eliminating duplicates. Normalization, scaling, and outlier treatment further refine the datasets, setting the stage for meaningful insights.

The Exploratory Data Analysis (EDA) phase reveals critical insights, going beyond basic statistical analysis. Visual tools like histograms, pie charts, bar charts, and reg plots offer a detailed understanding of user sentiments, app ratings, genre preferences, and the impact of updates over time. This thorough exploration lays the foundation for strategic decision-making in app development and optimization.

The project concludes with strategic recommendations for app developers, emphasizing areas such as genre development, prioritizing free apps, optimizing app size, tailoring content ratings, generating revenue strategically, ensuring compatibility with the latest Android versions, fostering user engagement in popular categories, continuous improvement, and prioritizing positive user sentiment. Addressing negative feedback is also highlighted as a key aspect of app optimization.

Beyond the comprehensive analysis, the study underscores the dynamic nature of user sentiment and the evolving landscape of app preferences. The ever-changing dynamics of the Play Store highlight the importance of staying agile and adapting strategies in real-time. Regularly monitoring emerging trends, competitor activities, and user feedback should be integrated into the app development and optimization processes.

In summary, the project's deep insights, combined with an adaptive strategy and user-focused approach, position the client to not only optimize their current app portfolio but also to navigate future challenges and seize emerging opportunities in the competitive Android app market. This holistic approach ensures a resilient and sustainable presence, fostering long-term success.


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market. Each app (row) has values for category, rating, size, and more. Another dataset contains customer reviews of the android apps. Explore and analyze the data to discover key factors responsible for app engagement and success.


#### **Define Your Business Objective?**

Advanced data analysis can reveal trends and patterns, such as the impact of app size on user ratings, and help prioritize updates based on user feedback. Sentiment analysis of reviews provides insights into user emotions, guiding the creation of more engaging interfaces.

Developers must also navigate varying user expectations across regions and demographics. The prevalence of free apps further complicates monetization, requiring a balance between user acquisition and revenue generation. Additionally, app store algorithms favor apps with high engagement metrics, pushing developers to focus on continuous improvement and user satisfaction.

Ultimately, leveraging these insights enables developers to refine their strategies, enhance user satisfaction, and achieve sustained success in the competitive Android market. This data-driven approach ensures developers stay ahead by making informed decisions and continuously innovating.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import all the necessary libraries and dependencies for data analysis and visualization
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
import os
from google.colab import files

### Dataset Loading

In [None]:
df_apps = pd.read_csv('Play Store Data.csv')

df_reviews = pd.read_csv('User Reviews.csv')

### Dataset First View

In [None]:
df_apps.head()  #Overview of the first few rows to understand the data structure and initial values.

### Dataset Rows & Columns count

In [None]:
df_reviews.head()  #Overview of the first few rows to understand the data structure and initial values.

In [None]:
df_apps.tail()  #Examination of the last few rows to check for end-of-file anomalies and overall consistency.

In [None]:
df_reviews.tail()  #Examination of the last few rows to check for end-of-file anomalies and overall consistency.

* **Data Summary and Key Statistics**

In [None]:
# Print the number of rows and columns in the Play Store Data DataFrame
print('Number of Rows in Play Store Data:', df_apps.shape[0])  # Number of rows (entries) in the Play Store Data DataFrame
print('Number of Columns in Play Store Data:', df_apps.shape[1])  # Number of columns (features) in the Play Store Data DataFrame

# Print the number of rows and columns in the User Reviews Data DataFrame
print('Number of Rows in User Reviews Data:', df_reviews.shape[0])  # Number of rows (entries) in the User Reviews Data DataFrame
print('Number of Columns in User Reviews Data:', df_reviews.shape[1])  # Number of columns (features) in the User Reviews Data DataFrame


### Dataset Information

In [None]:
# Print information about the Play Store Data DataFrame
print('Playstore Data Information:')
df_apps.info()  # Displays a concise summary of the DataFrame, including the number of non-null entries and data types for each column

print('\n')  # Print a newline for better readability between outputs

# Print information about the User Reviews Data DataFrame
print('User Reviews Data Information:')
df_reviews.info()  # Displays a concise summary of the DataFrame, including the number of non-null entries and data types for each column


#### Missing Values/Null Values

In [None]:
def null_value_percentage(data_df):
    null_values = pd.DataFrame(index=data_df.columns)  # Create a new DataFrame to store missing value statistics

    null_values['datatype'] = data_df.dtypes  # Add the data type of each column

    null_values['Non_Null_Values'] = data_df.count()  # Count the number of non-null values in each column

    null_values['Null_Values'] = data_df.isnull().sum()  # Count the number of null values in each column

    null_values['Null_Value_Percentage'] = round(data_df.isnull().sum() / len(data_df) * 100, 2)  # Calculate the percentage of null values in each column

    return null_values

# Print the percentage of missing values for each column in the Play Store dataset
print('Percentage of Missing Values In Play Store Dataset:', null_value_percentage(df_apps), sep='\n')

print('\n')   # Print a blank line to separate the outputs for better readability

# Print the percentage of missing values for each column in the User Reviews dataset
print('Percentage of Missing Values In User Reviews Dataset:', null_value_percentage(df_reviews), sep='\n')


In [None]:
# Visualizing the missing values
# Checking Null Values by plotting Heatmap for Play Store Data
sns.heatmap(df_apps.isnull(), cbar=False, cmap='viridis')   # Create a heatmap of null values, disabling the color bar and setting the colormap to 'viridis'
plt.xlabel('Columns')    # Label the x-axis as 'Columns'
plt.ylabel('Rows')     # Label the y-axis as 'Rows'
plt.title('Null Values Heatmap for Play Store Data')    # Add a title to the heatmap
plt.show()  # Display the plot



Summary For Null Values In Play Store Data:
* The Rating column has a significant number of missing values.
* Columns like  Current Ver has few missing values.
* All other columns (App, Category, Reviews, Size, Installs, Type, Price, Genres, Last Updated, Android Ver) appear to have no missing values (completely black).

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap for User Reviews Data
# Customize the heatmap
sns.heatmap(df_reviews.isnull(), cbar=False, cmap='viridis')  # Create heatmap showing missing values with 'viridis' color map
plt.title('Missing Value Heatmap for User Reviews Data')  # Set the title of the heatmap
plt.xlabel('Columns')  # Label for the x-axis
plt.ylabel('Rows')  # Label for the y-axis
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.show()  # Display the heatmap

Summary Of Null Values In User Review Data:
* The columns Translated_Review, Sentiment, Sentiment_Polarity, and Sentiment_Subjectivity have a significant number of missing values.
* The App column does not have any missing values.

# What did you know about your dataset?

### Play Store Dataset Summary:
- **Focus:** Mobile application industry, specifically Android platform.
- **Dimensions:** 10,841 rows, 13 columns.
- **Missing Values:**
  - `Rating`: 13.60% null values.
  - `Type`: 0.01% null values.
  - `Content Rating`: 0.01% null values.
  - `Current Ver`: 0.07% null values.
  - `Android Ver`: 0.03% null values.
- **Primary Aim:** Identify key factors that contribute to app engagement and success on the Play Store.

### User Reviews Dataset Summary:
- **Focus:** User reviews of mobile applications on the Android platform.
- **Dimensions:** 64,295 rows, 5 columns.
- **Missing Values:**
  - `Translated_Review`: 41.79% null values.
  - `Sentiment`: 41.78% null values.
  - `Sentiment_Polarity`: 41.78% null values.
  - `Sentiment_Subjectivity`: 41.78% null values.
- **Primary Aim:** Analyze user feedback to understand sentiment and its impact on app success.

### Key Insights:
- Both datasets provide comprehensive coverage of the Android mobile app market.
- The Play Store dataset offers a detailed view of app characteristics and their metadata.
- The User Reviews dataset provides valuable insights into user opinions and sentiment, which can be correlated with app features from the Play Store dataset.
- A significant portion of the data, especially in the User Reviews dataset, has missing values, which will require cleaning and preprocessing.
- By analyzing these datasets, we aim to uncover patterns and trends that can inform strategies for improving app performance and user satisfaction.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

# Print the column names in the Play Store dataset
print('Play Store Dataset Columns:', df_apps.columns, sep='\n', end='\n\n')

# Print the column names in the User Review dataset
print('User Review Dataset Columns:', df_reviews.columns, sep='\n')


In [None]:
# dataset description
# Generate descriptive statistics for all columns in the Play Store dataset
apps_description = df_apps.describe(include='all')

# Generate descriptive statistics for all columns in the User Reviews dataset
reviews_description = df_reviews.describe(include='all')

# Print the descriptive statistics for the Play Store dataset
print('Play Store Dataset Description:', apps_description, sep='\n', end='\n\n')

# Print the descriptive statistics for the User Reviews dataset
print('User Reviews Dataset Description:', reviews_description, sep='\n')


### Variables Description

**Brief analysis of both datasets:**
### Play Store Dataset Columns:
- **App (object)**: Name of the app. This is a categorical column.
- **Category (object)**: App category (e.g., Games, Productivity). This is a categorical column.
- **Rating (float64)**: User rating of the app on a scale from 1 to 5. This is a numerical column.
- **Reviews (object)**: Number of user reviews. This should ideally be numerical but is often stored as a categorical column due to formatting issues.
- **Size (object)**: Size of the app (e.g., '19M', 'Varies with device'). This is a categorical column
- **Installs (object)**: Number of installations (e.g., '1,000,000+'). This is a categorical column, but should be converted to a numerical format for analysis.
- **Type (object)**: Whether the app is free or paid. This is a categorical column.
- **Price (object)**: Price of the app (e.g., '$4.99', '0'). This is a categorical column, but should be converted to a numerical format (after removing the currency symbol) for analysis.
- **Content Rating (object)**: Age group suitable for the app (e.g., Everyone, Teen). This is a categorical column.
- **Genres (object)**: App genres (e.g., 'Action', 'Puzzle'). This is a categorical column.
- **Last Updated (object)**: Last update date of the app (e.g., 'August 3, 2018'). This is a categorical column, but should ideally be converted to a datetime format for analysis.
- **Current Ver (object)**: Current version of the app. This is a categorical column.
- **Android Ver (object)**: Minimum Android version required. This is a categorical column.

### User Reviews Dataset Columns:
- **App (object)**: Name of the app. This is a categorical column.
- **Translated_Review (object)**: Translated user review text. This is a categorical column.
- **Sentiment (object)**: Sentiment of the review (Positive, Neutral, Negative). This is a categorical column.
- **Sentiment_Polarity (float64)**: The review's polarity, ranging from -1 (Negative) to 1 (Positive). This is a numerical column.
- **Sentiment_Subjectivity (float64)**: Subjectivity score of the sentiment (ranging from 0 to 1).  Higher scores suggest opinions closer to the general public, while lower scores indicate more factual information in the review. This is a numerical column.

This refined description should provide a clearer understanding of each column and how they are categorized.Answer Here

### Check Unique Values for each variable.

In [None]:
# Iterate through each column in the Play Store dataset and print the total number of unique values in each column
for i in df_apps.columns.tolist():
    print('Total Unique Value in', i, ':', df_apps[i].nunique())

print('\n')  # Print a newline for better readability

# Iterate through each column in the User Reviews dataset and print the total number of unique values in each column
for i in df_reviews.columns.tolist():
    print('Total Unique Value in', i, ':', df_reviews[i].nunique())


In [None]:
# Print the number of duplicate values in the Play Store dataset
print('Duplicate Values in Play Store Dataset:', df_apps.duplicated().sum())

# Print the number of duplicate values in the User Review dataset
print('Duplicate Values in User Review Dataset:', df_reviews.duplicated().sum())


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Print the rows with non-numeric characters in the 'Reviews' column
non_numeric_data = df_apps[df_apps['Reviews'].str.contains(r'\D')]
print("Rows with non-numeric characters in 'Reviews' column:")
non_numeric_data

In [None]:
# The row at index 10472 contains data that is entirely incorrect and irrelevant.

# The row is seemed unusable and misleading for the analysis, and removing it will ensures the integrity and accuracy of the overall dataset.
df_apps = df_apps.drop(index=10472)

# Resetting the index ensures that the DataFrame has continuous and ordered indices after dropping a row.
df_apps = df_apps.reset_index(drop=True)

In [None]:
# Convert the 'Reviews' column to integer datatype
df_apps['Reviews'] = df_apps['Reviews'].astype(int)

In [None]:
# Convert the 'Last Updated' column to datetime format
df_apps['Last Updated'] = pd.to_datetime(df_apps['Last Updated'])

In [None]:
def drop_dollar_symbol(data):
    """
    Function to remove the dollar sign from a price string and convert it to a float.

    Parameters:
    data (str): The price string which may contain a dollar sign.

    Returns:
    float: The numeric value of the price without the dollar sign.
    """
    if '$' in data:
        # If the price contains a dollar sign, remove it and convert the remaining string to float
        return float(data[1:])
    else:
        # If the price does not contain a dollar sign, directly convert it to float
        return float(data)

# Applying the drop_dollar_symbol function to the 'Price' column
# This converts the 'Price' column values from strings with dollar signs to numeric values (floats)
df_apps['Price'] = df_apps['Price'].apply(drop_dollar_symbol)


In [None]:
def drop_addition_symbol(data):
    """
    Function to remove addition symbols and commas from a string representing installation numbers
    and convert it to an integer.

    Parameters:
    data (str): The installation number as a string, which may contain '+' or ',' symbols.

    Returns:
    int: The numeric value of the installation number after cleaning the string.
    """
    try:
        # Check if both '+' and ',' are in the data
        if '+' in data and ',' in data:
            # Remove ',' and '+', then convert to integer
            return int(data.replace(',', '').replace('+', ''))
        elif '+' in data:
            # Remove '+', then convert to integer
            return int(data.replace('+', ''))
        else:
            # Convert directly to integer if no '+' or ',' present
            return int(data)
    except ValueError:
        # Return 0 if conversion fails due to invalid format
        return 0

# Applying the drop_addition_symbol function to the 'Installs' column
# This converts the 'Installs' column values from strings with symbols to integer values
df_apps['Installs'] = df_apps['Installs'].apply(drop_addition_symbol)


In [None]:
def convert_kb_to_mb(data):
    """
    Function to convert size values from kilobytes to megabytes and keep megabytes as is.

    Parameters:
    data (str): The size of the app as a string, which may contain 'M' (for MB) or 'k' (for KB).

    Returns:
    float: The size of the app in megabytes. If the input is already in megabytes, it is returned as a float.
    """
    try:
        if 'M' in data:
        # If the size is in megabytes ('M'), remove the 'M' and convert to float
          return float(data[:-1])
        elif 'k' in data:
        # If the size is in kilobytes ('k'), remove the 'k', convert to float, and convert to megabytes
          return round(float(data[:-1]) / 1024, 4)
        else:
        # If the size is neither in 'M' nor 'k', return the data as is
           return data
    except:
        return data

# Applying the convert_kb_to_mb function to the 'Size' column
# This converts the 'Size' column values from strings with 'M' or 'k' to numeric values in megabytes
df_apps['Size'] = df_apps['Size'].apply(convert_kb_to_mb)


In [None]:
# Verifying the data type information after type conversion
print('Play Store Updated Data Info:')

# Display information about the DataFrame, including the data types of each column and non-null values
df_apps.info()


In [None]:
# Extract non-float values in 'Size' column
non_float_size_values = df_apps['Size'][~df_apps['Size'].apply(lambda x: isinstance(x, float))]

# Calculate the percentage of non-float values in 'Size' column
percentage_of_non_float = (len(non_float_size_values) / len(df_apps['Size'])) * 100

# Print the result
print(f"Non-float values in the 'Size' column: {non_float_size_values.unique()}")
print(f"Percentage of non-float values in the 'Size' column: {percentage_of_non_float:.2f}%")

# Note: 'Varies with device' being the only non-float entry, constituting 15.64%,
# led to the decision to retain rows with this value in the 'Size' column.


In [None]:
# Display the shape of the datasets before removing duplicates
print('Shape Before Removing Duplicates:')
print('Play Store Data Rows count Before Removing Duplicate Values:', df_apps.shape[0])  # Number of rows in Play Store data
print('Play Store Data Columns count Before Removing Duplicate Values:', df_apps.shape[1])  # Number of columns in Play Store data
print('User Reviews Data Rows count Before Removing Duplicate Values:', df_reviews.shape[0])  # Number of rows in User Reviews data
print('User Reviews Data Columns count:Before Removing Duplicate Values', df_reviews.shape[1])  # Number of columns in User Reviews data


print('\n')  # printing extra space for better readability of output

# Remove duplicate rows from both datasets
df_reviews.drop_duplicates(inplace=True)  # Remove duplicates from User Reviews data
df_apps.drop_duplicates(inplace=True)  # Remove duplicates from Play Store data

# Display the shape of the datasets after removing duplicates
print('Shape After Removing Duplicates:')
print('Play Store Data Rows count After Removing Duplicate Values:', df_apps.shape[0])  # Number of rows in Play Store data after duplicates removed
print('Play Store Data Columns count After Removing Duplicate Values:', df_apps.shape[1])  # Number of columns in Play Store data after duplicates removed
print('User Reviews Data Rows count After Removing Duplicate Values:', df_reviews.shape[0])  # Number of rows in User Reviews data after duplicates removed
print('User Reviews Data Columns count After Removing Duplicate Values:', df_reviews.shape[1])  # Number of columns in User Reviews data after duplicates removed


In [None]:
# Filling missing values for numerical and categorical columns in the play store data

# For numerical columns, use the median to fill missing values
df_apps['Rating'].fillna(df_apps['Rating'].median(), inplace=True)

# For categorical columns, use the mode (most frequent value) to fill missing values
df_apps['Type'].fillna(df_apps['Type'].mode()[0], inplace=True)
df_apps['Content Rating'].fillna(df_apps['Content Rating'].mode()[0], inplace=True)

# For columns with specific default values, fill missing values with 'Varies with device'
df_apps['Current Ver'].fillna('Varies with device', inplace=True)
df_apps['Android Ver'].fillna('Varies with device', inplace=True)

# Filling missing values for numerical and categorical columns in the user review data

# For numerical columns, use the median to fill missing values
df_reviews['Sentiment_Polarity'].fillna(df_reviews['Sentiment_Polarity'].median(), inplace=True)
df_reviews['Sentiment_Subjectivity'].fillna(df_reviews['Sentiment_Subjectivity'].median(), inplace=True)

# For categorical columns, use the mode (most frequent value) to fill missing values
df_reviews['Sentiment'].fillna(df_reviews['Sentiment'].mode()[0], inplace=True)

# For text data, fill missing values with a placeholder text
df_reviews['Translated_Review'].fillna('No Review', inplace=True)

# Check for any remaining missing values after imputation
updated_missing_value_count_apps = df_apps.isnull().sum()
updated_missing_value_count_reviews = df_reviews.isnull().sum()

# Print the count of missing values for each column in both datasets
print(f'Missing value in Apps Data: {updated_missing_value_count_apps}')

print('\n')  # Print a newline for better readability of the output

print(f'Missing value in Reviews Data: {updated_missing_value_count_reviews}')


In [None]:
# Visualization of outliers through box plots

# Create a figure with two subplots side by side
fig, axes = plt.subplots(1, 2, figsize=(10, 5))

# Plot a box plot for the 'Reviews' column on the first subplot
sns.boxplot(ax = axes[0], x = df_apps['Reviews'])
axes[0].set_title('Box Plot - Reviews')  # Set the title for the first subplot

# Plot a box plot for the 'Installs' column on the second subplot
sns.boxplot(ax = axes[1], x = df_apps['Installs'])
axes[1].set_title('Box Plot - Installs')  # Set the title for the second subplot

# Adjust the layout to prevent overlap of plots and titles
plt.tight_layout()

# Display the plots
plt.show()


In [None]:
# Display the shape of the datasets before removing outliers
print('Shape Before Removing Outliers in Apps Data:')
print('Play Store Data Row Count:', df_apps.shape[0])
print('Play Store Data Column Count:', df_apps.shape[1])
print('\n')
print('Shape Before Removing Outliers in Reviews Data:')
print('User Reviews Data Row Count:', df_reviews.shape[0])
print('User Reviews Data Column Count:', df_reviews.shape[1])

# Define quantiles for outlier removal
quantile_low = 0.05
quantile_high = 0.95

# Remove outliers from the 'Reviews' column in df_apps
df_apps = df_apps[(df_apps['Reviews'] >= df_apps['Reviews'].quantile(quantile_low)) &
                  (df_apps['Reviews'] <= df_apps['Reviews'].quantile(quantile_high))]

# Remove outliers from the 'Installs' column in df_apps
df_apps = df_apps[(df_apps['Installs'] >= df_apps['Installs'].quantile(quantile_low)) &
                  (df_apps['Installs'] <= df_apps['Installs'].quantile(quantile_high))]

# Display the shape of the datasets after removing outliers
print('Shape After Removing Outliers:')
print('Play Store Data Row Count:', df_apps.shape[0])
print('Play Store Data Column Count:', df_apps.shape[1])
print('User Reviews Data Row Count:', df_reviews.shape[0])
print('User Reviews Data Column Count:', df_reviews.shape[1])


In [None]:
# Dropping unnecessary columns which are not required for analysis from Play Store and User Reviews datasets

# Drop the 'Current Ver' column from df_apps
df_apps = df_apps.drop('Current Ver', axis=1)

# Drop the 'Translated_Review' column from df_reviews
df_reviews = df_reviews.drop('Translated_Review', axis=1)

print('Shape After Removing Unnecessary Columns:')
# Display the number of rows and columns in each dataset after removing unnecessary columns
print('Play Store Data Rows count:', df_apps.shape[0])
print('Play Store Data Columns count:', df_apps.shape[1])
print('User Reviews Data Rows count:', df_reviews.shape[0])
print('User Reviews Data Columns count:', df_reviews.shape[1])


### What all manipulations have you done and insights you found?

The following actions were taken to make the datasets analysis-ready:

1. **Identifying Non-Numeric Reviews:**
   - Checked and printed rows with non-numeric characters in the 'Reviews' column.

2. **Removing Irrelevant Row:**
   - Dropped the row at index 10472 as it contained incorrect or irrelevant data, ensuring dataset integrity.

3. **Converting Reviews to Integer:**
   - Converted the 'Reviews' column to integer data type for numerical analysis.

4. **Converting Last Updated to Datetime:**
   - Converted the 'Last Updated' column to datetime format for temporal analysis.

5. **Handling Price Values:**
   - Created a function (`drop_dollar`) to drop the '$' symbol and convert the 'Price' column to float data type.

6. **Handling Installs Values:**
   - Created a function (`drop_plus`) to drop the '+' symbol and convert the 'Installs' column to integer data type.

7. **Converting Size Entries:**
   - Created a function (`kb_to_mb`) to convert size entries to MB and handle 'k' or 'M' units.

8. **Verifying Data Types:**
   - Checked and printed the updated data type information after the type conversion.

9. **Removing Duplicates:**
   - Removed duplicate rows from both the Play Store and User Reviews datasets.

10. **Handling Missing Values:**
    - Filled missing values for numerical columns with the median and categorical columns with the mode.
    - Checked and printed the updated number of missing values in both datasets.

11. **Handling Outliers:**
    - Visualized outliers through box plots for Reviews and Installs.
    - Removed outliers from data based on quantile range (5% to 95%) for Reviews and Installs.

12. **Removing Unnecessary Columns:**
   - Certain columns were considered non-significant to the analysis and were subsequently dropped. Specifically, the 'Current Ver' column in the Play Store Dataset (`df_psdata`) and the 'Translated_Review' column in the User Reviews Dataset (`df_review`) were excluded.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Distribution of Sentiment Polarity

# Set the style of the visualization
sns.set_style('darkgrid')

# Create a new figure for the plot
plt.figure(figsize=(10, 5))

# Set the title of the plot
plt.title('Distribution of Sentiment Polarity', size=20)

# Plot the distribution of sentiment polarity using a histogram with KDE
sns.histplot(df_reviews['Sentiment_Polarity'], bins=50, kde=True)

# Set the labels for the x and y axes
plt.xlabel('Sentiment Polarity', size=15)
plt.ylabel('Frequency', size=15)

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

Histogram charts prove valuable when visualizing the distribution of a singular numerical variable. In this case, the histogram displays the frequency of different sentiment polarity values, while the KDE adds a smooth curve to highlight overall patterns and trends. This combination helps to clearly see the distribution shape, central tendency, and variability in sentiment scores, making it easier to interpret the sentiment data.

##### 2. What is/are the insight(s) found from the chart?


1. **Neutral sentiment dominance**: The histogram shows a significant peak at the neutral sentiment polarity (around 0.0). This indicates that a large portion of the data exhibits neutral sentiment, suggesting that the content or interactions analyzed often do not evoke strong positive or negative feelings.

2. **Prevalence of positive sentiment**: There is a noticeable right skew in the histogram, with a concentration of data points on the positive side of the polarity spectrum. This implies that the content generally leans towards a more positive sentiment, reflecting an overall positive reaction or sentiment in the dataset.

3. **Presence of negative sentiment**: Although less prominent, the histogram also reveals a tail on the negative side of the polarity scale. This suggests that some instances in the dataset carry negative sentiment, indicating areas where dissatisfaction or negative feedback is present.

4. **Distribution shape**: The distribution is skewed towards the positive side, meaning there is a higher concentration of data points with positive sentiment compared to negative ones. This trend implies that, overall, the sentiment within the dataset is more positive, with a general inclination towards user satisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

### Positive Business Impact:

1. **Overall Positive Sentiment**:
   - **Potential Positive Impact**: The strong presence of neutral and positive sentiment indicates that most content or interactions are well-received by users. This positive reception can enhance user experiences, increase retention, and potentially attract new users. Positive feedback and satisfaction often lead to favorable word-of-mouth, which can contribute significantly to the popularity and success of the business.

2. **Distribution Shape and User Satisfaction**:
   - **Positive Growth Opportunity**: The right-skewed distribution, with a higher concentration of positive sentiment, suggests a general trend of user satisfaction. This trend can be leveraged for positive growth. Developers and businesses can use this positive sentiment to reinforce popular features and further enhance the overall user experience, driving continued success and growth.

### Negative Growth Consideration:

1. **Presence of Negative Sentiment**:
   - **Potential Negative Impact**: The presence of negative sentiment, though less dominant, indicates that certain aspects of the content or interactions are causing user dissatisfaction. If this negative feedback is not addressed, it could lead to user churn, negative reviews, and a decline in reputation. Users might seek alternatives that better meet their expectations, potentially impacting the business negatively.

2. **Areas for Improvement**:
   - **Negative Growth Risk**: The negative sentiment also highlights areas that may require improvement. If these issues are not promptly addressed, they pose a risk of negative growth. Dissatisfied users could impact the overall ratings, retention rates, and success of the business in a competitive market. Addressing these concerns is crucial to prevent potential negative outcomes and maintain positive momentum.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Distribution Of App Rating In Play Store Data

# Set the visual style of the plot to 'darkgrid'
sns.set(style='darkgrid')

# Create a new figure
plt.figure(figsize=(10, 5))

# Add a title to the chart
plt.title('Distribution Of App Rating Values', size=20)

# Plot a histogram of the 'Rating' column from the DataFrame 'df_apps'
sns.histplot(df_apps['Rating'], bins=50, kde=True)

# Label the x-axis as 'Rating'
plt.xlabel('Rating', size=15)

# Label the y-axis as 'Frequency'
plt.ylabel('Frequency', size=15)

# Set the x-axis limits to display all values up to 6 while keeping the lower limit automatic
plt.xlim([None, 6])

# Display the plot
plt.show()



##### 1. Why did you pick the specific chart?

Histogram charts are beneficial for visualizing the distribution of a single numerical variable. In this scenario, utilizing a Histogram chart allowed me to portray the distribution of app ratings from the Play Store data, providing a clear representation of the frequency of various rating levels assigned by users.

##### 2. What is/are the insight(s) found from the chart?

Overall sentiment: The majority of users are satisfied with the apps, as evidenced by the high concentration of positive ratings.

Exceptional apps: A notable fraction of users demonstrates high enthusiasm for the apps, as evidenced by the concentration observed between ratings 4 and 5.

Areas for improvement: There is some room for improvement, as evidenced by the presence of lower ratings. By analyzing the negative reviews, developers can identify specific areas where the apps can be enhanced.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

###Positive Business Impact:

1. **Overall Positive Sentiment:**
   - **Potential Positive Impact:** The concentration of positive ratings suggests that the majority of users are satisfied with the apps. This positive sentiment can contribute to user retention, positive word-of-mouth, and potentially attract new users.

2. **Exceptional Apps and User Enthusiasm:**
   - **Potential Positive Impact:** The presence of a notable fraction of highly positive ratings indicates that some users are exceptionally enthusiastic about the apps. This could translate into a dedicated user base, potential advocates for the apps, and positive reviews that can enhance the apps' reputation.

3. **Areas for Improvement:**
   - **Positive Business Impact Opportunity:** Identifying areas for improvement, as indicated by lower ratings, presents an opportunity for positive business impact. By addressing specific concerns highlighted in negative reviews, developers can enhance the user experience, potentially leading to increased user satisfaction, better app reviews, and improved app performance.

###Negative Growth Consideration:
While the insights provide opportunities for positive impact, it's essential to consider the potential negative growth factors:

- **User Dissatisfaction:** The presence of lower ratings suggests that there are users who are not fully satisfied with the apps. If these concerns are not addressed promptly, it could lead to negative word-of-mouth, decreased user retention, and a potential decline in the apps' popularity.

- **Competitive Landscape:** If issues identified in the lower ratings are not addressed, it could impact the apps' competitiveness in the market. Users have various alternatives, and negative feedback might drive them towards competing apps that better meet their needs.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Distribution of App Type: Free or Paid

# Get the counts of each app type (Free or Paid) from the 'Type' column
apps_type = df_apps['Type'].value_counts()

# Create a new figure for the pie chart with a specified size of 8x8 inches
plt.figure(figsize=(8, 8))

# Plot a pie chart with the following parameters:
plt.pie(apps_type, labels=apps_type.index, explode=(0, 0.1), autopct='%1.1f%%', colors=sns.color_palette("Paired"))

# Set the title of the pie chart with a font size of 20
plt.title('Distribution of Paid and Free Apps', size=20)

# Display the pie chart
plt.show()


##### 1. Why did you pick the specific chart?

A pie chart visually conveys percentage distribution within a dataset, making it ideal for illustrating part-to-whole relationships. In this case, I utilized a pie chart to efficiently communicate the relative proportions of 'Free' and 'Paid' categories in the 'Type' column.

##### 2. What is/are the insight(s) found from the chart?

The larger slice of the pie chart represents the free apps, with 92.3% of the total. The smaller slice representing paid apps, with 7.7% of the total. This indicates that the vast majority of apps on the Google Play Store are free to download and use.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

###Positive Business Impact:
1. **Increased User Engagement:** Free apps are more likely to attract downloads and user engagement due to the absence of a cost barrier. This can result in a broader user base and heightened interaction with the app.

2. **Diverse Monetization Options:** Although free, developers can explore various monetization avenues, including in-app purchases, advertising, and premium features. This flexibility allows developers to generate revenue through different channels.

###Negative Growth Consideration:
1. **Monetization Challenges:** The dominance of free apps may pose challenges for developers seeking direct revenue through upfront payments. Monetizing free apps effectively requires thoughtful strategies, and some developers may find it challenging to strike the right balance between user satisfaction and sustainable business practices.

2. **Competition and Visibility:** The sheer volume of free apps can lead to increased competition for visibility. Standing out in a crowded market becomes a critical challenge, and developers may need to invest in marketing and discoverability strategies.

3. **User Experience Concerns:** Monetization through advertising must be approached carefully to avoid negatively impacting the user experience. Excessive or intrusive ads can lead to user dissatisfaction and potential abandonment of the app.

#### Chart - 4

In [None]:
# Chart - 4 Visualization Code
# To visualize the top 10 genres with the highest number of installs

# Group the data by 'Genres' and sum up the 'Installs' for each genre
genre_install = df_apps.groupby('Genres')['Installs'].sum().reset_index()

# Sort the genres by the total number of installs in descending order
sorted_genres_install = genre_install.sort_values(by='Installs', ascending=False)

# Select the top 10 genres with the highest number of installs
top_10_genres = sorted_genres_install.head(10)

# Create a figure for the bar plot with specified size
plt.figure(figsize=(10, 5))

# Create a bar plot showing the total installs for the top 10 genres
sns.barplot(x='Genres', y='Installs', data=top_10_genres, palette='colorblind')

# Rotate x-axis labels to fit them better and align them to the right
plt.xticks(rotation=45, ha='right')

# Set the title of the plot
plt.title('Top 10 Genres by Installs', size=20)

# Label the x-axis
plt.xlabel('App Genres', size=15)

# Label the y-axis
plt.ylabel('Total Installs', size=15)

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

Bar charts visually display the occurrence frequency of values across various levels of a categorical or nominal variable. I have chosen to use a bar chart to pinpoint the top 10 genres of apps by the number of installations they have.

##### 2. What is/are the insight(s) found from the chart?

- Tools apps are the most popular genre. This is likely due to the increasing reliance on smartphones and tablets for work and productivity.

- Action apps are the second most popular genre. Action games are typically fast-paced and exciting, and they appeal to a wide range of users.

- Photography apps are the third most popular genre. This is likely due to the increasing popularity of smartphone photography.

- Entertainment apps are the fourth most popular genre. Entertainment apps include various streaming services as well as social media apps.

- Communication apps are the fifth most popular genre. Communication apps include messaging apps and video conferencing apps.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

###Positive Business Impact:
1. **Tools Apps Popularity:** With Tools apps being the most popular genre, businesses can capitalize on this trend by developing and optimizing apps that cater to work and productivity needs. This insight suggests a demand for applications that enhance users' efficiency and organization, presenting an opportunity for businesses to create valuable and practical tools.

2. **Action Apps Appeal:** The popularity of Action apps, known for their fast-paced and exciting nature, indicates a broad appeal. Developers can leverage this insight to create engaging and entertaining games, potentially attracting a wide user base. This genre's popularity suggests a demand for immersive and thrilling experiences, which can be monetized effectively.

3. **Photography Apps Trend:** The increasing popularity of Photography apps aligns with the growing trend of smartphone photography. Businesses can seize this opportunity by developing innovative photography apps, offering features that enhance photo editing, organization, and sharing. This genre's popularity reflects a consumer interest in visual content creation.

4. **Entertainment Apps Opportunities:** Entertainment apps, encompassing streaming services and social media apps, hold significant popularity. Businesses can explore opportunities within this genre, either by creating new content streaming services or optimizing social media platforms. This trend indicates a sustained demand for diverse entertainment options.

5. **Communication Apps Importance:** The popularity of Communication apps, including messaging and video conferencing, highlights the essential role of connectivity. Businesses can focus on creating user-friendly and feature-rich communication apps, meeting the increasing demand for seamless connectivity and collaboration.

###Negative Growth Consideration:

1. **Competitive Challenges:** If certain genres have a saturated market with intense competition, it may be challenging for new apps to gain visibility and user traction. This could lead to negative growth for apps in those highly competitive categories.

2. **Addressing Negative Reviews:** Negative sentiment in reviews or lower app ratings may indicate areas for improvement. Ignoring or failing to address these issues could result in negative user experiences, leading to decreased installs and usage.

3. **Adapting to Trends:** App markets are dynamic, and user preferences can change. Failing to adapt to emerging trends or technological advancements may result in declining popularity and negative growth for apps that become outdated.


#### Chart - 5

In [None]:
# Chart - 5 Visualization Code
# Visualizing Content Ratings by Total Number of Installs

# Group the data by 'Content Rating' and sum up the 'Installs' for each content rating
content_rating_install = df_apps.groupby('Content Rating')['Installs'].sum().reset_index()

# Sort the content ratings by the total number of installs in descending order
sorted_content_rating = content_rating_install.sort_values(by='Installs', ascending=False)

# Set up the figure size for the plot
plt.figure(figsize=(10, 5))

# Create a bar plot showing the total installs for each content rating
sns.barplot(x='Installs', y='Content Rating', data=sorted_content_rating, palette='colorblind')

# Set the title of the plot and labels for x and y axis
plt.title('Content Ratings by Installs', size=20)
plt.xlabel('Total Installs', size=15)
plt.ylabel('Content Ratings', size=15)

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

Bar charts serve as a visual tool to effectively represent and compare data across distinct categories. Utilizing bars of varying lengths, these charts visually convey the values associated with each category, facilitating the observation of differences and similarities. Hence, I opted for a bar chart to analyze the variations in content ratings based on installations.

##### 2. What is/are the insight(s) found from the chart?

- The categories **Everyone** and **Teen** stand out with the highest number of installs, indicating preferences for apps suitable for all ages or users aged 13 and above. These categories encompass apps with minimal or mild content, including educational, entertainment, or social apps.

- The **Everyone 10+** category follows with the third-highest installs, suggesting a preference for apps suitable for users aged 10 and above. Such apps may contain more moderate content, such as fantasy or science fiction.

- The **Mature 17+** and **Adults only 18+** categories exhibit significantly fewer installs. This implies a limited preference for apps tailored to users aged 17 or older or 18 and older, which may feature intense or graphic content like violence, sexual content, drug use, or gambling.

- The **Unrated** category records the fewest installs, suggesting minimal interest in apps lacking official ratings. These apps may have unknown or variable content, potentially unsuitable for some users.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

###Positive Business Impact:

- The **Everyone** and **Teen** categories exhibit the highest number of installs, suggesting a positive reception and indicating a market demand for apps suitable for a broad audience, including educational and entertainment content.

- The **Everyone 10+** category, with the third-highest installs, reflects a positive response to apps tailored for users aged 10 and above. This indicates potential business opportunities in developing content with moderate themes for this demographic.

###Negative Growth Consideration:

- Limited installs for the **Mature 17+** and **Adults only 18+** categories suggest a potential negative impact. The lower preference for apps with intense or graphic content for users aged 17 or older may indicate a narrower market, prompting consideration before heavy investment in such content development.

- The **Unrated** category, recording the fewest installs, highlights user reluctance towards apps lacking official ratings. This hesitation could be attributed to uncertainties about the app's content, posing a challenge for positive user engagement and potential business growth.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Size Groups by Installs

# Define the function to group app sizes
def size_groups(value):
    try:
        if value < 20:
            return 'Below 20'
        elif value >= 20 and value <= 40:
            return '20-40'
        elif value > 40 and value <= 60:
            return '40-60'
        elif value > 60 and value <= 80:
            return '60-80'
        elif value > 80 and value <= 100:
            return '80-100'
        else:
            return 'Above 100'
    except:
        return value

# Apps with size 'Varies with device' have dynamic sizes
# that are not explicitly stated, making it challenging to categorize them accurately.
# Exclude rows where size is 'Varies with device'
df_filtered_size = df_apps[df_apps['Size'] != 'Varies with device']

# Apply the size_groups function to create a new 'Size Group' column
df_filtered_size['Size Group'] = df_filtered_size['Size'].apply(size_groups)

# Group by Size Group, summing the installs for each size group
size_group_installs = df_filtered_size.groupby('Size Group')['Installs'].sum().reset_index()

# Sort by installs in descending order
sorted_size_group_installs = size_group_installs.sort_values(by='Installs', ascending=False)

# Set up the plot
plt.figure(figsize=(10, 5))
sns.barplot(x='Installs', y='Size Group', data=sorted_size_group_installs, palette='pastel')

# Customize the plot
plt.title('Size Groups by Installs',size=20)
plt.xlabel('Total Installs',size=15)
plt.ylabel('Size Groups(MB)',size=15)

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

This horizontal bar chart was selected to visualize the relationship between different size groups of apps and the total number of installs for each group on the Google Play Store. Bar charts are particularly effective for this purpose because they allow for easy comparison across categories (in this case, app size groups) by using bars of varying lengths.

##### 2. What is/are the insight(s) found from the chart?

**Insights from the Chart:**

1. Dominance of Smaller Apps:

The "Below 20 MB" size group has the highest total installs, indicating that smaller apps tend to be more popular or are installed more frequently compared to larger apps. This could be due to several factors, such as lower data usage, quicker downloads, or limited device storage on the part of users.

2. Moderate Popularity for Medium-Sized Apps:

The size groups "20-40 MB" and "40-60 MB" also have significant numbers of installs but are less popular than the smallest apps. These apps may offer more features or content, which could appeal to users despite their larger size.

3. Lower Popularity of Larger Apps:

The "60-80 MB" and "80-100 MB" size groups have noticeably fewer installs compared to the smaller size groups. This suggests that users might be more selective or hesitant to download larger apps, possibly due to concerns about storage space, data usage, or the perception that larger apps are less essential.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the chart can help create a **positive business impact** if leveraged effectively. Here’s how:

### Positive Business Impact:
1. **Optimizing App Size for Greater Adoption:**
   - Knowing that smaller apps ("Below 20 MB") tend to have higher install rates, businesses can focus on optimizing their app sizes to be as small as possible. This could attract more users, particularly those with limited storage or data plans, thereby increasing the app’s reach and user base.

2. **Targeted Marketing Strategies:**
   - Understanding that medium-sized apps ("20-40 MB" and "40-60 MB") also perform well, businesses can tailor their marketing strategies to highlight the value these apps provide. For example, they can emphasize features that justify the larger size, ensuring that potential users understand the benefits of downloading a slightly bigger app.

3. **User Retention and Satisfaction:**
   - By focusing on reducing app size without compromising functionality, businesses can enhance user satisfaction. This approach can lead to better reviews, higher user retention, and more word-of-mouth referrals, contributing to sustained growth and profitability.

4. **Resource Allocation:**
   - Businesses can allocate resources more effectively by investing in optimizing smaller or medium-sized apps rather than developing large, resource-intensive apps that might not perform as well in terms of installs.

### Potential Risks (Negative Impact if Not Addressed):
- **Over-Optimization:**
  - If a business overly focuses on minimizing app size, it might strip away essential features or degrade the user experience, which could lead to negative reviews and a drop in user retention.

- **Ignoring High-Value Niche Markets:**
  - While smaller apps are more popular in general, there may be niche markets that prefer larger apps due to specific functionalities or content offerings. Ignoring these could mean missing out on potentially lucrative segments.


#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Distribution of Ratings by Update Year
# Extract the year from the 'Last Updated' column
df_apps['Update Year'] = df_apps['Last Updated'].dt.year

# Set up the regression plot
plt.figure(figsize=(10, 5))
sns.regplot(x='Update Year', y='Rating', data=df_apps, scatter_kws={'alpha':0.5}, line_kws={'color':'red'})

# Customize the plot
plt.title('Distribution of Ratings by Update Year',size=20)
plt.xlabel('Update Year',size=15)
plt.ylabel('Rating',size=15)

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

A regplot is a statistical visualization that reveals the relationship between two continuous variables through a scatter plot and a best-fit linear regression line. It helps identify trends, patterns, and the strength of the correlation between variables. I utilized regplot to explore the distribution of Ratings with respect to Update Years.

##### 2. What is/are the insight(s) found from the chart?

-  The average rating has shown an improvement, rising from approximately 3.5 in 2010 to nearly 4.5 in 2018. This indicates a general trend of increasing satisfaction among users with the product over the years.

-  The red line shows that the overall trend is towards increasing ratings. This is a positive sign for the product.

-  The slope of the red line is positive, which indicates that the relationship between rating and update year is positive. This means that ratings tend to increase as the update year increases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

###Positive Business Impact:
1. **Increasing Average Rating:** The rise in average ratings from approximately 3.5 in 2010 to nearly 4.5 in 2018 indicates an upward trend in user satisfaction. This is a positive signal as it suggests that users are generally more content with the product over time.

2. **Improvement Over Time:** The red line representing the trend shows a positive slope, indicating a consistent increase in ratings. This implies that developers are making continuous improvements, positively influencing user satisfaction.

3. **Possible Explanations for Rating Increase:**

  Introduction of New Features: Developers may be adding new features, enhancing the product's functionality, and providing users with more value.

  Enhancements in Reliability and Usability: The product may be improving in terms of reliability and user-friendliness, contributing to a better overall experience.

  Increased Popularity: A growing user base could lead to a more positive user experience, as popularity often correlates with user satisfaction.

###Negative Growth Consideration:
1. **Limited Historical Context:** While the increasing trend is positive, it's crucial to consider the context. Ratings might be influenced by various factors, and without a deeper understanding of the product's evolution or changes, solely relying on the increasing trend might be limited.

2. **Potential Plateau:** Over time, achieving consistently higher ratings becomes challenging, and there might be a plateau effect. If the ratings reach a saturation point, further improvements might yield diminishing returns, potentially leading to stagnation.

3. **User Base Shift:** Ratings might be influenced by changes in the user base. If the user demographic shifts or if newer users have different expectations, the historical trend may not accurately reflect the current user sentiment.


#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Top 10 Categories by Average Revenue
# Calculate Revenue for each app
df_apps['Revenue'] = df_apps['Installs'] * df_apps['Price']

# Group by Category and calculate the mean revenue, sorted in descending order
category_revenue = df_apps.groupby('Category')['Revenue'].mean().sort_values(ascending=False)

# Select the top 10 categories
top_10_categories_revenue = category_revenue.head(10)

# Set up the plot
plt.figure(figsize=(10, 5))

# Title of the plot
plt.title('Top 10 Categories by Average Revenue',size=20)

# Create a bar plot for the top 10 categories
sns.barplot(x=top_10_categories_revenue.values, y=top_10_categories_revenue.index, palette='Set3')

# Labeling the x and y axes
plt.xlabel('Average Revenue (USD)',size=15)
plt.ylabel('Category',size=15)

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart is a valuable visualization for comparing diverse data points, particularly when examining and contrasting data across different categories. So, I opted for a bar chart to explore the top 10 categories by average revenue.

##### 2. What is/are the insight(s) found from the chart?

- The top 3 revenue-generating categories, namely Lifestyle, Finance, and Weather, indicate a willingness among users to invest in products and services associated with their personal lives and finances.

- The following 3 categories in revenue ranking Game, Photography, and Family are all linked to entertainment and leisure, signaling an increasing trend of expenditure on experiences meant for enjoyment and shared moments.

- Contrastingly, the Sports category records the lowest average revenue, implying a lack of popularity among users. Similarly, the Education category follows closely with the second lowest average revenue, highlighting lower profitability.

- The Personalization category secures the fourth-lowest average revenue, suggesting a relatively lower level of user interest. Lastly, the Medical category ranks third lowest in average revenue, indicating potential challenges in terms of convenience or security for apps within this category.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

###Positive Business Impact:
1. **Strategic Focus on High-Revenue Categories:** Businesses can leverage the knowledge about the top revenue-generating categories like Lifestyle, Finance, and Weather to strategically focus on developing and optimizing apps within these genres. This aligns with user preferences and the willingness to invest in products associated with personal lives and finances.

2. **Entertainment and Leisure Trends:** Recognizing the revenue potential in Game, Photography, and Family categories allows businesses to tap into the increasing trend of expenditure on experiences related to entertainment and leisure. Developers can create engaging and enjoyable apps within these genres to attract users and generate revenue.

###Negative Growth Consideration:
1. **Low Revenue in Sports and Education Categories:** The Sports category recording the lowest average revenue suggests a lack of popularity among users. Similarly, the Education category having the second lowest average revenue indicates lower profitability. If businesses are heavily invested in these categories, they may face challenges in generating substantial revenue. This doesn't necessarily lead to negative growth, but it does signal areas for strategic evaluation and potential adjustments.

2. **Challenges in Personalization and Medical Categories:** The relatively low average revenue in the Personalization and Medical categories suggests challenges. In the Personalization category, there may be a lower level of user interest, while in the Medical category, potential issues related to convenience or security may be impacting revenue. Addressing these challenges through targeted improvements or considering alternative strategies is essential to mitigate any negative impact.



#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Comparing the Revenue and Android Version of the Top 8 Paid Apps in the Play Store
# Set style to 'whitegrid'
sns.set(style='whitegrid')

#Exclude apps with Android version 'Varies with device'
top_8_paid_apps = df_apps[df_apps['Android Ver'] != 'Varies with device'].nlargest(8, 'Revenue', keep='first')

# Set up the plot
plt.figure(figsize=(10, 5))

# Plotting lollipops for revenue using Seaborn scatterplot
sns.scatterplot(x='App', y='Revenue', hue='Android Ver', data=top_8_paid_apps, palette='Dark2', s=300, zorder=2)

# Plotting bars for revenue using Seaborn barplot
sns.barplot(x='App', y='Revenue', data=top_8_paid_apps, color='darkorange', width=0.08, zorder=1)

# Customize the plot
plt.xlabel('App', size=15)
plt.xticks(rotation=45,ha='right')
plt.ylabel('Revenue (USD)', size=15)
plt.title('Top Eight Paid Apps: Revenue and Android Version', size=20)

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

Lollipop charts offer a visually appealing alternative to traditional bar charts. In this chart, the lollipop's colour shows the Android version, while the length represents the app's revenue. It effectively highlights the top 10 revenue-generating apps and provides insights into Android version compatibility making this visualization my preferred choice.

##### 2. What is/are the insight(s) found from the chart?


1. The apps with Android versions 4.0 and above dominate the higher revenue ranks, suggesting a correlation between app compatibility with newer Android versions and revenue generation.

2. Among the top 8 high revenue apps, six are designed for Android versions 4.0 and above. The exceptions are "Grand Theft Auto: San Andreas" (Android 3.0 and up) and "DraStic DS Emulator" (Android 2.3 and up), both of which are on the lower end of the revenue spectrum.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

###Positive Business Impact:

1. **Targeted Development:** Knowing that apps compatible with Android versions 4.0 and above tend to generate higher revenue can guide developers in focusing their efforts on creating and optimizing apps for these versions. This targeted development approach may result in more successful and lucrative applications.

2. **Market Alignment:** Aligning app development with the Android versions preferred by users can enhance market penetration and user adoption. This alignment may lead to increased downloads and, subsequently, higher revenue.

###Negative Growth Consideration:

1. **Compatibility Challenges:** Apps designed for older Android versions (e.g., Android 2.3 and 3.0) are associated with lower revenue. Investing resources in developing or maintaining apps for these versions may not yield significant returns, potentially leading to negative growth.

2. **Revenue Discrepancy:** The notable revenue difference between apps for Android 4.0 and above versus older versions suggests a market preference for more recent Android iterations. Failing to adapt to this preference may result in negative growth as user demand shifts toward newer Android releases.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Sentiment counts across categories.
# Set style to 'whitegrid'
sns.set(style='whitegrid')

# Set the figure size
plt.figure(figsize=(14, 6))

# Merging the Play Store data with the User Reviews data
merged_df = pd.merge(df_apps, df_reviews, on='App', how='inner')

# Calculating the average sentiment polarity for each app
average_sentiment_polarity = merged_df.groupby('App')['Sentiment_Polarity'].mean()

# Merging the average sentiment polarity back to the original play store dataframe
play_store_sentiment_df = df_apps.join(average_sentiment_polarity, on='App')

# Count the number of sentiments for each category
grouped_df = merged_df.groupby(['Category', 'Sentiment']).size().unstack()

sns.barplot(data=grouped_df.reset_index(), x='Category', y='Positive', color='skyblue', label='Positive')
sns.barplot(data=grouped_df.reset_index(), x='Category', y='Negative', color='gray', bottom=grouped_df['Positive'], label='Negative')
sns.barplot(data=grouped_df.reset_index(), x='Category', y='Neutral', color='orange', bottom=grouped_df['Positive'] + grouped_df['Negative'], label='Neutral')

# Adding labels and title
plt.xlabel('Category',size=15)
plt.xticks(rotation=90)
plt.ylabel('Count of Sentiments',size=15)
plt.title('Sentiments vs Category',size=20)

# Adding legend
plt.legend(title='Sentiment')

# Displaying the plot
plt.show()


##### 1. Why did you pick the specific chart?

Stacked bar charts are useful when we want to compare the composition of different components that contribute to a whole. Each bar is divided into sub-bars, representing levels of the second categorical variable, offering a clear depiction of contributions to the total. So, I used it to display sentiment counts across categories.

##### 2. What is/are the insight(s) found from the chart?

Top 5 categories with the highest positive sentiments:

- GAME
- FAMILY
- HEALTH_AND_FITNESS
- TRAVEL_AND_LOCAL
- DATING

Top 5 categories with the highest negative sentiments:
- GAME
- FAMILY
- TRAVEL_AND_LOCAL
- DATING
- SPORTS

The stacked bar chart reveals a complex interplay of positive and negative sentiments across different categories. While people express positive sentiments towards categories like games, family, health and fitness, travel and local, and dating, there is also a notable presence of negative sentiment associated with most of these same categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

###Positive Business Impact:
1. **Mixed Sentiments Across Top Categories:**

  The presence of both positive and negative sentiments within the same categories (GAME, FAMILY, TRAVEL_AND_LOCAL, DATING) suggests a nuanced user experience. Users seem to have mixed feelings and varied interactions within these categories.

2. **Positive Exclusivity in HEALTH_AND_FITNESS:**

  HEALTH_AND_FITNESS stands out for exclusively having positive sentiments, indicating a notably favorable perception among users. This category appears to provide positive experiences with very less negative sentiments.

###Negative Growth Consideration:

1. **Addressing Negative Sentiments:**

  Negative sentiments within the same categories suggest potential challenges or dissatisfaction among users. Ignoring or neglecting these negative sentiments may lead to negative growth, impacting user retention and brand reputation.

2. **Mitigating Challenges in SPORTS:**

  Given the high negative sentiments in the SPORTS category, businesses should conduct a detailed analysis to understand and address the challenges. Proactive measures to improve user experiences in this category are essential to preventing negative growth.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Progression of update counts and the distribution of sentiment counts over time
# Group by 'Update Year' and count the number of updates
update_counts = merged_df.groupby("Update Year")["App"].count()

# Group by 'Update Year' and 'Sentiment' and count occurrences
sentiment_counts = merged_df.groupby(['Update Year', 'Sentiment']).size().unstack()

# Plotting
plt.figure(figsize=(8, 8))

# Plotting the number of updates received
plt.subplot(2, 1, 1)
plt.plot(update_counts.index, update_counts, label='Number of Updates', marker='o', color='gray')
plt.ylabel('Number of Updates', size=15)
plt.title('Number of Updates and Sentiments over Update Years', size=15)
plt.legend()

# Plotting sentiments
plt.subplot(2, 1, 2)
plt.plot(sentiment_counts.index, sentiment_counts['Positive'], label='Positive', marker='o')
plt.plot(sentiment_counts.index, sentiment_counts['Negative'], label='Negative', marker='o')
plt.plot(sentiment_counts.index, sentiment_counts['Neutral'], label='Neutral', marker='o')
plt.xlabel('Update Year', size=15)
plt.ylabel('Number of Sentiments', size=15)
plt.legend()

# Adjust layout for better readability
plt.tight_layout()

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

Line charts are an effective way to present data in a sequential manner, especially over time. In this case, I used line charts to illustrate the progression of update counts and the distribution of sentiment counts over time. The purpose was to investigate whether there is any correlation between these two sets of data.

##### 2. What is/are the insight(s) found from the chart?

- There is a general trend of increasing positive sentiments over time. This suggests that people are generally becoming more satisfied with the updates they are receiving.

- The number of updates is increasing over time. This suggests that the developers are releasing new updates more frequently.

- The number of negative sentiments is relatively stable. This suggests that people are generally not very unhappy with the updates they are receiving.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

###Positive Business Impact:

- **Increasing Positive Sentiments Over Time:**
  Positive sentiments trending upward over time indicates growing satisfaction among users. This is a positive signal for the business, as satisfied customers are more likely to stay engaged, recommend the product to others, and contribute positively to the brand image.

- **Increasing Number of Updates:**
  The rising number of updates implies that developers are actively working on improving the product and addressing user needs. This can have a positive impact on user engagement and loyalty, as users appreciate continuous improvement and new features.

- **Stable Negative Sentiments:**
  The stability of negative sentiments suggests that, in general, users are not significantly dissatisfied with the updates. While some negative sentiments may be inevitable, the fact that they are stable indicates that any issues are not escalating. This stability can be seen as a positive aspect, as it implies that negative feedback is manageable and not worsening.

###Negative Growth Consideration:
- **Continuously Monitor Negative Feedback:**
  Regularly monitoring negative sentiments and feedback can help identify specific areas for improvement. Even if negative sentiments are stable, addressing specific concerns can lead to enhanced customer satisfaction.

- **Engage with Users:**
  Proactively engaging with users to understand their concerns and suggestions can provide valuable insights. Addressing user feedback and concerns demonstrates a commitment to customer satisfaction and can contribute to positive growth.

- **Competitor Analysis:**
  Assessing competitor products and user feedback can provide a comparative perspective. Understanding what competitors are doing well and areas where they face challenges can help in refining the business strategy.



#### Chart - 12

In [None]:
# Chart - 12 visualization code# Set up the plot
plt.figure(figsize=(10, 5))

# Create a scatter plot using hue for installs and size for sentiment polarity
scatter = sns.scatterplot( x='Rating', y='Sentiment_Polarity',  size='Installs',  hue='Installs',  data=play_store_sentiment_df, sizes=(50, 200), palette='viridis', edgecolor='white', legend=True)

# Customize the legend for bubble size (Installs)
scatter.legend(loc='upper left', title='Installs', bbox_to_anchor=(1, 1))

# Customize the plot
plt.title('Sentiment Polarity by Rating and Installs', size=18, fontweight='bold')
plt.xlabel('Rating', size=14)
plt.ylabel('Average Sentiment Polarity', size=14)

# Show gridlines for better readability
plt.grid(True, linestyle='--', alpha=0.7)

# Show the plot
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A bubble chart visually represents data points as bubbles, it's an extension of a scatter plot that uses the size of the data points to represent a third dimension of data, making it ideal for illustrating the relationships among three variables. So, I used it to explore the connections between average sentiment polarity, app rating, and the number of installs.

##### 2. What is/are the insight(s) found from the chart?


1. **Apps with higher ratings tend to have higher average sentiment polarity:** This correlation is logical, as users are more inclined to leave positive reviews for apps they enjoy using.

2. **Niche Apps with High Sentiment Polarity:**
Some apps exhibit high average sentiment polarity despite having relatively low install counts. This indicates the presence of niche apps that are deeply cherished by their user base, even though they may not enjoy widespread popularity.

3. **Popularity vs. Sentiment Polarity Discrepancy:**
The large bubbles representing the most installed apps generally show lower average sentiment polarity. This implies that widespread popularity doesn't consistently align with positive user sentiment.Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

###Positive Business Impact:
- **Identifying popular apps with high sentiment:** Focusing marketing efforts towards popular apps with high sentiment polarity can further boost their popularity and attract new users.

- **Promoting niche apps with high sentiment:** Promoting niche apps with high sentiment polarity can help them reach their target audience and achieve sustainable growth.

- **Understanding user sentiment variations:** By analyzing how sentiment polarity varies across different app features and user segments, developers can identify areas for improvement and implement changes to increase user satisfaction.

- **Prioritizing feedback based on popularity and sentiment:** Insights from the chart can guide developers in prioritizing user feedback based on both app popularity and user sentiment, ensuring resources are allocated effectively.

###Negative Growth Consideration:
- **Focusing solely on popular apps:** Focusing solely on promoting popular apps might neglect niche apps with high user satisfaction, potentially missing out on valuable market segments.

- **Misinterpreting sentiment polarity:** Taking average sentiment polarity at face value might lead to overlooking important aspects of user feedback. A deeper analysis of individual reviews is necessary to understand specific user concerns.

- **Ignoring install count trends:** Ignoring the relationship between install count and sentiment polarity might result in neglecting potential issues with popular apps, leading to user churn and dissatisfaction.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Relationship between Sentiment Subjectivity and Sentiment Polarity
# Set the size of the figure
plt.figure(figsize=(12, 6))

# Create a scatter plot using seaborn
sns.scatterplot(data=merged_df, x='Sentiment_Subjectivity', y='Sentiment_Polarity', hue='Sentiment', palette='Set2')

# Set labels for the x and y axes
plt.xlabel('Sentiment Subjectivity', size=15)
plt.ylabel('Sentiment Polarity', size=15)

# Set the title of the plot
plt.title('Relationship between Sentiment Subjectivity and Sentiment Polarity', size=15)

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

Scatter plots are a powerful tool for exploratory data analysis, as they allow us to quickly identify patterns and correlations in the data. By using the hue parameter, we can color-code the data points based on a third variable, making it easier to identify patterns and correlations between the variables. So, I used a scatter plot to identify the patterns between Sentiment Subjectivity and Sentiment Polarity with the Sentiment variable represented by hue.

##### 2. What is/are the insight(s) found from the chart?

The scatter plot of Sentiment Polarity and Sentiment Subjectivity shows a moderate positive correlation between the two variables. This means that, in general, as sentiment polarity increases, sentiment subjectivity tends to increase as well. However, the relationship is not very strong. This suggests that there is a tendency for people to express their opinions more strongly when they are feeling positive than when they are feeling negative.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

###Positive Business Impact:
- **Targeted Marketing:** Understanding that positive sentiments are linked with increased subjectivity, businesses can tailor marketing messages and campaigns to resonate with the emotional and expressive aspects of their satisfied customers.

- **Enhanced Customer Engagement:** Recognizing the correlation, businesses can focus on creating platforms for customers to share their positive experiences in a more detailed and expressive manner. This could include encouraging reviews, testimonials, or social media interactions that capture the enthusiasm of satisfied customers.

- **Product/Service Improvements:** Analyzing the correlation might lead to insights into what aspects of products or services generate strong positive sentiments. Businesses can use this information to prioritize and refine features that contribute to customer satisfaction.

###Negative Growth Consideration:
- **Overreliance on Subjective Opinions:** Businesses should not solely rely on subjective opinions to drive decisions, as they may not reflect broader customer sentiment or objective performance metrics.

- **Misinterpretation of Feedback:** Strong expressions of negative sentiment may not always be indicative of a widespread issue. Careful analysis and context-specific interpretation are crucial.

- **Potential for Bias:** Subjective opinions may be influenced by individual biases, personal experiences, and external factors, potentially leading to inaccurate assessments of overall customer sentiment.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Chart - 14 - Correlation Heatmap visualization code
# Relationships among Rating, Reviews, Installs, Price, Sentiment_Polarity, and Sentiment_Subjectivity
# Selecting numerical columns from the merged dataframe
numerical_columns = merged_df[['Rating', 'Reviews', 'Installs', 'Price', 'Sentiment_Polarity', 'Sentiment_Subjectivity']]

# Create a correlation matrix
correlation_matrix = numerical_columns.corr()

# Create a heatmap
plt.figure(figsize=(10, 5))
sns.heatmap(correlation_matrix, annot=True, cmap='rocket', fmt=".2f")
plt.title('Correlation Heatmap',size=20)
plt.show()


##### 1. Why did you pick the specific chart?

A correlation heatmap is a powerful tool for identifying relationships between pairs of variables in a dataset. The color-coding allows us to see at a glance whether variables are positively correlated (similar movements) or negatively correlated (opposite movements). The colors in the heatmap represent the strength of the correlation, with brighter colors indicating stronger correlations and darker colors indicating weaker or no correlations. The range of correlation is from -1 to 1. So, I used a correlation heatmap to find the relationships among Rating, Reviews, Installs, Price, Sentiment_Polarity, and Sentiment_Subjectivity.

##### 2. What is/are the insight(s) found from the chart?

- Rating has a moderate positive correlation with Reviews, Installs, and Sentiment_Polarity. This means that apps with higher ratings tend to have more reviews, more installs, and more positive sentiment in their reviews.
Rating has a weak negative correlation with Price. This means that apps with higher ratings tend to be slightly cheaper.

- Reviews has a strong positive correlation with Installs. This means that apps with more reviews tend to have more installs.
Reviews has a weak negative correlation with Price. This means that apps with more reviews tend to be slightly cheaper.

- Installs has a weak negative correlation with Price. This means that apps with more installs tend to be slightly cheaper.

- Sentiment_Polarity has a moderate positive correlation with Sentiment_Subjectivity. This means that apps with more positive sentiment in their reviews tend to have slightly more subjective reviews.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

### Positive Business Impact:
1. **Correlation between Installs and Reviews (0.61)**:
   - There is a strong positive correlation between the number of installs and the number of reviews. This suggests that more popular apps (with high install counts) tend to generate more user engagement in the form of reviews.
   - **Business Impact**: Focusing on strategies that increase installs (e.g., marketing campaigns, partnerships) could amplify user feedback through reviews, which in turn helps improve the app’s visibility and credibility.

2. **Positive correlation between Rating and Sentiment Polarity (0.12)**:
   - Apps with higher ratings tend to have slightly more positive sentiment in the reviews. Even though this correlation is weak, improving app ratings can help generate more positive reviews, reinforcing a good user experience.
   - **Business Impact**: Encouraging users to rate the app positively through incentives, or improving app quality, could result in higher ratings, which marginally improves sentiment.

### Negative Insights and Potential for Negative Growth:
1. **Weak correlation between Price and other factors**:
   - Price shows very weak correlations with key factors like Reviews (-0.04), Installs (-0.08), and Sentiment Polarity (0.02). This suggests that price may not be a strong driving factor for user engagement or growth.
   - **Negative Growth Insight**: Pricing strategies may not significantly impact installs or user sentiment. Therefore, focusing too much on price adjustments may not lead to the expected business growth. A more value-driven approach (e.g., adding features or improving user experience) might be better for increasing installs and reviews.

2. **Sentiment Polarity vs. Reviews (-0.06) and Installs (-0.07)**:
   - The correlations between sentiment polarity and the number of reviews/installs are negative, though weak. This suggests that the number of installs or reviews does not necessarily improve the sentiment in user feedback.
   - **Negative Growth Insight**: While higher installs and reviews can increase visibility, they do not guarantee improved sentiment. Poor app experiences (despite high install numbers) could lead to negative reviews, potentially harming long-term growth. This highlights the need to focus on user satisfaction and app quality to ensure that higher installs translate into positive sentiment.


#### Chart - 15 - Pair Plot

In [None]:
# Chart - 15 - Pair Plot visualization code
# Correlations among Rating, Installs, Reviews and Price.
# Selecting numerical columns from the merged dataframe
included_columns = merged_df[['Rating', 'Installs', 'Reviews', 'Price', 'Type']]

# Log-transform 'Installs' and 'Reviews'
included_columns['Installs(Log)'] = np.log(included_columns['Installs'])
included_columns['Reviews(Log)'] = np.log10(included_columns['Reviews'])

# Selecting columns for pair plot
selected_columns = included_columns[['Rating', 'Installs(Log)', 'Reviews(Log)', 'Price', 'Type']]

# Create a pair plot
p = sns.pairplot(selected_columns, hue='Type', markers=["o", "s"], palette={"Free": "blue", "Paid": "orange"})
p.fig.suptitle("Pair Plot - Rating, Installs, Reviews, Price", x=0.5, y=1.02, fontsize=20)

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

Pair plots are a powerful tool for visualizing the relationships between multiple variables in a dataset. Each subplot in the grid represents a pair of variables, making it easy to identify patterns, trends, and correlations. So, I used a pair plot to explore the patterns and correlations between Rating, Installs, Reviews and Price.

##### 2. What is/are the insight(s) found from the chart?

- **Rating and Installs:** There is a positive correlation between rating and installs, meaning that apps with higher ratings tend to have more installs. This is likely because users are more likely to install apps that have been positively reviewed by other users.

- **Rating and Reviews:** There is also a positive correlation between rating and reviews, meaning that apps with higher ratings tend to have more reviews. This is likely because users are more likely to write reviews for apps that they enjoy using.

- **Rating and Price:** There is a weak negative correlation between rating and price, meaning that apps with higher ratings tend to be slightly cheaper. This is likely because developers of high-quality apps are able to charge lower prices due to the high demand for their apps.

- **Installs and Reviews:** There is a strong positive correlation between installs and reviews, meaning that apps with more installs tend to have more reviews. This is likely because users are more likely to write reviews for apps that they have used extensively.

- **Installs and Price:** There is a weak negative correlation between installs and price, meaning that apps with more installs tend to be slightly cheaper. This is likely because developers of popular apps are able to charge lower prices due to the high volume of installs.



## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

###Actionable Recommendations:
- **Strategic Development in Popular Genres:**
  Acknowledge the popularity of certain genres, such as Tools, Action, Photography, and Entertainment.
  Consider investing in or refining apps within these genres to align with user preferences and maximize engagement.

- **Emphasize Free Apps:**
  Free apps overwhelmingly dominate the market, indicating the significance of providing an engaging free version to attract users.
  Given that the majority of apps are free, focus on monetization strategies that complement the free model, such as in-app purchases, ads, or premium features.

- **Optimize App Size:**
  Recognize the preference for smaller apps, as indicated by the popularity of the below 20 and 20-40 MB size groups.
  Ensure that new app developments and updates prioritize efficiency and minimal storage requirements to align with user preferences for smaller-sized apps.

- **Tailor Content Ratings:**
  Understand the user preference for Everyone and Teen categories and ensure new apps align with these preferences.
  Be mindful of the limited interest in Mature and Adults-only categories, adjusting content accordingly.

- **Strategic Revenue Generation:**
  Consider app development or improvements in categories that generate higher revenue, such as Lifestyle, Finance, and Weather.
  Evaluate user preferences within lower-revenue categories to identify opportunities for enhancement.

- **Compatibility with Latest Android Versions:**
  Given the correlation between higher revenue and compatibility with newer Android versions, prioritize app development and updates for the latest versions of the Android operating system.

- **User Engagement in Popular Categories:**
  Recognize the positive sentiments associated with popular categories like Games, Family, Health and Fitness, Travel and Local, and Dating.
  Strategically engage users through marketing, promotions, and feature enhancements in these categories.

- **Continuous Improvement:**
  Monitor the progression of update counts and user sentiments over time.
  Respond to user feedback with timely updates to demonstrate a commitment to app improvement.
  
- **Focus on Positive User Sentiment:**
  Identify the characteristics of apps with consistently positive sentiment and leverage those aspects in future app development.
  Encourage and amplify positive user experiences through marketing and feature enhancements.

- **Address Negative Feedback:**
  Investigate apps with negative sentiment to pinpoint specific issues causing dissatisfaction.
  Prioritize improvements in areas highlighted by negative sentiment to enhance user satisfaction and overall app performance.

# **Conclusion**

This project has successfully analyzed the Play Store app dataset using Python, uncovering valuable insights into key factors for app engagement and success. The data visualizations and interpretations provide a comprehensive understanding of user sentiment, app ratings, genre preferences, content suitability, and the impact of updates.

Based on these insights, I have crafted actionable recommendations for the client to optimize app performance and achieve their business objectives. I advise them to focus on fostering positive sentiment, addressing user concerns promptly, catering to user preferences like smaller apps and frequent updates, targeting specific genres and content ratings strategically, and building loyal followings in niche markets.

By embracing a data-driven approach and continuously adapting to user preferences, the client can ensure long-term success in the competitive Android app market.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***

### ***Exploratory Data Analysis (EDA) Done By: Rudreh mishra***
### ***Email: mishrasm64@gmail.com***
### ***Technical Document:https://docs.google.com/document/d/19dzbJeSd_Jcne1NpS3Zwa98n4VyLI4dm/edit?usp=drive_link&ouid=106955276825311176485&rtpof=true&sd=true***