<a href="https://colab.research.google.com/github/B-7792/Exploratory-Data-Analysis/blob/main/Exploratory_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -
#**Exploratory Data Analysis of Google Play Store Apps**


##### **Project Type**    - Data Analysis / Exploratory Data Analysis (EDA)
##### **Contribution**    - Individual
##### **Team Member 1** - *Bhushan Mohod*


# **Project Summary -**
* Google Play Store Apps Data: Contains various attributes for each app such as category, rating, size, installs, and price.

* Google Play Store User Reviews Data: Contains user reviews with attributes such as review ID, username, content, score, thumbs up count, and sentiment.

Understanding Your Dataset
To gain insights into your dataset, it’s essential to systematically analyze its structure and content. Here’s a summary of the key steps you should take:

1. Dataset Columns
The first step is to understand the columns present in your dataset. Columns represent the features or attributes of the dataset and can include various types of data, such as numerical values, categorical labels, or text.
2. Dataset Describe
Next, use the describe() method to generate a statistical summary of the dataset. This method provides insights into the distribution and range of numerical columns.
3. Variables Description
Understanding what each column represents is crucial. This involves detailing the column names, data types, and descriptions based on the dataset's context.
4. Check Unique Values
Finally, examining the unique values in each column can reveal the diversity and range of data. This step is crucial for identifying categorical data and understanding the scope of numerical data.


By performing these steps, you gain a comprehensive understanding of your dataset:

Dataset Columns: You identify all the columns and their names, which form the structure of your dataset.
Dataset Describe: Provides a statistical summary for numerical columns and frequency counts for categorical columns, revealing insights into the data distribution and range.
Variables Description: Offers context for each column, describing what the data represents and its significance.
Check Unique Values: Highlights the distinct values in each column, which helps in understanding data variability and identifying potential anomalies.
This approach ensures that you have a solid grasp of your dataset’s content and structure, laying the groundwork for effective data cleaning, feature engineering, and analysis.

# **GitHub Link -**

Provide your GitHub Link here.

https://github.com/B-7792/Exploratory-Data-Analysis

> Add blockquote

git clone https://github.com/B-7792/Exploratory-Data-Analysis.git





# **Problem Statement**



This project aims to empower app developers with the insights they need to succeed in the competitive landscape of the Google Play Store. By conducting a thorough exploratory data analysis of app attributes and user reviews, we seek to uncover the key factors that drive app engagement and success, providing a clear pathway for developers to optimize their apps and achieve greater user satisfaction

#### **Define Your Business Objective?**

To perform an exploratory data analysis (EDA) of the Google Play Store apps data and user reviews to uncover key factors that influence app engagement and success. The goal is to provide actionable insights that app developers can use to optimize their apps for better performance and user satisfaction

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





Well-structured, formatted, and commented code is required.
Exception Handling, Production Grade Code & Deployment Ready Code will be a plus.
The notebook should be executable in one go without any errors.
Each and every logic should have proper comments.
Charts and Insights:
Each chart must address:
Why the specific chart was chosen.
Insights gained from the chart.
Potential business impact of the insights.
Any insights that may indicate negative growth, with justifications.
Analysis Plan:
Univariate Analysis (U):

App Ratings: Distribution of app ratings.
App Sizes: Distribution of app sizes.
Number of Installs: Distribution of installs.
App Prices: Distribution of app prices.
Bivariate Analysis (B):

Ratings by Category: Average rating by app category.
Installs vs. Ratings: Relationship between number of installs and ratings.
Price vs. Ratings: Relationship between app price and ratings.
Sentiment by Ratings: Sentiment polarity versus app ratings.
Multivariate Analysis (M):

Ratings by Category and Sentiment: How category and sentiment together affect app ratings.
Installs, Size, and Ratings: Relationship between app size, installs, and ratings.
Price, Category, and Ratings: Impact of price and category on app ratings

# Data Preprocessing:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
from google.colab import files

uploaded = files.upload()
import pandas as pd

# Assume the file name is 'Play Store Data.csv'
df = pd.read_csv('Play Store Data.csv')
print(df.head())


In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Step 1: Upload the CSV file
from google.colab import files
uploaded = files.upload()

# Step 2: Verify the uploaded file
!ls

# Step 3: Read the CSV file
reviews = pd.read_csv('User Reviews.csv')

# Display the first few rows of the reviews DataFrame
reviews.head()


# Data cleanning :

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load datasets
apps = pd.read_csv('Play Store Data.csv')
reviews = pd.read_csv('User Reviews.csv')

# Data cleaning for apps DataFrame
apps.dropna(inplace=True)
apps['Installs'] = apps['Installs'].str.replace(',', '').str.replace('+', '').astype(int)
apps['Price'] = apps['Price'].str.replace('$', '').astype(float)
apps['App'] = apps['App'].str.strip().str.lower()
apps = apps.drop_duplicates(subset=['App'])

# Data cleaning for reviews DataFrame
reviews.dropna(subset=['App'], inplace=True)
reviews['App'] = reviews['App'].str.strip().str.lower()
reviews = reviews.drop_duplicates(subset=['App'])
reviews['Sentiment_Polarity'] = pd.to_numeric(reviews['Sentiment_Polarity'], errors='coerce')

# Check the alignment of 'App' column
common_apps = set(apps['App']).intersection(set(reviews['App']))
unique_apps_in_apps = set(apps['App']).difference(set(reviews['App']))
unique_apps_in_reviews = set(reviews['App']).difference(set(apps['App']))

print(f"Number of common apps: {len(common_apps)}")
print(f"Number of unique apps in apps dataset: {len(unique_apps_in_apps)}")
print(f"Number of unique apps in reviews dataset: {len(unique_apps_in_reviews)}")

# Sentiment Analysis and Merge
avg_sentiment = reviews.groupby('App')['Sentiment_Polarity'].mean().reset_index()
avg_sentiment = avg_sentiment.merge(apps[['App', 'Category']], on='App', how='inner')

# Visualization
plt.figure(figsize=(12, 8))
sns.boxplot(x='Sentiment_Polarity', y='Category', data=avg_sentiment)
plt.title('Average Sentiment Polarity by App Category')
plt.xlabel('Sentiment Polarity')
plt.ylabel('Category')
plt.show()


# Data Distribution:

In [None]:
# Display basic information about the reviews DataFrame
reviews.info()

# Check for missing values
reviews.isnull().sum()

# Drop rows with missing values (if necessary)
reviews.dropna(inplace=True)

# Sentiment distribution of user reviews
plt.figure(figsize=(10, 6))
sns.countplot(x='Sentiment', data=reviews)
plt.title('Sentiment Distribution of User Reviews')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()

# Example: Average sentiment polarity by app category
avg_sentiment = reviews.groupby('App')['Sentiment_Polarity'].mean().reset_index()
avg_sentiment = avg_sentiment.merge(apps[['App', 'Category']], on='App')

plt.figure(figsize=(12, 8))
sns.boxplot(x='Sentiment_Polarity', y='Category', data=avg_sentiment)
plt.title('Average Sentiment Polarity by App Category')
plt.xlabel('Sentiment Polarity')
plt.ylabel('Category')
plt.show()


# Top 10 Categories by Number of Apps

In [None]:
plt.figure(figsize=(12, 8))
top_categories = apps['Category'].value_counts().head(10)
sns.barplot(x=top_categories.values, y=top_categories.index, palette='viridis')
plt.title('Top 10 Categories by Number of Apps')
plt.xlabel('Number of Apps')
plt.ylabel('Category')
plt.show()


# Univariate Analysis (U):
* Distribution of App Ratings:

In [None]:
plt.figure(figsize=(12, 8))
sns.histplot(apps[apps['Price'] > 0]['Price'], bins=30, kde=True)
plt.title('Distribution of Paid App Prices')
plt.xlabel('Price ($)')
plt.ylabel('Frequency')
plt.show()


# Average Rating by App Type (Free vs. Paid):

In [None]:
plt.figure(figsize=(10, 6))
apps['Type'] = apps['Type'].fillna('Free')
sns.boxplot(x='Type', y='Rating', data=apps)
plt.title('Average Rating by App Type (Free vs. Paid)')
plt.xlabel('Type')
plt.ylabel('Rating')
plt.show()


# Number of Reviews vs. Rating:

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Reviews', y='Rating', data=apps, alpha=0.5)
plt.title('Number of Reviews vs. Rating')
plt.xlabel('Number of Reviews')
plt.ylabel('Rating')
plt.xscale('log')
plt.show()


# Average App Size by Category:

In [None]:
# Preprocess 'Size' column
apps['Size'] = apps['Size'].replace('Varies with device', pd.NA)

# Handle 'M' and 'k' replacements separately
apps['Size'] = apps['Size'].apply(lambda x: str(x).replace('M', '') if 'M' in str(x) else str(x))
apps['Size'] = apps['Size'].apply(lambda x: str(float(str(x).replace('k', '')) / 1000) if 'k' in str(x) else str(x))

# Convert to numeric, coercing errors to NaN
apps['Size'] = pd.to_numeric(apps['Size'], errors='coerce')

# Drop rows with NaN 'Size'
apps.dropna(subset=['Size'], inplace=True)

# Calculate average size by category
avg_size_by_category = apps.groupby('Category')['Size'].mean().sort_values(ascending=False)

# Plot the average size by category
plt.figure(figsize=(12, 8))
sns.barplot(x=avg_size_by_category, y=avg_size_by_category.index, palette='coolwarm')
plt.title('Average App Size by Category')
plt.xlabel('Average Size (MB)')
plt.ylabel('Category')
plt.show()


# Total Installs by Category:

In [None]:
plt.figure(figsize=(12, 8))
total_installs_by_category = apps.groupby('Category')['Installs'].sum().sort_values(ascending=False)
sns.barplot(x=total_installs_by_category, y=total_installs_by_category.index, palette='inferno')
plt.title('Total Installs by Category')
plt.xlabel('Total Installs')
plt.ylabel('Category')
plt.show()


# Rating Distribution by Content Rating:

In [None]:
plt.figure(figsize=(12, 8))
sns.boxplot(x='Rating', y='Content Rating', data=apps)
plt.title('Rating Distribution by Content Rating')
plt.xlabel('Rating')
plt.ylabel('Content Rating')
plt.show()


# Correlation Matrix:

In [None]:
# Preprocess 'Size' column
apps['Size'] = apps['Size'].replace('Varies with device', pd.NA)
apps['Size'] = apps['Size'].apply(lambda x: str(x).replace('M', '') if 'M' in str(x) else str(x))
apps['Size'] = apps['Size'].apply(lambda x: str(float(str(x).replace('k', '')) / 1000) if 'k' in str(x) else str(x))
apps['Size'] = pd.to_numeric(apps['Size'], errors='coerce')

# Drop rows with NaN 'Size'
apps.dropna(subset=['Size'], inplace=True)

# Select only numeric columns for correlation matrix
numeric_cols = apps.select_dtypes(include=[float, int])

# Compute correlation matrix
correlation_matrix = numeric_cols.corr()

# Plot the correlation matrix
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()


# Top Apps by Number of Reviews:

In [None]:
plt.figure(figsize=(12, 8))
top_apps_by_reviews = apps.sort_values(by='Reviews', ascending=False).head(10)
sns.barplot(x='Reviews', y='App', data=top_apps_by_reviews, palette='Blues_r')
plt.title('Top 10 Apps by Number of Reviews')
plt.xlabel('Number of Reviews')
plt.ylabel('App')
plt.show()


# App Categories with Highest Average Sentiment Polarity:

In [None]:
# Calculate average sentiment polarity by app category
avg_sentiment_by_category = avg_sentiment.groupby('Category')['Sentiment_Polarity'].mean().sort_values(ascending=False)

plt.figure(figsize=(12, 8))
sns.barplot(x=avg_sentiment_by_category, y=avg_sentiment_by_category.index, palette='magma')
plt.title('App Categories with Highest Average Sentiment Polarity')
plt.xlabel('Average Sentiment Polarity')
plt.ylabel('Category')
plt.show()


# Sentiment Distribution of User Review:

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(x='Sentiment', data=reviews)
plt.title('Sentiment Distribution of User Reviews')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()


# Pairplot of Selected Features:

In [None]:
sns.pairplot(apps[['Rating', 'Reviews', 'Installs', 'Price']])
plt.title('Pairplot of Selected Features')
plt.show()


# Multivariate Analysis
*Correlation Heatmap*

In [None]:
print(apps.dtypes)


# Selecting only numeric columns
apps_numeric = apps.select_dtypes(include=['float64', 'int64'])

# Check for missing values
print(apps_numeric.isnull().sum())

plt.figure(figsize=(10, 6))
sns.heatmap(apps_numeric.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()



# Bivariate Analysis (B):
* Ratings by Category:

# Multivariate Analysis (M):
* Ratings by Category and Sentiment:

In [None]:
!pip install vaderSentiment


In [None]:
import os

# Print current working directory
print(os.getcwd())

In [None]:
from google.colab import files
import pandas as pd
import io

# Upload file
uploaded = files.upload()

# Load the uploaded file into a DataFrame
for filename in uploaded.keys():
    reviews_df = pd.read_csv(io.BytesIO(uploaded[filename]))
    print(f"Loaded {filename} into DataFrame.")

    reviews_df = pd.read_csv('User Reviews.csv')

print(reviews_df.head())



In [None]:
# Print the column names
print(reviews_df.columns)


# Exception Handling and Production Grade Code:

In [None]:
try:
    # Example of data loading and preprocessing with exception handling
    apps_df = pd.read_csv('Play Store Data.csv')
    reviews_df = pd.read_csv('User Reviews.csv')

    # Further processing steps...

except FileNotFoundError as e:
    print(f"Error: {e}. Please ensure the dataset files are in the correct directory.")
except pd.errors.ParserError as e:
    print(f"Error: {e}. There was an issue parsing the CSV files.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")


# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

In [None]:
# Import data manipulation and analysis libraries
import pandas as pd
import numpy as np

# Import visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Import machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Import sentiment analysis library (if needed)
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer


In [None]:
pip install pandas numpy matplotlib seaborn scikit-learn vaderSentiment


### Dataset Loading

In [None]:
from google.colab import files
import pandas as pd
import io

# Upload the file
uploaded = files.upload()

# Load the uploaded file into a DataFrame
for filename in uploaded.keys():
    apps_df = pd.read_csv(io.BytesIO(uploaded[filename]))
    print(f"Loaded {filename} into DataFrame.")


### Dataset First View

In [None]:
from google.colab import drive
drive.mount('/content/drive')


In [None]:
!ls /content/drive/My\ Drive/User Reviews.csv



In [None]:
# Load your dataset
import pandas as pd

# Correct path with slashes
df = pd.read_csv('/content/drive/My Drive/User Reviews.csv')


# Display the first 5 rows of the dataset
print(df.head())


### Dataset Rows & Columns count

In [None]:
# Get the number of rows and columns
rows, columns = df.shape
print(f"Number of rows: {rows}")
print(f"Number of columns: {columns}")


### Dataset Information

In [None]:
# Display detailed information about the dataset
print(df.info())


#### Duplicate Values

In [None]:
# Check for duplicate rows
duplicates = df.duplicated()
print(f"Number of duplicate rows: {duplicates.sum()}")

# Display duplicate rows
print("Duplicate rows:")
print(df[duplicates])

# Remove duplicate rows
df_cleaned = df.drop_duplicates()
print(f"Number of rows after removing duplicates: {df_cleaned.shape[0]}")


#### Missing Values/Null Values

In [None]:
pip install missingno


In [None]:
import missingno as msno
import pandas as pd

# Load your dataset

df = pd.read_csv('/content/drive/My Drive/User Reviews.csv')
# Visualize missing values
msno.matrix(df)
msno.heatmap(df)
msno.bar(df)


In [None]:
pip install seaborn


In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Generate a mask for missing values
missing_mask = df.isnull()

# Plot the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(missing_mask, cbar=False, cmap='viridis', yticklabels=False, xticklabels=df.columns)
plt.title('Missing Values Heatmap')
plt.show()


In [None]:
# Count missing values per column
missing_count = df.isnull().sum()

# Plot the bar plot
plt.figure(figsize=(12, 6))
missing_count.plot(kind='bar', color='skyblue')
plt.title('Missing Values Count by Column')
plt.xlabel('Column')
plt.ylabel('Number of Missing Values')
plt.xticks(rotation=45)
plt.show()


### What did you know about your dataset?

Based on the typical analysis steps, here's what you would know about your dataset:

1. First View
Content: You'll see the first few rows of the dataset, which gives you a preview of the data structure and the type of information each column holds (e.g., text, numbers).
Column Names: You'll learn the names of the columns and their possible values.
2. Rows and Columns Count
Size: You'll know the total number of rows (records) and columns (features). This helps understand the dataset's scale and complexity.
3. Dataset Information
Column Names and Data Types: You'll get the names and data types of each column. This tells you whether columns contain integers, floats, strings, etc.
Non-null Counts: You'll see how many non-null values each column contains. This helps identify columns with missing data.
Memory Usage: Provides insight into how much memory the dataset is using, which can be useful for performance considerations.
4. Duplicate Values
Number of Duplicates: You’ll know how many duplicate rows exist in the dataset.
Handling Duplicates: Whether you choose to keep, remove, or handle duplicates in some way.
5. Missing/Null Values
Missing Values Count: You'll see how many missing values exist per column.
Patterns of Missing Data: Whether missing values are concentrated in specific columns or distributed evenly.
Handling Missing Data: Whether you need to fill in, drop, or otherwise handle missing values based on the patterns observed.
6. Visualizations of Missing Values
Heatmap: Shows where missing values occur in the dataset, highlighting columns with significant missing data.
Bar Plot: Provides a clear count of missing values for each column, which helps in identifying which columns require attention.
Example Summary
Assuming the following hypothetical output from these steps:


## ***2. Understanding Your Variables***

In [None]:
# Display the column names
print("Dataset Columns:")
print(df.columns)


In [None]:
# Statistical summary of numerical columns
print("Statistical Summary:")
print(df.describe())

# Summary of categorical columns
print("\nCategorical Columns Summary:")
for col in df.select_dtypes(include='object').columns:
    print(f"\n{col} value counts:")
    print(df[col].value_counts())


### Variables Description

Manually describe each column based on its name and data type:

Column Names: Identify what each column represents.
Data Types: Understand the type of data each column contains (e.g., numerical, categorical, text, date).
Column Descriptions: Provide descriptions based on the context of your dataset

In [None]:
# Manually describe each column
column_descriptions = {
    'Category': 'The type of app or product category (e.g., Photography, Games, Health)',
    'Rating': 'User rating for the app (numerical)',
    'Review': 'Text of user review (categorical text)',
}

print("\nVariable Descriptions:")
for col, description in column_descriptions.items():
    print(f"{col}: {description}")


### Check Unique Values for each variable.

In [None]:
# Check unique values for each column
print("\nUnique Values in Each Column:")
for col in df.columns:
    print(f"\nUnique values in {col}:")
    print(df[col].unique())
    print(f"Number of unique values: {df[col].nunique()}")


## 3. ***Data Wrangling***

In [None]:
!pip install pandas numpy


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
ls


In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/content/drive/My Drive/User Reviews.csv')
# Initial exploration
print(df.info())
print(df.head())

# Handle missing values

# 1. Forward fill for non-numeric columns
df['Translated_Review'].fillna(method='ffill', inplace=True)
df['Sentiment'].fillna(method='ffill', inplace=True)

# 2. Fill missing numeric columns with the mean
df['Sentiment_Polarity'].fillna(df['Sentiment_Polarity'].mean(), inplace=True)
df['Sentiment_Subjectivity'].fillna(df['Sentiment_Subjectivity'].mean(), inplace=True)

# Verify the changes
print(df.info())
print(df.head())

# Save the cleaned data
df.to_csv('/content/drive/My Drive/User Reviews.csv', index=False)


### What all manipulations have you done and insights you found?

Here's a summary of the data wrangling steps performed and potential insights that can be derived from the "User Reviews" dataset:

1. Data Loading
The dataset "User Reviews.csv" was loaded, containing 64,295 rows and 5 columns: App, Translated_Review, Sentiment, Sentiment_Polarity, and Sentiment_Subjectivity.
2. Initial Exploration
The data types of the columns were checked.
Missing values were identified in the Translated_Review, Sentiment, Sentiment_Polarity, and Sentiment_Subjectivity columns.
3. Handling Missing Values
Text Columns:
Translated_Review: Missing values were filled using forward fill (ffill), which propagates the last valid observation forward.
Sentiment: Missing values were also filled using forward fill.
Numeric Columns:
Sentiment_Polarity: Missing values were filled with the mean of the column.
Sentiment_Subjectivity: Missing values were similarly filled with the mean.
4. Feature Engineering and Data Cleaning
Imputing Missing Data: This made the dataset more complete and ready for analysis by ensuring no rows were left with missing values.
No Dropping of Rows: Since the data filling strategy was applied, no rows were removed from the dataset, preserving the dataset's size.
5. Insights from the Wrangled Data
After cleaning the data, several insights can be derived:

Sentiment Analysis:

The Sentiment column provides a high-level sentiment analysis (Positive, Neutral, Negative) of user reviews.
By examining the distribution of sentiments, you can identify how users generally feel about different apps.
Polarity and Subjectivity Scores:

Sentiment_Polarity ranges from -1 (negative sentiment) to 1 (positive sentiment), and you can analyze the average sentiment polarity for each app.
Sentiment_Subjectivity indicates how subjective a review is, with values closer to 1 indicating more subjective opinions. This can help in understanding the tone of reviews.
Correlation Analysis:

You could further investigate correlations between sentiment polarity and sentiment subjectivity to see if more subjective reviews tend to be more positive or negative.
App-Specific Analysis:

By grouping the data by App, you can compare sentiment metrics across different apps. This could highlight which apps are perceived more positively or negatively by users.
Next Steps for Deeper Insights
Sentiment Trends: Plot sentiment distribution over time or across different apps to detect trends.
Feature Importance: If you were to build a predictive model, analyze which features (e.g., length of review, app category) are most predictive of positive or negative sentiment.
Outlier Detection: Identify reviews with extremely high or low polarity scores as potential outliers for further manual review.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

# Load the dataset again if necessary
df = pd.read_csv('/content/drive/My Drive/User Reviews.csv')

# Set the style for seaborn
sns.set(style="whitegrid")

# Now you can run the previous plotting code
plt.figure(figsize=(12, 6))
sns.countplot(y="App", hue="Sentiment", data=df, order=df['App'].value_counts().index[:10])
plt.title('Sentiment Distribution for Top 10 Apps')
plt.xlabel('Count')
plt.ylabel('App')
plt.legend(title='Sentiment')
plt.show()


##### 1. Why did you pick the specific chart?

Bar plots are great for comparing the frequency of categories across different groups. This chart allows you to see how sentiments are distributed across the top 10 apps, making it easy to compare which apps have the most positive, neutral, or negative feedback

##### 2. What is/are the insight(s) found from the chart?

This plot shows the distribution of sentiments (Positive, Negative, Neutral) for the top 10 apps. You can identify which apps have the highest count of each sentiment, revealing which apps are most positively or negatively received by users.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Knowing the proportion of positive, negative, and neutral sentiments helps tailor marketing strategies. Positive sentiment can be leveraged in marketing campaigns, while negative sentiment can be addressed proactively.

#### Chart - 2

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Sentiment_Polarity', y='Sentiment_Subjectivity', hue='Sentiment', data=df)
plt.title('Relationship between Sentiment Polarity and Subjectivity')
plt.xlabel('Sentiment Polarity')
plt.ylabel('Sentiment Subjectivity')
plt.show()

##### 1. Why did you pick the specific chart?

Sentiment Polarity vs Subjectivity Reason: Scatter plots are ideal for showing relationships between two numerical variables. This chart helps you understand if there’s any correlation between how subjective a review is and its polarity (whether it’s positive or negative)

##### 2. What is/are the insight(s) found from the chart?

This scatter plot illustrates the relationship between sentiment polarity (degree of positivity or negativity) and subjectivity (how subjective or objective a review is). It helps in understanding whether more subjective reviews tend to be more extreme (positive or negative) or if there's a pattern in the distribution of sentiment scores

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Insights from these plots can guide product development teams on whether to focus on enhancing features or improving user experience based on the observed sentiment patterns.

#### Chart - 3

In [None]:
plt.figure(figsize=(12, 6))
df_grouped = df.groupby('App')['Sentiment_Polarity'].mean().sort_values(ascending=False).head(10)
sns.barplot(x=df_grouped.values, y=df_grouped.index, palette="viridis")
plt.title('Top 10 Apps by Average Sentiment Polarity')
plt.xlabel('Average Sentiment Polarity')
plt.ylabel('App')
plt.show()

##### 1. Why did you pick the specific chart?

Average Sentiment Polarity by App Reason: This bar plot was chosen to show the average sentiment polarity for the top 10 apps, allowing for easy comparison of user satisfaction across different apps.

##### 2. What is/are the insight(s) found from the chart?

This bar plot displays the average sentiment polarity for the top 10 apps. It helps in identifying which apps are generally perceived as more positive or negative based on their average sentiment scores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding which apps have the highest and lowest average sentiment polarity helps prioritize which apps to focus on for improvements or marketing efforts. Apps with high positive sentiment can be promoted more, while those with low sentiment can be improved based on user feedback.

#### Chart - 4

In [None]:
positive_reviews = " ".join(review for review in df[df['Sentiment'] == 'Positive']['Translated_Review'])
wordcloud = WordCloud(width=800, height=400, background_color="white").generate(positive_reviews)

plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.title('Word Cloud of Positive Reviews')
plt.axis("off")
plt.show()


##### 1. Why did you pick the specific chart?

Word clouds are effective for visualizing the most frequently occurring words in a text. This chart was chosen to highlight common themes and keywords in positive reviews, helping you understand what users like.

##### 2. What is/are the insight(s) found from the chart?

This word cloud highlights the most frequently occurring words in positive reviews. It can reveal common themes or features praised by users.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

By identifying common words and themes in positive and negative reviews, businesses can understand what users appreciate and what issues they face. Addressing frequent complaints and enhancing features praised by users can improve overall user satisfaction and retention.

#### Chart - 5

In [None]:
negative_reviews = " ".join(review for review in df[df['Sentiment'] == 'Negative']['Translated_Review'])
wordcloud = WordCloud(width=800, height=400, background_color="black", colormap="Reds").generate(negative_reviews)

plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.title('Word Cloud of Negative Reviews')
plt.axis("off")
plt.show()

##### 1. Why did you pick the specific chart?

Similar to the positive reviews word cloud, this chart was selected to reveal common complaints or negative aspects mentioned by users

##### 2. What is/are the insight(s) found from the chart?

This word cloud displays the most common words in negative reviews. It helps in identifying recurring issues or complaints mentioned by users

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

By identifying common words and themes in positive and negative reviews, businesses can understand what users appreciate and what issues they face. Addressing frequent complaints and enhancing features praised by users can improve overall user satisfaction and retention

#### Chart - 6

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='Sentiment', y='Sentiment_Polarity', data=df)
plt.title('Sentiment Polarity Distribution by Sentiment Type')
plt.xlabel('Sentiment')
plt.ylabel('Sentiment Polarity')
plt.show()

##### 1. Why did you pick the specific chart?

Violin Plot: Sentiment Subjectivity by Sentiment Reason: Violin plots combine aspects of box plots and density plots, showing the distribution of the data. This chart was selected to provide a more detailed look at how sentiment subjectivity is distributed across sentiment types.

##### 2. What is/are the insight(s) found from the chart?

This plot reveals the distribution of sentiment subjectivity for different sentiment categories. It shows the density of subjectivity scores and highlights the variability within each sentiment type

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

If negative reviews are highly subjective, it can indicate that specific issues are being perceived differently by different users. This might complicate pinpointing exact problems and hinder effective solutions. Justification:

Impact of Negative Insights: Negative insights, such as high variance in sentiment or persistent negative sentiment in reviews, can lead to a poor user experience, decreased user satisfaction, and potentially loss of users. If these issues are not addressed, they can result in negative growth, affecting both user retention and acquisition

#### Chart - 7

In [None]:
plt.figure(figsize=(10, 6))
sns.violinplot(x='Sentiment', y='Sentiment_Subjectivity', data=df)
plt.title('Sentiment Subjectivity Distribution by Sentiment Type')
plt.xlabel('Sentiment')
plt.ylabel('Sentiment Subjectivity')
plt.show()

##### 1. Why did you pick the specific chart?

Box Plot: Sentiment Polarity by Sentiment Reason: Box plots show the distribution of a variable and highlight outliers. This chart helps you see how sentiment polarity varies across different sentiment categories (positive, neutral, negative).

##### 2. What is/are the insight(s) found from the chart?

The box plot provides an overview of the distribution of sentiment polarity across different sentiment types. It shows the central tendency, spread, and any potential outliers in sentiment scores for positive, negative, and neutral reviews

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Box Plot: Sentiment Polarity by Sentiment: If there is a high variance or a large number of negative outliers, it indicates significant dissatisfaction in specific areas. This can highlight areas needing immediate attention, but if not addressed, it can lead to negative growth.

#### Chart - 8

In [None]:
# Example with a scatter plot:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Sentiment_Polarity', y='Sentiment_Subjectivity', hue='Sentiment', data=df, alpha=0.6)
plt.title('Scatter Plot of Polarity and Subjectivity')
plt.xlabel('Sentiment Polarity')
plt.ylabel('Sentiment Subjectivity')
plt.show()

# Create a new plot after showing the previous one
plt.figure(figsize=(8, 6))
# Add your next plot here


##### 1. Why did you pick the specific chart?

Swarm Plot: Polarity and Subjectivity with Sentiment Reason: Swarm plots are useful for visualizing the distribution of a dataset while avoiding overlap. This chart was chosen to show the exact distribution of sentiment polarity and subjectivity across different sentiment types.

##### 2. What is/are the insight(s) found from the chart?

This swarm plot illustrates the distribution of sentiment polarity and subjectivity for each sentiment type, allowing for the visualization of individual data points and patterns within the data.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

If sentiment polarity is skewed negatively, it indicates overall dissatisfaction. Persistent negative sentiment can drive users away, affecting business growth. High Subjectivity in Negative Reviews

#### Chart - 9

In [None]:
sns.pairplot(df, hue='Sentiment', vars=['Sentiment_Polarity', 'Sentiment_Subjectivity'], diag_kind='kde')
plt.suptitle('Pair Plot of Sentiment Polarity and Subjectivity', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

Knowing the proportion of positive, negative, and neutral sentiments helps tailor marketing strategies. Positive sentiment can be leveraged in marketing campaigns, while negative sentiment can be addressed proactively

##### 2. What is/are the insight(s) found from the chart?

This scatter plot illustrates the relationship between sentiment polarity (degree of positivity or negativity) and subjectivity (how subjective or objective a review is). It helps in understanding whether more subjective reviews tend to be more extreme (positive or negative) or if there's a pattern in the distribution of sentiment scores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Insights from these plots can guide product development teams on whether to focus on enhancing features or improving user experience based on the observed sentiment patterns

#### Chart - 10

In [None]:
plt.figure(figsize=(10, 6))
sns.stripplot(x='Sentiment', y='Sentiment_Polarity', data=df, jitter=True)
plt.title('Strip Plot of Sentiment Polarity by Sentiment Type')
plt.xlabel('Sentiment')
plt.ylabel('Sentiment Polarity')
plt.show()

##### 1. Why did you pick the specific chart?

Box Plot: Sentiment Polarity by Sentiment: If there is a high variance or a large number of negative outliers, it indicates significant dissatisfaction in specific areas. This can highlight areas needing immediate attention, but if not addressed, it can lead to negative growth

##### 2. What is/are the insight(s) found from the chart?

Box Plot: Sentiment Polarity by Sentiment

Insight: The box plot provides an overview of the distribution of sentiment polarity across different sentiment types. It shows the central tendency, spread, and any potential outliers in sentiment scores for positive, negative, and neutral reviews

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Box Plot: Sentiment Polarity by Sentiment: If there is a high variance or a large number of negative outliers, it indicates significant dissatisfaction in specific areas. This can highlight areas needing immediate attention, but if not addressed, it can lead to negative growth.

#### Chart - 11

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(df['Sentiment_Polarity'], bins=20, kde=True)
plt.title('Distribution of Sentiment Polarity')
plt.xlabel('Sentiment Polarity')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

Sentiment Distribution (Pie Chart): Knowing the proportion of positive, negative, and neutral sentiments helps tailor marketing strategies. Positive sentiment can be leveraged in marketing campaigns, while negative sentiment can be addressed proactively.

##### 2. What is/are the insight(s) found from the chart?

Pie Chart: Sentiment Distribution

Insight: The pie chart illustrates the proportion of each sentiment type (Positive, Negative, Neutral) in the dataset. It provides a quick overview of the overall sentiment distribution among the reviews.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
plt.figure(figsize=(10, 6))
sns.kdeplot(df[df['Sentiment'] == 'Positive']['Sentiment_Polarity'], label='Positive', shade=True)
sns.kdeplot(df[df['Sentiment'] == 'Negative']['Sentiment_Polarity'], label='Negative', shade=True)
sns.kdeplot(df[df['Sentiment'] == 'Neutral']['Sentiment_Polarity'], label='Neutral', shade=True)
plt.title('KDE Plot of Sentiment Polarity by Sentiment Type')
plt.xlabel('Sentiment Polarity')
plt.ylabel('Density')
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

Histogram and KDE Plot: If sentiment polarity is skewed negatively, it indicates overall dissatisfaction. Persistent negative sentiment can drive users away, affecting business growth.

##### 2. What is/are the insight(s) found from the chart?

KDE Plot: Sentiment Polarity by Sentiment

Insight: The KDE plot provides a smoothed density estimate of sentiment polarity for each sentiment type. It helps in visualizing the probability distribution of sentiment scores and identifying overlaps or differences between sentiment categories

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Histogram and KDE Plot: If sentiment polarity is skewed negatively, it indicates overall dissatisfaction. Persistent negative sentiment can drive users away, affecting business growth.

#### Chart - 13

In [None]:
g = sns.FacetGrid(df, col="Sentiment", height=5)
g.map(sns.scatterplot, "Sentiment_Polarity", "Sentiment_Subjectivity")
g.add_legend()
plt.suptitle('Sentiment Polarity vs Subjectivity by Sentiment Type', y=1.05)
plt.show()

##### 1. Why did you pick the specific chart?

Scatter Plots (Sentiment Polarity vs. Subjectivity): Insights from these plots can guide product development teams on whether to focus on enhancing features or improving user experience based on the observed sentiment patterns

##### 2. What is/are the insight(s) found from the chart?

Scatter Plot: Sentiment Polarity vs Subjectivity

Insight: This scatter plot illustrates the relationship between sentiment polarity (degree of positivity or negativity) and subjectivity (how subjective or objective a review is). It helps in understanding whether more subjective reviews tend to be more extreme (positive or negative) or if there's a pattern in the distribution of sentiment scores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Product Development:

Scatter Plots (Sentiment Polarity vs. Subjectivity): Insights from these plots can guide product development teams on whether to focus on enhancing features or improving user experience based on the observed sentiment patterns

#### Chart - 14 - Correlation Heatmap

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate the correlation matrix for numerical columns
corr_matrix = df[['Sentiment_Polarity', 'Sentiment_Subjectivity']].corr()

# Plot the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Heatmap of Sentiment Polarity and Subjectivity')
plt.show()


##### 1. Why did you pick the specific chart?

Average Sentiment Polarity by App Reason: This bar plot was chosen to show the average sentiment polarity for the top 10 apps, allowing for easy comparison of user satisfaction across different apps

##### 2. What is/are the insight(s) found from the chart?

This bar plot displays the average sentiment polarity for the top 10 apps. It helps in identifying which apps are generally perceived as more positive or negative based on their average sentiment scores

#### Chart - 15 - Pair Plot

In [None]:
sns.pairplot(df, hue='Sentiment', vars=['Sentiment_Polarity', 'Sentiment_Subjectivity'], diag_kind='kde')
plt.suptitle('Pair Plot of Sentiment Polarity and Subjectivity', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

Sentiment Distribution Reason: Pie charts provide a quick visual of proportions within a dataset. This chart was selected to give a high-level overview of the overall sentiment distribution in the dataset. Each of these charts provides a unique perspective on the data, allowing you to explore various aspects and uncover different insights

##### 2. What is/are the insight(s) found from the chart?

The pie chart illustrates the proportion of each sentiment type (Positive, Negative, Neutral) in the dataset. It provides a quick overview of the overall sentiment distribution among the reviews.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To achieve business objectives based on the sentiment analysis insights, here’s a brief strategy
1. Enhance User Experience
Action: Address common complaints and issues identified from negative reviews.
Objective: Improve user satisfaction and reduce negative sentiment by making product or service adjustments based on user feedback.
2. Leverage Positive Feedback
Action: Highlight features or aspects praised in positive reviews in marketing campaigns and product promotions.
Objective: Increase user acquisition and brand loyalty by showcasing strengths and positive user experiences.
3. Prioritize Product Improvements
Action: Focus on features or apps with low average sentiment scores for improvements.
Objective: Increase overall user satisfaction and positive sentiment by refining and enhancing areas of weakness.
4. Implement Targeted Marketing Strategies
Action: Use sentiment distribution insights to tailor marketing strategies, such as promoting apps with high positive sentiment and addressing concerns related to apps with negative sentiment.
Objective: Optimize marketing efforts and resource allocation to maximize impact and engagement.
5. Monitor and Adapt
Action: Continuously monitor sentiment trends and user feedback to adapt strategies in real-time.
Objective: Stay responsive to user needs and preferences, ensuring ongoing alignment with business objectives and market demands.

Summary

Address Issues: Tackle negative feedback to enhance user experience and satisfaction.

Promote Strengths: Use positive feedback to drive marketing and brand positioning.

Focus on Improvements: Prioritize enhancements based on user sentiment to boost overall performance.

Adapt Strategies: Regularly update strategies based on current sentiment trends to remain competitive and aligned with user expectations.

By integrating these actions into a comprehensive strategy, the client can achieve their business objectives, enhance user satisfaction, and drive positive growth.

# **Conclusion**

Utilizing insights from sentiment analysis can significantly advance a business's objectives by strategically addressing both positive and negative feedback. By enhancing user experience through targeted improvements, leveraging positive sentiment in marketing efforts, and continuously adapting strategies based on real-time feedback, businesses can achieve greater user satisfaction and drive growth.

Addressing common issues identified from negative reviews ensures that user dissatisfaction is minimized, while promoting features highlighted in positive feedback can boost brand loyalty and attract new users. Regular monitoring and adaptation of strategies based on sentiment trends will keep the business responsive to user needs and market dynamics. Implementing these strategies effectively will lead to improved user engagement, increased retention, and overall positive business growth.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***