#**Google Play Store Apps Analysis**

Problem Statement:

"What are the key factors that influence the success of Google Play Store apps in terms of user ratings, downloads, and revenue, and how can developers optimize these factors to enhance user engagement and profitability?"



# Apps Data


## Data Cleaning

First we imported important libraries


In [None]:
# Import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Then we loaded our apps data file



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Import App Data File
df = pd.read_csv('/content/drive/MyDrive/Anudip (Python )/googleplaystore.csv')

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/Anudip (Python )/googleplaystore.csv'

In [None]:
# Checked the Shape (Row, Coloumn) of our data
df.shape

Then we have checked the data to understand its structure and identify any issues such as missing values, duplicate data, incorrect data types, and many more.

In [None]:
# Print first 5 rows
df.head()

In [None]:
# Gives information about data
df.info()

In [None]:
# Check for null values
pd.isnull(df).sum()

In [None]:
# Fill null coloumns --> Forward filling method
df.fillna(method='ffill', inplace=True)

In [None]:
# Checking if there is any duplicate row
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

In [None]:
# Removes duplicate rows
df.drop_duplicates(inplace=True)

In [None]:
# how many unique values are there
df.nunique()

In [None]:
# cleaning Installs column (removing unwanted symbols or letters)
# remove + and comma(,) from the values

df['Installs'] = df['Installs'].apply(lambda x: x.replace('+', '') if '+' in str(x) else x) # removing +
df['Installs'] = df['Installs'].apply(lambda x: x.replace(',', '') if ',' in str(x) else x) # removing ,
df = df[df['Installs'].str.isnumeric()]
df['Installs'] = df['Installs'].astype(int)

In [None]:
# cleaning Price column (removing unwanted symbols or letters)
# removing $ sign

df['Price'] = df['Price'].apply(lambda x: str(x).replace('$', '') if '$' in str(x) else str(x))
df['Price'] = df['Price'].apply(lambda x: float(x)) # converting into float

In [None]:
# cleaning Size column  (removing unwanted symbols or letters)

df['Size'] = df['Size'].apply(lambda x: str(x).replace('Varies with device', 'NaN') if 'Varies with device' in str(x) else x)

df['Size'] = df['Size'].apply(lambda x: str(x).replace('M', '') if 'M' in str(x) else x)
df['Size'] = df['Size'].apply(lambda x: str(x).replace(',', '') if ',' in str(x) else x)
df['Size'] = df['Size'].apply(lambda x: float(str(x).replace('k', '')) / 1000 if 'k' in str(x) else x)

df['Size'] = df['Size'].apply(lambda x: float(x))
df['Installs'] = df['Installs'].apply(lambda x: float(x))

In [None]:
# Cleaning review column (setting dtype)
df['Reviews'] = df['Reviews'].apply(lambda x: int(x))

In [None]:
# Again checking null values beacause above we have replaced values with NAN
pd.isnull(df).sum()

In [None]:
# Again Filling null coloumns --> Forward filling
df.fillna(method='ffill', inplace=True)

In [None]:
# After cleaning this is our final shape of data
df.shape

## Data Visulaization
Here, we have done Data Visualization using various plots and charts


## Count of Apps by Category
This code generates a countplot that shows the number of apps by category. From this plot, we can observe that the top 3 categories with the highest app counts are Family, Games, and Tools.

In [None]:
colormap = plt.get_cmap("inferno")
num_categories = df["Category"].nunique()
colors = [colormap(i / num_categories) for i in range(num_categories)]   # Generate list of colors

plt.figure(figsize=(15, 6))
sns.countplot(x = df["Category"],palette=colors)
plt.xticks(rotation=90, fontsize=7)
plt.title("Count of Apps by Category")
plt.show()

## Average of Rating
This code calculates and prints the average rating of apps from a dataset.


In [None]:
# Average of rating
avg = np.mean(df["Rating"])
print("Average rating of apps", round(avg,2))

## No. of Apps based on their Rating
This code generates douout chart which shows no. of apps based on thier rating. From this we can see that large no. of apps lies within the rating range 4 to 4.5.

In [None]:
# Define labels and values
labels = ["1.5-3", "3-3.5", "3.5-4", "4-4.5", "4.5-5"]

values = [
    (df["Rating"] < 3).sum(),
    (df["Rating"] < 3.5).sum() - (df["Rating"] < 3).sum(),  # Subtract previous range
    (df["Rating"] < 4).sum() - (df["Rating"] < 3.5).sum(),
    (df["Rating"] < 4.5).sum() - (df["Rating"] < 4).sum(),
    (df["Rating"] < 5).sum() - (df["Rating"] < 4.5).sum()
]

# Colors for the chart
colormap = plt.get_cmap("twilight_shifted")
colors = [colormap(i / len(labels)) for i in range(len(labels))]

# Plot a pie chart with a hole in the middle (donut chart)
plt.figure(figsize=(5, 5))
plt.pie(values, labels=labels, colors=colors, autopct='%1.1f%%', startangle=140, wedgeprops=dict(width=0.4, edgecolor='white'))

# Set the title
plt.title("No. of Apps based on their Rating", fontsize=20)

# Show the plot
plt.show()


## No. of Top Genres
This code generates donout chart which shows top 10 genres.

In [None]:
# Extract labels and values for the top 10 genres
labels = df["Genres"].value_counts()[:10].index
values = df["Genres"].value_counts()[:10]

# Get the 'inferno' colormap and generate colors based on the number of labels
colormap = plt.get_cmap("inferno")
colors = [colormap(i / len(labels)) for i in range(len(labels))]  # Generate list of colors

# Create a pie chart with a hole in the middle (donut chart)
plt.figure(figsize=(8, 8))
plt.pie(values, labels=labels, colors=colors, autopct='%1.1f%%', pctdistance=0.45 , startangle=140,
        wedgeprops=dict(width=0.5, edgecolor='white'))  # Donut chart with white edges

# Set title for the chart
plt.title("No. of Top Genres", fontsize=20)

# Show the plot
plt.show()


## Count of Apps by Type (Free and Paid)
This code generates count plot which shows count of Apps according to thier type- free or paid. From this we can see count of free apps are more than paid apps.

In [None]:
sns.countplot(x=df["Type"], palette = "inferno")
plt.title("Count of Apps by Type")
plt.show()

## Count of Apps by Content Rating based on Type
This code genrates a count plot to show the count of apps by Content Rating for each App Type.

In [None]:
sns.countplot(x=df["Type"],hue=df["Content Rating"],palette = "plasma")
plt.title("Count of Apps by Content Rating based on Type")
plt.show()

## Count of Apps in Each Category by Type
This code genrates countplot which shows count of Apps in each Category by type. From this we can see top 3 Free type categories are family, games and tools. And top 3 Paid Apps are family, game and medical.

In [None]:
plt.figure(figsize=(16,8))
sns.countplot(x=df["Category"],hue=df["Type"], palette = "plasma")
plt.xticks(rotation=90, fontsize=7)
plt.title("Count of Apps in Each Category by Type", fontsize = 20)
plt.show()

#  Apps Review Data


## Data Cleaning

In [None]:
# Loading apps review data
ddf = pd.read_csv("/content/drive/MyDrive/Anudip (Python )/googleplaystore_user_reviews.csv")

In [None]:
# Check Shape of User Review data
ddf.shape

Then we have checked the data to understand its structure and identify any issues such as missing values, duplicate data, incorrect data types, and many more.

In [None]:
ddf.info()
# around 40% data is empty

In [None]:
# checking uniques values
ddf.nunique()

In [None]:
# check how many null values are there
ddf.isnull().sum()

In [None]:
# droping null values
ddf.dropna(inplace=True)

In [None]:
# checking how many suplicates value are there
ddf.duplicated().sum()

In [None]:
# droping duplicates value
ddf.drop_duplicates(inplace=True)

In [None]:
# Final shape of our data frame
ddf.shape

In [None]:
# now data is cleaned
ddf.info()

## Data Visualization

## Count of Sentiment
This code generates countplot which shows distribution of sentiments (positive, negative and neutral) in the dataset. From this we can see there are lot of positive reviews.

In [None]:
sns.countplot(x=ddf["Sentiment"], palette="plasma")
plt.title("Count of Sentiment")
plt.show()

## Sentiment Distribution: Subjectivity vs. Polarity
This code generates the scatter plot which shows relationship between sentiment subjectivity and sentiment polarity. From this we can see that user have more positive and negative subjective reviews.

In [None]:
sns.scatterplot(x=ddf["Sentiment_Subjectivity"],y=ddf["Sentiment_Polarity"],
                hue=ddf["Sentiment"],palette="twilight")
plt.title("Sentiment Distribution: Subjectivity vs. Polarity", fontsize =20)
plt.show()

# Merging both data frames

In [None]:
# Now we are merging our app data and apps review data file and creating new file which is merged file
df_new = pd.merge(df,ddf,on="App",how="inner") # app column is common in both the tables
df_new.to_csv('merged_file.xlsx', index=False)

## Data Visualization

## Sentiment Distribution: Subjectivity vs. Polarity by Content Rating
This code generates scatter plot which shows the relationship between sentiment subjectivity and polarity based on content rating. From this we can see that content related to everyone have more positive and negative subjective reviews.

In [None]:
colors = ['steelblue','indianred','forestgreen','orange','darkmagenta']
sns.scatterplot(x=df_new["Sentiment_Subjectivity"],y=df_new["Sentiment_Polarity"],
                hue=df_new["Content Rating"], palette = colors)
plt.title("Sentiment Distribution: Subjectivity vs. Polarity by Content Rating ", fontsize =20)
plt.show()

## Sentiment Analysis by App Type
This code generates a countplot which shows senitment analysis by app type (free and paid). Here we can see that free apps have more positive reviews than paid apps.

In [None]:
sns.countplot(hue=df_new["Sentiment"],x=df_new["Type"],palette = "inferno")
plt.title("Sentiment Analysis by App Type")

## Sentiment Analysis by App Category
This code generates a countplot which shows senitment analysis by app category.

In [None]:
plt.figure(figsize=(16,8))
fig=sns.countplot(hue=df_new["Sentiment"],x=df_new["Category"], palette = "plasma")
plt.xticks(rotation=90, fontsize=7)
plt.title("Sentiment Analysis by App Category", fontsize = 20)
plt.show()