<a href="https://colab.research.google.com/github/KostasTheOne/Mobile-Apps-Project/blob/main/Profitable_Apps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Profitable App Analysis for the App Store and Google Play Markets


Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

# Opening andExploring the Data

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.
Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources with collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our purpose:

A [data set](https://www.kaggle.com/datasets/lava18/google-play-store-apps
) containing data about approximately ten thousand Android apps from Google Play.

A [data set](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps
) containing data about approximately seven thousand iOS apps from the App Store.

In [None]:
import pandas as pd

In [None]:
apple_data = pd.read_csv("/content/AppleStore.csv")
apple_data.head()

In [None]:
apple_data.describe()

In [None]:
apple_data.shape

In [None]:
android_data = pd.read_csv("/content/googleplaystore.csv")
android_data.head()

In [None]:
android_data.describe()

As we can see, there are some irregularities in our data. For instance, the maximum value in the Rating column is 19, which is clearly incorrect since ratings should range from 0 to 5. Additionally, the output of the describe() function only shows the Rating column, indicating that it is the only numeric column in the dataset. To make our analysis more meaningful, we will attempt to convert other columns—such as Reviews, Size, Installs, and Price—into numeric formats where appropriate.

In [None]:
android_data.dtypes

In [None]:
android_data.shape

We will convert the Reviews column to numeric values and then verify the changes by using the describe() function again.

In [None]:
android_data["Reviews"] = pd.to_numeric(android_data["Reviews"], errors="coerce")
android_data.describe()


If we attempt to convert the Installs column to numeric, a ValueError will occur, indicating that the string "Free" cannot be converted to a number. This shows that the column contains non-numeric values, which must be identified and removed from the dataset before conversion.

In [None]:
android_data["Installs"] = android_data["Installs"].str.replace(",", "", regex=True)

In [None]:
android_data["Installs"] =android_data["Installs"].str.replace(r"\+", "", regex=True).astype(int)

We identify rows in the 'Installs' column that still contain non-numeric values
even after removing commas and plus signs. We use str.isnumeric() to check
which entries are purely numeric. The tilde (~) negates the condition, so
we select rows that are NOT numeric. We then print the app name, installs,
and type columns to inspect the problematic entries.

In [None]:
non_numeric_installs = android_data[~android_data["Installs"].str.replace(",", "").str.replace(r"\+", "", regex=True).str.isnumeric()]
print(non_numeric_installs[["App", "Installs", "Type"]])


Then, we observe the problematic row from our data.

In [None]:
android_data.loc[10472]

This row contains incorrect values and explains the observations we noticed at the start of our Google Play dataset analysis. So we remove the entire row and reset our dataset's index.

In [None]:
android_data.drop(index=10472, inplace=True)

In [None]:
android_data.reset_index(drop=True, inplace=True)

In [None]:
android_data.shape

In [None]:
android_data["Installs"] =android_data["Installs"].str.replace(r"\+", "", regex=True).astype(int)

In [None]:
print(android_data["Installs"])

In [None]:
android_data.describe()

In [None]:
android_data["Price"] = android_data["Price"].str.replace(r"\$", "", regex=True).astype(float)

In [None]:
android_data.duplicated().sum()

In [None]:
android_data[android_data.duplicated()]

In [None]:
apple_data.duplicated().sum()

We observe that the Android dataset contains duplicate entries, whereas the Apple dataset appears to be clean. It is important to clearly define what we mean by duplicate values. Using the code above, we identify rows that are identical across all columns. However, there may also be apps that share the same name but have different values in other columns, and these are not captured by this definition of duplicates.

In [None]:
duplicated_values=android_data[android_data.duplicated(subset=["App"], keep=False)]

In [None]:
print(duplicated_values)

In [None]:
android_data[android_data['App']=="Instagram"]

In [None]:
duplicated_apps = []
unique_apps = []

for app in android_data["App"]:
  if app in unique_apps:
    duplicated_apps.append(app)
  else:
    unique_apps.append(app)
print(len(duplicated_apps))
print(duplicated_apps[:10])

We should delete the duplicates but not random. As we can see in the "Instagram" example the only difference is in the number of reviews in each row. It's like they updated the dataset in different times, so we are going to keep only the rows with the most reviews, which means we are keeping the latest addition.

In [None]:
max_reviews = {}
for app in android_data["App"]:
  if app in
