## Google Play Store Apps

## Motivation
An entruprenuer has approached us with the intention of investing in a new mobile application.
The requirements are to find unstaturated markets where there is the least amount of competition.
To do this, we will attempt to filter out all categories who's market share is less than 5%
We will then proceed to filter out five categories who's average rating is the lowest.
Then we will further filter out categories with substantially low number of installations.

To conclude,  we will provide the recommended categories to invest in, price, content rating and size.
We also provide a word cloud from all the application names to help with the branding.

### Data Exploration

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS

# Here we load our app store data using pandas and print a sample of the data so that we can become more familiar with it.
store = pd.read_csv('/kaggle/input/google-play-store-apps/googleplaystore.csv')
store.sample(10)

In [None]:
# Here we inspect the dimensions of the data set
store.shape

In [None]:
#  Here we are inspecting column data types. It is clear that several column type needs to be converted. 
#  Fields such as 'Price' and 'Installs' and 'Size' and 'Last Updated'.
#  We will create these type conversiona in the next blocks
store.info()

## Data Cleaning

### Last Updated

In [None]:
# Here we update the 'Last Updated' type from 'object' to 'datetime'
store['Last Updated'] = pd.to_datetime(store['Last Updated'], format='mixed', errors='coerce')
store['Last Updated'].unique()

### Price

In [None]:
# Here we progress to clean up the column price to float without dollar sign
# We also found that there were stray records with 'Everyone' in the Price column, we need to delete these too.

store = store[store['Price'] != 'Everyone']
store['Price'] = store['Price'].str.replace('$', '')
store['Price'] = store['Price'].str.replace(' ', '0')
store['Price'] = store['Price'].str.replace(',', '')
store['Price'] = store['Price'].astype(float)
store.Price.unique()

### Installs

In [None]:
#  Here we progress to clean up the 'Installs' column to float without dollar and plus sign
items_mude = ['+',',','$']
cols = ['Installs']

for item in items_mude:
    for col in cols:
        store['Installs'] = store[col].str.replace(item, '')

store['Installs'] = store['Installs'].astype(float)
store.Installs.unique()

### Size

In [None]:
# Here we seek to clean the data in the size to megabytes

# Replace 'M' with '000', remove 'k', replace 'Varies with device' with NaN
store['Size'] = store['Size'].str.replace('M', '000').str.replace('k', '').replace('Varies with device', np.nan)

# Remove non-numeric characters (except '.')
store['Size'] = store['Size'].replace('[^\d.]', '', regex=True)

# Convert to numeric, errors='coerce' will replace non-convertible values with NaN
store['Size'] = pd.to_numeric(store['Size'], errors='coerce')

# Replace NaN with 0
store['Size'] = store['Size'].fillna(0)

for i in store['Size']:
    if i < 10:
        store['Size']=store['Size'].replace(i,i*1000)
store['Size']=store['Size']/1000

store.Size.unique()

In [None]:
# We show how many empty cells there are in each column
print(store.shape)
store.isna().sum()

In [None]:
# We continue and drop all rows with empty Rating values, Current version and Android version 
# We are not concerned with losing this data since we want to consider only applications that are well defined and that have ratings.
store = store.dropna()

In [None]:
# Here we can see that now each column does not any empty values
# We can also see that we successfully removed 1481 applications from our dataset
print(store.shape)
store.isna().sum()

## Data Analysis

### Category Distribution

### A bar chart showing number of applications per cateogry on the store

In [None]:
# Here we create a bar chart showing the number of applications per category 
category = store['Category'].value_counts()

top3_cat = category.index[:3]
colors = ['steelblue' if cat in top3_cat else 'lightgray' for cat in category.index]

plt.figure(figsize=(10,8))
sns.barplot(y=category.index,x=category.values, orient='h', palette=colors)

for index, value in enumerate(category.values):
    plt.text(value + 2, index, str(value), fontsize=8, va='center')

plt.title('Number of apps per category', size=12)
plt.yticks(fontsize=8)
plt.xticks([])

plt.show()

### A Pie chart showing number of applications per category in percentage

In [None]:
# We want to create an application in a market that is not saturated
# For this we begin by finding the distribution of application categories accross the store spectrum (less than 5%  interests us)

plt.figure(figsize=(10, 10))
value_counts = store['Category'].value_counts()

plt.pie(value_counts, labels=value_counts.index, autopct='%1.1f%%')
plt.title('Distribution of Categories')
plt.show()



## Filtering out all applications in categories with market share less than 5%

In [None]:
# Here we filter OUT categories who's popularity is greater than 5% of all apps
store_unsaturated = store[~store["Category"].isin(['FAMILY', 'GAME', 'TOOLS'])]
store_unsaturated.Category.unique()

## Finding the average price distribution amongst number of categories

In [None]:
# Here we find the average price amongst unsaturated cateogories
# Lets create a histogram so that we can choose a logical market price
means = store_unsaturated.groupby('Category')['Price'].mean()
plt.hist(means, bins=20)
plt.xlabel("Price")
plt.ylabel('Number of Categories')
plt.show()

## Finding five categories with the lowest avg. ratings 

In [None]:
# Next we print which five application categories have the lowest avergage rating
mask = store_unsaturated.groupby('Category')['Rating'].mean()
mask = mask.sort_values(ascending=True)
print(mask.iloc[: 5])

## Here we continue to filter categories that have low installations
### After inspection of the graph below we find that  'Travel and Local' and 'Video Players' have relativey high number of installations.
### They will be excluded from our category recommendations

In [None]:
top_5_category_installs = store_unsaturated.groupby('Category')['Installs'].sum().sort_values(ascending=False)
top_5_category_installs = top_5_category_installs[::-1]

plt.figure(figsize=(8,6))
top_5_category_installs.plot(kind='barh', color='skyblue')
plt.title('Caltegories with highest installs')
plt.xlabel('Total Installs')
plt.ylabel('Categories')
plt.show()

## Here we analyze the distribution of Content Rating across categories

In [None]:
content = store_unsaturated['Content Rating'].value_counts()

top1_cat = category.index[:1]
colors = ['steelblue' if cat in top1_cat else 'lightgray' for cat in category.index]


# Plot
plt.figure(figsize=(10,5))
sns.barplot(y=content.index,x=content.values, orient='h', palette=colors)

for index, value in enumerate(content.values):
    plt.text(value, index, str(value), fontsize=10, va='center')

plt.title('Number of applications per content rating', size=12)
plt.ylabel('')
plt.yticks(fontsize=10)
plt.xticks([])

plt.show()

## Here we investigate the distribution of application type ('Free' or 'Paid')

In [None]:
plt.figure(figsize=(6, 6))
value_counts = store_unsaturated['Type'].value_counts()

plt.pie(value_counts, labels=value_counts.index, autopct='%1.1f%%')
plt.title('Distribution of application types')
plt.show()

## 

## Here we create a scatter graph showing us app size vs. rating

### We find that the majority of apps in the dataset are smaller in size
### We can also see that smaller size apps shows more top rating achivement rather large size apps 

In [None]:
plt.figure(figsize=(10, 6))

# Scatter plot
plt.scatter(store_unsaturated['Size'], store_unsaturated['Rating'], alpha=0.5, color='orange')
plt.title('App Size vs. Ratings')
plt.xlabel('App Size')
plt.ylabel('Rating')
plt.grid(True)

plt.show()


## Here we create a word cloud from application name to help us with branding

In [None]:
wordcloud = WordCloud(width=1900,
                      height=1000,
                      stopwords=STOPWORDS,
                      background_color='white').generate(" ".join(store_unsaturated['App']))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

## Conclusion
In this data analysis outcome, we can confidently guide our entreprenuer on how to invest using the following steps:

- Categories to invest in should be 'DATING', 'MAPS_AND_NAVIGATION' and 'LIFESTYLE'
- Price should be between 0 and 1 dollars
- Content rating should be 'Everyone'
- Size should be between 0 and 20 mb.
- Popular words derived from application names are 'Free', 'App', 'New', 'Pro', Mobile', 'Theme', 'Chat', 'Photo Editor', 'Tracker', 'Google'and more.