<a href="https://colab.research.google.com/github/Pooja-Dalwani/PlayStoreAppReviewAnalysisEDA/blob/main/Capstone_Project_PlayStoreAppReviewAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <b> The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market. </b>

## <b> Each app (row) has values for catergory, rating, size, and more. Another dataset contains customer reviews of the android apps.</b>

## <b> Explore and analyze the data to discover key factors responsible for app engagement and success. </b>

In this era of rapidly growing technology, we have an app to cater to basically every need of ours. From stock market to food delivery, they have got us covered. Whether you need to travel, invest, track your expenses, maintain a healthy lifestyle, or simply entertain yourself, there is an app for everything. Their importance can't be overstated.

These needs are fulfilled by various businesses. A business can benefit largely in terms of loyalty, and increased customer engagement through an app. Now, as per growing needs, there is a surge in the number of apps and categories. Currently, as of 2022, Google Play Store is the biggest store with 33 categories, and total 3.48 million apps to offer. Now that's a lot, but not all of them survive the competition. 

What we have here, are two datasets: 1) Apps and their information 2) App Reviews, from Google Play Store, and our aim is to explore the factors on which the success of an app depends. The sequence in which we shall go about it are:



1.   **Data cleaning**
2.   **Exploration and Analysis**
3. **Inferences and conclusion**

Without further ado, let's begin!

In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from io import StringIO

In [None]:
# Mounting the drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Creating a common path
path = '/content/drive/MyDrive/'

In [None]:
# Creating dataframes for both data sets
playstore_df = pd.read_csv(path + 'Copy of Play Store Data.csv')
reviews_df = pd.read_csv(path + 'Copy of User Reviews.csv')

In [None]:
# Storing the playstore dataset into a temporary variable
temp_df = playstore_df

In [None]:
temp_df.shape

(10841, 13)

In [None]:
temp_df.head(10)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
8403,"""i DT"" Fútbol. Todos Somos Técnicos.",29,4.3,27,3.6M,500,1,0.0,2,Sports,1073,0.22,4.1 and up
8061,+Download 4 Instagram Twitter,28,4.5,40467,22M,1000000,1,0.0,2,Social,1371,5.03,4.1 and up
291,- Free Comics - Comic Apps,6,3.5,115,9.1M,10000,1,0.0,4,Comics,1351,5.0.12,5.0 and up
4086,.R,30,4.5,259,203k,10000,1,0.0,2,Tools,224,1.1.06,1.5 and up
4181,/u/app,7,4.7,573,53M,10000,1,0.0,4,Communication,1341,4.2.4,4.1 and up
5483,058.ba,22,4.4,27,14M,100,1,0.0,2,News & Magazines,1344,1.0,4.2 and up
9770,1. FC Köln App,29,4.6,2019,41M,100000,1,0.0,2,Sports,1358,1.13.0,4.4 and up
1225,10 Best Foods for You,16,4.0,2490,3.8M,500000,1,0.0,3,Health & Fitness,850,1.9,2.3.3 and up
8012,10 Minutes a Day Times Tables,12,4.1,681,48M,100000,1,0.0,2,Education,193,1.2,2.2 and up
7268,10 WPM Amateur ham radio CW Morse code trainer,7,3.5,10,3.8M,100,2,1.49,2,Communication,1289,2.1.4,2.1 and up


So now we know that the data set contains a total of **10481** rows and **13** columns. And the various columns are:



1.   **App** - The name of the app 
2.   **Category** - The category a particular app belongs to, e.g., 'Game', 'Travel and local', 'Dating' etc.
3. **Rating** - Rating is an average of user ratings given at that time when this data was extracted.
4. **Reviews** - Count of number of reviews received by an app.
5. **Size** - The space an app will occupy in your storage if you install it.
6. **Installs** - Number of times an app has been downloaded since its launch.
7. **Price** - The amount you have to pay if you purchase the app. 
8. **Type** - Whether it's a free or paid app.
9. **Content Rating** - An indication of that age-group for which the content of an app is suitable for. e.g., some content are rated as "X" meaning no one under 17 years of age is allowed.
10. **Genre** - Similar to category
11. **Last updated** - The date when new additions/features were introduced in the app.
12. **Current version** - Version of the app being used
13. **Android version** - That minimum version of your andriod device, which is required for an app to perform well.

Having been familiarized with the data set a bit, let's begin our journey.

## 1. Data cleaning

In this step we are going to:


>(i) Remove unnecessary columns and rows<br>
(ii)  Check the data type of variables and if required convert them<br>
(iii) Remove dupplicate and repetative entries if any<br>
(iv) Treat null values

For this, let us first check the unique values in concerned variables.

In [None]:
temp_df['Category'].unique()

array(['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY',
       'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION',
       'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE',
       'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME',
       'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL',
       'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL',
       'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER',
       'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION',
       '1.9'], dtype=object)

In [None]:
temp_df['Rating'].unique()

array([ 4.1,  3.9,  4.7,  4.5,  4.3,  4.4,  3.8,  4.2,  4.6,  3.2,  4. ,
        nan,  4.8,  4.9,  3.6,  3.7,  3.3,  3.4,  3.5,  3.1,  5. ,  2.6,
        3. ,  1.9,  2.5,  2.8,  2.7,  1. ,  2.9,  2.3,  2.2,  1.7,  2. ,
        1.8,  2.4,  1.6,  2.1,  1.4,  1.5,  1.2, 19. ])

In [None]:
temp_df['Size'].unique()

array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
       '28M', '12M', '20M', '21M', '37M', '2.7M', '5.5M', '17M', '39M',
       '31M', '4.2M', '7.0M', '23M', '6.0M', '6.1M', '4.6M', '9.2M',
       '5.2M', '11M', '24M', 'Varies with device', '9.4M', '15M', '10M',
       '1.2M', '26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k',
       '3.6M', '5.7M', '8.6M', '2.4M', '27M', '2.5M', '16M', '3.4M',
       '8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M',
       '2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
       '7.1M', '3.7M', '22M', '7.4M', '6.4M', '3.2M', '8.2M', '9.9M',
       '4.9M', '9.5M', '5.0M', '5.9M', '13M', '73M', '6.8M', '3.5M',
       '4.0M', '2.3M', '7.2M', '2.1M', '42M', '7.3M', '9.1M', '55M',
       '23k', '6.5M', '1.5M', '7.5M', '51M', '41M', '48M', '8.5M', '46M',
       '8.3M', '4.3M', '4.7M', '3.3M', '40M', '7.8M', '8.8M', '6.6M',
       '5.1M', '61M', '66M', '79k', '8.4M', '118k', '44M', '695k', '1.6M',
     

In [None]:
temp_df['Installs'].unique()

array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
       '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
       '1,000,000,000+', '1,000+', '500,000,000+', '50+', '100+', '500+',
       '10+', '1+', '5+', '0+', '0', 'Free'], dtype=object)

In [None]:
temp_df['Type'].unique()

array(['Free', 'Paid', nan, '0'], dtype=object)

In [None]:
# Checking the datatype of variables
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


Here we can see that "Reviews", "Size", "Installs", "Price", and "Last Updated" are of data-type string, and to perform necessary analysis we will have to convert them into numeric and date as required.

In [None]:
# Extracting the row with element '1.9' in Category column 
temp_df[(temp_df['Category'] == '1.9')]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


In the above cell we can see that 'Category' is missing an element and so subsequent elements have shifted towards left so we are going to remove the row altogether.

In [None]:
temp_df.drop(10472,axis=0,inplace=True)

In [None]:
#For 'Reviews'
temp_df['Reviews'] = temp_df['Reviews'].astype(int)

#For 'Installs'
temp_df['Installs'] = temp_df['Installs'].str.replace(',','')
temp_df['Installs'] = temp_df['Installs'].str.replace('+','')
temp_df['Installs'] = temp_df['Installs'].astype(int)

#For 'Price'
temp_df['Price'] = temp_df['Price'].str.replace('$','')
temp_df['Price'] = temp_df['Price'].astype(float)

#For 'Sizes'
temp_df['Size'].replace("Varies with device","0",inplace = True)

sizes_list = temp_df['Size']

list_of_new_sizes = []
for size in sizes_list:
  if '0' in size:
    size = float(0)   
  elif 'k' in size:
    size = size.replace('k', '')
    size = float(size)
    size = size/1024
  elif 'M' in size:
    size = size.replace('M', '')  
    size = float(size)
  
  list_of_new_sizes.append(size)

#For 'Last Updated'
temp_df['Last Updated'] = pd.to_datetime(temp_df['Last Updated'])

  
  # Remove the CWD from sys.path while we load stuff.
