## <b> The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market. </b>

## <b> Each app (row) has values for catergory, rating, size, and more. Another dataset contains customer reviews of the android apps.</b>

## <b> Explore and analyze the data to discover key factors responsible for app engagement and success. </b>

#**STEP 1: Importing essential libraries and Loading the Dataset**




In this step, import essential libraries


*   numpy and pandas for data analysis
*   seaborn and matplotlib.pyplot for data visualization.

Load the dataset which is in .csv format.





In [None]:
# Importing essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Mounting drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# loading the Play Store Data.csv file 
df = pd.read_csv("/content/drive/MyDrive/Almabetter/Capstone project/1. EDA - Playstore App review analysis/Play Store Data.csv")

# loading the User Reviews.csv file 
user_reviews_df = pd.read_csv("/content/drive/MyDrive/Almabetter/Capstone project/1. EDA - Playstore App review analysis/User Reviews.csv")

In [None]:
# View the dataframe df
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [None]:
# View the dataframe user_reviews_df
user_reviews_df.head()

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
2,10 Best Foods for You,,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
4,10 Best Foods for You,Best idea us,Positive,1.0,0.3


Here marks 'The End' of 'STEP 1'.

Moving forward to clean the data.

# **STEP 2: Data Cleaning**



This step is also known as Pre-processing.

In this step, clean the data. 

Looking for and handling
*   NAN/Null/ Missing values
*   Duplicate entries
*   Mismatched entries

Also attaining,
*   Standardized Columns
*   Transformed Columns (Original: Object data type column to Transformed column: Numeric data type)






## **1) Handling Null values from user_reviews_df DataFrame**

In [None]:
# Cleaning the data from user_reviews_df, since it has NAN values.
user_reviews_df = user_reviews_df[~user_reviews_df['Sentiment_Subjectivity'].isna()]

Adding a 'Sentiment_marks' column to user_reviews_df dataframe, which will be used later on.

In [None]:
# Create a function which will convert the sentiment to number
def change_to_numbers(value):
  if value == 'Positive':
    return 1
  elif value=='Neutral':
    return 0
  elif value == 'Negative':
    return -1
  else:
    return value

In [None]:
# Apply the function to column
user_reviews_df['Sentiment_marks'] = user_reviews_df['Sentiment'].apply(change_to_numbers)

In [None]:
# View the unique values from 'Sentiment_marks' column
user_reviews_df['Sentiment_marks'].unique()

array([ 1,  0, -1])

In [None]:
# Adding Sentiment_Score column which is the product of Sentiment_Polarity and Sentiment_Subjectivity
user_reviews_df['Sentiment_Score'] = user_reviews_df['Sentiment_Polarity'] * user_reviews_df['Sentiment_Subjectivity']

## 2) **Getting rid of Duplicate entries**
Sometiems given Dataset can have Duplicate value.
Duplicate values, if large in number can significantly deviate the results of summarization.
Duplicate values also makes the dataset unnecessarily bulky. 


In [None]:
# Check the data info first
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [None]:
# Check whether there are duplicate entries in App column
boolean = df['App'].duplicated().any()
boolean

True

In [None]:
# Check the count of each app, this will clearly indicate the presence of duplicates of each app quantitatively.
df['App'].value_counts()

ROBLOX                                                9
CBS Sports App - Scores, News, Stats & Watch Live     8
ESPN                                                  7
Duolingo: Learn Languages Free                        7
Candy Crush Saga                                      7
                                                     ..
Meet U - Get Friends for Snapchat, Kik & Instagram    1
U-Report                                              1
U of I Community Credit Union                         1
Waiting For U Launcher Theme                          1
iHoroscope - 2018 Daily Horoscope & Astrology         1
Name: App, Length: 9660, dtype: int64

In [None]:
# Check for top entry if the data contain of this duplicate across all features is same or not.
df[df['App']=='ROBLOX']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
1653,ROBLOX,GAME,4.5,4447388,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
1701,ROBLOX,GAME,4.5,4447346,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
1748,ROBLOX,GAME,4.5,4448791,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
1841,ROBLOX,GAME,4.5,4449882,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
1870,ROBLOX,GAME,4.5,4449910,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
2016,ROBLOX,FAMILY,4.5,4449910,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
2088,ROBLOX,FAMILY,4.5,4450855,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
2206,ROBLOX,FAMILY,4.5,4450890,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
4527,ROBLOX,FAMILY,4.5,4443407,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up


In [None]:
# We can't go through each row of data to search for duplicate, 
# however using following code, if there are any duplicates present in the data they will be deleted
# sorting by such that highest number of reviews will be on top

df.sort_values("Reviews", ascending=False,inplace = True)

# dropping ALL duplicate values from the 'df' using inplace, keeping the ones with first entry only
df.drop_duplicates( subset="App",keep='first',inplace = True)

In [None]:
# Verify the change
df[df['App']=='ROBLOX']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2206,ROBLOX,FAMILY,4.5,4450890,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up


In [None]:
# Verify the change
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9660 entries, 2989 to 4177
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             9660 non-null   object 
 1   Category        9660 non-null   object 
 2   Rating          8197 non-null   float64
 3   Reviews         9660 non-null   object 
 4   Size            9660 non-null   object 
 5   Installs        9660 non-null   object 
 6   Type            9659 non-null   object 
 7   Price           9660 non-null   object 
 8   Content Rating  9659 non-null   object 
 9   Genres          9660 non-null   object 
 10  Last Updated    9660 non-null   object 
 11  Current Ver     9652 non-null   object 
 12  Android Ver     9657 non-null   object 
dtypes: float64(1), object(12)
memory usage: 1.0+ MB


So we have successfully removed the duplicate entries. The same can be verified by the difference in the non-null values.

We will take care of NaN/ Null or missing values from df DataFrame in futher sections as we face them.

## **3) Handling Mis-matched data**


In [None]:
df.reset_index(inplace=True)

In [None]:
# Easiest way to find the mismatch data for the df DataFrame is running (rating >5) querry on "Rating" column. Since maximum rating is 5.
df[df['Rating']>5]

Unnamed: 0,index,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
4484,10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


In [None]:
# Drop the row with the mis-match data using its index
df.drop(df.index[4484],inplace=True)

In [None]:
# Check for the change
df[df['Rating']>5]

Unnamed: 0,index,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver


In [None]:
# Run (rating < 0) querry as well
df[df['Rating']< 0]

Unnamed: 0,index,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver


In [None]:
# View the DataFrame
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,GollerCepte Live Score,SPORTS,4.2,9992,31M,"1,000,000+",Free,0,Everyone,Sports,"May 23, 2018",6.5,4.1 and up
1,Ad Block REMOVER - NEED ROOT,TOOLS,3.3,999,91k,"100,000+",Free,0,Everyone,Tools,"December 17, 2013",3.2,2.2 and up
2,SnipSnap Coupon App,SHOPPING,4.2,9975,18M,"1,000,000+",Free,0,Everyone,Shopping,"January 22, 2018",1.4,4.3 and up
3,US Open Tennis Championships 2018,SPORTS,4.0,9971,33M,"1,000,000+",Free,0,Everyone,Sports,"June 5, 2018",7.1,5.0 and up
4,DreamTrips,TRAVEL_AND_LOCAL,4.7,9971,22M,"500,000+",Free,0,Teen,Travel & Local,"August 6, 2018",1.28.1,5.0 and up


In [None]:
# Drop the unnecessary index column
df.drop(columns=['index'],inplace=True)

In [None]:
# View the DataFrame
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,GollerCepte Live Score,SPORTS,4.2,9992,31M,"1,000,000+",Free,0,Everyone,Sports,"May 23, 2018",6.5,4.1 and up
1,Ad Block REMOVER - NEED ROOT,TOOLS,3.3,999,91k,"100,000+",Free,0,Everyone,Tools,"December 17, 2013",3.2,2.2 and up
2,SnipSnap Coupon App,SHOPPING,4.2,9975,18M,"1,000,000+",Free,0,Everyone,Shopping,"January 22, 2018",1.4,4.3 and up
3,US Open Tennis Championships 2018,SPORTS,4.0,9971,33M,"1,000,000+",Free,0,Everyone,Sports,"June 5, 2018",7.1,5.0 and up
4,DreamTrips,TRAVEL_AND_LOCAL,4.7,9971,22M,"500,000+",Free,0,Teen,Travel & Local,"August 6, 2018",1.28.1,5.0 and up


The mis-matched data got removed