<a href="https://colab.research.google.com/github/MaazAnsari-OO7/Play-Store-App-Review-Analysis/blob/main/Play_Store_App_Review_Analysis_Maaz_Ansari_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <b> The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market. </b>

## <b> Each app (row) has values for catergory, rating, size, and more. Another dataset contains customer reviews of the android apps.</b>

## <b> Explore and analyze the data to discover key factors responsible for app engagement and success. </b>

In [1]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
playstore_data_path = '/content/drive/MyDrive/Capstone Files/EDA/Play Store Data.csv'
user_review_data_path = '/content/drive/MyDrive/Capstone Files/EDA/User Reviews.csv'

In [4]:
playstore_df = pd.read_csv(playstore_data_path)
review_df = pd.read_csv(user_review_data_path)

# Exploring Playstore and User Review Dataframe.

In [None]:
playstore_df.shape

In [None]:
playstore_df.head()

In [None]:
playstore_df.tail()

In [None]:
playstore_df.info()

In [None]:
review_df.shape

In [None]:
review_df.info()

In [None]:
review_df.head()

In [None]:
review_df.tail()

# Data Cleaning.

In [328]:
df = playstore_df.copy()

# 1- Converting "Review" column type from "object" to "int".


In [329]:
df['Reviews']=df['Reviews'].apply(lambda x : eval(x))

SyntaxError: ignored

## While converting Reviews into "int" type we found some Error. 
## Which show that one of the value in review column is 3.0M.
## So we have to remove that row from the data set.

In [330]:
# Finding the row index of that value.
df[df['Reviews']=='3.0M']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


## We have found that index no 10472 contain the value 3.0M.


In [331]:
# Dropping that index value using drop method and reseting index again.
df = df.drop(10472).reset_index(drop=True)

In [333]:
# Evaluating the values.
df['Reviews']=df['Reviews'].apply(lambda x : eval(x))

# 2- Converting "Size" type from "object" to "float".

In [334]:
df.tail()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10835,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10836,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10837,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10838,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device
10839,iHoroscope - 2018 Daily Horoscope & Astrology,LIFESTYLE,4.5,398307,19M,"10,000,000+",Free,0,Everyone,Lifestyle,"July 25, 2018",Varies with device,Varies with device


### It is observed that "Size column contain values in MB (M), kb(k) and "Varies with device".


In [335]:
df[df['Size']=='Varies with device'].shape

(1695, 13)

## As we can see there are **1695** rows that has **"Varies with device"** value in **"Size"** column. Replacing this value with mean value will affect the visualization. Replacing **Varies with device** value with np.NaN which is of float type.


In [336]:
def converting_size_into_float(string):
  '''
  This function helps in removing 'M'(MB) and 'k'(kb) which are present at the end of the string and replace 'Varies with device' with np.NaN. This function also evaluate the values present. 
  '''
  
  if string[-1] == 'M':
    return eval(string.strip('M'))

  elif string[-1] == 'k':
    a =string.strip('k')
    b = str(round(eval(a)/1024,1))
    return eval(b)

  elif string == 'Varies with device':
    string = np.NaN  
    return string
  else:
    return eval(string)

In [337]:
# Applying defined function.
df['Size']= df['Size'].apply(lambda x : converting_size_into_float(x))

# 3- Converting "Install" type from "object to "int".

## In **Install** column values are of **object** type and contain '+' and ',' in them. So we are going remove '+' and ',' from the values and then convert them into **int** type using **eval** method.

In [338]:
# Creating function to remove + and ,.
def remove_plus_and_comma(string):
  '''
  This function removes '+' and ',' from the string.
  '''
  string = string.replace(',','')
  string = string.strip('+')
  return string

In [339]:
# Applying defined function on the column and evaluating those values.
df['Installs'] = df['Installs'].apply(lambda x: eval(remove_plus_and_comma(x)))

# 4- Converting "Price" type from "object" to "float".

## **Price** column value has $ symbol in them and they are of object type. we'll remove the symbol and change the type.  

In [340]:
# Creating function to remove $ symbol.
def remove_sign(string):
  '''
  This function removes $ symbol from the string and convert given string data type from 'str' to 'float'.
  '''
  return round(float(string.strip('$')),2)

In [341]:
# Applying the function.
df['price_in_dollar'] = df['Price'].apply(lambda x : remove_sign(x))

# 5- Converting **Last Update** type from "object" to "datetime".

In [342]:
# Converting str into datetime formate using 'to_datetime' function.
df['Last Updated'] = pd.to_datetime(df['Last Updated'])

# 6- Handling Rating column 

In [343]:
len(df[df['Rating'].isnull()]['Rating'])

1474

## Rating column contain 1474 NaN values. We cannot drop this much amount of row from dataset.
## We gonna replace all the NaN values with the average of non-null values.

In [344]:
# Finding average of non-null values from Rating column.
non_null_mean= round(df[~df['Rating'].isnull()]['Rating'].mean(),1)

In [345]:
# Replacing null values with average rating.
df['Rating'].fillna(value= non_null_mean, inplace=True)


# 7- Removing Null value from "Type" column.

## **Type** column contain only one NaN value.

In [346]:
df[df['Type'].isnull()]['Type']

9148    NaN
Name: Type, dtype: object

## It is observed that Type column contain Null value at index 9148.

In [347]:
# Removing row having index 9148 and reseting the index.
df = df.drop(9148).reset_index(drop=True)

# 8- Removing Null values from "Current Ver".

In [348]:
df[df['Current Ver'].isnull()]['Current Ver'].shape

(8,)

## Number of Null value present in the 'Current Ver' column is 8.


In [349]:
# Removing the rows containing Null values from Dataframe.
df = df[~df['Current Ver'].isnull()]

# 9- Removing Null values from "Android Ver" column.

In [350]:
len(df[df['Android Ver'].isnull()]['Android Ver'])

2

## Number of Null values present in the "Android Ver" column is 2.

In [351]:
# Removing Rows containing Null value from dataframe.
df= df[~df['Android Ver'].isnull()]

## Android Ver type is of object and there are different values. we'll get the create another column in the Dataframe which will store the minimum android version for the App.

In [352]:
# creating a function to obtain minimum version for the App.
def get_ver(string):
  '''

  
  '''

  if string =='Varies with device':
    return eval('1.0')
  else:
    string = string[0:3]
    return eval(string)
  




In [353]:
# Applying the function.
df['min_ver']=df['Android Ver'].apply(lambda x : get_ver(x))

# Checking and Removing Duplicate values from the data set.

## As we can see in the Dataframe only one column that contain **UNIQUE** value which cannot be repeate and that column is **App** column.

In [354]:
# Creating temperory dataframe to find the number of duplicate apps.
a = df['App'].value_counts().reset_index()

In [355]:
# Finding the number of duplicate rows.
len(a[a['App']>=2])

798

### As we can see above Dataframe contains 798 duplicate rows.

In [356]:
# Removing DUPLICATES from the Dataframe.
df=df.drop_duplicates(subset = 'App')

Reseting index

In [365]:
df=df.reset_index()

In [366]:
df.drop(['index'],axis = 1,inplace=True)

In [382]:
df

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,price_in_dollar,min_ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19.0,10000,Free,0,Everyone,Art & Design,2018-01-07,1.0.0,4.0.3 and up,0.0,4.0
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,500000,Free,0,Everyone,Art & Design;Pretend Play,2018-01-15,2.0.0,4.0.3 and up,0.0,4.0
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7,5000000,Free,0,Everyone,Art & Design,2018-08-01,1.2.4,4.0.3 and up,0.0,4.0
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25.0,50000000,Free,0,Teen,Art & Design,2018-06-08,Varies with device,4.2 and up,0.0,4.2
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8,100000,Free,0,Everyone,Art & Design;Creativity,2018-06-20,1.1,4.4 and up,0.0,4.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9643,Sya9a Maroc - FR,FAMILY,4.5,38,53.0,5000,Free,0,Everyone,Education,2017-07-25,1.48,4.1 and up,0.0,4.1
9644,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6,100,Free,0,Everyone,Education,2018-07-06,1.0,4.1 and up,0.0,4.1
9645,Parkinson Exercices FR,MEDICAL,4.2,3,9.5,1000,Free,0,Everyone,Medical,2017-01-20,1.0,2.2 and up,0.0,2.2
9646,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,,1000,Free,0,Mature 17+,Books & Reference,2015-01-19,Varies with device,Varies with device,0.0,1.0


In [368]:
df.shape

(9648, 15)

### - After removing "Duplicates" and "NaN values" from the Dataframe we now have a modified Dataframe with 9648 rows and 15 columns.

## Cleaning User Review data

In [396]:
r_df = review_df.copy()

In [397]:
r_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64295 entries, 0 to 64294
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   App                     64295 non-null  object 
 1   Translated_Review       37427 non-null  object 
 2   Sentiment               37432 non-null  object 
 3   Sentiment_Polarity      37432 non-null  float64
 4   Sentiment_Subjectivity  37432 non-null  float64
dtypes: float64(2), object(3)
memory usage: 2.5+ MB


In [398]:
r_df.head()

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
2,10 Best Foods for You,,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
4,10 Best Foods for You,Best idea us,Positive,1.0,0.3


# Handling Null values in User data review Dataframe.

In [399]:
r_df[r_df['Translated_Review'].isnull()].shape

(26868, 5)

## There are 26868 Null Values in T**ranslated Review** column.

## Removing NaN values from **Translated_Review** column, cause the rows containing NaN values are of no use and we cannot impute null values for these column. 
## If there is no review then there will be no sentiment.
## Therefore, We will remove all the rows that contains NaN values in Translated_Review columm.

In [400]:
r_df= r_df[~r_df['Translated_Review'].isnull()]

Reseting index

In [401]:
r_df= r_df.reset_index()

In [402]:
r_df.drop(['index'],axis=1,inplace=True)

In [405]:
r_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37427 entries, 0 to 37426
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   App                     37427 non-null  object 
 1   Translated_Review       37427 non-null  object 
 2   Sentiment               37427 non-null  object 
 3   Sentiment_Polarity      37427 non-null  float64
 4   Sentiment_Subjectivity  37427 non-null  float64
dtypes: float64(2), object(3)
memory usage: 1.4+ MB
