<a href="https://colab.research.google.com/github/RadhikaRM/Playstore-Data-Analysis/blob/main/PlayStore_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Problem statement**
## <b> The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market. </b>

## <b> Each app (row) has values for category, rating, size, and more. Another dataset contains customer reviews of the android apps.</b>

## <b> Explore and analyze the data to discover key factors responsible for app engagement and success. </b> 

Google Play Store is a major app distribution platform for Android users.There are thousands of applications for various purposes available here.As a result,the competition in the app market is quite fierce.To carve their niche in this market,companies should go about it very strategically.App developers would benefit greatly from knowing what works and what does not work on Playstore.

> ***There comes data to our rescue!***

We can make use of historical data of Apps in Playstore.The Playstore apps data available to us contain details regarding the category,rating,size and several other factors about the apps.We also have another dataset containing the customer reviews of Android apps.
By performing Exploratory Data Analysis on these datasets,we can figure out the key factors responsible for App Engagement  and Success.

So,let's get started!


First,we import the libraries and modules which we have to use in this analysis.

## **Importing the required libraries**

In [235]:
#importing the required libraries and modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
from datetime import datetime
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!



## **Bringing in the data**


In [236]:
#mounting the drive containing the data files
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [237]:
#reading the playstore data and user reviews data
PS_df=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Capstone projects/Project 1- Playstore Data Analysis/ PlayStore_Data.csv')
UR_df=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Capstone projects/Project 1- Playstore Data Analysis/ User_Reviews.csv')


---
## **Data cleaning and preparation** 


***“No data is clean, but most is useful.***”~ Dean Abbott

An important part of the data analysis process is Data Cleaning.  Only when the data has been cleaned can it be analysed and transformed into something beneficial.



>## ***Playstore data***


In [238]:
#To display the first 3 observations of the data 
PS_df.head(3)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up


In [239]:
#To display the last 5 observations of the data 
PS_df.tail(5)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device
10840,iHoroscope - 2018 Daily Horoscope & Astrology,LIFESTYLE,4.5,398307,19M,"10,000,000+",Free,0,Everyone,Lifestyle,"July 25, 2018",Varies with device,Varies with device


### **Data Description of Playstore dataframe**

The **PS_df** contains **10841** rows and **13** columns in total.
The columns present in the Play Store dataframe are:

1.   **App :** Name of the application
2.  **Category**: Category in which the app belongs
3.   **Rating**: Rating given by users to the particular app
4.  **Reviews**: The number of reviews received by the app
5.  **Size**: The amount of space required for app installation
5.  **Installs**: The number of installations of the app 
6.   **Type**: The type of the app ,whether its free or paid
7.   **Price**: The price of the app
8.   **Content rating**:The targeted audience of the app
9.   **Genres**: The genre of content offered by the app
10.  **Last updated**:The date on which the app was last updated
11.  **Current Ver**:The current version of the app
12.  **Android Ver**: The Andriod version(s) supported by the app


In [240]:
def dataframe_info(df):
  ''' 
  Returns a dataframe displaying the column datatypes,
  count of unique values and count & percent of missing values in the dataframe
  '''
  info_df=df.isnull().sum().reset_index()
  info_df.rename(columns={'index':'Column_name',0:'Count_of_missing_values'},inplace=True)
  info_df['% of  NaN']=round((info_df['Count_of_missing_values']/len(df))*100,2)
  info_df['dtype']=df.dtypes.values
  info_df['Count_of_unique_values']=df.nunique(axis=0).values
  return(info_df)

In [241]:
#To understand the datatypes of features in playstore data and to determine the count of missing values
dataframe_info(PS_df)

Unnamed: 0,Column_name,Count_of_missing_values,% of NaN,dtype,Count_of_unique_values
0,App,0,0.0,object,9660
1,Category,0,0.0,object,34
2,Rating,1474,13.6,float64,40
3,Reviews,0,0.0,object,6002
4,Size,0,0.0,object,462
5,Installs,0,0.0,object,22
6,Type,1,0.01,object,3
7,Price,0,0.0,object,93
8,Content Rating,1,0.01,object,6
9,Genres,0,0.0,object,120


* It is evident that the dataset contains several null values.
* The datatypes of columns such as Size,Installs,Rating,Price,Content Rating are incorrect

In [242]:
#displaying the unique values of some columns in PS_df
for col in PS_df.columns:
  if PS_df[col].nunique()<500:
    print(f"The unique values in {col} column are :")
    print(PS_df[col].unique())
    print('\n')

The unique values in Category column are :
['ART_AND_DESIGN' 'AUTO_AND_VEHICLES' 'BEAUTY' 'BOOKS_AND_REFERENCE'
 'BUSINESS' 'COMICS' 'COMMUNICATION' 'DATING' 'EDUCATION' 'ENTERTAINMENT'
 'EVENTS' 'FINANCE' 'FOOD_AND_DRINK' 'HEALTH_AND_FITNESS' 'HOUSE_AND_HOME'
 'LIBRARIES_AND_DEMO' 'LIFESTYLE' 'GAME' 'FAMILY' 'MEDICAL' 'SOCIAL'
 'SHOPPING' 'PHOTOGRAPHY' 'SPORTS' 'TRAVEL_AND_LOCAL' 'TOOLS'
 'PERSONALIZATION' 'PRODUCTIVITY' 'PARENTING' 'WEATHER' 'VIDEO_PLAYERS'
 'NEWS_AND_MAGAZINES' 'MAPS_AND_NAVIGATION' '1.9']


The unique values in Rating column are :
[ 4.1  3.9  4.7  4.5  4.3  4.4  3.8  4.2  4.6  3.2  4.   nan  4.8  4.9
  3.6  3.7  3.3  3.4  3.5  3.1  5.   2.6  3.   1.9  2.5  2.8  2.7  1.
  2.9  2.3  2.2  1.7  2.   1.8  2.4  1.6  2.1  1.4  1.5  1.2 19. ]


The unique values in Size column are :
['19M' '14M' '8.7M' '25M' '2.8M' '5.6M' '29M' '33M' '3.1M' '28M' '12M'
 '20M' '21M' '37M' '2.7M' '5.5M' '17M' '39M' '31M' '4.2M' '7.0M' '23M'
 '6.0M' '6.1M' '4.6M' '9.2M' '5.2M' '11M' '24M' 'Va

Besides null values,the data could also  contain  duplicate values, values with incorrect datatypes and formats,irrelevant columns and outliers . It is important to replace/remove these values before proceeding with the analysis. So let's get to work and treat them right!

**1.Dropping Duplicate rows**

In [243]:
#finding the number of duplicates in 'App' column
PS_df.duplicated(['App']).value_counts()

False    9660
True     1181
dtype: int64

In [244]:
# Updates the playstore dataframe with duplicate rows removed
PS_df.sort_values(["App","Reviews"],inplace=True)
PS_df.drop_duplicates("App","last",inplace=True)
PS_df.reset_index(drop=True,inplace=True)

**2.Handling missing or NaN values**

>**2.1. Treating null value in 'Type' column**

In [245]:
#identifying the app with type NaN
PS_df[PS_df['Type'].isnull()]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2641,Command & Conquer: Rivals,FAMILY,,0,Varies with device,0,,0,Everyone 10+,Strategy,"June 28, 2018",Varies with device,Varies with device


After cross-checking in the Google Playstore,I have identified that this app belongs to the Type 'Free'

In [246]:
#Filling the type of the app as Free
PS_df['Type'].fillna("Free", inplace = True)

>**2.2. Treating null value in 'Content Rating' column**


In [247]:
#identifying the app with Content Rating NaN
PS_df[PS_df['Content Rating'].isnull()]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
5806,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


The value of category  and genres column is missing in this row .This app belongs to 'LIFESTYLE' category and the same genre. 

In [248]:
#Filling in the values for Category and Genres
PS_df.loc[5806, :] = ['Life Made WI-Fi Touchscreen Photo Frame', 'LIFESTYLE',1.9, 19.0, '3.0M',
       '1,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'February 11, 2018',
       '1.0.19', '4.0 and up']
PS_df.loc[5806, :]

App               Life Made WI-Fi Touchscreen Photo Frame
Category                                        LIFESTYLE
Rating                                                1.9
Reviews                                              19.0
Size                                                 3.0M
Installs                                           1,000+
Type                                                 Free
Price                                                   0
Content Rating                                   Everyone
Genres                                          Lifestyle
Last Updated                            February 11, 2018
Current Ver                                        1.0.19
Android Ver                                    4.0 and up
Name: 5806, dtype: object

>**2.3. Treating the null values in 'Rating' column**


As seen earlier,The 'Rating' column has 1474 null values which accounts to a significant amount of data .So,dropping rows containing null values of Rating can drastically affect our analysis.So, it is wise to better to replace the null values with the median value of Rating.

In [249]:
#taking median value of the rating column and replace NaN values in it with it
rating_median = PS_df['Rating'].median()

In [250]:
#Replacing the NaN values of  "Rating" with the median value
PS_df['Rating'].fillna(value=rating_median, inplace = True)

>**2.4. Dropping the null values in 'Android Ver'  and 'Current Ver' columns**

In [251]:
#dropping null values in 'Android Ver' and 'Current Ver' columns
PS_df = PS_df.dropna(subset= ["Current Ver","Android Ver"],how = "any")

In [252]:
#check if the playstore dataframe has any null values to ensure that all modifications are in place
PS_df.isnull().sum()


App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              0
Price             0
Content Rating    0
Genres            0
Last Updated      0
Current Ver       0
Android Ver       0
dtype: int64

Perfect!No more null values

**3. Correcting data types**

Reviews, Size, Installs, and Price should have numeric datatypes, but we find object type here, therefore let's convert them to their respective data types.


>  **3.1. Correcting datatype of 'Reviews' column**

In [253]:
#Changing datatype for "Reviews" column
PS_df['Reviews'] = PS_df['Reviews'].astype(float)


>  **3.2. Correcting datatype of 'Size' column**

In [254]:
#the count of number of apps with the same size 
PS_df['Size'].value_counts()

Varies with device    1227
12M                    181
11M                    181
13M                    177
14M                    176
                      ... 
914k                     1
353k                     1
784k                     1
951k                     1
549k                     1
Name: Size, Length: 457, dtype: int64

In [255]:
# Replacing/removing the string characters in  "Size" column

PS_df['Size'] = PS_df.Size.apply(lambda x: x.replace(',', ''))
PS_df['Size'] = PS_df.Size.apply(lambda x: x.replace('M', 'e+6'))
PS_df['Size'] = PS_df.Size.apply(lambda x: x.replace('k', 'e+3'))

#For the time being,replacing 'Varies with device' with Nan 
PS_df['Size'] = PS_df.Size.replace('Varies with device', np.NaN)

In [256]:
#Changing datatype for "Size" column
PS_df['Size']= PS_df['Size'].apply(pd.to_numeric)

In [257]:
#Converting the value in bytes to megabytes for easier interpretation
PS_df['Size']=PS_df['Size']/(10**6) 

In [258]:
#determining the mean,median of Apps sizes excluding the apps with Nan values
PS_df[PS_df['Size']!=np.nan]['Size'].describe()

count    8423.000000
mean       20.409646
std        21.830924
min         0.008500
25%         4.600000
50%        12.000000
75%        28.000000
max       100.000000
Name: Size, dtype: float64

Majority of the apps have size that 'Varies with device'.Keeping those apps aside,we see that  minimum size of apps in the dataset is 8.5KB and the maximum size is 100 MB.It looks like most of the apps in playstore are sized under 28 MB which is why we get median of 12MB and mean of 20MB.


> **3.3. Correcting datatype of 'Installs' column**

In [259]:
#Displaying the unqiue values in Installs column
PS_df['Installs'].unique()

array(['500+', '1,000,000+', '10,000+', '100+', '100,000+', '500,000+',
       '10,000,000+', '5,000+', '50,000+', '5+', '1,000+', '10+',
       '50,000,000+', '100,000,000+', '5,000,000+', '50+', '0+', '1+',
       '500,000,000+', '0', '1,000,000,000+'], dtype=object)

In [260]:
# Replacing/removing the string characters in "Installs" column
PS_df['Installs']=PS_df['Installs'].apply(lambda x: x.strip('+'))
PS_df['Installs']=PS_df['Installs'].apply(lambda x: x.replace(',',''))

#Changing datatype for "Installs" column
PS_df['Installs']=PS_df['Installs'].astype(int)

In [261]:
# #The counts of apps with same number of installations
# PS_df['Installs'].value_counts()

Looks like most number of apps have 1 Million installs.There are 20 apps with 100 million downloads .

In [262]:
PS_df['Installs'].describe()

count    9.650000e+03
mean     7.787301e+06
std      5.378459e+07
min      0.000000e+00
25%      1.000000e+03
50%      1.000000e+05
75%      1.000000e+06
max      1.000000e+09
Name: Installs, dtype: float64

About 75% of apps in Playstore have installations under 1 Million.The maximum number of downloads for an app/apps is 100 Million.There are also quite a few apps which have been not installed at all.


> **3.4. Correcting datatype of 'Price' column**

In [263]:
# Replacing/removing the string characters in "Price" column
PS_df['Price']=PS_df['Price'].apply(lambda x: x.replace('$',''))

#Changing datatype for "Price" column
PS_df['Price']=PS_df['Price'].astype(float)

> **3.5.  Correcting datatype of 'Last updated' column and extracting two new columns**

In [264]:
#Changing the datatype of column "Last updated" and extracting year and month of last update 
PS_df['Last Updated']=PS_df['Last Updated'].apply(lambda x : datetime.strptime(x,"%B %d, %Y"))


 **4. Sanity checks**

In [265]:
PS_df[PS_df['Reviews']>PS_df['Installs']]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
434,AX Watch for WatchMaker,PERSONALIZATION,4.3,2.0,0.238,1,Paid,0.99,Everyone,Personalization,2017-08-18,1.0,2.3 and up
629,Alarmy (Sleep If U Can) - Pro,LIFESTYLE,4.8,10249.0,,10000,Paid,2.49,Everyone,Lifestyle,2018-07-30,Varies with device,Varies with device
1765,Brick Breaker BR,GAME,5.0,7.0,19.0,5,Free,0.0,Everyone,Arcade,2018-07-23,1.0,4.1 and up
2970,DN Blog,SOCIAL,5.0,20.0,4.2,10,Free,0.0,Teen,Social,2018-07-23,1.0,4.0 and up
3151,DZ Puzzle,FAMILY,4.3,14.0,47.0,10,Paid,0.99,Everyone,Puzzle,2017-04-22,1.2,2.3 and up
5532,KBA-EZ Health Guide,MEDICAL,5.0,4.0,25.0,1,Free,0.0,Everyone,Medical,2018-08-02,1.0.72,4.0.3 and up
6349,Mu.F.O.,GAME,5.0,2.0,16.0,1,Paid,0.99,Everyone,Arcade,2017-03-03,1.0,2.3 and up
7301,RMEduS - 음성인식을 활용한 R 프로그래밍 실습 시스템,FAMILY,4.3,4.0,64.0,1,Free,0.0,Everyone,Education,2018-07-17,1.0.1,4.4 and up
7314,Ra Ga Ba,GAME,5.0,2.0,20.0,1,Paid,1.49,Everyone,Arcade,2017-02-08,1.0.4,2.3 and up
7643,Sam.BN Pro,TOOLS,4.3,11.0,2.0,10,Paid,0.99,Everyone,Tools,2015-03-27,1.0.0,4.0.3 and up


Reviews given without even Installing the apps makes no sense.So let's remove these entries.



In [266]:
PS_df.drop(PS_df[PS_df['Reviews']>PS_df['Installs']].index , inplace=True)

Are there any entries with ratings less than zero or greater than 5?

In [267]:
PS_df[(PS_df['Rating']>5) | (PS_df['Rating']<0)]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver


There are no such  entries.

In [268]:
#Statistical summary of numerical columns
PS_df.describe()

Unnamed: 0,Rating,Reviews,Size,Installs,Price
count,9639.0,9639.0,8413.0,9639.0,9639.0
mean,4.191493,217035.7,20.409713,7796186.0,1.100396
std,0.496792,1833271.0,21.833382,53814630.0,16.86958
min,1.0,0.0,0.0085,0.0,0.0
25%,4.0,25.0,4.6,1000.0,0.0
50%,4.3,975.0,12.0,100000.0,0.0
75%,4.5,29473.5,28.0,1000000.0,0.0
max,5.0,78158310.0,100.0,1000000000.0,400.0


The datatypes of all the columns have been corrected.

The Cleaned Playstore dataset has 9638 entries.Now let's move on to the User Review dataset.

###**Creating new columns**

In [269]:
# segment and sort the numeric data values in Size column into bins
#Creating Size_groups
def size_group(app_size):
  '''
  This function is used to segment and sort numeric data values in Size column into groups
  '''
  if app_size<1:
    return('<1MB')
  elif app_size>1 and app_size<=20:
    return('1MB-20MB') 
  elif app_size>20 and app_size<=40:
    return('20MB-40MB')
  elif app_size>40 and app_size<=60:
    return('40MB-60MB')
  elif app_size>60 and app_size<=80:
    return('60MB-80MB')
  elif app_size>80 and app_size<=100:
    return('80MB-100MB')
  else:
    return('Varies with device')  

In [270]:
#Creating size group from 'Size' column
PS_df['Size_group']=PS_df['Size'].apply(lambda x:size_group(x))

In [None]:
#Extracting year and month from 'Last Updated' column
PS_df['Last_update_year']=PS_df['Last Updated'].apply(lambda x :x.year)
PS_df['Last_update_month']=PS_df['Last Updated'].apply(lambda x :x.month)



> ## ***User review data***


In [None]:
#To display first 5 observations of User Review dataframe
UR_df.head()

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
2,10 Best Foods for You,,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
4,10 Best Foods for You,Best idea us,Positive,1.0,0.3


In [None]:
#To display last 5 observations of User Review dataframe
UR_df.head()

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
2,10 Best Foods for You,,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
4,10 Best Foods for You,Best idea us,Positive,1.0,0.3


## **Data Description of User Reviews dataframe**

The User Review dataframe contains the following information:

1.  **App:** Name of the application.
2.   **Translated_Review:** The review given by the user, translated to English.
3.   **Sentiment:** The nature of the review,whether it is positive or negative.
4.   **Sentiment_Polarity:** numerical value given to the sentiment of the user by analyzing the translated review.Its value ranges from[-1,1].A review which has  sentiment polarity in the range of [-1,0) can be considered to have negative sentiment.A value of 0 corresponds to a neutral sentiment and values ranging from (0,1] indicates that the review has a positive sentiment.

5.   **Sentiment_Subjectivity:** quantifies the amount of personal opinion and factual information contained in the translated reviews.Its value ranges from [0,1].A value of 0 indicates that the review is purely objective(fact) and a value of 1 implies that the review is purely subjective(opinion).



In [None]:
#To understand the datatypes of features in UR_df and to determine the count of missing values
dataframe_info(UR_df)

Unnamed: 0,Column_name,Count_of_missing_values,% of NaN,dtype
0,App,0,0.0,object
1,Translated_Review,26868,41.79,object
2,Sentiment,26863,41.78,object
3,Sentiment_Polarity,26863,41.78,float64
4,Sentiment_Subjectivity,26863,41.78,float64


 **1.Dealin with duplicate rows**

In [None]:
#Checking for the presence of entries which are entirely duplicated
UR_df.duplicated( ).value_counts()

False    29692
True      7735
dtype: int64

The reviews in the Translated_Review column need not be unique.The translated version of people's opinions for the apps could be similar.So we are not dropping the duplicate rows

In [None]:
UR_df[UR_df.duplicated()].head(5)

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
78,10 Best Foods for You,Good,Positive,0.7,0.6
79,10 Best Foods for You,Good,Positive,0.7,0.6
100,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
101,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
103,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875


**2.Handling missing values**

In [None]:
#checking for null values
UR_df.isnull().sum()

App                       0
Translated_Review         0
Sentiment                 0
Sentiment_Polarity        0
Sentiment_Subjectivity    0
dtype: int64

In [None]:
pd.concat([UR_df[UR_df['Translated_Review'].isnull()].head(5),UR_df[UR_df['Translated_Review'].isnull()].tail(5)])

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
2,10 Best Foods for You,,,,
7,10 Best Foods for You,,,,
15,10 Best Foods for You,,,,
102,10 Best Foods for You,,,,
107,10 Best Foods for You,,,,
64290,Houzz Interior Design Ideas,,,,
64291,Houzz Interior Design Ideas,,,,
64292,Houzz Interior Design Ideas,,,,
64293,Houzz Interior Design Ideas,,,,
64294,Houzz Interior Design Ideas,,,,


These entries do not contain any textual information in the Translated_review column and so,they are not helpful to us.So it's better to drop these rows.

In [None]:
#Eliminating the existing null value(s) from Translated_Review
UR_df.dropna(subset = ['Translated_Review'], inplace=True)

In [None]:
#Checking for null values in Sentiment,Sentiment_Polarity,Sentiment_Subjectivity columns
UR_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37427 entries, 0 to 64230
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   App                     37427 non-null  object 
 1   Translated_Review       37427 non-null  object 
 2   Sentiment               37427 non-null  object 
 3   Sentiment_Polarity      37427 non-null  float64
 4   Sentiment_Subjectivity  37427 non-null  float64
dtypes: float64(2), object(3)
memory usage: 1.7+ MB


All the null values have been eliminated

**3.Sanity checks**

In [None]:
#inspecting the elements in the Sentiment column
set(UR_df['Sentiment'])

{'Negative', 'Neutral', 'Positive'}

Let's check if range of values in Sentiment_Polarity column matches with the Sentiment of the review

In [None]:
for sentiment in UR_df['Sentiment'].unique():
  print(f'The value distribution of Sentiment Polarity of {sentiment} reviews')
  print(UR_df[UR_df['Sentiment']==sentiment]['Sentiment_Polarity'].describe())
  print('\n')

The value distribution of Sentiment Polarity of Positive reviews
count    2.399800e+04
mean     3.724021e-01
std      2.526559e-01
min      5.551115e-18
25%      1.666667e-01
50%      3.300000e-01
75%      5.000000e-01
max      1.000000e+00
Name: Sentiment_Polarity, dtype: float64


The value distribution of Sentiment Polarity of Neutral reviews
count    5158.0
mean        0.0
std         0.0
min         0.0
25%         0.0
50%         0.0
75%         0.0
max         0.0
Name: Sentiment_Polarity, dtype: float64


The value distribution of Sentiment Polarity of Negative reviews
count    8.271000e+03
mean    -2.561728e-01
std      2.354828e-01
min     -1.000000e+00
25%     -3.645833e-01
50%     -1.833333e-01
75%     -8.125000e-02
max     -2.523234e-18
Name: Sentiment_Polarity, dtype: float64




In [None]:
#inspecting elements in the Sentiment_Polarity column
UR_df[(UR_df['Sentiment_Polarity']>1) | (UR_df['Sentiment_Polarity']<-1)]

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity


Therefore,all the elements in Sentiment_Polarity lie within the range [-1,1].

In [None]:
#inspecting elements in the Sentiment_Sujectivity column
UR_df[(UR_df['Sentiment_Subjectivity']<0) | (UR_df['Sentiment_Subjectivity']>1)]

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity


Therefore,all the elements in Sentiment_Subjectivity lie within the range [0,1].

In [None]:
#The number of unique apps in User Review dataframe
UR_df['App'].nunique()

865

After cleaning,The User Review dataframe **UR_df** contains the reviews for **865** applications



> ## ***Merging the Play Store and User reviews data***



To find correlation between the features in PS_df and UR_df,let's merge the two dataframes

In [None]:
#Creating a new dataframe by merging UR_df and PS_df
new_df_merged=pd.merge(left=PS_df,right=UR_df,how='inner',on='App')

In [None]:
new_df_merged.head(3)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Last_update_year,Last_update_month,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,HEALTH_AND_FITNESS,4.0,2490.0,3.8,500000,Free,0.0,Everyone 10+,Health & Fitness,2017-02-17,1.9,2.3.3 and up,2017,2017,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,HEALTH_AND_FITNESS,4.0,2490.0,3.8,500000,Free,0.0,Everyone 10+,Health & Fitness,2017-02-17,1.9,2.3.3 and up,2017,2017,This help eating healthy exercise regular basis,Positive,0.25,0.288462
2,10 Best Foods for You,HEALTH_AND_FITNESS,4.0,2490.0,3.8,500000,Free,0.0,Everyone 10+,Health & Fitness,2017-02-17,1.9,2.3.3 and up,2017,2017,Works great especially going grocery store,Positive,0.4,0.875


In [None]:
new_df_merged['App'].nunique()

816

There are **816** apps common to both User Review datafame **UR_df** and Playstore dataframe **PS_df**