<a href="https://colab.research.google.com/github/Jahnavi-Jaolekar/Play-Store-Review-Analysis/blob/main/Play_Store_App_Review_Analysis_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b> Problem Statement<b>
## <b>The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market. </b>

## <b> Each app (row) has values for catergory, rating, size, and more. Another dataset contains customer reviews of the android apps.</b>

## <b> Explore and analyze the data to discover key factors responsible for app engagement and success. </b>

In [24]:
Title_of_the_project= "abc"

In [1]:
# Importing necessary libraries required
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


## **Let's mount our data**

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# Store the data in a variable and view it
data='/content/drive/MyDrive/capstone project/Copy of Play Store Data.csv'
play_store= pd.read_csv(data)
play_store.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


## **Determining number of rows and columns**

In [25]:
# Determining the shape of the data
print(f'number of rows:{play_store.shape[0]}')
print(f'number of columns:{play_store.shape[1]}')

number of rows:10840
number of columns:13


## **Let's just dig into the data a bit more!!**


In [6]:
# Columns is an attribute that returns column labels, which helps to understand which column contains what type of information.
play_store.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
       'Android Ver'],
      dtype='object')

In [7]:
# Defining a function to to determine number of null values,unique values and data type of all values
def give_info():
  extra_info = pd.DataFrame(index=play_store.columns)
  extra_info['data_type'] = play_store.dtypes
  extra_info['null_count'] = play_store.isnull().sum()
  extra_info['unique_count'] = play_store.nunique()
  return extra_info
  

In [8]:
give_info()

Unnamed: 0,data_type,null_count,unique_count
App,object,0,9660
Category,object,0,34
Rating,float64,1474,40
Reviews,object,0,6002
Size,object,0,462
Installs,object,0,22
Type,object,1,3
Price,object,0,93
Content Rating,object,1,6
Genres,object,0,120








*   This shows that this data contains 1474 null values in rating column, current ver has 8, android ver has 3, content rating and type each have 1 null value.
*   Only rating has numeric values in the form of floats.





# **Dealing with null values in the data**

In [9]:
def impute_median(series):
  return series.fillna(series.median)

In [10]:
#Replacing null values in rating column with median of all values present in that column
play_store.Rating=play_store['Rating'].transform(impute_median)

In [11]:
#Removing/droping the row containg null value for the column content rating
play_store[play_store['Content Rating'].isna()]
play_store.drop([10472],inplace=True)

In [12]:
#Type,Current Ver,Android Ver, columns are replaced with mode of respective columns.
#Mode is most recurring value of that particular column, so null values are being replaced by mode.
play_store['Type'].fillna(str(play_store['Type'].mode().values[0]),inplace=True)
play_store['Current Ver'].fillna(str(play_store['Current Ver'].mode().values[0]),inplace=True)
play_store['Android Ver'].fillna(str(play_store['Android Ver'].mode().values[0]),inplace=True)

In [13]:
#Check if still null values are present
play_store.isnull().sum()

App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              0
Price             0
Content Rating    0
Genres            0
Last Updated      0
Current Ver       0
Android Ver       0
dtype: int64

# **What is to be cleaned?**

### *   Reviews: It has to be a numerical column and datatype into int/float

### *   Size: Every Entry has "M" and "K" which needs to be removed and datatype into int/float
### *   Installs: Every entry has "+" at the end which needs to be removed and datatype into int/float also it has entries like "Free"

### *   Price: It has "S" at the begining it needs to be cleaned and datatype into int/float
### *   Last Updated: it is date time column,datatype shouldbe changed accordingly and in feature engineering we will splitt the column into year and month

### *   Current Ver: The entries are 1.0.1,1.2.1,1.2 so we will make it 101,121,120, correct upto 3 places
### *   Android Ver: we will remove "And up" from the end and making 4.0.3 as 403




### **Review Column**

In [28]:
# Columns review hsould be of int type
# play_store['Reviews'] = play_store.Reviews.astype(int)
# give_info()

### **Size Column**

In [15]:
# Removing the ',' and '+' Signs
play_store['Size'] = play_store.Size.apply(lambda x: x.strip('+'))
play_store['Size'] = play_store.Size.apply(lambda x: x.replace(',', ''))

In [16]:
# Convert M to Mega
play_store['Size'] = play_store.Size.apply(lambda x: x.replace('M', 'e+6'))

# Convert the K to Kilo
play_store['Size'] = play_store.Size.apply(lambda x: x.replace('k', 'e+3'))

# Replacing the Varies with device value with Nan
#play_store['Size'] = play_store.Size.replace('Varies with device', np.NaN)

In [19]:
# Convert Size Column from object to integer
#play_store['Size'] = pd.to_numeric(play_store['Size'])

### **Installs Column**

In [18]:
#Remove '+' sign from 
play_store['Installs'] = play_store.Installs.apply(lambda x: x.strip('+'))

# Remove the ',' from the numbers.
play_store['Installs'] = play_store.Installs.apply(lambda x: x.replace(',', ''))

# Convert it from string type to numeric type
play_store['Installs'] = pd.to_numeric(play_store['Installs'])

### **Price Column**

In [20]:
# Remove'$' sign
play_store['Price'] = play_store.Price.apply(lambda x: x.strip('$'))

# Convert the values to Numeric type.
play_store['Price'] = pd.to_numeric(play_store['Price'])

In [26]:
give_info()

TypeError: ignored