 # Project name : Google Playstore Apps Analysis & Visualization
 # Team name : Cluster champ
 ## Team Member : Abdul Jaweed, Masna Ashraf, Shaziya Shaikh and Tanu Rajput 

## About the project

In this project, you will be working on a real-world dataset of the google play store, one of the most used applications for downloading android apps. This project aims on cleaning the dataset, analyze the given dataset, and mining informational quality insights. This project also involves visualizing the data to better and easily understand trends and different categories.

## Project Description

This project will help you understand how a real-world database is analyzed using SQL, how to get maximum available insights from the dataset, pre-process the data using python for a better upcoming performance, how a structured query language helps us retrieve useful information from the database, and visualize the data with the power bi tool.

# Module 1: Pre-processing, Analyzing data using Python and SQL.

In this module, you will query the dataset using structured query language to gain insights from the database. The problem statements to be solved will be provided to you and you need to provide the solution for the same using your logic. Different concepts of SQL will be used in this process such as aggregating the data, grouping the data, ordering the data, etc. Module 1 consists of subtasks which are as follows

In [70]:
#importing libraries

import numpy as np
import pandas as pd
from numpy import nan

In [71]:
# import playstore_app data set

app=pd.read_csv("playstore_apps.csv",index_col='App')

In [72]:
# drop duplicate values

app.drop_duplicates(keep=False,inplace=True)

In [73]:
app.head()

Unnamed: 0_level_0,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
App,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159.0,19M,10000.0,Free,0.0,Everyone,Art & Design,07-01-2018,1.0.0,4.0.3 and up
Coloring book moana,ART_AND_DESIGN,3.9,967.0,14M,500000.0,Free,0.0,Everyone,Art & Design;Pretend Play,15-01-2018,2.0.0,4.0.3 and up
"U Launcher Lite – FREE Live Cool Themes, Hide Apps",ART_AND_DESIGN,4.7,87510.0,8.7M,5000000.0,Free,0.0,Everyone,Art & Design,01-08-2018,1.2.4,4.0.3 and up
Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644.0,25M,50000000.0,Free,0.0,Teen,Art & Design,08-06-2018,Varies with device,4.2 and up
Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967.0,2.8M,100000.0,Free,0.0,Everyone,Art & Design;Creativity,20-06-2018,1.1,4.4 and up


In [74]:
app.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9930 entries, Photo Editor & Candy Camera & Grid & ScrapBook to iHoroscope - 2018 Daily Horoscope & Astrology
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Category        9930 non-null   object 
 1   Rating          8487 non-null   float64
 2   Reviews         9929 non-null   float64
 3   Size            9930 non-null   object 
 4   Installs        9929 non-null   float64
 5   Type            9929 non-null   object 
 6   Price           9929 non-null   float64
 7   Content Rating  9929 non-null   object 
 8   Genres          9930 non-null   object 
 9   Last Updated    9929 non-null   object 
 10  Current Ver     9922 non-null   object 
 11  Android Ver     9927 non-null   object 
dtypes: float64(4), object(8)
memory usage: 1008.5+ KB


In [75]:
#find % of misssing value in each column
def no_perc_null(df):
    no_of_null=df.isnull().sum()
    percentage_of_null=df.isnull().sum()/len(df)*100
    dtypecol=df.dtypes
    null_val=pd.concat([no_of_null,percentage_of_null,dtypecol],axis=1)
    null_val_new=null_val.rename(columns={0:"Null values",1:"Percentage",2:"Dataype"})
    return null_val_new
no_perc_null(app)

Unnamed: 0,Null values,Percentage,Dataype
Category,0,0.0,object
Rating,1443,14.531722,float64
Reviews,1,0.01007,float64
Size,0,0.0,object
Installs,1,0.01007,float64
Type,1,0.01007,object
Price,1,0.01007,float64
Content Rating,1,0.01007,object
Genres,0,0.0,object
Last Updated,1,0.01007,object


In [76]:
# Find number of uniques values in column
print("Pring unique values of each column")
for i in app.columns[:]:
  print("*"*70)
  print("number of unique values of ",i," column :",app[i].nunique())
  print("unique values:")
  print(app[i].unique())

Pring unique values of each column
**********************************************************************
number of unique values of  Category  column : 34
unique values:
['ART_AND_DESIGN' 'AUTO_AND_VEHICLES' 'BEAUTY' 'BOOKS_AND_REFERENCE'
 'BUSINESS' 'COMICS' 'COMMUNICATION' 'DATING' 'EDUCATION' 'ENTERTAINMENT'
 'EVENTS' 'FINANCE' 'FOOD_AND_DRINK' 'HEALTH_AND_FITNESS' 'HOUSE_AND_HOME'
 'LIBRARIES_AND_DEMO' 'LIFESTYLE' 'GAME' 'FAMILY' 'MEDICAL' 'SOCIAL'
 'SHOPPING' 'PHOTOGRAPHY' 'SPORTS' 'TRAVEL_AND_LOCAL' 'TOOLS'
 'PERSONALIZATION' 'PRODUCTIVITY' 'PARENTING' 'WEATHER' 'VIDEO_PLAYERS'
 'NEWS_AND_MAGAZINES' 'MAPS_AND_NAVIGATION' '1.9']
**********************************************************************
number of unique values of  Rating  column : 40
unique values:
[ 4.1  3.9  4.7  4.5  4.3  4.4  3.8  4.2  4.6  3.2  4.   nan  4.8  4.9
  3.6  3.7  3.3  3.4  3.5  3.1  5.   2.6  3.   2.5  1.   1.9  2.9  2.8
  2.3  2.2  1.7  2.   1.8  2.7  2.4  1.6  2.1  1.4  1.5  1.2 19. ]
**************

In [29]:
# find Category is equal to 1.9 and drop that row
print(app[app['Category'] == '1.9'])

                                        Category  Rating  Reviews    Size  \
App                                                                         
Life Made WI-Fi Touchscreen Photo Frame      1.9    19.0      NaN  1,000+   

                                         Installs Type  Price Content Rating  \
App                                                                            
Life Made WI-Fi Touchscreen Photo Frame       NaN    0    NaN            NaN   

                                                    Genres Last Updated  \
App                                                                       
Life Made WI-Fi Touchscreen Photo Frame  February 11, 2018          NaN   

                                        Current Ver Android Ver  
App                                                              
Life Made WI-Fi Touchscreen Photo Frame  4.0 and up         NaN  


In [30]:

app.drop("Life Made WI-Fi Touchscreen Photo Frame",inplace=True)

In [31]:
# Add 0 in place of NAN value in Ratings
app["Rating"] = app["Rating"].replace(np.nan,0 )

In [32]:
# drop NAN values
app.dropna(inplace=True)

In [63]:
# rename Columns 
app=app.rename(columns={"Current Ver": "current_Ver","Android Ver":"Android_Ver","Last Updated":"Last_Updated","Content Rating":"Content_Rating"})

In [64]:
app.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9918 entries, Photo Editor & Candy Camera & Grid & ScrapBook to iHoroscope - 2018 Daily Horoscope & Astrology
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Category        9918 non-null   object 
 1   Rating          9918 non-null   float64
 2   Reviews         9918 non-null   float64
 3   Size            9918 non-null   object 
 4   Installs        9918 non-null   float64
 5   Type            9918 non-null   object 
 6   Price           9918 non-null   float64
 7   Content_Rating  9918 non-null   object 
 8   Genres          9918 non-null   object 
 9   Last_Updated    9918 non-null   object 
 10  current_Ver     9918 non-null   object 
 11  Android_Ver     9918 non-null   object 
dtypes: float64(4), object(8)
memory usage: 1007.3+ KB


In [77]:
# read Playstore_reviews data
reviews=pd.read_csv("playstore_reviews.csv",index_col='App')

In [78]:
reviews.head()

Unnamed: 0_level_0,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
App,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
10 Best Foods for You,,,,
10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
10 Best Foods for You,Best idea us,Positive,1.0,0.3


In [79]:
reviews.shape

(64295, 4)

In [80]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 64295 entries, 10 Best Foods for You to Houzz Interior Design Ideas
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Translated_Review       37427 non-null  object 
 1   Sentiment               37432 non-null  object 
 2   Sentiment_Polarity      37432 non-null  float64
 3   Sentiment_Subjectivity  37432 non-null  float64
dtypes: float64(2), object(2)
memory usage: 2.5+ MB


In [81]:
# drop NAN values
reviews.dropna(axis=0,inplace=True)

In [82]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 37427 entries, 10 Best Foods for You to Housing-Real Estate & Property
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Translated_Review       37427 non-null  object 
 1   Sentiment               37427 non-null  object 
 2   Sentiment_Polarity      37427 non-null  float64
 3   Sentiment_Subjectivity  37427 non-null  float64
dtypes: float64(2), object(2)
memory usage: 1.4+ MB


In [83]:
import string
string.punctuation


'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [84]:
quotes = '"'
# Remove double quotes

def remove_quotes(text):
    for char in quotes:
        text = text.replace(char, '')
    return text


In [85]:
reviews['Translated_Review'].apply(remove_quotes)

App
10 Best Foods for You             I like eat delicious food. That's I'm cooking ...
10 Best Foods for You               This help eating healthy exercise regular basis
10 Best Foods for You                    Works great especially going grocery store
10 Best Foods for You                                                  Best idea us
10 Best Foods for You                                                      Best way
                                                        ...                        
Housing-Real Estate & Property    Most ads older many agents ..not much owner po...
Housing-Real Estate & Property    If photos posted portal load, fit purpose. I'm...
Housing-Real Estate & Property    Dumb app, I wanted post property rent give opt...
Housing-Real Estate & Property    I property business got link SMS happy perform...
Housing-Real Estate & Property    Useless app, I searched flats kondapur, Hydera...
Name: Translated_Review, Length: 37427, dtype: object

In [88]:
reviews.to_csv("cleaned_review.csv",encoding='utf-8')

In [89]:
app.to_csv("cleaned_app.csv",encoding='utf-8')