# Pyspark Analysis on Googel Play Store Dataset

Agenda /
1. Find out Top 10 reviews given to the apps.
2. Top 10 installs app and distribution of type (Free/  Paid)
3. Category wise distribution of installed apps.
4. Top paid apps
5. Top paid rating apps.


In [43]:
# Import Libraies
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

Create sparksession object

In [44]:
spark = SparkSession.builder.master('local[*]').appName("GoogleAppStore").config("spark.driver.bindAddress","10.0.2.15").getOrCreate()


Read input csv data into pyspark dataframe

In [45]:
df = spark.read.csv("/home/hdoop/Documents/python/Pyspark_Analysis/googleplaystore/googleplaystore.csv",header=True,inferSchema=True)
df.show(10)

+--------------------+--------------+------+-------+----+-----------+----+-----+--------------+--------------------+------------------+------------------+------------+
|                 App|      Category|Rating|Reviews|Size|   Installs|Type|Price|Content Rating|              Genres|      Last Updated|       Current Ver| Android Ver|
+--------------------+--------------+------+-------+----+-----------+----+-----+--------------+--------------------+------------------+------------------+------------+
|Photo Editor & Ca...|ART_AND_DESIGN|   4.1|    159| 19M|    10,000+|Free|    0|      Everyone|        Art & Design|   January 7, 2018|             1.0.0|4.0.3 and up|
| Coloring book moana|ART_AND_DESIGN|   3.9|    967| 14M|   500,000+|Free|    0|      Everyone|Art & Design;Pret...|  January 15, 2018|             2.0.0|4.0.3 and up|
|U Launcher Lite –...|ART_AND_DESIGN|   4.7|  87510|8.7M| 5,000,000+|Free|    0|      Everyone|        Art & Design|    August 1, 2018|             1.2.4|4.0.3 

# Data Exploration

Dataframe shape

In [46]:
print("Total Rows in dataframe:",df.count())
print("Total columns in dataframe:",len(df.columns))

Total Rows in dataframe: 10841
Total columns in dataframe: 13


In [47]:
df.dtypes

[('App', 'string'),
 ('Category', 'string'),
 ('Rating', 'string'),
 ('Reviews', 'string'),
 ('Size', 'string'),
 ('Installs', 'string'),
 ('Type', 'string'),
 ('Price', 'string'),
 ('Content Rating', 'string'),
 ('Genres', 'string'),
 ('Last Updated', 'string'),
 ('Current Ver', 'string'),
 ('Android Ver', 'string')]

# Data Cleaning

Removing columns from data framwe which are not needed

In [48]:
df1 = df.drop("Size","Content Rating","Last Updated","Current Ver","Android Ver")

In [49]:
df1.show()

+--------------------+--------------+------+-------+-----------+----+-----+--------------------+
|                 App|      Category|Rating|Reviews|   Installs|Type|Price|              Genres|
+--------------------+--------------+------+-------+-----------+----+-----+--------------------+
|Photo Editor & Ca...|ART_AND_DESIGN|   4.1|    159|    10,000+|Free|    0|        Art & Design|
| Coloring book moana|ART_AND_DESIGN|   3.9|    967|   500,000+|Free|    0|Art & Design;Pret...|
|U Launcher Lite –...|ART_AND_DESIGN|   4.7|  87510| 5,000,000+|Free|    0|        Art & Design|
|Sketch - Draw & P...|ART_AND_DESIGN|   4.5| 215644|50,000,000+|Free|    0|        Art & Design|
|Pixel Draw - Numb...|ART_AND_DESIGN|   4.3|    967|   100,000+|Free|    0|Art & Design;Crea...|
|Paper flowers ins...|ART_AND_DESIGN|   4.4|    167|    50,000+|Free|    0|        Art & Design|
|Smoke Effect Phot...|ART_AND_DESIGN|   3.8|    178|    50,000+|Free|    0|        Art & Design|
|    Infinite Painter|ART_AND_

# Data Preprocessing

In [50]:
df1.printSchema()

root
 |-- App: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Rating: string (nullable = true)
 |-- Reviews: string (nullable = true)
 |-- Installs: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: string (nullable = true)
 |-- Genres: string (nullable = true)



Changing Data types of columns

In [51]:
df1 = df1.withColumn('Rating',col('Rating').cast('float'))\
         .withColumn('Reviews',col('Reviews').cast('int'))\
         .withColumn("Installs",regexp_replace(col("Installs"),"[^0-9]","")).withColumn('Installs',col('Installs').cast('int'))\
         .withColumn('Price',regexp_replace(col('Price'),"[$]","")).withColumn('Price',col('Price').cast('int'))

df1.show()


+--------------------+--------------+------+-------+--------+----+-----+--------------------+
|                 App|      Category|Rating|Reviews|Installs|Type|Price|              Genres|
+--------------------+--------------+------+-------+--------+----+-----+--------------------+
|Photo Editor & Ca...|ART_AND_DESIGN|   4.1|    159|   10000|Free|    0|        Art & Design|
| Coloring book moana|ART_AND_DESIGN|   3.9|    967|  500000|Free|    0|Art & Design;Pret...|
|U Launcher Lite –...|ART_AND_DESIGN|   4.7|  87510| 5000000|Free|    0|        Art & Design|
|Sketch - Draw & P...|ART_AND_DESIGN|   4.5| 215644|50000000|Free|    0|        Art & Design|
|Pixel Draw - Numb...|ART_AND_DESIGN|   4.3|    967|  100000|Free|    0|Art & Design;Crea...|
|Paper flowers ins...|ART_AND_DESIGN|   4.4|    167|   50000|Free|    0|        Art & Design|
|Smoke Effect Phot...|ART_AND_DESIGN|   3.8|    178|   50000|Free|    0|        Art & Design|
|    Infinite Painter|ART_AND_DESIGN|   4.1|  36815| 1000000

Changing column names for better query operations

In [52]:
df1.columns

['App', 'Category', 'Rating', 'Reviews', 'Installs', 'Type', 'Price', 'Genres']

In [53]:
rename_columns = {"App":"app","Category":"category","Rating":"rating","Reviews":"reviews","Installs":"installs","Type":"type","Price":"price","Genres":"genres"}

for old_column_name ,new_column_name in rename_columns.items():
    df1 = df1.withColumnRenamed(old_column_name,new_column_name)

In [54]:
df1.show()

+--------------------+--------------+------+-------+--------+----+-----+--------------------+
|                 app|      category|rating|reviews|installs|type|price|              genres|
+--------------------+--------------+------+-------+--------+----+-----+--------------------+
|Photo Editor & Ca...|ART_AND_DESIGN|   4.1|    159|   10000|Free|    0|        Art & Design|
| Coloring book moana|ART_AND_DESIGN|   3.9|    967|  500000|Free|    0|Art & Design;Pret...|
|U Launcher Lite –...|ART_AND_DESIGN|   4.7|  87510| 5000000|Free|    0|        Art & Design|
|Sketch - Draw & P...|ART_AND_DESIGN|   4.5| 215644|50000000|Free|    0|        Art & Design|
|Pixel Draw - Numb...|ART_AND_DESIGN|   4.3|    967|  100000|Free|    0|Art & Design;Crea...|
|Paper flowers ins...|ART_AND_DESIGN|   4.4|    167|   50000|Free|    0|        Art & Design|
|Smoke Effect Phot...|ART_AND_DESIGN|   3.8|    178|   50000|Free|    0|        Art & Design|
|    Infinite Painter|ART_AND_DESIGN|   4.1|  36815| 1000000

Creating temp view of the dataframe to perfrom sql operations

In [55]:
df1.createOrReplaceTempView("playstore")

# SQL Operations

1. Top 10 reviews given to apps

In [56]:
query = """ 
        select app,sum(reviews) as review_count from playstore group by app order by review_count desc LIMIT 10
        """

spark.sql(query).show()

+--------------------+------------+
|                 app|review_count|
+--------------------+------------+
|           Instagram|   266241989|
|  WhatsApp Messenger|   207348304|
|      Clash of Clans|   179558781|
|Messenger – Text ...|   169932272|
|      Subway Surfers|   166331958|
|    Candy Crush Saga|   156993136|
|            Facebook|   156286514|
|         8 Ball Pool|    99386198|
|        Clash Royale|    92530298|
|            Snapchat|    68045010|
+--------------------+------------+



2. Top 10 installs app and distribution type(free/paid)

In [65]:
query ="""
    select app,type,sum(installs) as install_count from playstore where type="Free" or type="Paid" group by 1,2 order by 3 desc LIMIT 10
    """

spark.sql(query).show()

+------------------+----+-------------+
|               app|type|install_count|
+------------------+----+-------------+
|    Subway Surfers|Free|   6000000000|
|         Instagram|Free|   4000000000|
|      Google Drive|Free|   4000000000|
|          Hangouts|Free|   4000000000|
|     Google Photos|Free|   4000000000|
|       Google News|Free|   4000000000|
|  Candy Crush Saga|Free|   3500000000|
|WhatsApp Messenger|Free|   3000000000|
|             Gmail|Free|   3000000000|
|      Temple Run 2|Free|   3000000000|
+------------------+----+-------------+



In [None]:
# identifying categories in 'type' column
df1.select('type').distinct().show()

+------+
|  type|
+------+
|     0|
|102248|
|   NaN|
|  Free|
|  Paid|
|  2509|
+------+



Distribution of Type (free/paid)

In [None]:

query = """
SELECT type, count(*) AS count
FROM playstore
where type="Free" or type="Paid"
GROUP BY type
"""
spark.sql(query).show()

+----+-----+
|type|count|
+----+-----+
|Free|10037|
|Paid|  800|
+----+-----+



3. Category wise distribution of installed apps.

In [67]:
query = """
select category,sum(installs) as install_count from playstore GROUP BY 1 order by 2 desc
"""

spark.sql(query).show()

+-------------------+-------------+
|           category|install_count|
+-------------------+-------------+
|               GAME|  35086024415|
|      COMMUNICATION|  32647276251|
|       PRODUCTIVITY|  14176091369|
|             SOCIAL|  14069867902|
|              TOOLS|  11452771915|
|             FAMILY|  10258263505|
|        PHOTOGRAPHY|  10088247655|
| NEWS_AND_MAGAZINES|   7496317760|
|   TRAVEL_AND_LOCAL|   6868887146|
|      VIDEO_PLAYERS|   6222002720|
|           SHOPPING|   3247848785|
|      ENTERTAINMENT|   2869160000|
|    PERSONALIZATION|   2325494782|
|BOOKS_AND_REFERENCE|   1921469576|
|             SPORTS|   1751174498|
| HEALTH_AND_FITNESS|   1582072512|
|           BUSINESS|   1001914865|
|            FINANCE|    876648734|
|          EDUCATION|    871452000|
|MAPS_AND_NAVIGATION|    719281890|
+-------------------+-------------+
only showing top 20 rows



4. Top Paid Apps

In [76]:
# price in dollars
query = """
        select app,sum(price) as price from playstore where type="Paid" group by 1 order by 2 desc
        """

spark.sql(query).show()

+--------------------+-----+
|                 app|price|
+--------------------+-----+
|I'm Rich - Trump ...|  400|
|most expensive ap...|  399|
|   I Am Rich Premium|  399|
|I'm Rich/Eu sou R...|  399|
|         💎 I'm rich|  399|
|I am rich (Most e...|  399|
|           I am rich|  399|
|       I Am Rich Pro|  399|
|  I AM RICH PRO PLUS|  399|
|  I am rich(premium)|  399|
|           I am Rich|  399|
|          I am Rich!|  399|
|      I am Rich Plus|  399|
|         Eu Sou Rico|  394|
|           I Am Rich|  389|
| I am extremely Rich|  379|
|       I am rich VIP|  299|
|        EP Cook Book|  200|
|Vargo Anesthesia ...|  158|
|       cronometra-br|  154|
+--------------------+-----+
only showing top 20 rows



Closing spark session

In [None]:
spark.stop()