## SOURCE  of This Notebook
https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/ 
#### Data file download - sign in required with registration
https://www.kaggle.com/datasets/ysf12ff/app-store-dataset 
Once logged in, 'Download' link will be available around top-right corner
#### In the downloaded zip file, you will upload 'train' data file to Jupyter Notebook

## Description of dataset
The data is about the applications available on the App Store, with their current version, the rating they received, price and genre of application and number of supported devices of each application

### Columns
id   Application Unique ID  
track_name  Application Name  
size_bytes application size in 'kb' 
currency   price currency  
price      official price  
rating_count_tot   total rating count  
rating_count_ver   current version rating 
user_rating rating by user  
user_rating_ver rating by user on current version  
ver current version on app store  
cont_rating  
prime_genre  
sup_devices.num  
ipadSc_urls.num  
lang.num  
vpp_lic  

#### Setup Apache Spark Context


In [2]:
import pyspark
from pyspark.sql import SQLContext

sc = pyspark.SparkContext('local[*]')
sqlContext = SQLContext(sc)

#### Creating the DataFrame from CSV file
#### This is where you import your own csv file

In [3]:
train = sqlContext.read.csv("AppleStore.csv", header = True, inferSchema = True)


#### How to see datatype of columns?


In [3]:
train.printSchema()


root
 |-- id: integer (nullable = true)
 |-- track_name: string (nullable = true)
 |-- size_bytes: long (nullable = true)
 |-- currency: string (nullable = true)
 |-- price: double (nullable = true)
 |-- rating_count_tot: integer (nullable = true)
 |-- rating_count_ver: integer (nullable = true)
 |-- user_rating: double (nullable = true)
 |-- user_rating_ver: double (nullable = true)
 |-- ver: string (nullable = true)
 |-- cont_rating: string (nullable = true)
 |-- prime_genre: string (nullable = true)
 |-- sup_devices.num: integer (nullable = true)
 |-- ipadSc_urls.num: integer (nullable = true)
 |-- lang.num: integer (nullable = true)
 |-- vpp_lic: integer (nullable = true)



#### How to Show first n observation? 
We can use head operation to see first n observation (say, 5 observation). Head operation in PySpark is similar to head operation in Pandas.

In [4]:
train.head(5)


[Row(id=284882215, track_name='Facebook', size_bytes=389879808, currency='USD', price=0.0, rating_count_tot=2974676, rating_count_ver=212, user_rating=3.5, user_rating_ver=3.5, ver='95', cont_rating='4+', prime_genre='Social Networking', sup_devices.num=37, ipadSc_urls.num=1, lang.num=29, vpp_lic=1),
 Row(id=389801252, track_name='Instagram', size_bytes=113954816, currency='USD', price=0.0, rating_count_tot=2161558, rating_count_ver=1289, user_rating=4.5, user_rating_ver=4.0, ver='10.23', cont_rating='12+', prime_genre='Photo & Video', sup_devices.num=37, ipadSc_urls.num=0, lang.num=29, vpp_lic=1),
 Row(id=529479190, track_name='Clash of Clans', size_bytes=116476928, currency='USD', price=0.0, rating_count_tot=2130805, rating_count_ver=579, user_rating=4.5, user_rating_ver=4.5, ver='9.24.12', cont_rating='9+', prime_genre='Games', sup_devices.num=38, ipadSc_urls.num=5, lang.num=18, vpp_lic=1),
 Row(id=420009108, track_name='Temple Run', size_bytes=65921024, currency='USD', price=0.0, r

Above results are comprised of row like format. To see the result in more interactive manner (rows under the columns), we can use the show operation. Let’s apply show operation on train and take first 2 rows of it. We can pass the argument truncate = True to truncate the result.

In [5]:
train.show(2,truncate= True)


+---------+----------+----------+--------+-----+----------------+----------------+-----------+---------------+-----+-----------+-----------------+---------------+---------------+--------+-------+
|       id|track_name|size_bytes|currency|price|rating_count_tot|rating_count_ver|user_rating|user_rating_ver|  ver|cont_rating|      prime_genre|sup_devices.num|ipadSc_urls.num|lang.num|vpp_lic|
+---------+----------+----------+--------+-----+----------------+----------------+-----------+---------------+-----+-----------+-----------------+---------------+---------------+--------+-------+
|284882215|  Facebook| 389879808|     USD|  0.0|         2974676|             212|        3.5|            3.5|   95|         4+|Social Networking|             37|              1|      29|      1|
|389801252| Instagram| 113954816|     USD|  0.0|         2161558|            1289|        4.5|            4.0|10.23|        12+|    Photo & Video|             37|              0|      29|      1|
+---------+---------

#### How to Count the number of rows in DataFrame?


In [6]:
train.count()


7197

#### How many columns do we have in train and test files along with their names?


In [7]:
len(train.columns), train.columns


(16,
 ['id',
  'track_name',
  'size_bytes',
  'currency',
  'price',
  'rating_count_tot',
  'rating_count_ver',
  'user_rating',
  'user_rating_ver',
  'ver',
  'cont_rating',
  'prime_genre',
  'sup_devices.num',
  'ipadSc_urls.num',
  'lang.num',
  'vpp_lic'])

#### How to get the summary statistics (mean, standard deviance, min ,max, count) of numerical columns in a DataFrame? 
describe operation is use to calculate the summary statistics of numerical column(s) in DataFrame. If we don’t specify the name of columns it will calculate summary statistics for all numerical columns present in DataFrame.

In [8]:
train.describe().show()


+-------+--------------------+--------------------+--------------------+--------+------------------+------------------+-----------------+------------------+------------------+------------------+-----------+-----------+-----------------+------------------+-----------------+-------------------+
|summary|                  id|          track_name|          size_bytes|currency|             price|  rating_count_tot| rating_count_ver|       user_rating|   user_rating_ver|               ver|cont_rating|prime_genre|  sup_devices.num|   ipadSc_urls.num|         lang.num|            vpp_lic|
+-------+--------------------+--------------------+--------------------+--------+------------------+------------------+-----------------+------------------+------------------+------------------+-----------+-----------+-----------------+------------------+-----------------+-------------------+
|  count|                7197|                7197|                7197|    7197|              7197|              7197

In [9]:
train.describe('user_rating').show()


+-------+------------------+
|summary|       user_rating|
+-------+------------------+
|  count|              7197|
|   mean| 3.526955675976101|
| stddev|1.5179475936298863|
|    min|               0.0|
|    max|               5.0|
+-------+------------------+



#### How to select column(s) from the DataFrame?


In [10]:
train.select('id','track_name').show(5)


+---------+--------------------+
|       id|          track_name|
+---------+--------------------+
|284882215|            Facebook|
|389801252|           Instagram|
|529479190|      Clash of Clans|
|420009108|          Temple Run|
|284035177|Pandora - Music &...|
+---------+--------------------+
only showing top 5 rows



#### What if I want to calculate pair wise frequency of categorical columns? 
We can use crosstab operation on DataFrame to calculate the pair wise frequency of columns. Let’s apply crosstab operation on ‘Age’ and ‘Gender’ columns of train DataFrame.

In [12]:
train.crosstab('prime_genre', 'user_rating').show()


+-----------------------+---+---+---+---+---+---+---+---+----+---+
|prime_genre_user_rating|0.0|1.0|1.5|2.0|2.5|3.0|3.5|4.0| 4.5|5.0|
+-----------------------+---+---+---+---+---+---+---+---+----+---+
|               Shopping| 16|  0|  0|  0|  5| 10| 14| 24|  41| 12|
|                 Sports| 13|  4|  1| 10| 12| 14| 22| 17|  15|  6|
|              Lifestyle| 31|  4|  7|  4| 12| 10| 10| 29|  29|  8|
|          Photo & Video| 24|  4|  4|  3| 16| 14| 29| 71| 154| 30|
|                Medical|  3|  2|  1|  0|  0|  0|  3|  1|  11|  2|
|           Productivity|  6|  0|  2|  3|  2|  5| 16| 53|  78| 13|
|                   Book| 47|  1|  0|  0|  3|  2|  4| 11|  30| 14|
|          Entertainment| 64|  3|  6| 21| 40| 72| 68|117| 118| 26|
|              Reference| 11|  0|  1|  0|  1|  2|  7| 13|  21|  8|
|                  Music|  4|  0|  0|  1|  3|  7| 17| 42|  58|  6|
|               Catalogs|  5|  0|  0|  0|  0|  0|  1|  2|   1|  1|
|                Weather|  6|  0|  1|  0|  2|  8| 11| 18|  24|

#### What If I want to get the DataFrame which won’t have duplicate rows of given DataFrame?


In [13]:
train.select('prime_genre','user_rating').dropDuplicates().show()


+-----------------+-----------+
|      prime_genre|user_rating|
+-----------------+-----------+
|             Book|        5.0|
|       Navigation|        4.5|
|         Business|        4.0|
|        Utilities|        3.5|
|             Book|        1.0|
|     Food & Drink|        5.0|
|           Sports|        2.5|
|            Music|        2.5|
|             News|        3.0|
|    Photo & Video|        4.5|
|          Medical|        4.0|
|             Book|        2.5|
|        Utilities|        4.0|
|Social Networking|        3.5|
|           Travel|        3.0|
|        Utilities|        2.5|
|        Lifestyle|        0.0|
|    Photo & Video|        0.0|
|           Travel|        1.0|
|             News|        3.5|
+-----------------+-----------+
only showing top 20 rows



#### What if I want to drop the all rows with null value? 
The dropna operation can be use here. To drop row from the DataFrame it consider three options. 
* how– ‘any’ or ‘all’. If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a row only if all its values are null.
* thresh – int, default None If specified, drop rows that have less than thresh non-null values. This overwrites the how parameter.
* subset – optional list of column names to consider. 

Let’t drop null rows in train with default parameters and count the rows in output DataFrame. Default options are any, None, None for how, thresh, subset respectively.


In [6]:
# train.dropna().count()

train = train.withColumnRenamed("sup_devices.num", "sup_devices_num")
train = train.withColumnRenamed("ipadSc_urls.num", "ipadSc_urls_num")
train = train.withColumnRenamed("lang.num", "lang_num")

train.dropna().count()

7197

#### What if I want to fill the null values in DataFrame with constant number? 
Use fillna operation here. The fillna will take two parameters to fill the null values. 
+ value:
  + It will take a dictionary to specify which column will replace with which value.
  + A value (int , float, string) for all columns.
+ subset: Specify some selected columns.

Let’s fill ‘-1’ inplace of null values in train DataFrame.

In [15]:
train.fillna(-1).show(2)


+---------+----------+----------+--------+-----+----------------+----------------+-----------+---------------+-----+-----------+-----------------+---------------+---------------+--------+-------+
|       id|track_name|size_bytes|currency|price|rating_count_tot|rating_count_ver|user_rating|user_rating_ver|  ver|cont_rating|      prime_genre|sup_devices.num|ipadSc_urls.num|lang.num|vpp_lic|
+---------+----------+----------+--------+-----+----------------+----------------+-----------+---------------+-----+-----------+-----------------+---------------+---------------+--------+-------+
|284882215|  Facebook| 389879808|     USD|  0.0|         2974676|             212|        3.5|            3.5|   95|         4+|Social Networking|             37|              1|      29|      1|
|389801252| Instagram| 113954816|     USD|  0.0|         2161558|            1289|        4.5|            4.0|10.23|        12+|    Photo & Video|             37|              0|      29|      1|
+---------+---------

#### If I want to filter the rows in train which has user_rating more than 3? 
We can apply the filter operation on user_rating column in train DataFrame to filter out the rows with values more than 3. We need to pass a condition. Let’s apply filter on user_rating column in train DataFrame and print the number of rows which has more user_rating than 3.

In [16]:
train.filter(train.user_rating > 3).count()


5483

#### How to find the mean rating_count_tot of each prime_genre group in train? 
The groupby operation can be used here to find the mean of rating_count_tot for each prime_genre group in train. Let’s see how can we get the mean rating_count_tot for the ‘rating_count_tot’ column train.

In [18]:
train.groupby('prime_genre').agg({'rating_count_tot': 'mean'}).show()


+-----------------+---------------------+
|      prime_genre|avg(rating_count_tot)|
+-----------------+---------------------+
|        Education|   2239.2295805739514|
|       Navigation|    11853.95652173913|
|    Entertainment|    7533.678504672897|
|           Sports|   14026.929824561403|
|     Food & Drink|   13938.619047619048|
|    Photo & Video|   14352.280802292264|
|           Travel|   14129.444444444445|
|          Finance|   11047.653846153846|
|Social Networking|    45498.89820359281|
|             Book|            5125.4375|
|         Shopping|    18615.32786885246|
|        Reference|          22410.84375|
| Health & Fitness|    9913.172222222222|
|        Utilities|    6863.822580645161|
|     Productivity|   8051.3258426966295|
|            Games|   13691.996633868463|
|            Music|   28842.021739130436|
|        Lifestyle|    6161.763888888889|
|         Business|    4788.087719298245|
|         Catalogs|               1732.5|
+-----------------+---------------

We can also apply sum, min, max, count with groupby when we want to get different summary insight each group. Let’s take one more example of groupby to count the number of rows in each Age group.

In [20]:
train.groupby('user_rating').count().show()


+-----------+-----+
|user_rating|count|
+-----------+-----+
|        0.0|  929|
|        3.5|  702|
|        4.5| 2663|
|        2.5|  196|
|        1.0|   44|
|        4.0| 1626|
|        3.0|  383|
|        2.0|  106|
|        1.5|   56|
|        5.0|  492|
+-----------+-----+



#### How to create a sample DataFrame from the base DataFrame? 
We can use sample operation to take sample of a DataFrame. The sample method on DataFrame will return a DataFrame containing the sample of base DataFrame. The sample method will take 3 parameters.

+ withReplacement = True or False to select a observation with or without replacement.
+ fraction = x, where x = .5 shows that we want to have 50% data in sample DataFrame.
+ seed for reproduce the result

Let’s create the two DataFrame t1 and t2 from train, both will have 20% sample of train and count the number of rows in each.

In [21]:
t1 = train.sample(False, 0.2, 42)
t2 = train.sample(False, 0.2, 43)
t1.count(),t2.count()

(1447, 1391)

#### How to apply map operation on DataFrame columns? 
We can apply a function on each row of DataFrame using map operation. After applying this function, we get the result in the form of RDD. Let’s apply a map operation on User_ID column of train and print the first 5 elements of mapped RDD(x,1) after applying the function (I am applying lambda function). 

Spark 2.0 (https://stackoverflow.com/questions/39535447/attributeerror-dataframe-object-has-no-attribute-map) 
You can't map a dataframe, but you can convert the dataframe to an RDD and map that by doing spark_df.rdd.map(). Prior to Spark 2.0, spark_df.map would alias to spark_df.rdd.map(). With Spark 2.0, you must explicitly call .rdd first.

Map vs Filter (https://stackoverflow.com/questions/40459695/map-vs-filter-operations)

Map, you pass in a function which returns a value for each element in an array. <strong>The return value of this function represents what an element becomes in our new array.</strong>

Filter, you pass in a function which returns either true or false for each element. <strong>If the function that you pass returns true for an element, then that element is included in the final array.</strong>

In [22]:
train.select('id').rdd.map(lambda x:(x,1)).take(5)


[(Row(id=284882215), 1),
 (Row(id=389801252), 1),
 (Row(id=529479190), 1),
 (Row(id=420009108), 1),
 (Row(id=284035177), 1)]

#### How to sort the DataFrame based on column(s)? 
We can use orderBy operation on DataFrame to get sorted output based on some column. The orderBy operation take two arguments.

+ List of columns.
+ ascending = True or False for getting the results in ascending or descending order(list in case of more than two columns )

Let’s sort the train DataFrame based on ‘Purchase’.

In [23]:
train.orderBy(train.cont_rating.desc()).show(5)


+---------+--------------------+----------+--------+-----+----------------+----------------+-----------+---------------+-----+-----------+-----------+---------------+---------------+--------+-------+
|       id|          track_name|size_bytes|currency|price|rating_count_tot|rating_count_ver|user_rating|user_rating_ver|  ver|cont_rating|prime_genre|sup_devices.num|ipadSc_urls.num|lang.num|vpp_lic|
+---------+--------------------+----------+--------+-----+----------------+----------------+-----------+---------------+-----+-----------+-----------+---------------+---------------+--------+-------+
|924373886|Crossy Road - End...| 165471232|     USD|  0.0|          669079|            1087|        4.5|            4.5|1.5.4|         9+|      Games|             38|              5|      13|      1|
|572395608|        Temple Run 2| 158025728|     USD|  0.0|          295211|              91|        4.5|            4.0| 1.37|         9+|      Games|             38|              5|       1|      1|


We can use withColumn operation to add new column (we can also replace) in base DataFrame and return a new DataFrame. The withColumn operation will take 2 parameters.

+ Column name which we want add /replace.
+ Expression on column.

Let’s see how withColumn works. I am calculating new column name ‘Purchase_new’ in train which is calculated by dviding Purchase column by 2.

In [24]:
train.withColumn('user_rating_new', train.user_rating /2.0).select('user_rating','user_rating_new').show(5)


+-----------+---------------+
|user_rating|user_rating_new|
+-----------+---------------+
|        3.5|           1.75|
|        4.5|           2.25|
|        4.5|           2.25|
|        4.5|           2.25|
|        4.0|            2.0|
+-----------+---------------+
only showing top 5 rows



#### How to Apply SQL Queries on DataFrame? 
We have already discussed in the above section that DataFrame has additional information about datatypes and names of columns associated with it. Unlike RDD, this additional information allows Spark to run SQL queries on DataFrame. To apply SQL queries on DataFrame first we need to register DataFrame as table. Let’s first register train DataFrame as table.

In [25]:
# NOTE. spark 2.0.0+
train.createOrReplaceTempView('train_table')

In [26]:
sqlContext.sql('select id from train_table').show(5)


+---------+
|       id|
+---------+
|284882215|
|389801252|
|529479190|
|420009108|
|284035177|
+---------+
only showing top 5 rows



Let’s get maximum user_rating of each prime_genre group in train_table.


In [27]:
sqlContext.sql('select prime_genre, max(user_rating) from train_table group by prime_genre').show()


+-----------------+----------------+
|      prime_genre|max(user_rating)|
+-----------------+----------------+
|        Education|             5.0|
|       Navigation|             5.0|
|    Entertainment|             5.0|
|           Sports|             5.0|
|     Food & Drink|             5.0|
|    Photo & Video|             5.0|
|           Travel|             5.0|
|          Finance|             5.0|
|Social Networking|             5.0|
|             Book|             5.0|
|         Shopping|             5.0|
|        Reference|             5.0|
| Health & Fitness|             5.0|
|        Utilities|             5.0|
|     Productivity|             5.0|
|            Games|             5.0|
|            Music|             5.0|
|        Lifestyle|             5.0|
|         Business|             5.0|
|         Catalogs|             5.0|
+-----------------+----------------+
only showing top 20 rows



### Did all data processing steps work well with your new dataset? If no, which instructions did not work and why you think it happened?

No, there was one step which resulted in an error:  

#### Instruction:
train.dropna().count()

#### Error:
AnalysisException: 'Cannot resolve column name "sup_devices.num" among (id, track_name, size_bytes, currency, price, rating_count_tot, rating_count_ver, user_rating, user_rating_ver, ver, cont_rating, prime_genre, sup_devices.num, ipadSc_urls.num, lang.num, vpp_lic);'

#### Why it happened?
The error occured due to the column name "sup_devices.num" containing a special character (the dot ".") which can cause confusion in some cases. I tried to handle the error by renaming the column names and the issue was resolved.


## Building three questions to investigate further

#### 1. What is the distribution of the number of supported devices (sup_devices.num) across different genres?

In [28]:
sqlContext.sql('SELECT prime_genre, AVG(user_rating) AS avg_user_rating \
                FROM train_table \
                GROUP BY prime_genre \
                ORDER BY avg_user_rating DESC').show()


+-----------------+------------------+
|      prime_genre|   avg_user_rating|
+-----------------+------------------+
|     Productivity|  4.00561797752809|
|            Music|3.9782608695652173|
|    Photo & Video|3.8008595988538683|
|         Business| 3.745614035087719|
| Health & Fitness|               3.7|
|            Games|3.6850077679958573|
|          Weather|3.5972222222222223|
|         Shopping| 3.540983606557377|
|        Reference|          3.453125|
|           Travel| 3.376543209876543|
|        Education| 3.376379690949227|
|          Medical| 3.369565217391304|
|        Utilities| 3.278225806451613|
|    Entertainment|3.2467289719626167|
|     Food & Drink|3.1825396825396823|
|Social Networking|2.9850299401197606|
|           Sports| 2.982456140350877|
|             News|              2.98|
|        Lifestyle|2.8055555555555554|
|       Navigation|2.6847826086956523|
+-----------------+------------------+
only showing top 20 rows



#### 2. How does the average price of apps vary across different content ratings (cont_rating)?

In [29]:
sqlContext.sql('SELECT cont_rating, AVG(price) AS avg_price \
                FROM train_table \
                GROUP BY cont_rating \
                ORDER BY cont_rating').show()


+-----------+------------------+
|cont_rating|         avg_price|
+-----------+------------------+
|        12+|1.5666666666666695|
|        17+|0.9811093247588443|
|         4+|1.7772005413940217|
|         9+|2.1535055724417433|
+-----------+------------------+



#### 3. What is the relationship between app size and user ratings for each prime_genre?

In [30]:
sqlContext.sql('SELECT prime_genre, AVG(size_bytes) AS avg_app_size, AVG(user_rating) AS avg_user_rating \
                FROM train_table \
                GROUP BY prime_genre \
                ORDER BY prime_genre').show()


+-----------------+--------------------+------------------+
|      prime_genre|        avg_app_size|   avg_user_rating|
+-----------------+--------------------+------------------+
|             Book| 1.788206262857143E8|2.4776785714285716|
|         Business|  6.41684929122807E7| 3.745614035087719|
|         Catalogs|         5.0181632E7|               2.1|
|        Education|1.8042422133995584E8| 3.376379690949227|
|    Entertainment|1.0147872102056074E8|3.2467289719626167|
|          Finance| 7.823585928846154E7|2.4326923076923075|
|     Food & Drink| 7.759499784126984E7|3.1825396825396823|
|            Games|2.8365830049792856E8|3.6850077679958573|
| Health & Fitness| 9.010664106666666E7|               3.7|
|        Lifestyle| 6.230646750694445E7|2.8055555555555554|
|          Medical| 3.763890086956522E8| 3.369565217391304|
|            Music|1.0963563757971014E8|3.9782608695652173|
|       Navigation| 1.033544852826087E8|2.6847826086956523|
|             News|6.2470853266666666E7|

## SOURCE  of This Notebook
https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/ 