# Cohort Assignment

### Write a spark program(python or scala or any other language) to achieve the following:

#### 1. Calculate hourly, daily, weekly and monthly averages of aggregated counts of each customer and make this accessible for querying purposes
 							
#### 2. Calculate Week-on-week cohorts for customers taking rides in that week, an example cohort output is shown as follows.


## We can get Dataset from the given link

##### https://drive.google.com/file/d/1gGDJpGrKTFdy-IKOXxRv5QNuTlJHmp1f/view?usp=sharing

In [1]:
#!/usr/bin/env bash
!rm DE_Cohorts_Assignment_Data.csv
!unzip DE_Cohorts_Assignment_Data.csv.zip
!pwd

Archive:  DE_Cohorts_Assignment_Data.csv.zip
  inflating: DE_Cohorts_Assignment_Data.csv  
/home/rita/Documents/Spark/Assignments


## Import all relevant packages

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, window,substring, col
from pyspark.sql.functions import *
from datetime import datetime
import time

## Create a Spark Session

In [3]:
spark = SparkSession.builder.appName("Cohort Assignment").master("local[*]").getOrCreate()

22/01/25 12:31:16 WARN Utils: Your hostname, EMPID21092 resolves to a loopback address: 127.0.1.1; using 192.168.1.6 instead (on interface wlp3s0)
22/01/25 12:31:16 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/01/25 12:31:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Read data from the csv file

In [4]:
df = spark.read.format("csv").options(header='true', inferschema='true', delimiter=',').load("/home/rita/Documents/Spark/Assignments/DE_Cohorts_Assignment_Data.csv")

                                                                                

### Data with 2 column 'timestamp' & 'user_id'

In [30]:
df.show()

+-------------------+-------+
|          timestamp|user_id|
+-------------------+-------+
|2018-04-07 07:07:17|  14626|
|2018-04-07 07:32:27|  85490|
|2018-04-07 07:36:44|  05408|
|2018-04-07 07:38:00|  58940|
|2018-04-07 07:39:29|  05408|
|2018-04-07 07:43:08|  05408|
|2018-04-07 07:43:55|  50266|
|2018-04-07 07:52:31|  58940|
|2018-04-07 07:52:42|  58940|
|2018-04-07 07:53:23|  28126|
|2018-04-07 07:55:05|  99251|
|2018-04-07 07:55:24|  99251|
|2018-04-07 08:00:04|  34808|
|2018-04-07 08:00:16|  34808|
|2018-04-07 08:03:16|  89714|
|2018-04-07 08:03:24|  89714|
|2018-04-07 08:07:16|  82060|
|2018-04-07 08:08:57|  18815|
|2018-04-07 08:11:35|  38288|
|2018-04-07 08:17:24|  80401|
+-------------------+-------+
only showing top 20 rows



### Shape of data (no. of rows, no. of columns)

In [31]:
print((df.count(), len(df.columns)))

(8381556, 2)


[Stage 56:====>                                                   (1 + 11) / 12]                                                                                

### Users count per hour

In [40]:
hourly_count = df.groupby(hour(col("timestamp")).alias('hour')).count().orderBy('hour')
hourly_count.show()

[Stage 77:====>                                                   (1 + 11) / 12]

+----+------+
|hour| count|
+----+------+
|   0| 87920|
|   1| 32117|
|   2|  8532|
|   3|  4066|
|   4|  4689|
|   5|  9656|
|   6| 39227|
|   7|166430|
|   8|448867|
|   9|863465|
|  10|848744|
|  11|447322|
|  12|360131|
|  13|349257|
|  14|301419|
|  15|270961|
|  16|313228|
|  17|447311|
|  18|787922|
|  19|930550|
+----+------+
only showing top 20 rows



                                                                                

In [48]:
hourly_avg = hourly_count.groupby('hour').agg(avg(col("count")))
hourly_avg.show()



+----+----------+
|hour|avg(count)|
+----+----------+
|  12|  360131.0|
|  22|  342316.0|
|   1|   32117.0|
|  13|  349257.0|
|  16|  313228.0|
|   6|   39227.0|
|   3|    4066.0|
|  20|  669083.0|
|   5|    9656.0|
|  19|  930550.0|
|  15|  270961.0|
|   9|  863465.0|
|  17|  447311.0|
|   4|    4689.0|
|   8|  448867.0|
|  23|  187113.0|
|   7|  166430.0|
|  10|  848744.0|
|  21|  461230.0|
|  11|  447322.0|
+----+----------+
only showing top 20 rows



                                                                                

### Hourly count of each user 

In [42]:
hourly_user_count = df.groupby(hour(col("timestamp")).alias('hour'),'user_id').count().orderBy('hour')
hourly_user_count.show()



+----+-------+-----+
|hour|user_id|count|
+----+-------+-----+
|   0|  57736|    6|
|   0|  11898|   11|
|   0|  32325|    7|
|   0|  47147|    8|
|   0|  97775|   19|
|   0|  27178|    1|
|   0|  01246|    2|
|   0|  39708|    6|
|   0|  53656|    2|
|   0|  57384|    6|
|   0|  86880|    1|
|   0|  32014|    2|
|   0|  77001|    1|
|   0|  80967|    2|
|   0|  19673|    2|
|   0|  00278|    1|
|   0|  43585|    2|
|   0|  76993|    5|
|   0|  06413|    2|
|   0|  62460|   54|
+----+-------+-----+
only showing top 20 rows



                                                                                

In [41]:
daily_count = df.groupby(dayofmonth(col("timestamp")).alias('day')).count()
daily_count.show()



+---+------+
|day| count|
+---+------+
| 31|176963|
| 28|296030|
| 26|260514|
| 27|263165|
| 12|278631|
| 22|289733|
|  1|296340|
| 13|260829|
| 16|238040|
|  6|272150|
|  3|265798|
| 20|289614|
|  5|358409|
| 19|264761|
| 15|274835|
|  9|256080|
| 17|242458|
|  4|315497|
|  8|256411|
| 23|229617|
+---+------+
only showing top 20 rows



                                                                                

In [49]:
daily_avg = daily_count.groupby('day').agg(avg(col("count")))
daily_avg.show()



+---+----------+
|day|avg(count)|
+---+----------+
| 31|  176963.0|
| 28|  296030.0|
| 26|  260514.0|
| 27|  263165.0|
| 12|  278631.0|
| 22|  289733.0|
|  1|  296340.0|
| 13|  260829.0|
| 16|  238040.0|
|  6|  272150.0|
|  3|  265798.0|
| 20|  289614.0|
|  5|  358409.0|
| 19|  264761.0|
| 15|  274835.0|
|  9|  256080.0|
| 17|  242458.0|
|  4|  315497.0|
|  8|  256411.0|
| 23|  229617.0|
+---+----------+
only showing top 20 rows



                                                                                

In [43]:
daily_user_count = df.groupby(dayofmonth(col("timestamp")).alias('day'),'user_id').count()
daily_user_count.show()



+---+-------+-----+
|day|user_id|count|
+---+-------+-----+
|  7|  98098|   26|
|  7|  33825|   18|
|  7|  98699|    2|
|  7|  00341|    3|
|  7|  64446|   14|
|  7|  21277|    4|
|  7|  67642|    5|
|  7|  67570|   20|
|  7|  39253|    2|
|  7|  62460|   52|
|  7|  93312|    1|
|  7|  55075|   15|
|  7|  23739|   21|
|  7|  35507|    4|
|  8|  19825|    7|
|  8|  89285|    8|
|  8|  27050|   13|
|  8|  64132|   30|
|  8|  35015|   76|
|  8|  63489|    2|
+---+-------+-----+
only showing top 20 rows



                                                                                

In [45]:
weekly_count = df.groupby(weekofyear(col('timestamp')).alias('week')).count()
weekly_count.show()



+----+------+
|week| count|
+----+------+
|  26| 60759|
|  22| 63352|
|  16| 62111|
|  20| 56615|
|  19| 51375|
|  15| 69476|
|  17| 55846|
|  23| 59813|
|  25| 63241|
|  24| 60669|
|  21| 57284|
|  14|596244|
|  18| 47554|
|  31| 64243|
|  34| 97551|
|  28| 64492|
|  27| 61134|
|  35|113572|
|  29| 65746|
|  32| 74065|
+----+------+
only showing top 20 rows



                                                                                

In [54]:
weekly_avg = weekly_count.groupby('week').agg(avg(col("count")))
weekly_avg.show()

[Stage 107:====>                                                  (1 + 11) / 12]

+----+----------+
|week|avg(count)|
+----+----------+
|  26|   60759.0|
|  22|   63352.0|
|  16|   62111.0|
|  20|   56615.0|
|  19|   51375.0|
|  15|   69476.0|
|  17|   55846.0|
|  23|   59813.0|
|  25|   63241.0|
|  24|   60669.0|
|  21|   57284.0|
|  14|  596244.0|
|  18|   47554.0|
|  31|   64243.0|
|  34|   97551.0|
|  28|   64492.0|
|  27|   61134.0|
|  35|  113572.0|
|  29|   65746.0|
|  32|   74065.0|
+----+----------+
only showing top 20 rows





In [46]:
weekly_user_count = df.groupby(weekofyear(col('timestamp')).alias('week'),'user_id').count()
weekly_user_count.show()

[Stage 92:====>                                                   (1 + 11) / 12]

+----+-------+-----+
|week|user_id|count|
+----+-------+-----+
|  14|  27783|    3|
|  14|  04643|   13|
|  14|  73301|    1|
|  14|  28358|    1|
|  14|  67169|   10|
|  14|  26118|   75|
|  14|  05757|    2|
|  14|  99723|    8|
|  14|  07125|    1|
|  14|  97131|   30|
|  14|  93694|    1|
|  14|  24798|   15|
|  14|  22527|    3|
|  14|  47815|    7|
|  14|  98884|    5|
|  15|  70020|    3|
|  15|  91992|   57|
|  15|  19825|   14|
|  15|  48479|    2|
|  15|  76219|    1|
+----+-------+-----+
only showing top 20 rows





In [63]:
monthly_count = df.groupby(month(col('timestamp')).alias('month')).count().orderBy('month')
monthly_count.show()



+-----+-------+
|month|  count|
+-----+-------+
|    1| 876345|
|    2|1293520|
|    3|1761390|
|    4| 790434|
|    5| 246228|
|    6| 263027|
|    7| 282412|
|    8| 389824|
|    9| 573085|
|   10| 623576|
|   11| 610519|
|   12| 671196|
+-----+-------+



                                                                                

In [58]:
monthly_avg = monthly_count.groupby('month').agg(avg(col('count'))).orderBy('month')
monthly_avg.show()



+-----+----------+
|month|avg(count)|
+-----+----------+
|    1|  876345.0|
|    2| 1293520.0|
|    3| 1761390.0|
|    4|  790434.0|
|    5|  246228.0|
|    6|  263027.0|
|    7|  282412.0|
|    8|  389824.0|
|    9|  573085.0|
|   10|  623576.0|
|   11|  610519.0|
|   12|  671196.0|
+-----+----------+



                                                                                

In [60]:
monthly_count = df.groupby(month(col('timestamp')).alias('month'),'user_id').count()
monthly_count.show()

[Stage 122:====>                                                  (1 + 11) / 12]

+-----+-------+-----+
|month|user_id|count|
+-----+-------+-----+
|    4|  22868|   22|
|    4|  00698|    5|
|    4|  62234|   11|
|    4|  66804|   20|
|    4|  34467|   60|
|    4|  85144|   27|
|    4|  06110|   11|
|    4|  63444|    5|
|    4|  55826|    1|
|    4|  61000|  154|
|    4|  14508|   59|
|    4|  45868|    4|
|    4|  09569|   11|
|    4|  65064|   88|
|    4|  68322|    4|
|    4|  35605|    4|
|    4|  00029|    3|
|    4|  82228|    1|
|    4|  27517|  175|
|    4|  26082|    1|
+-----+-------+-----+
only showing top 20 rows



                                                                                

In [61]:
yearly_count = df.groupby(year(col('timestamp')).alias('year')).count().orderBy('year')
yearly_count.show()

[Stage 125:>                                                      (0 + 12) / 12]

+----+-------+
|year|  count|
+----+-------+
|2018|3869148|
|2019|4512408|
+----+-------+



[Stage 125:====>                                                  (1 + 11) / 12]                                                                                

In [64]:
yearly_count = df.groupby(year(col('timestamp')).alias('year'),'user_id').count().orderBy('year')
yearly_count.show()



+----+-------+-----+
|year|user_id|count|
+----+-------+-----+
|2018|  23844|  100|
|2018|  08096|   63|
|2018|  85257|   51|
|2018|  99459|    6|
|2018|  58098|   88|
|2018|  83151|  337|
|2018|  32434|   88|
|2018|  26482|  169|
|2018|  69357|   45|
|2018|  20127|   82|
|2018|  97283|   89|
|2018|  58418|   30|
|2018|  98878|   31|
|2018|  06736|   54|
|2018|  70103|   17|
|2018|  64297|  150|
|2018|  80246|   47|
|2018|  61307|   74|
|2018|  48690|  137|
|2018|  73963|  300|
+----+-------+-----+
only showing top 20 rows



                                                                                