##Big Data Analysis  
Projekat za temu ima da istraži vezu između spavanja i stresa, kao i svakodnevni uticaj stresa na čoveka.  
Za tu potrebu koristiću dva data set-a, koja se mogu pronaći na sledećim linkovima:

https://www.kaggle.com/datasets/laavanya/human-stress-detection-in-and-through-sleep  
https://www.kaggle.com/datasets/laavanya/stress-level-detection?select=Stress-Lysis.csv


Dalji tok projekta ide po zadatoj specifikaciji, koja je propraćena dodatnim objašnjenjima.

In [0]:
%fs ls FileStore/tables/stress.csv

path,name,size,modificationTime
dbfs:/FileStore/tables/stress.csv/Stress_Lysis.csv,Stress_Lysis.csv,36495,1676071444000


In [0]:
%fs ls FileStore/tables/sleep_stress.csv

path,name,size,modificationTime
dbfs:/FileStore/tables/sleep_stress.csv/SaYoPillow.csv,SaYoPillow.csv,32492,1676071770000


U nastavku će biti prikazano učitavanje data set-a, kao i prikaz tabela i definisanje šema

In [0]:
#from pyspark.sql import Row, Column
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

df_stress = spark.read.csv("/FileStore/tables/stress.csv", header=True, inferSchema=True)
df_sleep_stress = spark.read.csv("/FileStore/tables/sleep_stress.csv", header=True, inferSchema=True)


In [0]:
display(df_stress)


Humidity,Temperature,Step_count,Stress_Level
21.33,90.33,123,1
21.41,90.41,93,1
27.12,96.12,196,2
27.64,96.64,177,2
10.87,79.87,87,0
11.31,80.31,40,0
18.16,87.16,88,1
28.2,97.2,162,2
14.25,83.25,61,0
26.13,95.13,168,2


In [0]:
display(df_sleep_stress)

sr0,rr,t,lm,bo,rem,sr6,hr,sl
93.8,25.68,91.84,16.6,89.84,99.6,1.84,74.2,3
91.64,25.104,91.552,15.88,89.552,98.88,1.552,72.76,3
60.0,20.0,96.0,10.0,95.0,85.0,7.0,60.0,1
85.76,23.536,90.768,13.92,88.768,96.92,0.768,68.84,3
48.12,17.248,97.872,6.496,96.248,72.48,8.248,53.12,0
56.88,19.376,95.376,9.376,94.064,83.44,6.376,58.44,1
47.0,16.8,97.2,5.6,95.8,68.0,7.8,52.0,0
50.0,18.0,99.0,8.0,97.0,80.0,9.0,55.0,0
45.28,16.112,96.168,4.224,95.112,61.12,7.112,50.28,0
55.52,19.104,95.104,9.104,93.656,82.76,6.104,57.76,1


####Objašnjenje skraćenica:

sr0 - snoring range of the user  
rr - respiration rate  
t - body temperature  
lm - limb movement rate   
bo - blood oxygen levels   
rem - eye movement  
sr6 - number of hours of sleep  
hr - heart rate   
sl - Stress Levels

In [0]:
stress_schema = StructType(
    [StructField('humidity', DoubleType(), True),
     StructField('temperature', DoubleType(), True),
     StructField('step_count', IntegerType(), True),
     StructField('stress_level', IntegerType(), True)])

sleep_stress_schema = StructType(
    [StructField('sr0', DoubleType(), True),
     StructField('rr', DoubleType(), True),
     StructField('t', DoubleType(), True),
     StructField('lm', DoubleType(), True),
     StructField('bo', DoubleType(), True),
     StructField('rem', DoubleType(), True),
     StructField('sr6', DoubleType(), True),
     StructField('hr', DoubleType(), True),
     StructField('sl', IntegerType(), True)])


In [0]:
stress_df = spark.read.csv("/FileStore/tables/stress.csv", header=True, schema=stress_schema)
sleep_stress_df = spark.read.csv("/FileStore/tables/sleep_stress.csv", header=True, schema=sleep_stress_schema)

Ispis šema

In [0]:
stress_df.printSchema()

root
 |-- humidity: double (nullable = true)
 |-- temperature: double (nullable = true)
 |-- step_count: integer (nullable = true)
 |-- stress_level: integer (nullable = true)



In [0]:
sleep_stress_df.printSchema()

root
 |-- sr0: double (nullable = true)
 |-- rr: double (nullable = true)
 |-- t: double (nullable = true)
 |-- lm: double (nullable = true)
 |-- bo: double (nullable = true)
 |-- rem: double (nullable = true)
 |-- sr6: double (nullable = true)
 |-- hr: double (nullable = true)
 |-- sl: integer (nullable = true)



Potrebno je odraditi keširanje podataka kako bismo dobili što bolje performanse

In [0]:
stress_df.cache()
sleep_stress_df.cache()

Out[8]: DataFrame[sr0: double, rr: double, t: double, lm: double, bo: double, rem: double, sr6: double, hr: double, sl: int]

####Transformacija prvog data set-a (stress)

In [0]:
stress_df.show()

+--------+-----------+----------+------------+
|humidity|temperature|step_count|stress_level|
+--------+-----------+----------+------------+
|   21.33|      90.33|       123|           1|
|   21.41|      90.41|        93|           1|
|   27.12|      96.12|       196|           2|
|   27.64|      96.64|       177|           2|
|   10.87|      79.87|        87|           0|
|   11.31|      80.31|        40|           0|
|   18.16|      87.16|        88|           1|
|    28.2|       97.2|       162|           2|
|   14.25|      83.25|        61|           0|
|   26.13|      95.13|       168|           2|
|   23.61|      92.61|       200|           2|
|   19.37|      88.37|       117|           1|
|   29.08|      98.08|       179|           2|
|   17.83|      86.83|        55|           1|
|   28.06|      97.06|       148|           2|
|   19.43|      88.43|       123|           1|
|   26.85|      95.85|       169|           2|
|   26.51|      95.51|       135|           2|
|   29.49|   

Promena naziva kolona

In [0]:
stress_df = stress_df.withColumnRenamed("humidity", "vlažnost")
stress_df = stress_df.withColumnRenamed("temperature", "temperatura")
stress_df = stress_df.withColumnRenamed("step_count", "broj_koraka")
stress_df = stress_df.withColumnRenamed("stress_level", "nivo_stresa")


Brisanje duplikata

In [0]:
stress_df = stress_df.dropDuplicates()

Filtriranje u odnosu na broj pređenih koraka

In [0]:
stress_df = stress_df.filter(stress_df['broj_koraka'] > 50)
#display(stress_df)

Filtriranje u odnosu na temperaturu

In [0]:
stress_df = stress_df.filter(stress_df['temperatura'] > 80)

In [0]:
stress_df.select('temperatura').distinct().show(truncate=False)

+-----------+
|temperatura|
+-----------+
|87.16      |
|95.69      |
|98.49      |
|98.08      |
|89.94      |
|96.12      |
|88.43      |
|95.85      |
|90.41      |
|96.64      |
|90.33      |
|83.25      |
|86.83      |
|83.43      |
|95.51      |
|92.61      |
|88.37      |
|97.2       |
|97.6       |
|95.13      |
+-----------+
only showing top 20 rows



Pronalazak svih nivoa stresa u data set-u

In [0]:
stress_df.select('nivo_stresa').distinct().show()

+-----------+
|nivo_stresa|
+-----------+
|          1|
|          2|
|          0|
+-----------+



Odbaciti osobe bez stresa tj. sa stress level-om 0

In [0]:
stress_df = stress_df.filter(stress_df['nivo_stresa'] > 0)
stress_df.select('nivo_stresa').distinct().show()

+-----------+
|nivo_stresa|
+-----------+
|          1|
|          2|
+-----------+



Konačan ispis (stress) data set-a

In [0]:
display(stress_df)

vlažnost,temperatura,broj_koraka,nivo_stresa
15.97,84.97,51,1
24.64,93.64,161,2
26.57,95.57,141,2
25.71,94.71,197,2
21.57,90.57,109,1
26.54,95.54,138,2
26.06,95.06,161,2
27.93,96.93,135,2
19.75,88.75,117,1
17.92,86.92,64,1


####Transformacija drugog data set-a (sleep_stress)

In [0]:
sleep_stress_df.show()

+------+------+------+------+------+------+-----+-----+---+
|   sr0|    rr|     t|    lm|    bo|   rem|  sr6|   hr| sl|
+------+------+------+------+------+------+-----+-----+---+
|  93.8| 25.68| 91.84|  16.6| 89.84|  99.6| 1.84| 74.2|  3|
| 91.64|25.104|91.552| 15.88|89.552| 98.88|1.552|72.76|  3|
|  60.0|  20.0|  96.0|  10.0|  95.0|  85.0|  7.0| 60.0|  1|
| 85.76|23.536|90.768| 13.92|88.768| 96.92|0.768|68.84|  3|
| 48.12|17.248|97.872| 6.496|96.248| 72.48|8.248|53.12|  0|
| 56.88|19.376|95.376| 9.376|94.064| 83.44|6.376|58.44|  1|
|  47.0|  16.8|  97.2|   5.6|  95.8|  68.0|  7.8| 52.0|  0|
|  50.0|  18.0|  99.0|   8.0|  97.0|  80.0|  9.0| 55.0|  0|
| 45.28|16.112|96.168| 4.224|95.112| 61.12|7.112|50.28|  0|
| 55.52|19.104|95.104| 9.104|93.656| 82.76|6.104|57.76|  1|
| 73.44|21.344|93.344|11.344|91.344| 91.72|4.016|63.36|  2|
| 59.28|19.856|95.856| 9.856|94.784| 84.64|6.856|59.64|  1|
|  48.6| 17.44| 98.16|  6.88| 96.44|  74.4| 8.44| 53.6|  0|
|96.288|26.288| 85.36|17.144|82.432|100.

Obrisaću kolone lm i rem, jer smatram da nisu potrebne za dalju analizu

In [0]:
sleep_stress_df = sleep_stress_df.drop('lm')
sleep_stress_df = sleep_stress_df.drop('rem')

Promena nazaiva kolona

In [0]:
sleep_stress_df = sleep_stress_df.withColumnRenamed("sr0", "hrkanje")
sleep_stress_df = sleep_stress_df.withColumnRenamed("rr", "brzina_disanja")
sleep_stress_df = sleep_stress_df.withColumnRenamed("t", "temperatura_tela")
sleep_stress_df = sleep_stress_df.withColumnRenamed("bo", "kiseonik_u_krvi")
sleep_stress_df = sleep_stress_df.withColumnRenamed("sr6", "dužina_sna")
sleep_stress_df = sleep_stress_df.withColumnRenamed("hr", "puls")
sleep_stress_df = sleep_stress_df.withColumnRenamed("sl", "stres")


Ukloniću granične vrednosti kod kolone brzina disanja. Pod tim smatram da brzina manja od 15 i veća od 25 znači da osoba nije zdrav uzorak u tom trenutku.

In [0]:
sleep_stress_df = sleep_stress_df.filter(sleep_stress_df['brzina_disanja'] > 15)
sleep_stress_df = sleep_stress_df.filter(sleep_stress_df['brzina_disanja'] < 25)
display(sleep_stress_df)

hrkanje,brzina_disanja,temperatura_tela,kiseonik_u_krvi,dužina_sna,puls,stres
60.0,20.0,96.0,95.0,7.0,60.0,1
85.76,23.536,90.768,88.768,0.768,68.84,3
48.12,17.248,97.872,96.248,8.248,53.12,0
56.88,19.376,95.376,94.064,6.376,58.44,1
47.0,16.8,97.2,95.8,7.8,52.0,0
50.0,18.0,99.0,97.0,9.0,55.0,0
45.28,16.112,96.168,95.112,7.112,50.28,0
55.52,19.104,95.104,93.656,6.104,57.76,1
73.44,21.344,93.344,91.344,4.016,63.36,2
59.28,19.856,95.856,94.784,6.856,59.64,1


In [0]:
#sleep_stress_df.withColumnRenamed("kiseonik_u_krvi", "max kiseonika").select("max kiseonika").where(col("max kiseonika") > 5).show(5, False)
#sleep_stress_df.withColumnRenamed("puls", "max puls").select("max puls").where(col("max puls") > 5).show(5, False)

U nastavku ću prikazati nivo stresa koji se u ovom data set-u pojavljuje

In [0]:
sleep_stress_df.select('stres').distinct().show()

+-----+
|stres|
+-----+
|    1|
|    3|
|    2|
|    0|
+-----+



Ovim korakom ću ukloniti slučajeve bez ili sa vrlo niskim nivoom stresa

In [0]:
sleep_stress_df = sleep_stress_df.filter(sleep_stress_df['stres'] > 0)
sleep_stress_df.select('stres').distinct().show()

+-----+
|stres|
+-----+
|    1|
|    3|
|    2|
+-----+



In [0]:
#sleep_stress_df = sleep_stress_df.groupBy("dužina_sna").agg("dužins_sna":"avg")
sleep_stress_df.select(mean ('dužina_sna')).show()


+-----------------+
|  avg(dužina_sna)|
+-----------------+
|3.661664739884393|
+-----------------+



Odbaciću hrkanje koje je manje od 50 Db

In [0]:
sleep_stress_df = sleep_stress_df.filter(sleep_stress_df['hrkanje'] > 50)
#sleep_stress_df.select('hrkanje').distinct().show()

U sledećem koraku ću samo pogledati da li je hrkanje veće od 100Db  
To ću pokazati preko groupBy, gde bih izdvojio prvih 15 najvećih uzoraka

In [0]:
sleep_stress_df .select(col('hrkanje').alias('max hrkanja')).orderBy(col('hrkanje').desc()).limit(15).show()

+-----------+
|max hrkanja|
+-----------+
|      91.16|
|      91.04|
|      90.92|
|       90.8|
|      90.68|
|      90.56|
|      90.44|
|      90.32|
|       90.2|
|      90.08|
|      89.96|
|      89.84|
|      89.72|
|       89.6|
|      89.48|
+-----------+



In [0]:
display(sleep_stress_df)

hrkanje,brzina_disanja,temperatura_tela,kiseonik_u_krvi,dužina_sna,puls,stres
60.0,20.0,96.0,95.0,7.0,60.0,1
85.76,23.536,90.768,88.768,0.768,68.84,3
56.88,19.376,95.376,94.064,6.376,58.44,1
55.52,19.104,95.104,93.656,6.104,57.76,1
73.44,21.344,93.344,91.344,4.016,63.36,2
59.28,19.856,95.856,94.784,6.856,59.64,1
87.8,24.08,91.04,89.04,1.04,70.2,3
52.32,18.464,94.464,92.696,5.464,56.16,1
52.64,18.528,94.528,92.792,5.528,56.32,1
86.24,23.664,90.832,88.832,0.832,69.16,3


####Spajanje data set-ova  
Data set-ove ću spojiti preko temperature.

In [0]:
#all_df = stress_df.join(sleep_stress_df, stress_df.nivo_stresa == sleep_stress_df.stres, "inner")
all_df = sleep_stress_df.join(stress_df, sleep_stress_df.temperatura_tela != stress_df.temperatura, "inner")

In [0]:
all_df = all_df.drop('temperatura')

In [0]:
display(all_df)

hrkanje,brzina_disanja,temperatura_tela,kiseonik_u_krvi,dužina_sna,puls,stres,vlažnost,broj_koraka,nivo_stresa
60.0,20.0,96.0,95.0,7.0,60.0,1,15.97,51,1
85.76,23.536,90.768,88.768,0.768,68.84,3,15.97,51,1
56.88,19.376,95.376,94.064,6.376,58.44,1,15.97,51,1
55.52,19.104,95.104,93.656,6.104,57.76,1,15.97,51,1
73.44,21.344,93.344,91.344,4.016,63.36,2,15.97,51,1
59.28,19.856,95.856,94.784,6.856,59.64,1,15.97,51,1
87.8,24.08,91.04,89.04,1.04,70.2,3,15.97,51,1
52.32,18.464,94.464,92.696,5.464,56.16,1,15.97,51,1
52.64,18.528,94.528,92.792,5.528,56.32,1,15.97,51,1
86.24,23.664,90.832,88.832,0.832,69.16,3,15.97,51,1


####Čuvanje data set-a

In [0]:
all_data_df = 'dbfs:/stresTransformacije/df.csv'
all_df.write.csv(all_data_df, header=True, mode='overwrite')

Čuvanje Structered Streaming-a

In [0]:
%fs ls dbfs:/stresTransformacije/df.csv

path,name,size,modificationTime
dbfs:/stresTransformacije/df.csv/_committed_2056761049369130746,_committed_2056761049369130746,201,1676207400000
dbfs:/stresTransformacije/df.csv/_committed_2417139068891011757,_committed_2417139068891011757,202,1676250187000
dbfs:/stresTransformacije/df.csv/_committed_2636612712425275446,_committed_2636612712425275446,199,1676203372000
dbfs:/stresTransformacije/df.csv/_committed_2783157834056723206,_committed_2783157834056723206,201,1676207092000
dbfs:/stresTransformacije/df.csv/_committed_366300356189847713,_committed_366300356189847713,200,1676288340000
dbfs:/stresTransformacije/df.csv/_committed_4600391673446376028,_committed_4600391673446376028,200,1676203273000
dbfs:/stresTransformacije/df.csv/_committed_5912154206775318348,_committed_5912154206775318348,201,1676207112000
dbfs:/stresTransformacije/df.csv/_committed_6338825090602400376,_committed_6338825090602400376,200,1676207063000
dbfs:/stresTransformacije/df.csv/_committed_7052203698561823375,_committed_7052203698561823375,201,1676207618000
dbfs:/stresTransformacije/df.csv/_committed_7501370630618151542,_committed_7501370630618151542,212,1676160712000


In [0]:
all_df.printSchema()

root
 |-- hrkanje: double (nullable = true)
 |-- brzina_disanja: double (nullable = true)
 |-- temperatura_tela: double (nullable = true)
 |-- kiseonik_u_krvi: double (nullable = true)
 |-- dužina_sna: double (nullable = true)
 |-- puls: double (nullable = true)
 |-- stres: integer (nullable = true)
 |-- vlažnost: double (nullable = true)
 |-- broj_koraka: integer (nullable = true)
 |-- nivo_stresa: integer (nullable = true)



In [0]:
df_str  = 'dbfs:/streaming/df.csv'
all_df.write.csv(df_str, header=True, mode='overwrite')

#dbutils.fs.rm("dbfs:/streaming", recurse=True)