Użyj każdą z tych funkcji 
* `unix_timestamp()` 
* `date_format()`
* `to_unix_timestamp()`
* `from_unixtime()`
* `to_date()` 
* `to_timestamp()` 
* `from_utc_timestamp()` 
* `to_utc_timestamp()`

In [0]:
from pyspark.sql.functions import current_date, current_timestamp

kolumny = ["timestamp", "unix", "Date"]
dane = [("2015-03-22T14:13:34", 1646641525847, "May, 2021"),
        ("2015-03-22T15:03:18", 1646641557555, "Mar, 2021"),
        ("2015-03-22T14:38:39", 1646641578622, "Jan, 2021")]

dataFrame = spark.createDataFrame(dane, kolumny) \
    .withColumn("current_date", current_date()) \
    .withColumn("current_timestamp", current_timestamp())

display(dataFrame)

timestamp,unix,Date,current_date,current_timestamp
2015-03-22T14:13:34,1646641525847,"May, 2021",2025-03-17,2025-03-17T13:57:14.229+0000
2015-03-22T15:03:18,1646641557555,"Mar, 2021",2025-03-17,2025-03-17T13:57:14.229+0000
2015-03-22T14:38:39,1646641578622,"Jan, 2021",2025-03-17,2025-03-17T13:57:14.229+0000


In [0]:

dataFrame.printSchema()

root
 |-- timestamp: string (nullable = true)
 |-- unix: long (nullable = true)
 |-- Date: string (nullable = true)
 |-- current_date: date (nullable = false)
 |-- current_timestamp: timestamp (nullable = false)



## unix_timestamp(..) & cast(..)

Konwersja **string** to a **timestamp**.

Lokalizacja funkcji 
* `pyspark.sql.functions` in the case of Python
* `org.apache.spark.sql.functions` in the case of Scala & Java

## 1. Zmiana formatu wartości timestamp yyyy-MM-dd'T'HH:mm:ss 
`unix_timestamp(..)`

Dokumentacja API `unix_timestamp(..)`:
> Convert time string with given pattern (see <a href="http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html" target="_blank">SimpleDateFormat</a>) to Unix time stamp (in seconds), return null if fail.

`SimpleDataFormat` is part of the Java API and provides support for parsing and formatting date and time values.

In [0]:
from pyspark.sql.functions import to_date, from_unixtime, to_timestamp, to_utc_timestamp, from_utc_timestamp, unix_timestamp, date_format

2. Zmień format zgodnie z klasą `SimpleDateFormat`**yyyy-MM-dd HH:mm:ss**
  * a. Wyświetl schemat i dane żeby sprawdzicz czy wartości się zmieniły

In [0]:

zmianaFormatu = dataFrame.withColumn("timestamp", date_format("timestamp", "yyyy-MM-dd HH:mm:ss"))
display(zmianaFormatu)
zmianaFormatu.printSchema()

timestamp,unix,Date,current_date,current_timestamp
2015-03-22 14:13:34,1646641525847,"May, 2021",2025-03-17,2025-03-17T14:21:34.705+0000
2015-03-22 15:03:18,1646641557555,"Mar, 2021",2025-03-17,2025-03-17T14:21:34.705+0000
2015-03-22 14:38:39,1646641578622,"Jan, 2021",2025-03-17,2025-03-17T14:21:34.705+0000


root
 |-- timestamp: string (nullable = true)
 |-- unix: long (nullable = true)
 |-- Date: string (nullable = true)
 |-- current_date: date (nullable = false)
 |-- current_timestamp: timestamp (nullable = false)



In [0]:
#unix_timestamp
tempE = dataFrame.withColumn("timestamp", unix_timestamp("timestamp", "yyyy-MM-dd'T'HH:mm:ss"))
display(tempE)

timestamp,unix,Date,current_date,current_timestamp
1427033614,1646641525847,"May, 2021",2025-03-17,2025-03-17T14:21:41.209+0000
1427036598,1646641557555,"Mar, 2021",2025-03-17,2025-03-17T14:21:41.209+0000
1427035119,1646641578622,"Jan, 2021",2025-03-17,2025-03-17T14:21:41.209+0000


## Stwórz nowe kolumny do DataFrame z wartościami year(..), month(..), dayofyear(..)

In [0]:
#date_format
yearDate = dataFrame \
    .withColumn("year", date_format("timestamp", "yyyy")) \
    .withColumn("month", date_format("timestamp", "MM")) \
    .withColumn("dayofyear", date_format("timestamp", "DD"))
display(yearDate)

timestamp,unix,Date,current_date,current_timestamp,year,month,dayofyear
2015-03-22T14:13:34,1646641525847,"May, 2021",2025-03-17,2025-03-17T14:05:51.668+0000,2015,3,81
2015-03-22T15:03:18,1646641557555,"Mar, 2021",2025-03-17,2025-03-17T14:05:51.668+0000,2015,3,81
2015-03-22T14:38:39,1646641578622,"Jan, 2021",2025-03-17,2025-03-17T14:05:51.668+0000,2015,3,81


In [0]:
#to_date()
toDate = dataFrame \
    .withColumn("year", date_format("timestamp", "yyyy")) \
    .withColumn("month", date_format("timestamp", "MM")) \
    .withColumn("dayofyear", date_format("timestamp", "DD")) \
    .withColumn("date", to_date("timestamp"))
display(toDate)

timestamp,unix,date,current_date,current_timestamp,year,month,dayofyear
2015-03-22T14:13:34,1646641525847,2015-03-22,2025-03-17,2025-03-17T14:07:55.165+0000,2015,3,81
2015-03-22T15:03:18,1646641557555,2015-03-22,2025-03-17,2025-03-17T14:07:55.165+0000,2015,3,81
2015-03-22T14:38:39,1646641578622,2015-03-22,2025-03-17,2025-03-17T14:07:55.165+0000,2015,3,81


In [0]:
#from_unixtime()
fromUnix = dataFrame \
    .withColumn("year", date_format("timestamp", "yyyy")) \
    .withColumn("month", date_format("timestamp", "MM")) \
    .withColumn("dayofyear", date_format("timestamp", "DD")) \
    .withColumn("from_unixtime", from_unixtime("unix", "dd-MM-yyyy HH:mm:ss"))
display(fromUnix)

timestamp,unix,Date,current_date,current_timestamp,year,month,dayofyear,from_unixtime
2015-03-22T14:13:34,1646641525847,"May, 2021",2025-03-17,2025-03-17T14:09:35.274+0000,2015,3,81,28-12-+54149 23:50:47
2015-03-22T15:03:18,1646641557555,"Mar, 2021",2025-03-17,2025-03-17T14:09:35.274+0000,2015,3,81,29-12-+54149 08:39:15
2015-03-22T14:38:39,1646641578622,"Jan, 2021",2025-03-17,2025-03-17T14:09:35.274+0000,2015,3,81,29-12-+54149 14:30:22


In [0]:
#to_timestamp()
toTimestamp = dataFrame \
    .withColumn("year", date_format("timestamp", "yyyy")) \
    .withColumn("month", date_format("timestamp", "MM")) \
    .withColumn("dayofyear", date_format("timestamp", "DD")) \
    .withColumn("timestamp", to_timestamp("timestamp"))
display(toTimestamp)


timestamp,unix,Date,current_date,current_timestamp,year,month,dayofyear
2015-03-22T14:13:34.000+0000,1646641525847,"May, 2021",2025-03-17,2025-03-17T14:10:23.551+0000,2015,3,81
2015-03-22T15:03:18.000+0000,1646641557555,"Mar, 2021",2025-03-17,2025-03-17T14:10:23.551+0000,2015,3,81
2015-03-22T14:38:39.000+0000,1646641578622,"Jan, 2021",2025-03-17,2025-03-17T14:10:23.551+0000,2015,3,81


In [0]:
#to_utc_timestamp()
toUtcTimestamp = dataFrame \
    .withColumn("year", date_format("timestamp", "yyyy")) \
    .withColumn("month", date_format("timestamp", "MM")) \
    .withColumn("dayofyear", date_format("timestamp", "DD")) \
    .withColumn("UTC", to_utc_timestamp("timestamp", "Pacific/Auckland"))
display(toUtcTimestamp)



timestamp,unix,Date,current_date,current_timestamp,year,month,dayofyear,UTC
2015-03-22T14:13:34,1646641525847,"May, 2021",2025-03-17,2025-03-17T14:32:58.183+0000,2015,3,81,2015-03-22T01:13:34.000+0000
2015-03-22T15:03:18,1646641557555,"Mar, 2021",2025-03-17,2025-03-17T14:32:58.183+0000,2015,3,81,2015-03-22T02:03:18.000+0000
2015-03-22T14:38:39,1646641578622,"Jan, 2021",2025-03-17,2025-03-17T14:32:58.183+0000,2015,3,81,2015-03-22T01:38:39.000+0000


In [0]:
#from_utc_timestamp()
fromUtcTimestamp = dataFrame \
    .withColumn("year", date_format("timestamp", "yyyy")) \
    .withColumn("month", date_format("timestamp", "MM")) \
    .withColumn("dayofyear", date_format("timestamp", "DD")) \
    .withColumn("Auckland", from_utc_timestamp("timestamp", "Pacific/Auckland"))
display(fromUtcTimestamp)

timestamp,unix,Date,current_date,current_timestamp,year,month,dayofyear,Auckland
2015-03-22T14:13:34,1646641525847,"May, 2021",2025-03-17,2025-03-17T14:11:56.480+0000,2015,3,81,2015-03-23T03:13:34.000+0000
2015-03-22T15:03:18,1646641557555,"Mar, 2021",2025-03-17,2025-03-17T14:11:56.480+0000,2015,3,81,2015-03-23T04:03:18.000+0000
2015-03-22T14:38:39,1646641578622,"Jan, 2021",2025-03-17,2025-03-17T14:11:56.480+0000,2015,3,81,2015-03-23T03:38:39.000+0000


In [0]:
%fs ls dbfs:/databricks-datasets/airlines/part-00002


path,name,size,modificationTime
dbfs:/databricks-datasets/airlines/part-00002,part-00002,67108930,1436493185000


In [0]:
#zadanie 2
filePath = "dbfs:/FileStore/tables/Files/actors.csv"

actors = spark.read.format("csv") \
            .option("header", "true") \
            .load(filePath)

actors.printSchema()
type(actors)


root
 |-- imdb_title_id: string (nullable = true)
 |-- ordering: string (nullable = true)
 |-- imdb_name_id: string (nullable = true)
 |-- category: string (nullable = true)
 |-- job: string (nullable = true)
 |-- characters: string (nullable = true)

Out[1]: pyspark.sql.dataframe.DataFrame

In [0]:
#zadanie 3
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
filePath = "/FileStore/tables/wrong_movies.csv"
schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("title", StringType(), False),
    StructField("genre", StringType(), True)
])

In [0]:
#zadanie 3

movies_permissive = spark.read.format("csv") \
            .option("header", "true") \
            .option("mode", "PERMISSIVE") \
            .option("delimiter", ";") \
            .schema(schema) \
            .load(filePath)
            

movies_permissive.show(10)
#domyslny tryb

+---+----------------+---------+
| id|           title|    genre|
+---+----------------+---------+
|  1|              HP|  fantasy|
|  2|          Hobbit|  fantasy|
|  3|            LOTR|  fantasy|
|  4|            Cars|animation|
|  5|            null|    drama|
|  6|            null|   horror|
|  7|       Spiderman|     null|
|  8|?mier? w Wenecji|  thriler|
|  9|            Rush|     null|
| 10|           Shrek|animation|
+---+----------------+---------+



In [0]:
#zadanie 3
movies_dropmalformed = spark.read.format("csv") \
            .option("header", "true") \
            .option("mode", "DROPMALFORMED") \
            .option("delimiter", ";") \
            .schema(schema) \
            .load(filePath)
            

movies_dropmalformed.show(10)
#powinien usunac wiersze z nullami, nie wiem czemu nie dziala

+---+----------------+---------+
| id|           title|    genre|
+---+----------------+---------+
|  1|              HP|  fantasy|
|  2|          Hobbit|  fantasy|
|  3|            LOTR|  fantasy|
|  4|            Cars|animation|
|  5|            null|    drama|
|  6|            null|   horror|
|  7|       Spiderman|     null|
|  8|?mier? w Wenecji|  thriler|
|  9|            Rush|     null|
| 10|           Shrek|animation|
+---+----------------+---------+



In [0]:
#zadanie 3
movies_failfast = spark.read.format("csv") \
            .option("header", "true") \
            .option("mode", "FAILFAST") \
            .option("delimiter", ";") \
            .schema(schema) \
            .load(filePath)
            

movies_failfast.show(10)
#to samo co powyzej

+---+----------------+---------+
| id|           title|    genre|
+---+----------------+---------+
|  1|              HP|  fantasy|
|  2|          Hobbit|  fantasy|
|  3|            LOTR|  fantasy|
|  4|            Cars|animation|
|  5|            null|    drama|
|  6|            null|   horror|
|  7|       Spiderman|     null|
|  8|?mier? w Wenecji|  thriler|
|  9|            Rush|     null|
| 10|           Shrek|animation|
+---+----------------+---------+



In [0]:
#zadanie 4

movies_permissive.write.mode("overwrite").parquet("dbfs:/FileStore/tables/movies_parquet")
movies_permissive.write.mode("overwrite").json("dbfs:/FileStore/tables/movies_json")

parquetDf = spark.read.parquet("dbfs:/FileStore/tables/movies_parquet")
print("Parquet:")
parquetDf.show(5)

jsonDf = spark.read.json("dbfs:/FileStore/tables/movies_json")
print("json:")
jsonDf.show(5)

Parquet:
+---+------+---------+
| id| title|    genre|
+---+------+---------+
|  1|    HP|  fantasy|
|  2|Hobbit|  fantasy|
|  3|  LOTR|  fantasy|
|  4|  Cars|animation|
|  5|  null|    drama|
+---+------+---------+
only showing top 5 rows

json:
+---------+---+------+
|    genre| id| title|
+---------+---+------+
|  fantasy|  1|    HP|
|  fantasy|  2|Hobbit|
|  fantasy|  3|  LOTR|
|animation|  4|  Cars|
|    drama|  5|  null|
+---------+---+------+
only showing top 5 rows

