Names.csv 
* Dodaj kolumnę z wartością czasu wykonania notatnika w formacie epoch
* Dodaj kolumnę w której wyliczysz wzrost w stopach (feet)
* Odpowiedz na pytanie jakie jest najpopularniesze imię?
* Dodaj kolumnę i policz wiek aktorów 
* Usuń kolumny (bio, death_details)
* Zmień nazwy kolumn - dodaj kapitalizaję i usuń _
* Posortuj dataframe po imieniu rosnąco

In [0]:
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql import DataFrame


filePath = "dbfs:/FileStore/tables/Files/names.csv"
namesDf = spark.read.format("csv").option("header","true").option("inferSchema","true").load(filePath)
print(namesDf.explain())


def clean_and_convert_date(df: DataFrame, date_column: str) -> DataFrame:
    """Standardizes date formats in the given column by removing extraneous characters and converting to date format."""
    df = (df.withColumn(date_column, F.regexp_replace(date_column, r"[^0-9.]", ""))
            .withColumn(date_column, F.regexp_replace(date_column, r"[-]", "."))
            .withColumn(date_column, 
                        F.when(F.col(date_column).rlike(r"^\d{2}\.\d{2}\.\d{4}$"), F.to_date(F.col(date_column), "dd.MM.yyyy"))
                         .when(F.col(date_column).rlike(r"^\d{4}\.\d{2}\.\d{2}$"), F.to_date(F.col(date_column), "yyyy.MM.dd"))
                         .when(F.col(date_column).rlike(r"^\d{4}$"), F.to_date(F.col(date_column), "yyyy"))
                         .otherwise(None)
            ))
    return df

names_df = (namesDf.withColumn("CurrentEpoch", F.current_timestamp())
                    .withColumn("HeightFeet", F.col("height") * 0.0328)
                    .drop("bio", "death_details"))

for date_col in ["date_of_birth", "date_of_death"]:
    names_df = clean_and_convert_date(names_df, date_col)

names_df = names_df.withColumn("Age", F.floor(F.datediff(F.coalesce(F.col("date_of_death"), F.current_date()), F.col("date_of_birth")) / 365))

formatted_columns = [col.replace("_", " ").title().replace(" ", "") for col in names_df.columns]
names_df = names_df.toDF(*formatted_columns)

names_df = names_df.orderBy(F.col("Name"))

display(names_df.limit(20))

== Physical Plan ==
FileScan csv [imdb_name_id#2753,name#2754,birth_name#2755,height#2756,bio#2757,birth_details#2758,date_of_birth#2759,place_of_birth#2760,death_details#2761,date_of_death#2762,place_of_death#2763,reason_of_death#2764,spouses_string#2765,spouses#2766,divorces#2767,spouses_with_children#2768,children#2769] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[dbfs:/FileStore/tables/Files/names.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<imdb_name_id:string,name:string,birth_name:string,height:int,bio:string,birth_details:stri...


None


ImdbNameId,Name,BirthName,Height,BirthDetails,DateOfBirth,PlaceOfBirth,DateOfDeath,PlaceOfDeath,ReasonOfDeath,SpousesString,Spouses,Divorces,SpousesWithChildren,Children,Currentepoch,Heightfeet,Age
nm1001478,'Big' LeRoy Mobley,LeRoy King Mobley III,193.0,"April 1, 1973 in Atlantic City, New Jersey, USA",1973-04-01,"Atlantic City, New Jersey, USA",,,,,0,0,0,0,2025-04-02T09:12:05.558+0000,6.330400000000001,52.0
nm0521811,'Ducky' Louie,Lawrence Louie,,"July 22, 1931 in Berkeley, California, USA",1931-07-22,"Berkeley, California, USA",,,,,0,0,0,0,2025-04-02T09:12:05.558+0000,,93.0
nm0722372,'Little Billy' Rhodes,William H. Rhodes,,"February 1, 1895 in Illinois, USA",,"Illinois, USA",1967-07-24,"Hollywood, California, USA",stroke,,0,0,0,0,2025-04-02T09:12:05.558+0000,,
nm0946148,'Weird Al' Yankovic,Alfred Matthew Yankovic,183.0,"October 23, 1959 in Downey, California, USA",1959-10-23,"Downey, California, USA",,,,Suzanne Krajewski (10 February 2001 - present) (1 child),1,0,1,1,2025-04-02T09:12:05.558+0000,6.002400000000001,65.0
nm1265067,50 Cent,Curtis James Jackson III,183.0,"July 6, 1975 in Queens, New York City, New York, USA",1975-07-06,"Queens, New York City, New York, USA",,,,,0,0,0,0,2025-04-02T09:12:05.558+0000,6.002400000000001,49.0
nm0553436,A Martinez,Adolph Larrue Martinez III,175.0,"September 27, 1948 in Glendale, California, USA",1948-09-27,"Glendale, California, USA",,,,Leslie Bryans (17 July 1982 - present) (3 children)Mare Winningham (1981 - 29 January 1982) (divorced),2,1,1,3,2025-04-02T09:12:05.558+0000,5.74,76.0
nm1100197,A. Baldwin Sloane,A. Baldwin Sloane,,"August 28, 1872 in Baltimore, Maryland, USA",,"Baltimore, Maryland, USA",1925-02-21,"Red Bank, New Jersey, USA",,,0,0,0,0,2025-04-02T09:12:05.558+0000,,
nm0080406,A. Bhimsingh,A. Bhimsingh,,"July 15, 1924 in Tirupati, Andhra Pradesh, India",1924-07-15,"Tirupati, Andhra Pradesh, India",1978-01-16,"Madras, Tamil Nadu, India",,Sukumari (? - 16 January 1978) (his death) (1 child),1,0,1,1,2025-04-02T09:12:05.558+0000,,53.0
nm0770661,A. Hans Scheirl,Angela Hans Schierl,,"1956 in Salzburg, Austria",1956-01-01,"Salzburg, Austria",,,,,0,0,0,0,2025-04-02T09:12:05.558+0000,,69.0
nm0072200,A. Jonathan Benny,A. Jonathan Benny,,"November 4, 1970",1970-11-04,,,,,,0,0,0,0,2025-04-02T09:12:05.558+0000,,54.0


Movies.csv
* Dodaj kolumnę z wartością czasu wykonania notatnika w formacie epoch
* Dodaj kolumnę która wylicza ile lat upłynęło od publikacji filmu
* Dodaj kolumnę która pokaże budżet filmu jako wartość numeryczną, (trzeba usunac znaki walut)
* Usuń wiersze z dataframe gdzie wartości są null

In [0]:
filePath = "dbfs:/FileStore/tables/Files/movies.csv"
moviesDf = (spark.read.format("csv")
              .option("header","true")
              .option("inferSchema","true")
              .load(filePath))

movies_df = (moviesDf.withColumn("CurrentEpoch", F.current_timestamp())
                     .dropna()
                     .withColumn("Budget", F.col("budget").cast("int").alias("Budget"))
                     .withColumn("Budget", F.regexp_replace(F.col("Budget"), r"[^0-9]", "").cast("int"))
)

movies_df = clean_and_convert_date(movies_df, "date_published")
movies_df = movies_df.withColumn("YearsSincePublished", F.floor(F.datediff(F.current_date(), F.col("date_published")) / 365))

display(movies_df.limit(20))

top_companies = (movies_df.withColumn("ProductionCompany", F.col("production_company"))
                          .groupby("ProductionCompany")
                          .count()
                          .orderBy(F.col("count").desc()))

top_companies.show(3)

imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,Budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics,CurrentEpoch,YearsSincePublished
tt0017136,Metropolis,Metropolis,1927,1928-10-01,"Drama, Sci-Fi",153,Germany,German,Fritz Lang,"Thea von Harbou, Thea von Harbou",Universum Film (UFA),"Alfred Abel, Gustav Fröhlich, Rudolf Klein-Rogge, Fritz Rasp, Theodor Loos, Erwin Biswanger, Heinrich George, Brigitte Helm","In a futuristic city sharply divided between the working class and the city planners, the son of the city's mastermind falls in love with a working class prophet who predicts the coming of a savior to mediate their differences.",08.mar,156076,,$ 1236166,$ 1349711,98.0,495.0,208.0,2025-04-02T09:05:53.430+0000,96
tt0021749,Luci della città,City Lights,1931,1931-04-02,"Comedy, Drama, Romance",87,USA,English,Charles Chaplin,Charles Chaplin,Charles Chaplin Productions,"Virginia Cherrill, Florence Lee, Harry Myers, Al Ernest Garcia, Hank Mann, Charles Chaplin","With the aid of a wealthy erratic tippler, a dewy-eyed tramp who has fallen in love with a sightless flower girl accumulates money to be able to help her medically.",08.maj,162668,,$ 19181,$ 46008,99.0,295.0,122.0,2025-04-02T09:05:53.430+0000,94
tt0027977,Tempi moderni,Modern Times,1936,1937-03-12,"Comedy, Drama, Family",87,USA,English,Charles Chaplin,Charles Chaplin,Charles Chaplin Productions,"Charles Chaplin, Paulette Goddard, Henry Bergman, Tiny Sandford, Chester Conklin, Hank Mann, Stanley Blystone, Al Ernest Garcia, Richard Alexander, Cecil Reynolds, Mira McKinney, Murdock MacQuarrie, Wilfred Lucas, Edward LeSaint, Fred Malatesta",The Tramp struggles to live in modern industrial society with the help of a young homeless woman.,08.maj,211250,,$ 163577,$ 457688,96.0,307.0,115.0,2025-04-02T09:05:53.430+0000,88
tt0029453,Il bandito della Casbah,Pépé le Moko,1937,1937-10-22,"Crime, Drama, Romance",94,France,"French, Arabic",Julien Duvivier,"Henri La Barthe, Henri La Barthe",Paris Film,"Jean Gabin, Gabriel Gabrio, Saturnin Fabre, Fernand Charpin, Lucas Gridoux, Gilbert Gil, Marcel Dalio, Charles Granval, Gaston Modot, René Bergeron, Paul Escoffier, Roger Legris, Jean Témerson, Robert Ozanne, Philippe Richard","A wanted gangster is both king and prisoner of the Casbah. He is protected from arrest by his friends, but is torn by his desire for freedom outside. A visiting Parisian beauty may just tempt his fate.",07.lip,6180,,$ 155895,$ 155895,98.0,46.0,55.0,2025-04-02T09:05:53.430+0000,87
tt0029583,Biancaneve e i sette nani,Snow White and the Seven Dwarfs,1937,1938-11-30,"Animation, Family, Fantasy",83,USA,English,"William Cottrell, David Hand","Jacob Grimm, Wilhelm Grimm",Walt Disney Productions,"Roy Atwell, Stuart Buchanan, Adriana Caselotti, Eddie Collins, Pinto Colvig, Marion Darlington, Billy Gilbert, Otis Harlan, Lucille La Verne, James MacDonald, Scotty Mattraw, Moroni Olsen, Purv Pullen, Harry Stockwell, Bill Thompson","Exiled into the dangerous forest by her wicked stepmother, a princess is rescued by seven dwarf miners who make her part of their household.",07.cze,177157,,$ 184925486,$ 184925486,95.0,260.0,173.0,2025-04-02T09:05:53.430+0000,86
tt0031381,Via col vento,Gone with the Wind,1939,1949-03-12,"Drama, History, Romance",238,USA,English,"Victor Fleming, George Cukor","Margaret Mitchell, Sidney Howard",Selznick International Pictures,"Thomas Mitchell, Barbara O'Neil, Vivien Leigh, Evelyn Keyes, Ann Rutherford, George Reeves, Fred Crane, Hattie McDaniel, Oscar Polk, Butterfly McQueen, Victor Jory, Everett Brown, Howard Hickman, Alicia Rhett, Leslie Howard",A manipulative woman and a roguish man conduct a turbulent romance during the American Civil War and Reconstruction periods.,08.sty,283975,,$ 200852579,$ 402352579,97.0,881.0,197.0,2025-04-02T09:05:53.430+0000,76
tt0031679,Mr. Smith va a Washington,Mr. Smith Goes to Washington,1939,1947-04-05,"Comedy, Drama",129,USA,English,Frank Capra,"Sidney Buchman, Lewis R. Foster",Columbia Pictures,"Jean Arthur, James Stewart, Claude Rains, Edward Arnold, Guy Kibbee, Thomas Mitchell, Eugene Pallette, Beulah Bondi, H.B. Warner, Harry Carey, Astrid Allwyn, Ruth Donnelly, Grant Mitchell, Porter Hall, H.V. Kaltenborn","A naive man is appointed to fill a vacancy in the United States Senate. His plans promptly collide with political corruption, but he doesn't back down.",08.sty,104547,,$ 144738,$ 144738,73.0,296.0,88.0,2025-04-02T09:05:53.430+0000,78
tt0032138,Il mago di Oz,The Wizard of Oz,1939,1949-04-19,"Adventure, Family, Fantasy",102,USA,English,"Victor Fleming, George Cukor","Noel Langley, Florence Ryerson",Metro-Goldwyn-Mayer (MGM),"Judy Garland, Frank Morgan, Ray Bolger, Bert Lahr, Jack Haley, Billie Burke, Margaret Hamilton, Charley Grapewin, Pat Walshe, Clara Blandick, Terry, The Singer Midgets",Dorothy Gale is swept away from a farm in Kansas to a magical land of Oz in a tornado and embarks on a quest with her new friends to see the Wizard who can help her return home to Kansas and help her friends as well.,8.0,366293,,$ 24790250,$ 26142032,100.0,688.0,168.0,2025-04-02T09:05:53.430+0000,76
tt0032455,Fantasia,Fantasia,1940,1946-09-19,"Animation, Family, Fantasy",125,USA,English,"James Algar, Samuel Armstrong","Joe Grant, Dick Huemer",Walt Disney Productions,"Deems Taylor, Leopold Stokowski, The Philadelphia Orchestra",A collection of animated interpretations of great works of Western classical music.,07.sie,86795,,$ 76408097,$ 76411401,96.0,342.0,119.0,2025-04-02T09:05:53.430+0000,78
tt0032910,Pinocchio,Pinocchio,1940,1947-11-27,"Animation, Comedy, Family",88,USA,English,"Norman Ferguson, T. Hee","Carlo Collodi, Ted Sears",Walt Disney Animation Studios,"Mel Blanc, Don Brodie, Stuart Buchanan, Walter Catlett, Marion Darlington, Frankie Darro, Cliff Edwards, Dickie Jones, Charles Judels, John McLeish, Clarence Nash, Patricia Page, Christian Rub, Bill Thompson, Evelyn Venable","A living puppet, with the help of a cricket as his conscience, must prove himself worthy to become a real boy.",07.kwi,127618,,$ 84254167,$ 121892045,99.0,202.0,140.0,2025-04-02T09:05:53.430+0000,77


+------------------+-----+
| ProductionCompany|count|
+------------------+-----+
|Universal Pictures|  307|
| Columbia Pictures|  296|
|      Warner Bros.|  296|
+------------------+-----+
only showing top 3 rows



ratings.csv
* Dodaj kolumnę z wartością czasu wykonania notatnika w formacie epoch
* Dla każdego z poniższych wyliczeń nie bierz pod uwagę `nulls` 
* Kto daje lepsze oceny chłopcy czy dziewczyny dla całego setu
* Dla jednej z kolumn zmień typ danych do `long` 

In [0]:


filePath = "dbfs:/FileStore/tables/Files/ratings.csv"
ratingsDf = (spark.read.format("csv")
              .option("header","true")
              .option("inferSchema","true")
              .load(filePath))


ratingsDf_copy = (ratingsDf.withColumn("current_epoch", F.current_timestamp())
                       .dropna()
                       .withColumn("total_votes", F.col("total_votes").cast("long")))

# ratingsDf1.printSchema()
display(ratingsDf_copy.limit(20))

rates_avg = ratingsDf_copy.agg(F.avg("males_allages_avg_vote").alias("average_male_votes"), F.avg("females_allages_avg_vote").alias("average_female_votes"))
# print(rates_avg.explain())
rates_avg.show()
print("we see that WOMEN is answer!!!!")     


imdb_title_id,weighted_average_vote,total_votes,mean_vote,median_vote,votes_10,votes_9,votes_8,votes_7,votes_6,votes_5,votes_4,votes_3,votes_2,votes_1,allgenders_0age_avg_vote,allgenders_0age_votes,allgenders_18age_avg_vote,allgenders_18age_votes,allgenders_30age_avg_vote,allgenders_30age_votes,allgenders_45age_avg_vote,allgenders_45age_votes,males_allages_avg_vote,males_allages_votes,males_0age_avg_vote,males_0age_votes,males_18age_avg_vote,males_18age_votes,males_30age_avg_vote,males_30age_votes,males_45age_avg_vote,males_45age_votes,females_allages_avg_vote,females_allages_votes,females_0age_avg_vote,females_0age_votes,females_18age_avg_vote,females_18age_votes,females_30age_avg_vote,females_30age_votes,females_45age_avg_vote,females_45age_votes,top1000_voters_rating,top1000_voters_votes,us_voters_rating,us_voters_votes,non_us_voters_rating,non_us_voters_votes,current_epoch
tt0000009,5.9,154,5.9,6.0,12,4,10,43,28,28,9,1,5,14,7.2,4.0,6.0,38.0,5.7,50.0,6.6,35.0,6.2,97.0,7.0,1.0,5.9,24.0,5.6,36.0,6.7,31.0,6.0,35.0,7.3,3.0,5.9,14.0,5.7,13.0,4.5,4.0,5.7,34.0,6.4,51.0,6.0,70.0,2025-04-02T08:49:13.664+0000
tt0002130,7.0,2237,6.9,7.0,210,225,436,641,344,169,66,39,20,87,7.5,4.0,7.0,402.0,7.0,895.0,7.1,482.0,7.0,1607.0,8.0,2.0,7.0,346.0,7.0,804.0,7.0,396.0,7.2,215.0,7.0,2.0,7.0,52.0,7.3,82.0,7.4,77.0,6.9,139.0,7.0,488.0,7.0,1166.0,2025-04-02T08:49:13.664+0000
tt0003740,7.1,3073,6.5,7.0,285,301,591,727,443,199,85,27,18,397,6.0,3.0,7.0,393.0,7.0,1126.0,7.2,1006.0,7.1,2149.0,6.0,2.0,7.0,323.0,7.0,976.0,7.3,799.0,6.9,402.0,6.0,1.0,6.8,67.0,6.9,134.0,6.8,194.0,6.9,177.0,7.0,1035.0,7.0,1332.0,2025-04-02T08:49:13.664+0000
tt0004972,6.3,22213,6.4,7.0,3661,1741,3314,3963,2876,1928,978,701,577,2474,4.7,14.0,6.0,3183.0,6.3,8861.0,6.7,4901.0,6.4,14818.0,4.7,12.0,6.1,2623.0,6.4,7669.0,6.8,4104.0,5.7,2417.0,6.0,1.0,5.4,517.0,5.8,1106.0,5.9,725.0,6.5,355.0,6.3,7452.0,6.4,8306.0,2025-04-02T08:49:13.664+0000
tt0005680,5.9,130,5.9,6.0,4,3,8,20,52,27,8,5,2,1,5.0,3.0,5.8,19.0,5.9,44.0,5.9,50.0,5.9,102.0,5.0,2.0,5.8,16.0,5.8,40.0,6.0,43.0,5.9,12.0,5.0,1.0,5.5,2.0,6.5,4.0,6.2,5.0,5.7,50.0,5.9,26.0,5.8,87.0,2025-04-02T08:49:13.664+0000
tt0006206,7.3,4166,6.7,7.0,620,439,763,872,484,232,140,40,31,545,6.5,7.0,7.1,517.0,7.3,1661.0,7.3,1294.0,7.3,2835.0,6.8,6.0,7.1,414.0,7.3,1384.0,7.3,976.0,7.2,670.0,4.0,1.0,6.9,98.0,7.1,256.0,7.7,306.0,6.8,204.0,7.5,1350.0,7.2,1945.0,2025-04-02T08:49:13.664+0000
tt0006864,7.8,13875,7.8,8.0,3477,2230,3214,2249,1179,605,340,181,133,267,7.9,8.0,7.8,1795.0,7.7,5451.0,7.8,3667.0,7.8,9441.0,8.0,7.0,7.9,1498.0,7.7,4734.0,7.8,2970.0,7.5,1632.0,7.0,1.0,7.2,276.0,7.4,660.0,8.0,662.0,7.5,321.0,7.7,4286.0,7.8,5954.0,2025-04-02T08:49:13.664+0000
tt0009611,7.3,5895,6.7,7.0,552,564,1303,1742,724,228,69,35,22,656,5.0,4.0,7.3,833.0,7.3,2162.0,7.3,1730.0,7.3,4074.0,4.7,3.0,7.3,729.0,7.3,1888.0,7.3,1340.0,7.4,746.0,6.0,1.0,7.4,96.0,7.3,257.0,7.6,380.0,6.8,283.0,7.3,1734.0,7.3,2694.0,2025-04-02T08:49:13.664+0000
tt0010323,8.1,55601,7.9,8.0,11426,11262,15971,8883,3517,1562,725,483,348,1424,8.5,47.0,8.1,10734.0,8.1,23386.0,8.0,8271.0,8.1,35834.0,8.5,38.0,8.1,8304.0,8.0,19321.0,8.0,6821.0,8.1,7660.0,8.8,7.0,8.1,2236.0,8.2,3807.0,8.1,1341.0,7.5,559.0,8.0,13136.0,8.1,25399.0,2025-04-02T08:49:13.664+0000
tt0011130,7.0,4753,7.1,7.0,393,335,965,1666,816,326,107,54,35,56,7.0,4.0,6.9,546.0,7.0,1830.0,7.0,1428.0,7.0,3253.0,6.0,3.0,6.9,415.0,6.9,1541.0,7.0,1207.0,7.2,606.0,10.0,1.0,6.8,128.0,7.3,269.0,7.2,202.0,6.9,239.0,7.1,1733.0,6.9,1815.0,2025-04-02T08:49:13.664+0000


+------------------+--------------------+
|average_male_votes|average_female_votes|
+------------------+--------------------+
| 6.175647515893578|   6.371356251471637|
+------------------+--------------------+

we see that WOMEN is answer


Spark UI to interfejs do monitorowania i analizy wydajności aplikacji Spark. Oto kluczowe sekcje interfejsu:

- **Jobs** – Wyświetla listę wszystkich zadań wraz z ich statusem. Możemy przejść do szczegółów konkretnego zadania i sprawdzić jego przebieg oraz czas wykonania.
- **Stages** – Przedstawia etapy przetwarzania danych, oddzielone operacjami shuffle. Zawiera informacje o czasie wykonania zadań, wielkości operacji Shuffle Read/Write oraz innych szczegółach wydajnościowych.
- **Storage** – Pokazuje informacje na temat danych przechowywanych w pamięci podręcznej. Jeśli dane nie były cache’owane, sekcja może pozostać pusta.
- **Environment** – Podsumowanie konfiguracji środowiska, w tym ustawienia pamięci i CPU dla wykonawców. Pozwala zweryfikować, czy klaster działa zgodnie z oczekiwaniami.
- **Executors** – Lista wykonawców wraz z informacjami o ich stanie i wykorzystaniu zasobów. Jeśli działamy na jednym węźle, zazwyczaj mamy jednego wykonawcę pełniącego także rolę drivera.
- **SQL/Dataframe** – Szczegółowe plany wykonania zapytań SQL oraz operacji na DataFrame’ach, co pomaga w optymalizacji wydajności kodu.

Pozostałe sekcje Spark UI zawierają dodatkowe informacje diagnostyczne