# Data Engineering 2: Graded Lab 02
---------------

#### Grading
For this graded lab you can get a total of 20 points. These 20 points count 10% of your final grade for the course.

#### Note
Check each result carefully. Use data filter, cleaning, and transformation methods wherever needed. The data can sometimes be really messy and have hidden issues.

#### Submission
You are allowed to submit the solution in groups of **two or three** students.
Submit your GradedLab02.ipynb file renamed to FirstnameStudent01LastnameStudent01_FirstnameStudent02LastnameStudent02_FirstnameStudent03LastnameStudent03.ipynb in moodle.   
Please submit a runnable python jupyter notebook file.
All other submissions will be rejected and graded with 0 points.

#### Part 01: In this part of the graded lab you need to solve different tasks for analysing Twitter data (10 points)
###### Note: The data for this part is contained in the part01 folder

##### Task 01: Print the top10 words of the tweets (2 points)
###### Read the 'text' column of the tweets from the file "2023_01_01" into an rdd and print the top10 words of the tweets. Note: lowercase all words and remove the stopwords from the stopwords list of the archive.

In [None]:
from pyspark.sql.functions import lower, explode, split, count, col, regexp_replace, length, trim

# Read the JSON file
tweets_2023_01_01 = spark.read.json("/FileStore/tables/data_january2023/2023_01_01.json")

# Read stopwords from file and collect as a Python list
stopwords_df = spark.read.text("/FileStore/tables/stopwords.txt")
stopwords = [row.value.strip() for row in stopwords_df.collect()]

# Remove URLs
tweets_2023_01_01 = tweets_2023_01_01.withColumn(
    "text",
    regexp_replace(col("text"), "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", "")
)

# Lowercase the text
tweets_2023_01_01 = tweets_2023_01_01.withColumn("text", lower(col("text")))

# Split text into words, remove punctuation, and explode into rows
words_df = tweets_2023_01_01.select(
    explode(
        split(
            regexp_replace(col("text"), r"[^a-zA-Z0-9]", " "),  # Replace non-alphanumeric with space
            r"\s+"
        )
    ).alias("word")
)

# Remove empty strings and stopwords
words_df = words_df.withColumn("word", trim(col("word")))
words_df = words_df.filter(
    (length(col("word")) > 0) &
    (~col("word").isin(stopwords))
)

# Count word frequencies and show top 10
print("\nTop 10 most frequent words:")
words_df.groupBy("word").count().orderBy(col("count").desc()).show(10)

##### Task 02: Find the user with the most tweets in January 2023 (2 points) 
###### Use paired RDDs and their functions to find the user with the most tweets in January 2023.

In [0]:
# Read all JSON files as a DataFrame first
tweets_df = spark.read.json("/FileStore/tables/data_january2023/*.json")

# Convert the DataFrame to an RDD
tweets_rdd = tweets_df.rdd

# Map to ((user_id, screen_name), 1) pairs
user_screen_pairs = tweets_rdd.map(lambda row: ((row.user_id, row.screen_name), 1))

# Reduce by key to count tweets per (user_id, screen_name)
user_screen_counts = user_screen_pairs.reduceByKey(lambda a, b: a + b)

# Find the (user_id, screen_name) with the most tweets
most_tweets_user = user_screen_counts.reduce(lambda a, b: a if a[1] > b[1] else b)

# Print the result
print("\nUser with the most tweets in January 2023:")
print(f"User ID: {most_tweets_user[0][0]}, Screen Name: {most_tweets_user[0][1]}, Number of tweets: {most_tweets_user[1]}")

##### Task 03: Print the top5 users with the most tweets in January 2023 including their top 5 terms (2 points)
###### Use paired RDDs and their functions to find the users with the most tweets in January 2023. Afterwards analyse their text content of the tweets and print the top 5 terms from all their posted tweets. Note: lowercase all words and remove the stopwords from the stopwords list of the archive.

In [0]:
import re

# Map to ((user_id, screen_name), 1) pairs
user_screen_pairs = tweets_rdd.map(lambda row: ((row.user_id, row.screen_name), 1))

# Reduce by key to count tweets per user
user_screen_counts = user_screen_pairs.reduceByKey(lambda a, b: a + b)

# Get the top 5 users with the most tweets
top5_users = user_screen_counts.takeOrdered(5, key=lambda x: -x[1])

# Read stopwords from file
stopwords = set([row.value.strip().lower() for row in spark.read.text("/FileStore/tables/stopwords.txt").collect()])

# Remove stopwords and URL's
def extract_words(text):
    # Remove URL's
    text = re.sub(r"http[s]?://\S+", "", text)
    # Lowercase and split into words
    words = re.findall(r'\b\w+\b', text.lower())
    # Remove stopwords
    return [word for word in words if word not in stopwords]

for user in top5_users:
    user_id, screen_name = user[0]
    # Filter tweets for this user
    user_tweets = tweets_rdd.filter(lambda row: row.user_id == user_id)
    # Extract and clean words from all their tweets
    words_rdd = user_tweets.flatMap(lambda row: extract_words(row.text))
    # Count word frequencies
    word_counts = words_rdd.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
    # Get top 5 terms
    top5_terms = word_counts.takeOrdered(5, key=lambda x: -x[1])
    # Print results
    print(f"\nTop 5 terms for User ID: {user_id}, Screen Name: {screen_name}:")
    for word, count in top5_terms:
        print(f"{word}: {count}")

##### Task 04: Print minutewise the top5 terms within the tweets of the first of January (2 points)
###### Find a solution of your choice to print the top5 terms within the tweets of the file "2023_01_01". Note: lowercase all words and remove the stopwords from the stopwords list of the archive.

In [0]:
from pyspark.sql.functions import col, lower, regexp_replace, split, explode, trim, length
from pyspark.sql import Window
import pyspark.sql.functions as F

# Read tweets and stopwords
tweets_df = spark.read.json("/FileStore/tables/data_january2023/2023_01_01.json")
stopwords = [row.value.strip().lower() for row in spark.read.text("/FileStore/tables/stopwords.txt").collect()]

# Remove URLs from text
tweets_df = tweets_df.withColumn(
    "clean_text",
    regexp_replace(col("text"), r"http[s]?://\S+", "")
)

# Lowercase and split into words, remove punctuation
tweets_df = tweets_df.withColumn(
    "clean_text",
    lower(col("clean_text"))
)
tweets_df = tweets_df.withColumn(
    "word",
    explode(split(regexp_replace(col("clean_text"), r"[^a-zA-Z0-9]", " "), r"\s+"))
)

# Remove empty strings and stopwords
tweets_df = tweets_df.withColumn("word", trim(col("word")))
tweets_df = tweets_df.filter(
    (length(col("word")) > 0) &
    (~col("word").isin(stopwords))
)

# Extract minute from time
tweets_df = tweets_df.withColumn("minute", col("time").substr(1, 16))  # "2023-01-01T00:00"

# Count word frequencies per minute
word_counts = tweets_df.groupBy("minute", "word").count()

# For each minute, get the top 5 words
window = Window.partitionBy("minute").orderBy(col("count").desc())
word_counts = word_counts.withColumn("rank", F.row_number().over(window))
top5_per_minute = word_counts.filter(col("rank") <= 5)

# Now, for each minute, create a DataFrame and store in a dictionary
minutes = [row.minute for row in top5_per_minute.select("minute").distinct().collect()]
minute_dfs = {}

for minute in minutes:
    df = top5_per_minute.filter(col("minute") == minute).select("minute", "word", "count")
    minute_dfs[minute] = df

# Print the top5 words in tweets of every minute
for minute in sorted(minute_dfs.keys()):
    print(f"\nTop 5 words for minute {minute}:")
    minute_dfs[minute].show()


##### Task 05: Your analysis (2 points)
###### Find an interesting analysis question for the Twitter data and answer it with a solution of your choice. Explain shortly, why your question is interesting and what the value of your created information is.

Meine Idee ist anhand der Zeitzone, welche dem Timestamp immer zugeordnet ist, herauszufinden in welcher Region ein User höchst wahrscheinlich lebt.
Dies ist relevant, weil man so ohne explizite Ortsangaben oder Geotags Rückschlüsse auf den geografischen Ursprung der Nutzer ziehen kann. Dadurch lassen sich zum Beispiel regionale Schwerpunkte und Zielgruppen besser identifizieren, Inhalte gezielter ausspielen oder auch ungewöhnliche Aktivitätsmuster erkennen, die auf automatisierte Accounts oder Bots hindeuten könnten.

In [None]:
# Timezones

offset_to_region = {
    # North and South America (West to East)
    "-10:00": "Hawaii-Aleutian Time",
    "-09:00": "Alaska Time",
    "-08:00": "Pacific Time (US/Canada)",
    "-07:00": "Mountain Time (US/Canada)",
    "-06:00": "Central Time (US/Canada)",
    "-05:00": "Eastern Time (US/Canada)",
    "-04:00": "Atlantic Time (Canada)/Caribbean",
    "-03:00": "Argentina/Brazil/Chile",
    "-02:00": "Mid-Atlantic",

    # Europe and Africa (West to East)
    "+00:00": "UK/Ireland/Portugal (GMT)",
    "+01:00": "Central European Time (Germany, France, Italy)",
    "+02:00": "Eastern European Time (Finland, Greece, Egypt)",
    "+03:00": "Moscow, East Africa",

    # Asia and Oceania (West to East)
    "+04:00": "Dubai, UAE, Azerbaijan",
    "+05:00": "Pakistan, Kazakhstan",
    "+05:30": "India, Sri Lanka",
    "+06:00": "Bangladesh, Bhutan",
    "+07:00": "Thailand, Vietnam, Indonesia",
    "+08:00": "China, Singapore, Malaysia",
    "+09:00": "Japan, South Korea",
    "+09:30": "Central Australia",
    "+10:00": "Eastern Australia",
    "+11:00": "Solomon Islands",
    "+12:00": "New Zealand, Fiji"
}

In [6]:
def analyze_user_timezone(user_id_input):
    """
    Analysiert die Zeitzonen-Aktivität eines spezifischen Twitter-Users.

    Args:
        user_id_input (str): Die Twitter User ID

    Returns:
        None: Druckt die Analyse-Ergebnisse
    """

    try:
        # Lade alle Tweets aus Januar 2023
        tweets_df = spark.read.json("/FileStore/tables/data_january2023/*.json")
        tweets_rdd = tweets_df.rdd

        # Filtere Tweets des spezifischen Users
        user_tweets = tweets_rdd.filter(lambda row: row.user_id == user_id_input)

        # Prüfe ob User existiert
        if user_tweets.count() == 0:
            print(f"Keine Tweets gefunden für User ID: {user_id_input}")
            return

        # Hole Screen Name des Users
        screen_name = user_tweets.first().screen_name

        print(f"\nZeitzonenanalyse für User {screen_name} (ID: {user_id_input}):")
        print(f"Gesamtzahl Tweets: {user_tweets.count()}")

        # Extrahiere Timezone Offsets
        def extract_offset(row):
            import re
            match = re.search(r'([+-]\d{2}:\d{2})$', row.time)
            return match.group(1) if match else None

        offsets = user_tweets.map(extract_offset).filter(lambda x: x is not None)

        # Zähle Häufigkeit der Timezones
        offset_counts = offsets.map(lambda offset: (offset, 1)).reduceByKey(lambda a, b: a + b)

        # Finde häufigste Timezone
        if not offset_counts.isEmpty():
            most_common_offset = offset_counts.reduce(lambda a, b: a if a[1] > b[1] else b)

            print(f"\nHäufigster Timezone Offset: {most_common_offset[0]}")
            print(f"Anzahl Tweets in dieser Timezone: {most_common_offset[1]}")
            print(f"Wahrscheinliche Region: {offset_to_region.get(most_common_offset[0], 'Unbekannte Region')}")

            # Zeige Verteilung aller Zeitzonen
            print("\nVerteilung aller Zeitzonen:")
            for offset, count in sorted(offset_counts.collect()):
                region = offset_to_region.get(offset, "Unbekannte Region")
                percentage = (count / user_tweets.count()) * 100
                print(f"Timezone {offset} ({region}): {count} Tweets ({percentage:.1f}%)")
        else:
            print("Keine Timezone-Informationen gefunden in den Tweets.")

    except Exception as e:
        print(f"Ein Fehler ist aufgetreten: {str(e)}")

# Beispiel-Nutzung:
analyze_user_timezone(most_tweets_user[0][0])

#### Part 02: In this part of the graded lab you need to solve different tasks for analysing Graph data (10 points)

###### Note: The data for this part is contained in the part02 folder. The library to import (like mentioned in Lab06 is also in the archive)

##### Task 01: Construct the graph (2 points)
###### Create a graph with the structure you can find on the images in the part02 folder. Print ten items of the vertices and 10 of the edges.
###### Note:  There is one image for vertices and one image for edges. You need to do some transformation to get to the desired result.

##### Task 02: Motifs (2 points)
###### Define a pattern which detects the all flights from New York with destination as Las Vegas and total delay of the flight should be no more than 60 minutes.

##### Task 03: Graph Pattern Analysis (2 points)
###### What are the flight routes with no direct connection?

##### Task 04: What are the most important airports, according to PageRank? (2 points)
###### Note: Your allowed to use the pageRank function of the graph library.

##### Task 05: Your analysis (2 points)
###### Find an interesting analysis question for the Graph data and answer it with a solution of your choice. Explain shortly, why your question is interesting and what the value of your created information is.