# In class exercise
This exercise is based on the global warming tweets in January 2023. I have selected four metrics related to each account (followers count, following count, listed count and tweet count). Please complete the following steps:
- find out whether they include any null values
- develop a strategy to replace null values
- find out outliers of those metrics
- develop a strategy to replace/remove outliers

In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from helper_functions import displayByGroup
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# check if the Spark session is active. If it is activate, close it

try:
    if spark:
        spark.stop()
except:
    pass    

spark = (SparkSession.builder.appName("Multidimensional Data Frame")
        .config("spark.port.maxRetries", "100")
        .config("spark.sql.mapKeyDedupPolicy", "LAST_WIN")  # This configuration allow the duplicate keys in the map data type.
#        .config("spark.driver.memory", "16g")
        .getOrCreate())

# confiture the log level (defaulty is WWARN)
spark.sparkContext.setLogLevel('ERROR')

# read the global warming tweets

df=spark.read.parquet('/opt/shared/globalwarming_202301')

In [None]:
df.select('author.public_metrics').printSchema()

In [None]:
df1=df.select('author.name', 'author.public_metrics.followers_count', 'author.public_metrics.following_count',
          'author.public_metrics.listed_count', 
          'author.public_metrics.tweet_count').distinct()

df1.orderBy('name').show()

Look at the above table, why some name appears multiple times?

In [None]:
df1=df.select('author.id','author.name', 'created_at', 'author.public_metrics.followers_count', 'author.public_metrics.following_count',
          'author.public_metrics.listed_count', 
          'author.public_metrics.tweet_count').distinct()

df1.orderBy('name').show()

In [None]:
# for each account, we need to use the metrics from the most recent date

from pyspark.sql.window import Window
from pyspark.sql.functions import rank, desc, col

windowSpec=Window.partitionBy('id').orderBy(desc('created_at'))

df2=df1.withColumn('rank', rank().over(windowSpec)).filter(col('rank')==1).drop('rank')

df2.select('name', 'followers_count').orderBy('name').show()

In [None]:
metrics=['followers_count',
 'following_count',
 'listed_count',
 'tweet_count']

df2.select(metrics).summary().show()