# Analysis of Global Warming Tweets on January 2023.
This proejct is based on global warming tweets tweeted on January 2023. I have added emotion (anger, joy, opotimism and sadness) and gender at the end of each tweets. Emotion analysis was performed using a pre-trained model from Hugging Face (Twitter-roBERTa-base for Emotion Recognition). This is a roBERTa-base model trained on ~58M tweets and finetuned for emotion recognition with the TweetEval benchmark. Each tweet is classified into four emotions (joy, optimism, anger, and sadness) with a confidence score. In addition, gender is extracted based on first name of user account if a user account has a real first name and the gender can be identified by python package gender guesser. 

# Read tweets into a data frame

In [2]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from helper_functions import displayByGroup
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# check if the Spark session is active. If it is activate, close it

try:
    if spark:
        spark.stop()
except:
    pass    

spark = (SparkSession.builder.appName("Multidimensional Data Frame")
        .config("spark.port.maxRetries", "100")
        .config("spark.sql.mapKeyDedupPolicy", "LAST_WIN")  # This configuration allow the duplicate keys in the map data type.
#        .config("spark.driver.memory", "16g")
        .getOrCreate())

# confiture the log level (defaulty is WWARN)
spark.sparkContext.setLogLevel('ERROR')

# read the global warming tweets

df=spark.read.parquet('/opt/shared/globalwarming_202301')

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/27 18:57:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/09/27 18:57:34 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/09/27 18:57:34 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
23/09/27 18:57:34 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
23/09/27 18:57:34 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
23/09/27 18:57:34 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.
23/09/27 18:57:34 WARN Utils: Service 'SparkUI' could not bind on port 4045. Attempting port 4046.
23/09/27 18:57:34 WARN Utils: Service 'SparkUI' could not bind on port 4046. Attempting port 4047.
23/09/27 18:57:34 WARN Utils: Serv

# In class exercise

## Extract mentions of each tweet account received.

In [153]:
#df.select('author').printSchema()

In [3]:
df.select('author.entities.description.mentions').printSchema()

root
 |-- mentions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- end: long (nullable = true)
 |    |    |-- start: long (nullable = true)
 |    |    |-- username: string (nullable = true)



In [4]:
from pyspark.sql.functions import col, explode, desc
df1=df.select('author.username', col('author.entities.description.mentions.username').alias('mentions')).filter(F.col('mentions').isNotNull())

df2=df1.select('username', explode('mentions').alias('mentions'))

In [None]:
# users who received most mentions

df2.groupBy('mentions').count().orderBy(desc('count')).show()

In [161]:
# user who mentioned most people

df2.groupBy('username').count().orderBy(desc('count')).show()



+---------------+-----+
|       username|count|
+---------------+-----+
| IMPraveenDalal| 1128|
|          _PTLB|  678|
| _DigitalPolice|  410|
|  FurusetGerden|  308|
|DisasterReliefs|  306|
|  ManishKhurana|  209|
|         _AFPOH|  186|
| FiyyazAhmed_06|  172|
|  SwatiBhalla23|  151|
|   GeraldKutney|  144|
|       _CEDILRI|  131|
|   HaarpDecoded|  109|
|Qlightenigma007|   96|
|   Living4Earth|   94|
|     PtlbSchool|   92|
| EastMeetsWest0|   92|
|     jonburkeUK|   88|
|   leon_mugisho|   79|
|          pmagn|   79|
|    SaleemulHuq|   78|
+---------------+-----+
only showing top 20 rows



                                                                                

## extract user location from tweets

In [162]:
df.select('author.location').show()

+--------------------+
|            location|
+--------------------+
| Manchester, England|
|Down very long tr...|
|                null|
|Santa Ana, Califo...|
|                null|
|      Morgantown, WV|
|                null|
|                null|
|              U.S.A.|
|LORD HIS EXCELLEN...|
|      Earth USA S.C.|
|On stolen Kulin l...|
|Rural Hall, North...|
|Johannesburg. Sou...|
|                null|
|    Chicago & Galway|
|             Germany|
|        Alabama, USA|
|           TOKYO-III|
|                null|
+--------------------+
only showing top 20 rows



## Extract hashtags from tweets

In [167]:
df.select('entities.hashtags.tag').show()

+--------------------+
|                 tag|
+--------------------+
|                null|
|                null|
|                null|
|                null|
|                null|
|                null|
|[GretaThunberg, G...|
|                null|
|                null|
|                null|
|                null|
|                null|
|[climate, change,...|
|                null|
|                null|
|                null|
|                null|
|                null|
|                null|
|                null|
+--------------------+
only showing top 20 rows



## Extract Entity (place, person, organizations) from tweets

In [174]:
df.select('entities.annotations.normalized_text', 'entities.annotations.type').show()

+--------------------+--------------+
|     normalized_text|          type|
+--------------------+--------------+
|                null|          null|
|                null|          null|
|[America, Pikas N...|[Place, Other]|
|              [Elon]|      [Person]|
|                null|          null|
|                null|          null|
|     [GretaThunberg]|      [Person]|
| [Bible, Revelation]|[Other, Other]|
|           [Florida]|       [Place]|
|        [Ice-Age, -]|[Other, Other]|
|                null|          null|
|              [IPCC]|[Organization]|
|     [MammothSteppe]|       [Place]|
|                null|          null|
|                null|          null|
|[Ice Road Trucker...|[Other, Other]|
|                null|          null|
|    [Global Warming]|       [Other]|
|                null|          null|
|                null|          null|
+--------------------+--------------+
only showing top 20 rows



In [179]:
df1=df.select(F.map_from_arrays('entities.annotations.normalized_text', 'entities.annotations.type').alias('entities'))

df1.select(explode('entities')).show()

+--------------------+------------+
|                 key|       value|
+--------------------+------------+
|             America|       Place|
|       Pikas Now Pre|       Other|
|                Elon|      Person|
|       GretaThunberg|      Person|
|               Bible|       Other|
|          Revelation|       Other|
|             Florida|       Place|
|             Ice-Age|       Other|
|                   -|       Other|
|                IPCC|Organization|
|       MammothSteppe|       Place|
|Ice Road Truckers...|       Other|
|Don’t Call It Glo...|       Other|
|      Global Warming|       Other|
|               Greta|      Person|
|       GretaThunberg|      Person|
|            Libtards|       Other|
|         mumbo-jumbo|       Other|
|The Age of Aquari...|       Other|
|          Solar Myth|       Other|
+--------------------+------------+
only showing top 20 rows

