# MapReduce con Spark

Para arrancar una aplicación de Spark, necesitamos unas operaciones preliminares: 
- crear una configuración de spark 
- crear una sesión de spark 
- obtener un *contexto* de *Spark*

In [2]:
from pyspark import SparkConf
from pyspark.sql import SparkSession
import sagemaker_pyspark

conf = SparkConf()
conf.set("spark.driver.extraClassPath", ":".join(sagemaker_pyspark.classpath_jars()))
conf.set('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider')
spark = (
    SparkSession
    .builder
    .config(conf=conf)
    .appName("test")
    .getOrCreate()
)
sc = spark.sparkContext

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


25/02/11 14:40:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Notes:
`spark.hadoop.fs.s3a.aws.credentials.provider` property to  `org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider`, `Spark` will use the anonymous AWS credentials provider to interact with Amazon S3 without any authentication. This is useful when the data in the S3 bucket is publicly accessible and doesn't require any specific AWS credentials (like an access key and secret key) to access it.

## Leer el input

Spark puede leer input directamente desde S3. Utilizando el método `textfile`, podemos leer el input entero línea por línea, donde cada línea (que en nuestro caso representa el JSON integral de un tweet) es un elemento de un RSS de un solo valor, que será de tipo `RDD[str]` (RDD de string, es decir, un RDD donde cada fila contiene un solo elemento, y ese elemento es una cadena. 


In [2]:
rdd = sc.textFile('s3a://mudab-2025-big-data/twitter-data/Eurovision-00.json')

**Nota**: Si especificamos una *carpeta* en lugar de un *fichero*, procesaremos todos los ficheros de esa carpeta! 

In [3]:
rdd.count()

25/02/10 22:42:35 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties


                                                                                

222465

## Reciclar código


Podemos reutilizar algúnos de los métodos que escribimos en pasado, por ejemplo el código para parsear cadenar en Tweets

In [4]:
from dataclasses import dataclass

@dataclass
class Tweet:
  """Class to model a Tweet"""
  id: int         # The unique ID of a tweet
  content: str    # The textual content of a tweet
  author: str     # The nickname of the author of the tweet
  language: str   # The language of the tweet

In [5]:
import json

def toTweet(line: str): # FROM STRING TO TWEET
      try: #TRY-EXCEPT BLOCK
        parsed = json.loads(line)
        return Tweet(parsed['id'], parsed['text'], parsed['user']['name'], parsed['lang'])
      except Exception as e:
        return None # IF AN EXCEPTION OCCURS

# For small input tests, extract first 10 values
#small_data = rdd.take(10) # Takes first 10 elements
#input_rdd = sc.parallelize(input_data) # Create an RDD with first 10 elements
# for small tests, use small_rdd instead of rdd below     

processed = (rdd  # rdd[string]
    .map(toTweet) # parse string into Tweet. Return rdd[Tweet]
    .filter(lambda x: x is not None) # filter empty values. Return same type (rdd[Tweet])
)

## Recuento de tweets

El código de arriba produce un `RDD[Tweet]` a partir de un `RDD[str]`. Con el comando de Spark `.count()` podemos obtener el recuento de elementos dentro del RDD. 

In [6]:
print(processed.count())



102968


                                                                                

In [7]:
processed.take(5)

[Tweet(id=995443356309311493, content='RT @jk_rowling: France ❤️ #Eurovision', author='Liz', language='fr'),
 Tweet(id=995443356602982401, content='RT @neltropico: Salvador Sobral entregándole el premio a Israel #Eurovision https://t.co/sApzlSoPMb', author='Marina', language='es'),
 Tweet(id=995443356552527873, content='RT @jungjaeguns: cuando apago la luz del pasillo para irme a mi habitación y que no me maten los espíritus #Eurovision https://t.co/0naU3Xm…', author='Rosalinda ♥', language='es'),
 Tweet(id=995443356921720835, content='RT @itsbrookewar: Enrique amigooo!!! \n#Eurovision https://t.co/HTbbMUwEHd', author='Laura Pérez🐒', language='es'),
 Tweet(id=995443356825202690, content='RT @snugglycamila: i’m not surprised eurovision is confusing for americans since the concept of the person with the most votes actually win…', author='Maggi', language='en')]

### Note:
`take()` action is a Spark action -> triggers the execution of the Spark job up to the point where the first 5 elements are collected. 

## Tweets by language

Podemos re-escribir un ejemplo anterior (recuento de Tweets por idioma) utilizando Spark. El código es muy compacto. 

In [8]:
by_lang = (
    processed # RDD OF TWEET
        .map(lambda x: (x.language, 1)) # from rdd[Tweet] to rdd[str, int]
        .reduceByKey(lambda x, y: x + y) # reduce produces same type rdd[str, int]
        .sortBy(lambda x: x[1], ascending=False) # sort the output. 
            # Since it's a pair-rdd, select which element of the pair should be used to order 
)

print(by_lang.collect()) # we use collect() because we know that the output will be of limited size



[('es', 50874), ('en', 30675), ('pt', 5831), ('fr', 3766), ('ru', 2774), ('und', 2659), ('it', 2326), ('tr', 672), ('el', 672), ('pl', 547), ('de', 332), ('ca', 286), ('nl', 269), ('iw', 181), ('et', 132), ('sv', 132), ('in', 109), ('tl', 102), ('uk', 95), ('cy', 79), ('ht', 52), ('no', 48), ('cs', 40), ('hu', 39), ('ja', 38), ('da', 36), ('ro', 35), ('vi', 34), ('is', 30), ('fi', 24), ('eu', 21), ('lt', 11), ('sr', 10), ('fa', 9), ('bg', 7), ('ar', 6), ('sl', 4), ('th', 4), ('hi', 3), ('lv', 3), ('ko', 1)]


                                                                                

Counts the number of tweets for each language, sorts the results by the tweet count in descending order, and then prints the final result.

## WordCount

El siguiente código implementa el Word Count con Spark. Podemos reutilizar el RDD anterior `processed`, que ya contiene nuestra colección de Tweets. 

In [9]:
word_count = (
    processed # RDD OF TWEET
        .flatMap(lambda x: x.content.split()) # from rdd[Tweet] to rdd[str] 
        .map(lambda x: (x.lower(), 1)) # from rdd[str] to rdd[str, int]
        .reduceByKey(lambda x, y: x + y) # reduce to same rdd[str, int]
        .sortBy(lambda x: x[1], ascending=False) # sort the output. 
            # Since it's a pair-rdd, select which element of the pair should be used to order 
)

print(word_count.take(100))



[('rt', 85609), ('#eurovision', 64604), ('de', 27967), ('que', 27257), ('a', 26931), ('la', 20603), ('el', 17377), ('the', 16859), ('y', 16235), ('en', 13159), ('no', 10726), ('eurovision', 9431), ('un', 9042), ('israel', 8688), ('con', 8147), ('to', 7935), ('los', 7354), ('is', 7064), ('cuando', 6779), ('por', 6569), ('una', 6508), ('and', 6496), ('me', 6244), ('es', 6136), ('in', 5893), ('of', 5866), ('para', 5784), ('lo', 5714), ('i', 5554), ('año', 5342), ('se', 5103), ('for', 4845), ('you', 4386), ('del', 4325), ('ha', 4321), ('mi', 4020), ('te', 4017), ('este', 4013), ('gana', 3750), ('-', 3714), ('this', 3680), ('o', 3650), ('pero', 3283), ('that', 3193), ('yo', 3160), ('song', 2864), ('#finaleurovision', 2845), ('su', 2799), ('e', 2799), ('al', 2719), ('it', 2663), ('le', 2656), ('on', 2615), ('las', 2597), ('canción', 2551), ('españa', 2480), ('but', 2437), ('si', 2403), ('was', 2317), ('so', 2263), ('tu', 2229), ('winner', 2206), ('@manelnmusic:', 2192), ('we', 2173), ('my', 

                                                                                

It performs a word count operation on the tweet data, where the most frequently occurring words are identified and sorted in descending order. Super useful for understanding the content and trends within the tweet dataset.

In [10]:
word_count = (
    processed # RDD OF TWEET
        .filter(lambda x: x.language == 'es') # filter for Spanish tweets
        .flatMap(lambda x: x.content.split()) # from rdd[Tweet] to rdd[str] 
        .map(lambda x: (x.lower(), 1)) # from rdd[str] to rdd[str, int]
        .reduceByKey(lambda x, y: x + y) # reduce to same rdd[str, int]
        .sortBy(lambda x: x[1], ascending=False) # sort the output. 
            # Since it's a pair-rdd, select which element of the pair should be used to order 
)

print(word_count.take(100))



[('rt', 46290), ('#eurovision', 36209), ('de', 24563), ('que', 24539), ('la', 17957), ('el', 17271), ('y', 16142), ('a', 16001), ('en', 12437), ('no', 9635), ('un', 8104), ('con', 7897), ('los', 7347), ('cuando', 6778), ('por', 6320), ('una', 6227), ('es', 6046), ('lo', 5582), ('año', 5342), ('para', 4824), ('me', 4466), ('se', 4382), ('israel', 4369), ('del', 4096), ('ha', 4007), ('te', 3928), ('este', 3770), ('gana', 3744), ('mi', 3701), ('pero', 3281), ('yo', 3140), ('su', 2762), ('#finaleurovision', 2642), ('las', 2589), ('canción', 2543), ('eurovision', 2479), ('españa', 2475), ('al', 2470), ('si', 2210), ('@manelnmusic:', 2192), ('tu', 2127), ('nos', 1934), ('gallo', 1880), ('como', 1873), ('han', 1833), ('así', 1770), ('esto', 1770), ('amaia', 1763), ('pasado', 1730), ('alfred', 1720), ('-', 1704), ('más', 1697), ('mucho', 1651), ('quedo', 1611), ('le', 1561), ('ganado', 1560), ('último', 1547), ('gallina...', 1501), ('https://t.co/efvxqbb8jp', 1500), ('mejor', 1487), ('ya', 144

                                                                                

In [11]:
word_count = (
    processed # RDD OF TWEET
        .filter(lambda x: x.language == 'es') # filter for Spanish tweets
        .flatMap(lambda x: x.content.split()) # from rdd[Tweet] to rdd[str] 
        .map(lambda x: (x.lower(), 1)) # from rdd[str] to rdd[str, int]
        .reduceByKey(lambda x, y: x + y) # reduce to same rdd[str, int]
        .sortBy(lambda x: x[1], ascending=False) # sort the output. 
            # Since it's a pair-rdd, select which element of the pair should be used to order 
        .map(lambda x :x[0]) # RDD OF STRING
)

print(word_count.take(100))



['rt', '#eurovision', 'de', 'que', 'la', 'el', 'y', 'a', 'en', 'no', 'un', 'con', 'los', 'cuando', 'por', 'una', 'es', 'lo', 'año', 'para', 'me', 'se', 'israel', 'del', 'ha', 'te', 'este', 'gana', 'mi', 'pero', 'yo', 'su', '#finaleurovision', 'las', 'canción', 'eurovision', 'españa', 'al', 'si', '@manelnmusic:', 'tu', 'nos', 'gallo', 'como', 'han', 'así', 'esto', 'amaia', 'pasado', 'alfred', '-', 'más', 'mucho', 'quedo', 'le', 'ganado', 'último', 'gallina...', 'https://t.co/efvxqbb8jp', 'mejor', 'ya', 'qué', 'está', 'país', 'contra', 'os', 'europa', 'ganadora', 'sin', 'va', 'todos', 'esta', 'memes', 'madre', 'eurovisión', 'puntos', 'antes', 'ser', 'hay', 'nada', 'son', 'viene', 'israel,', 'tiene', 'todo', '@amaia_ot2017', 'porque', 'hasta', 'o', '@alfred_ot2017', 'ni', 'hace', 'años', 'ganar', 'luz', 'actuación', 'siempre', 'les', 'puestos', 'pasillo']


                                                                                

This script is to perform a word count operation on the Spanish tweets in the processed RDD, and then output the 100 most frequent words.

In [12]:
word_count = (
    processed # RDD OF TWEET
        .flatMap(lambda x: x.content.split()) # from rdd[Tweet] to rdd[str] 
        .filter(lambda x: x.startswith('#')) # filter for hashtags
        .map(lambda x: (x.lower(), 1)) # from rdd[str] to rdd[str, int]
        .reduceByKey(lambda x, y: x + y) # reduce to same rdd[str, int]
        .sortBy(lambda x: x[1], ascending=False) # sort the output. 
            # Since it's a pair-rdd, select which element of the pair should be used to order 
        .map(lambda x :x[0]) # RDD OF STRING
)

print(word_count.take(100))



['#eurovision', '#finaleurovision', '#esc2018', '#eurovision2018', '#allaboard', '#israel', '#escita', '#eurovision.', '#isr', '#eurovision:', '#gaza', '#eurocancion', '#amaiaalfred12points', '#finaleurovisión', '#eurovisiongr', '#eurovision…', '#eurovision,', '#finaleu…', '#e…', '#eurovi…', '#eur…', '#metamoroesc2018', '#almaia', '#…', '#netta', '#eurovisi…', '#eurovis…', '#eurov…', '#píntalocántalo', '#metamoro', '#foustanela', '#freepalestine', '#cze', '#czechrepublic', '#yolaidaypa…', '#eurovision?', '#esp', '#cyprus', '#hun', '#toy', '#ukr', '#eu…', '#esf18', '#esc', '#den', '#euro…', '#allaboa…', '#bds', '#евровидение', '#irl', '#ita', '#yolaidaypacho12points', '#cyp', '#por', '#евровидение2018', '#mercy', '#voteisrael', '#eurovisionfinal', '#met…', '#esc18', '#españa', '#hungary', '#eurovisio…', '#freepalestina', '#eurowizja', '#israel.', '#eurovison', '#fin', '#esc2018!', '#nettabarzilai', '#esc2018…', '#eurovision!', '#graciasalfredyamaia', '#am…', '#fortnite', '#france', '#me

                                                                                

In [13]:
word_count = (
    processed # RDD OF TWEET
        .flatMap(lambda x: x.content.split()) # from rdd[Tweet] to rdd[str] 
        .filter(lambda x: x[0][0]=='#') # filter for hashtags
        .map(lambda x: (x.lower(), 1)) # from rdd[str] to rdd[str, int]
        .reduceByKey(lambda x, y: x + y) # reduce to same rdd[str, int]
        .sortBy(lambda x: x[1], ascending=False) # sort the output. 
            # Since it's a pair-rdd, select which element of the pair should be used to order 
        #.map(lambda x :x[0]) # RDD OF STRING
)

print(word_count.take(100))



[('#eurovision', 64604), ('#finaleurovision', 2845), ('#esc2018', 1977), ('#eurovision2018', 1618), ('#allaboard', 1372), ('#israel', 1328), ('#escita', 838), ('#eurovision.', 542), ('#isr', 360), ('#eurovision:', 346), ('#gaza', 339), ('#eurocancion', 334), ('#amaiaalfred12points', 317), ('#finaleurovisión', 296), ('#eurovisiongr', 283), ('#eurovision…', 281), ('#eurovision,', 278), ('#finaleu…', 252), ('#e…', 215), ('#eurovi…', 213), ('#eur…', 195), ('#metamoroesc2018', 195), ('#almaia', 188), ('#…', 174), ('#netta', 169), ('#eurovisi…', 168), ('#eurovis…', 162), ('#eurov…', 160), ('#píntalocántalo', 159), ('#metamoro', 158), ('#foustanela', 158), ('#freepalestine', 155), ('#cze', 146), ('#czechrepublic', 144), ('#yolaidaypa…', 143), ('#eurovision?', 142), ('#esp', 138), ('#cyprus', 136), ('#hun', 134), ('#toy', 129), ('#ukr', 126), ('#eu…', 121), ('#esf18', 112), ('#esc', 109), ('#den', 108), ('#euro…', 104), ('#allaboa…', 98), ('#bds', 98), ('#евровидение', 88), ('#irl', 88), ('#it

                                                                                

In [14]:
word_count = (
    processed # RDD OF TWEET
        .filter(lambda x: x.language == 'es') # filter for Spanish tweets
        .flatMap(lambda x: x.content.split()) # from rdd[Tweet] to rdd[str] 
        .filter(lambda x: x not in ['el', 'la', 'los'])
        .map(lambda x: (x.lower(), 1)) # from rdd[str] to rdd[str, int]
        .reduceByKey(lambda x, y: x + y) # reduce to same rdd[str, int]
        .sortBy(lambda x: x[1], ascending=False) # sort the output. 
            # Since it's a pair-rdd, select which element of the pair should be used to order 
        #.map(lambda x :x[0]) # RDD OF STRING
)

print(word_count.take(100))



[('rt', 46290), ('#eurovision', 36209), ('de', 24563), ('que', 24539), ('y', 16142), ('a', 16001), ('en', 12437), ('no', 9635), ('un', 8104), ('con', 7897), ('cuando', 6778), ('por', 6320), ('una', 6227), ('es', 6046), ('lo', 5582), ('año', 5342), ('para', 4824), ('me', 4466), ('se', 4382), ('israel', 4369), ('del', 4096), ('ha', 4007), ('te', 3928), ('este', 3770), ('gana', 3744), ('mi', 3701), ('el', 3663), ('pero', 3281), ('la', 3234), ('yo', 3140), ('su', 2762), ('#finaleurovision', 2642), ('las', 2589), ('canción', 2543), ('eurovision', 2479), ('españa', 2475), ('al', 2470), ('si', 2210), ('@manelnmusic:', 2192), ('tu', 2127), ('nos', 1934), ('gallo', 1880), ('como', 1873), ('han', 1833), ('así', 1770), ('esto', 1770), ('amaia', 1763), ('pasado', 1730), ('alfred', 1720), ('-', 1704), ('más', 1697), ('mucho', 1651), ('quedo', 1611), ('le', 1561), ('ganado', 1560), ('último', 1547), ('gallina...', 1501), ('https://t.co/efvxqbb8jp', 1500), ('mejor', 1487), ('ya', 1442), ('qué', 1373)

                                                                                

# Questions (Home, Moodle)

Estas preguntas se tienen que contestar sobre la colección **entera** de Tweets

In [10]:
rdd = sc.textFile('s3a://mudab-2025-big-data/twitter-data/Eurovision-*.json')

In [4]:
rdd.count()

25/02/11 14:41:15 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties


                                                                                

2583345

In [11]:
from dataclasses import dataclass

@dataclass
class Tweet:
  """Class to model a Tweet"""
  id: int         # The unique ID of a tweet
  content: str    # The textual content of a tweet
  author: str     # The nickname of the author of the tweet
  language: str   # The language of the tweet

In [12]:
import json

def toTweet(line: str): # FROM STRING TO TWEET
      try: #TRY-EXCEPT BLOCK
        parsed = json.loads(line)
        return Tweet(parsed['id'], parsed['text'], parsed['user']['name'], parsed['lang'])
      except Exception as e:
        return None # IF AN EXCEPTION OCCURS

# For small input tests, extract first 10 values
#small_data = rdd.take(10) # Takes first 10 elements
#input_rdd = sc.parallelize(input_data) # Create an RDD with first 10 elements
# for small tests, use small_rdd instead of rdd below     

processed = (rdd  # rdd[string]
    .map(toTweet) # parse string into Tweet. Return rdd[Tweet]
    .filter(lambda x: x is not None) # filter empty values. Return same type (rdd[Tweet])
)

## 1. Encuentra las 100 palabras más largas entre los tweets en español

When using the `take(100)` we get the first 100 elements of the sorted RDD. So this means that an array of 100 elements is returned, which are the 100 longest words in the RDD.

In [7]:
from pyspark.sql.functions import length

word_count = (
    processed # RDD OF TWEET
        .filter(lambda x: x.language == 'es') # filter for Spanish tweets
        .flatMap(lambda x: x.content.split()) # from rdd[Tweet] to rdd[str]
        .map(lambda x: (x.lower(), len(x))) # from rdd[str] to rdd[str, int]
        .sortBy(lambda x: x[1], ascending=False) # sort by word length in descending order
        .take(100) # take the top 100 longest words
)

print(word_count)



[('jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjajajajajajjajajajajjsajaajajajajajajjajajajajajajsjajajajjajj…', 124), ('jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjajajajajajjajajajajjsajaajajajajajajjajajajajajajsjajajajjajj…', 124), ('jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjajajajajajjajajajajjsajaajajajajajajjajajajajajajsjajajajjajj…', 124), ('hungría&gt;moldavia&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;resto', 124), ('jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjjajajajajajajajajajajajajajajajajjjajajajaj…', 122), ('jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjjajajajajajajajajajajajajajajajajjjajajajaj…', 122), ('jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjjajajajajajajajajajajajajajajajajjjajajajaj…', 122), ('jajajajajajajajajajajajajajajajajajajajajajajajajajajajaja

                                                                                

The `take(100)` is appplied after the RDD has been sorted. So this means that an array of 100 elements is returned, which are the 100 longest words in the RDD.

In [15]:
from pyspark.sql.functions import length

longest_word = (
    processed # RDD OF TWEET
        .filter(lambda x: x.language == 'es') # filter for Spanish tweets
        .flatMap(lambda x: x.content.split()) # from rdd[Tweet] to rdd[str]
        .map(lambda x: (x.lower(), len(x))) # from rdd[str] to rdd[str, int]
        .sortBy(lambda x: x[1], ascending=False) # sort by word length in descending order
)

print(longest_word.take(100))



[('jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjajajajajajjajajajajjsajaajajajajajajjajajajajajajsjajajajjajj…', 124), ('jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjajajajajajjajajajajjsajaajajajajajajjajajajajajajsjajajajjajj…', 124), ('jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjjajajajajajajajajajajajajajajajajjjajajajaj…', 122), ('jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjjajajajajajajajajajajajajajajajajjjajajajaj…', 122), ('jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjjajajajajajajajajajajajajajajajajjjajajajaj…', 122), ('jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjajajajajajjajajajajjsajaajajajajajajjajajajajajajsjaj…', 116), ('jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj…', 116), ('mooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo

                                                                                

Most efficient way bc of `top(100, key=lambda x: x[1])`. By using this method we will get the 100 longest words in the RDD. So this means that an array of 100 elements is returned, which are the 100 longest words in the RDD, sorted by word length. 

In [16]:
from pyspark.sql.functions import length

longest_word = (
    processed # RDD OF TWEET
        .filter(lambda x: x.language == 'es') # filter for Spanish tweets
        .flatMap(lambda x: x.content.split()) # from rdd[Tweet] to rdd[str]
        .map(lambda x: (x.lower(), len(x))) # from rdd[str] to rdd[str, int]
)

print(longest_word.top(100, key=lambda x: x[1]))



[('jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjajajajajajjajajajajjsajaajajajajajajjajajajajajajsjajajajjajj…', 124), ('jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjajajajajajjajajajajjsajaajajajajajajjajajajajajajsjajajajjajj…', 124), ('jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjjajajajajajajajajajajajajajajajajjjajajajaj…', 122), ('jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjjajajajajajajajajajajajajajajajajjjajajajaj…', 122), ('jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjjajajajajajajajajajajajajajajajajjjajajajaj…', 122), ('jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjajajajajajjajajajajjsajaajajajajajajjajajajajajajsjaj…', 116), ('jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj…', 116), ('mooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo

                                                                                

## 2. Encuentra las 100 palabra palíndroma más largas entre los tweets en inglés 

Un palíndromo es una palabra que tiene la misma lectura de izquierda a derecha o de derecha a izquierda. 

Ejemplos de palíndromos en inglés: `noon`, `level`, etc...

### WITH EMOJIS

In [8]:
english_palyndrome = (
    processed # RDD OF TWEET
        .filter(lambda x: x.language == 'en') # filter for English tweets
        .flatMap(lambda x: x.content.split()) # from rdd[Tweet] to rdd[str]
        .map(lambda x: (x.lower(), len(x))) # from rdd[str] to rdd[str, int]
        .filter(lambda x: x[0] == x[0][::-1]) # filter for palindromes
        .sortBy(lambda x: x[1], ascending=False) # sort by word length in descending order
)

print(english_palyndrome.take(100))





                                                                                

### WITHOUT EMOJIS

In [17]:
import re

english_palindrome = (
    processed # RDD OF TWEET
        .filter(lambda x: x.language == 'en') # FILTER FOR ENGLISH TWEETS
        # EXCLUDE EMOJIS
        .flatMap(lambda x: re.findall(r'\b\w+\b', re.sub(r'[^\x00-\x7F]+', '', x.content).lower())) # from rdd[Tweet] to rdd[str]
        .map(lambda x: re.sub(r'[^a-z0-9]', '', x)) # REMOVE PUNCTUATION AND SPECIAL CHARACTERS
        .distinct() # REMOVE DUPLICATES
        .map(lambda x: (x, len(x))) # from rdd[str] to rdd[str, int]
        .filter(lambda x: x[0] == x[0][::-1] and len(x[0]) > 1) # FILTER PALINDROMES
        .sortBy(lambda x: x[1], ascending=False) # SORT ASC. 
)

print(english_palindrome.take(100))



[('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 106), ('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 74), ('aaaaaaaaaaaaaaaaaaaaaaaaa', 25), ('aaaaaaaaaaaaaaaaaaaaaa', 22), ('aaaaaaaaaaaaaaaaaaaaa', 21), ('looooooooooooooool', 18), ('eeeeeeeeeeeeeeee', 16), ('oooooooooooooooo', 16), ('zzzzzzzzzzzzzzzz', 16), ('loooooooooooool', 15), ('aaaaaaaaaaaaaaa', 15), ('zzzzzzzzzzzzzz', 14), ('oooooooooooooo', 14), ('aaaaaaaaaaaaaa', 14), ('hahahahahahah', 13), ('lolololololol', 13), ('aaaaaaaaaaaaa', 13), ('ahahahahahaha', 13), ('aaaaaaaaaaaa', 12), ('eeeeeeeeeeee', 12), ('oooooooooooo', 12), ('woooooooooow', 12), ('looooooooool', 12), ('muuuuuuuuum', 11), ('lololololol', 11), ('ahahahahaha', 11), ('lemononomel', 11), ('wwwooooowww', 11), ('mmmmmmmmmmm', 11), ('hahahahahah', 11), ('nerak3karen', 11), ('wooooooooow', 11), ('loooooooool', 11), ('renyapayner', 11), ('wowowowowow', 11), ('aaaaaaaaaaa', 11

                                                                                

`re.findall(r'\b\w+\b', ...)` -> finds all occurrences of word characters that might have letters, numbers, or underscores in the input string. The `\b` match word boundaries, and `\w+` matches one or more word characters.
`re.sub(r'[^\x00-\x7F]+', '', ...)` -> replaces any characters that are not in the ASCII range with an empty string, they will get deleted from the input string.

In [18]:
from tabulate import tabulate

results = english_palindrome.take(100)
table = [[word, len(word)] for word, _ in results]
# tablefmt parameter -> specify the table format
print(tabulate(table, headers=['Word', 'Len.'], tablefmt='orgtbl'))



| Word                                                                                                       |   Len. |
|------------------------------------------------------------------------------------------------------------+--------|
| aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |    106 |
| aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa                                 |     74 |
| aaaaaaaaaaaaaaaaaaaaaaaaa                                                                                  |     25 |
| aaaaaaaaaaaaaaaaaaaaaa                                                                                     |     22 |
| aaaaaaaaaaaaaaaaaaaaa                                                                                      |     21 |
| looooooooooooooool                                                                                         |     18 |
| eeeeeeeeeeeeeeee                      

                                                                                