# Class 3 - Usar ficheros desde S3

## Objetivo

Usar datos reales de gran tamaño almacenados en S3

### Steps

1. subir datos a S3
2. utilizar S3 como fuente de datos de Tweets 
3. contestar preguntas sobre estos datos

# Parte 1: Operaciones en S3

Para praticar las operaciones disponibles en S3 usaremos la consola del Learner Lab. 

## Buckets 

### Crear un bucket 

Para crear el bucket `bucket-de-ejemplo`, el comando es: 

```aws
aws s3api create-bucket -bucket bucket-de-ejemplo
```

Con el comando 
```aws
aws s3api list-buckets  
```

podemos comprobar que el bucket haya sido creado. 

&#x26A0; **Importante!** No pueden existir globalmente en S3 dos buckets con el mismo nombre, aunque sean creados por personas/cuentas diferentes. 

### Eliminar un bucket 

Para eliminar el bucket `bucket-de-ejemplo`, el comando es: 

```aws
aws s3api delete-bucket -bucket bucket-de-ejemplo
```

Con el comando 
```aws
aws s3api list-buckets  
```

podemos comprobar que el bucket haya sido eliminado. 

## Ficheros

### Hacer el listado de ficheros en un directorio 

En S3 los directorios como si no existen, pero cuando dos objetos tienen un prefijo comun separado por `/` este prefijo comun se puede interpretar como un directorio. Por ejemplo, el fichero con key `test/sub/file1` y el fichero con key `test/sub/file2` comparten el mismo prefijo `test/sub`, que podemos interpretar como que en el bucket contiene un directorio `test` que contiene un directorio `sub`, que contiene dos ficheros `file1` y `file2`. 

Por lo tanto, imaginando que estos ficheros sean dentro de un bucket llamado `my-bucket`, podemos pedir por consola de ver el contenido del directorio `test/sub` con el siguiente comando:

```
aws s3 ls s3://my-bucket/test/sub 
```

### Copiar un fichero a S3 

Un fichero `my-file.txt` desde el ordenador se puede copiar a S3 con el siguiente comando:

```
aws s3 cp my-file.txt s3://my-bucket/my-dir/my-file.txt
```

También es posible copiar un fichero de S3 a S3, siempre y cuando tengamos los permisos suficientes para realizar la operación (permisos de lectura del fichero de origen, y permiso de escritura al bucket de destino).

Por ejemplo el siguiente comando: 
```
aws s3 cp s3://my-bucket/file1.txt s3://my-other-bucket/copy/file1
```

copia el fichero `file1.txt` desde el bucket `my-bucket` al bucket `my-other-bucket` con el key `copy/file1`. Esto quiere decir que el fichero `file1` estará automáticamente en un directorio `copy` que se creará sin necesidad de existir previamente. Con el comando 

```
aws s3 ls s3://my-other-bucket/copy/ 
```
confirmaremos que el fichero está en el directorio `copy` bajo el nombre `file1` 

### Otras operaciones sobre ficheros "locales"

Para hacer pruebas, puede ser útil crear ficheros **desde la consola del Learner Lab**. 

El siguiente comando crea un fichero vacío llamado `my-file.txt`

```bash
touch my-file.txt
```

El siguiente comando imprime en pantalla el listado de ficheros en el directorio actual: 

```bash
ls
```

También podemos crear un fichero con un contenido desde la consola: 

```bash
echo '12345' > my-second-file.txt 
```

El contenido del fichero `my-second-file.txt` será el texto `12345`

## Actividad 1 (en clase)
**Usando la consola del Learner Lab**, ejecuta lo siguiente:

- Crea un bucket llamado `mudab-ixxxxx` donde `ixxxxx` es tu número identificativo de estudiante
- Crea un fichero local con el contenido `abcde` y nombre `file1.txt`
- Copia el fichero en tu bucket, bajo el nombre `test/consola/file1`
- Copia de nuevo el fichero, bajo el nombre `test/consola/file2`
- Comprueba el contenido de tu bucket bajo la key `test/consola/`, deberían aparecer 2 ficheros
- Borra el fichero `file2`
- Comprueba el contenido de tu bucket bajo la key `test/consola/`, debería aparecer 1 fichero

### Descarga en tu ordenador  
Si quieres tener los ficheros en tu ordenador para hacer pruebas por tu cuenta, pues los objetos públicos de S3 tiene un URL desde donde descargar el objeto. La versión comprimida de los tweet está disponible aquí: https://mudab-2025-big-data.s3.us-east-1.amazonaws.com/twitter-data-compressed/twitter-data-from-eurovision-2018-splits.zip

# Parte 2: S3 de manera programática 

En esta parte, añadiremos el acceso a S3 como sistema de ficheros a través de una librería llamada [s3fs](https://s3fs.readthedocs.io/en/latest/). Esta librería nos permite acceder a S3 como si fuera un sistema de ficheros local. 

Aunque el acceso a S3 de manera programatica requiera credenciales, la ejecución en entorno SageMaker nos permite omitir este paso de configuración, y tener un código más sencillo. 



In [None]:
import s3fs
s3 = s3fs.S3FileSystem(anon=False)

bucket='mudab-2025-big-data'
data_key = 'twitter-data/Eurovision-00.json'
data_location = 's3://{}/{}'.format(bucket, data_key)

### Operaciones sobre ficheros con la librería s3fs 

La librería s3fs es un ejemplo de acceso programático a los servicios de AWS. En este caso, nos permite acceder a S3 y hacer operaciones utilizando Python. Las operaciones sobre ficheros explicadas arriba se traducen en pequeños *snippets* de código.

#### Hacer el listado de ficheros en un directorio

```python
s3.ls("my-bucket/test/sub")
```

#### Copiar un fichero a S3

```python
s3.put("my-file.txt", "s3://my-bucket/my-dir/my-file.txt")
```

#### Borrar un fichero desde S3 

```python
s3.rm("my-bucket/my-dir/my-file.txt")
```

## Actividad 2 (en clase)

Primero, crea en tu ordenador un fichero local con el contenido que quieras, y nombre `f1.txt`, luego cargalo a tu entorno Jupyter a través del icono de upload &#x2B06;. Usaremos el mismo bucket creado en la actividad anterior

Luego, **Usando el entorno de Jupyter**, ejecuta lo siguiente:

- Copia el fichero en tu bucket, bajo el nombre `test/jupyter/file1`
- Copia de nuevo el fichero, bajo el nombre `test/jupyter/file2`
- Comprueba el contenido de tu bucket bajo la key `test/jupyter/`, deberían aparecer 2 ficheros
- Borra el fichero `file2`
- Comprueba el contenido de tu bucket bajo la key `test/jupyter/`, debería aparecer 1 fichero

In [1]:
import s3fs
s3 = s3fs.S3FileSystem(anon=False) # CREATE CLIENT

s3.put('mini_input.txt', 's3://mudab-2025-pc1262057/dir3/') # COPIES LOCAL FILE 

In [2]:
s3.ls('s3://mudab-2025-pc1262057/dir3/')

['mudab-2025-pc1262057/dir3/mini_input.txt']

In [3]:
s3.put('mini_input.txt', 's3://mudab-2025-pc1262057/dir3/mini_input2.txt')
s3.ls('s3://mudab-2025-pc1262057/dir3/')

['mudab-2025-pc1262057/dir3/mini_input.txt',
 'mudab-2025-pc1262057/dir3/mini_input2.txt']

In [4]:
s3.rm('s3://mudab-2025-pc1262057/dir3/mini_input2.txt') # ERASE
s3.ls('s3://mudab-2025-pc1262057/dir3/')

['mudab-2025-pc1262057/dir3/mini_input.txt']

# Parte 3: Big data desde S3 

En esta parte, intentaremos procesar la colección entera de Tweets que reside en S3. Los pasos a seguir serán los siguientes: 

1. Desde la consola del Learner Lab, copiar la colección entera desde el bucket de origen (`mudab-2025-big-data`) bajo el path `twitter-data`. Los ficheros se denominan `Eurovision-XX.json` donde `XX` es un número entre `00` y `09`. Para hacer una copia recursiva de ficheros se puede utilizar el comando de copia de una manera más sofisticada utilizando la opción `--recursive`. 
```bash
aws s3 cp —recursive s3://mudab-2023/twitter-data/ s3://mudab2023-i123456/input/
```

2. Utilizar el mísmo código de la versión anterior para contestar a las siguientes preguntas: 
- Cuántos Tweeets son en español? 
- Cual son las 100 palabras más frecuentes en español? 

## Modelo de Tweets

Similar a la clase anterior, creamos un modelo de Tweet 

In [5]:
from dataclasses import dataclass

@dataclass
class Tweet:
  """Class to model a Tweet"""
  id: int         # The unique ID of a tweet
  content: str    # The textual content of a tweet
  author: str     # The nickname of the author of the tweet
  language: str   # The language of the tweet


## Procesamiento de input (ETL) 

Similar a la clase anterior, almacenamos todos los tweets en memoria. 

**Pregunta**: funcionarà esto para nuestro input completo? Que hacer si así no es? 

In [7]:
import json, dataclasses

tweets = []

bucket='mudab-2025-pc1262057'
data_key = 'input/Eurovision-00.json'
data_location = 's3://{}/{}'.format(bucket, data_key)

def parse_line(line: str):
  """Try to parse a string into a Person"""
  error = 0
  try:
    parsed = json.loads(line)
    return Tweet(parsed['id'], parsed['text'], parsed['user']['name'], parsed['lang'])
  except Exception as e:
    error += 1    
#   print(f"Error parsing '{line}': {e}")

with s3.open(data_location) as input: #CLIENT S3. CAN READ THE FILE LINE BY LINE
  for line in input.readlines():
    if len(line.strip()) > 0:
      tweet = parse_line(line)
      if tweet: # We add only if the tweet is not 'None'
         tweets.append(tweet)

for modeled_tweet in tweets[0:10]:
  print(modeled_tweet)

Tweet(id=995443356309311493, content='RT @jk_rowling: France ❤️ #Eurovision', author='Liz', language='fr')
Tweet(id=995443356602982401, content='RT @neltropico: Salvador Sobral entregándole el premio a Israel #Eurovision https://t.co/sApzlSoPMb', author='Marina', language='es')
Tweet(id=995443356552527873, content='RT @jungjaeguns: cuando apago la luz del pasillo para irme a mi habitación y que no me maten los espíritus #Eurovision https://t.co/0naU3Xm…', author='Rosalinda ♥', language='es')
Tweet(id=995443356921720835, content='RT @itsbrookewar: Enrique amigooo!!! \n#Eurovision https://t.co/HTbbMUwEHd', author='Laura Pérez🐒', language='es')
Tweet(id=995443356825202690, content='RT @snugglycamila: i’m not surprised eurovision is confusing for americans since the concept of the person with the most votes actually win…', author='Maggi', language='en')
Tweet(id=995443357055967233, content='RT @JESSthir: Fuegos artificiales para celebrarlo vamooooooos #eurovision https://t.co/2hwMsyRVAZ', 

# Procesamiento de datos

In [8]:
import json, dataclasses

def read_clean_tweets(input: str):
  tweets = []
  with open(input, 'r') as f:
    lines = f.readlines()
  for line in lines:
    parsed = json.loads(line)
    tweet = Tweet(**parsed)
    tweets.append(tweet)
  return tweets

def count_tweets(language: str, tweets: list[Tweet]):
  count = 0
  for tweet in tweets:
    if (tweet.language == language):
      count = count + 1
  return count

def most_frequent_word(tweets: list[Tweet]):
  count = {}
  for tweet in tweets:
    words = tweet.content.split(' ')
    for word in words:
      if (word in count):
        new_val = count[word] + 1
        count[word] = new_val
      else:
        count[word] = 1
  return dict(sorted(count.items(), key=lambda item: item[1], reverse=True))

spanish_tweets_count = count_tweets('es', tweets)

words_by_frequence = list(most_frequent_word(tweets).items())[0:100]

print(spanish_tweets_count)
print(words_by_frequence)

50874
[('RT', 85512), ('#Eurovision', 45318), ('de', 26529), ('a', 23982), ('que', 23961), ('la', 16756), ('y', 14531), ('the', 14414), ('el', 13689), ('en', 11688), ('un', 8159), ('to', 7662), ('Israel', 7521), ('', 7442), ('no', 7327), ('con', 7203), ('una', 6211), ('is', 6209), ('los', 6100), ('#eurovision', 6040), ('por', 5799), ('and', 5562), ('in', 5386), ('año', 5282), ('Eurovision', 5220), ('of', 5200), ('es', 4987), ('for', 4565), ('para', 4427), ('Cuando', 4321), ('lo', 4291), ('se', 4120), ('me', 4087), ('del', 4041), ('ha', 3972), ('I', 3783), ('te', 3769), ('eurovision', 3622), ('gana', 3565), ('you', 3377), ('este', 3324), ('mi', 2886), ('this', 2870), ('A', 2854), ('that', 2836), ('#EUROVISION', 2822), ('su', 2749), ('o', 2742), ('#FinalEurovision', 2672), ('pero', 2548), ('canción', 2362), ('e', 2353), ('QUE', 2324), ('on', 2307), ('España', 2286), ('it', 2281), ('El', 2263), ('le', 2223), ('was', 2222), ('-', 2222), ('al', 2211), ('@ManelNMusic:', 2192), ('las', 2171),

# Question 3 (Home, Moodle)

Los ejemplos de arriba se han hecho con *un solo* fichero, pero las siguientes preguntas aplican a **TODA** la colección (el conjunto de 10 ficheros). 

- 3.1. Cuántos tweets en español son originales (es decir, no son retweets)? 
- 3.2. Cual es el porcentaje de tweets para cada lenguaje? Es decir, de toda la colleccion, XX% son en idioma YY, ZZ% son en idioma WW, etc...
- 3.3. Cuáles son las palabras más frecuentse en castellano? 

Añade tu respuesta en un bloque de Jupyter aquí abajo 

In [45]:
from dataclasses import dataclass

@dataclass
class Tweet:
  """Class to model a Tweet"""
  id: int         # The unique ID of a tweet
  content: str    # The textual content of a tweet
  author: str     # The nickname of the author of the tweet
  language: str   # The language of the tweet
  retweeted_status: bool 

# 1:

In [46]:
import s3fs
s3 = s3fs.S3FileSystem(anon=False)

bucket='mudab-2025-big-data'
data_key = 'twitter-data/Eurovision-01.json'
data_location = 's3://{}/{}'.format(bucket, data_key)

In [47]:
import json, dataclasses

tweets = []

bucket='mudab-2025-pc1262057'
data_key = 'input/Eurovision-01.json'
data_location = 's3://{}/{}'.format(bucket, data_key)

def parse_line(line: str):
  """Try to parse a string into a Person"""
  error = 0
  try:
    parsed = json.loads(line)
    return Tweet(parsed['id'], parsed['text'], parsed['user']['name'], parsed['lang'],parsed.get('retweeted_status', False))
  except Exception as e:
    error += 1    
#   print(f"Error parsing '{line}': {e}")

with s3.open(data_location) as input: #CLIENT S3. CAN READ THE FILE LINE BY LINE
  for line in input.readlines():
    if len(line.strip()) > 0:
      tweet = parse_line(line)
      if tweet: # We add only if the tweet is not 'None'
         tweets.append(tweet)

for modeled_tweet in tweets[0:10]:
  print(modeled_tweet)

Tweet(id=995452386658410496, content='RT @Uraa_: El amor nunca gana #Eurovision https://t.co/p9ZLqpz9eg', author='Monana Geller', language='es', retweeted_status={'created_at': 'Sat May 12 22:42:34 +0000 2018', 'id': 995434084619911168, 'id_str': '995434084619911168', 'text': 'El amor nunca gana #Eurovision https://t.co/p9ZLqpz9eg', 'display_text_range': [0, 30], 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 138489714, 'id_str': '138489714', 'name': 'Bimbolles', 'screen_name': 'Uraa_', 'location': 'Palma', 'url': None, 'description': 'No sé decir la erre lo cual me ha impedido presentar el telediario, luego reinar', 'translator_type': 'regular', 'protected': False, 'verified': False, 'followers_count': 3835, 'friends_count': 316, 'listed_coun

In [72]:
import json, dataclasses
def read_clean_tweets(input: str):
    tweets = []
    with open(input, 'r') as f:
        lines = f.readlines()
    for line in lines:
        parsed = json.loads(line)
        tweet = Tweet(**parsed)
        tweets.append(tweet)
    return tweets

def count_tweets(language: str, tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.language == language:
            count += 1
    return count 
    

def most_frequent_word(tweets: list[Tweet]):
    count = {}
    for tweet in tweets:
        words = tweet.content.split(' ')
        for word in words:
            if word in count:
                count[word] += 1
            else:
                count[word] = 1
    return dict(sorted(count.items(), key=lambda item: item[1], reverse=True))

def most_frequent_words_spanish(tweets: list[Tweet]):
    count = {}
    for tweet in tweets:
        if tweet.language == 'es':
            words = tweet.content.split(' ')
            for word in words:
                if word in count:
                    count[word] += 1
                else:
                    count[word] = 1
    return dict(sorted(count.items(), key=lambda item: item[1], reverse=True))

def get_total_tweets_count(tweets: list[Tweet]):
    return len(tweets)

def count_original_tweets(tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.retweeted_status == False:
            count += 1
    return count

def count_original_tweets_spanish(tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.language == 'es' and tweet.retweeted_status == False:
            count += 1
    return count


tweets = read_clean_tweets('clean-dataset')

english_tweets_count = count_tweets('en', tweets)
spanish_tweets_count = count_tweets('es', tweets)
catalan_tweets_count = count_tweets('ca', tweets)
galician_tweets_count = count_tweets('gl', tweets)
basque_tweets_count = count_tweets('eu', tweets)
portuguese_tweets_count = count_tweets('pt', tweets)
french_tweets_count = count_tweets('fr', tweets)
italian_tweets_count = count_tweets('it', tweets)
romanian_tweets_count = count_tweets('ro', tweets)
occitan_tweets_count = count_tweets('oc', tweets) 
corsican_tweets_count = count_tweets('co', tweets) 
breton_tweets_count = count_tweets('br', tweets) 
luxembourgish_tweets_count = count_tweets('lb', tweets)
greek_tweets_count = count_tweets('el', tweets)
turkish_tweets_count = count_tweets('tr', tweets)
german_tweets_count = count_tweets('de', tweets)
swedish_tweets_count = count_tweets('sv', tweets)
norwegian_tweets_count = count_tweets('no', tweets)
danish_tweets_count = count_tweets('da', tweets) 
finnish_tweets_count = count_tweets('fi', tweets)
dutch_tweets_count = count_tweets('nl', tweets)
ukrainian_tweets_count = count_tweets('uk', tweets) 
polish_tweets_count = count_tweets('pl', tweets) 
czech_tweets_count = count_tweets('cs', tweets) 
hungarian_tweets_count = count_tweets('hu', tweets) 
bulgarian_tweets_count = count_tweets('bg', tweets) 
albanian_tweets_count = count_tweets('sq', tweets)
bosnian_tweets_count = count_tweets('bs', tweets)
icelandic_tweets_count = count_tweets('is', tweets)
estonian_tweets_count = count_tweets('et', tweets)
maltese_tweets_count = count_tweets('mt', tweets)
montenegrin_tweets_count = count_tweets('me', tweets)
macedonian_tweets_count = count_tweets('mk', tweets)
azerbaijani_tweets_count = count_tweets('az', tweets)
lithuanian_tweets_count = count_tweets('lt', tweets)
latvian_tweets_count = count_tweets('lv', tweets)
armenian_tweets_count = count_tweets('hy', tweets)
georgian_tweets_count = count_tweets('ka', tweets)
serbian_tweets_count = count_tweets('sr', tweets)
croatian_tweets_count = count_tweets('hr', tweets)
slovenian_tweets_count = count_tweets('sl', tweets)
slovak_tweets_count = count_tweets('sk', tweets)
russian_tweets_count = count_tweets('ru', tweets) 
belarusian_tweets_count = count_tweets('be', tweets) 
hebrew_tweets_count = count_tweets('he', tweets)
unknown_tweets_count = count_tweets('und', tweets)


# PERCENTAGE FOR EACH LANGUAGE
total_tweets = get_total_tweets_count(tweets)
english_tweets_percentage = (english_tweets_count / total_tweets) * 100
spanish_tweets_percentage = (spanish_tweets_count / total_tweets) * 100
catalan_tweets_percentage = (catalan_tweets_count / total_tweets) * 100
galician_tweets_percentage = (galician_tweets_count / total_tweets) * 100
basque_tweets_percentage = (basque_tweets_count / total_tweets) * 100
portuguese_tweets_percentage = (portuguese_tweets_count / total_tweets) * 100
french_tweets_percentage = (french_tweets_count / total_tweets) * 100
italian_tweets_percentage = (italian_tweets_count / total_tweets) * 100
romanian_tweets_percentage = (romanian_tweets_count / total_tweets) * 100
occitan_tweets_percentage = (occitan_tweets_count / total_tweets) * 100
corsican_tweets_percentage = (corsican_tweets_count / total_tweets) * 100
breton_tweets_percentage = (breton_tweets_count / total_tweets) * 100
luxembourgish_tweets_percentage = (luxembourgish_tweets_count / total_tweets) * 100
greek_tweets_percentage = (greek_tweets_count / total_tweets) * 100
turkish_tweets_percentage = (turkish_tweets_count / total_tweets) * 100
german_tweets_percentage = (german_tweets_count / total_tweets) * 100
swedish_tweets_percentage = (swedish_tweets_count / total_tweets) * 100
norwegian_tweets_percentage = (norwegian_tweets_count / total_tweets) * 100
danish_tweets_percentage = (danish_tweets_count / total_tweets) * 100 
finnish_tweets_percentage = (finnish_tweets_count / total_tweets) * 100 
dutch_tweets_percentage = (dutch_tweets_count / total_tweets) * 100
ukrainian_tweets_percentage = (ukrainian_tweets_count / total_tweets) * 100
polish_tweets_percentage = (polish_tweets_count / total_tweets) * 100
czech_tweets_percentage = (czech_tweets_count / total_tweets) * 100
hungarian_tweets_percentage = (hungarian_tweets_count / total_tweets) * 100
bulgarian_tweets_percentage = (bulgarian_tweets_count / total_tweets) * 100
albanian_tweets_percentage = (albanian_tweets_count / total_tweets) * 100
bosnian_tweets_percentage = (bosnian_tweets_count / total_tweets) * 100
icelandic_tweets_percentage = (icelandic_tweets_count / total_tweets) * 100
estonian_tweets_percentage = (estonian_tweets_count / total_tweets) * 100
maltese_tweets_percentage = (maltese_tweets_count / total_tweets) * 100
montenegrin_tweets_percentage = (montenegrin_tweets_count / total_tweets) * 100
macedonian_tweets_percentage = (macedonian_tweets_count / total_tweets) * 100
azerbaijani_tweets_percentage = (azerbaijani_tweets_count / total_tweets) * 100
lithuanian_tweets_percentage = (lithuanian_tweets_count / total_tweets) * 100
latvian_tweets_percentage = (latvian_tweets_count / total_tweets) * 100
armenian_tweets_percentage = (armenian_tweets_count / total_tweets) * 100
georgian_tweets_percentage = (georgian_tweets_count / total_tweets) * 100
serbian_tweets_percentage = (serbian_tweets_count / total_tweets) * 100
croatian_tweets_percentage = (croatian_tweets_count / total_tweets) * 100
slovenian_tweets_percentage = (slovenian_tweets_count / total_tweets) * 100
slovak_tweets_percentage = (slovak_tweets_count / total_tweets) * 100
russian_tweets_percentage = (russian_tweets_count / total_tweets) * 100
belarusian_tweets_percentage = (belarusian_tweets_count / total_tweets) * 100
hebrew_tweets_percentage = (hebrew_tweets_count / total_tweets) * 100 
unknown_tweets_percentage = (unknown_tweets_count / total_tweets) * 100

tweets_count = get_total_tweets_count(tweets)
original_tweets_count = count_original_tweets(tweets)
words_by_frequence = list(most_frequent_word(tweets).items())[0:100]
spanish_tweets_count = count_original_tweets_spanish(tweets)
spanish_words_by_frequence = list(most_frequent_words_spanish(tweets).items())[0:100]

print(f"Number of English tweets: {english_tweets_count}")  #ENGLISH
print(f"Percentage of English tweets: {english_tweets_percentage:.2f}%")
print(f"Number of Spanish tweets: {spanish_tweets_count}") #SPANISH
print(f"Percentage of Spanish tweets: {spanish_tweets_percentage:.2f}%")
print(f"Number of Catalan tweets: {catalan_tweets_count}") #CATALAN
print(f"Percentage of Catalan tweets: {catalan_tweets_percentage:.2f}%")
print(f"Number of Galician tweets: {galician_tweets_count}") #GALICIAN
print(f"Percentage of Galician tweets: {galician_tweets_percentage:.2f}%")
print(f"Number of Basque tweets: {basque_tweets_count}") #BASQUE
print(f"Percentage of Basque tweets: {basque_tweets_percentage:.2f}%")
print(f"Number of Portuguese tweets: {portuguese_tweets_count}") #PORTUGUESE
print(f"Percentage of Portuguese tweets: {portuguese_tweets_percentage:.2f}%")
print(f"Number of French tweets: {french_tweets_count}") #FRENCH
print(f"Percentage of French tweets: {french_tweets_percentage:.2f}%") 
print(f"Number of Italian tweets: {italian_tweets_count}") #ITALIAN
print(f"Percentage of Italian tweets: {italian_tweets_percentage:.2f}%")
print(f"Number of Romanian tweets: {romanian_tweets_count}") #ROMANIAN
print(f"Percentage of Romanian tweets: {romanian_tweets_percentage:.2f}%")
print(f"Number of Occitan tweets: {occitan_tweets_count}") #OCCITAN
print(f"Percentage of Occitan tweets: {occitan_tweets_percentage:.2f}%")
print(f"Number of Corsican tweets: {corsican_tweets_count}") #CORSICAN
print(f"Percentage of Corsican tweets: {corsican_tweets_percentage:.2f}%")
print(f"Number of Breton tweets: {breton_tweets_count}") #BRETON
print(f"Percentage of Breton tweets: {breton_tweets_percentage:.2f}%")
print(f"Number of Luxembourgish tweets: {luxembourgish_tweets_count}") #LUXEMBOURGISH
print(f"Percentage of Luxembourgish tweets: {luxembourgish_tweets_percentage:.2f}%")
print(f"Number of Greek tweets: {greek_tweets_count}") #GREEK
print(f"Percentage of Greek tweets: {greek_tweets_percentage:.2f}%")
print(f"Number of Turkish tweets: {turkish_tweets_count}") #TURKISH
print(f"Percentage of Turkish tweets: {turkish_tweets_percentage:.2f}%")
print(f"Number of German tweets: {german_tweets_count}") #GERMAN
print(f"Percentage of German tweets: {german_tweets_percentage:.2f}%")
print(f"Number of Swedish tweets: {swedish_tweets_count}") #SWEDISH
print(f"Percentage of Swedish tweets: {swedish_tweets_percentage:.2f}%") 
print(f"Number of Norwegian tweets: {norwegian_tweets_count}") #NORWEGIAN
print(f"Percentage of Norwegian tweets: {norwegian_tweets_percentage:.2f}%") 
print(f"Number of Danish tweets: {danish_tweets_count}") #DANISH
print(f"Percentage of Danish tweets: {danish_tweets_percentage:.2f}%")
print(f"Number of Finnish tweets: {finnish_tweets_count}") #FINNISH
print(f"Percentage of Finnish tweets: {finnish_tweets_percentage:.2f}%")
print(f"Number of Dutch tweets: {dutch_tweets_count}") #DUTCH
print(f"Percentage of Dutch tweets: {dutch_tweets_percentage:.2f}%")
print(f"Number of Ukrainian tweets: {ukrainian_tweets_count}") #UKRAINIAN
print(f"Percentage of Ukrainian tweets: {ukrainian_tweets_percentage:.2f}%")
print(f"Number of Polish tweets: {polish_tweets_count}") #POLISH 
print(f"Percentage of Polish tweets: {polish_tweets_percentage:.2f}%")
print(f"Number of Czech tweets: {czech_tweets_count}") #CZECH
print(f"Percentage of Czech tweets: {czech_tweets_percentage:.2f}%")
print(f"Number of Hungarian tweets: {hungarian_tweets_count}") #HUNGARIAN
print(f"Percentage of Hungarian tweets: {hungarian_tweets_percentage:.2f}%")
print(f"Number of Bulgarian tweets: {bulgarian_tweets_count}") #BULGARIAN
print(f"Percentage of Bulgarian tweets: {bulgarian_tweets_percentage:.2f}%")
print(f"Number of Albanian tweets: {albanian_tweets_count}") #ALBANIAN
print(f"Percentage of Albanian tweets: {albanian_tweets_percentage:.2f}%")
print(f"Number of Bosnian tweets: {bosnian_tweets_count}") #BOSNIAN
print(f"Percentage of Bosnian tweets: {bosnian_tweets_percentage:.2f}%")
print(f"Number of Icelandic tweets: {icelandic_tweets_count}") #ICELANDIC
print(f"Percentage of Icelandic tweets: {icelandic_tweets_percentage:.2f}%")
print(f"Number of Estonian tweets: {estonian_tweets_count}") #ESTONIAN
print(f"Percentage of Estonian tweets: {estonian_tweets_percentage:.2f}%")
print(f"Number of Maltese tweets: {maltese_tweets_count}") #MALTESE
print(f"Percentage of Maltese tweets: {maltese_tweets_percentage:.2f}%")
print(f"Number of Montenegrin tweets: {montenegrin_tweets_count}") #MONTENEGRIN
print(f"Percentage of Montenegrin tweets: {montenegrin_tweets_percentage:.2f}%")
print(f"Number of Macedonian tweets: {macedonian_tweets_count}") #MACEDONIAN
print(f"Percentage of Macedonian tweets: {macedonian_tweets_percentage:.2f}%")
print(f"Number of Azerbaijani tweets: {azerbaijani_tweets_count}") #AZERBAIJANI
print(f"Percentage of Azerbaijani tweets: {azerbaijani_tweets_percentage:.2f}%")
print(f"Number of Lithuanian tweets: {lithuanian_tweets_count}") #LITHUANIAN
print(f"Percentage of Lithuanian tweets: {lithuanian_tweets_percentage:.2f}%")
print(f"Number of Latvian tweets: {latvian_tweets_count}") #LATVIAN
print(f"Percentage of Latvian tweets: {latvian_tweets_percentage:.2f}%")
print(f"Number of Armenian tweets: {armenian_tweets_count}") #ARMENIAN
print(f"Percentage of Armenian tweets: {armenian_tweets_percentage:.2f}%")
print(f"Number of Georgian tweets: {georgian_tweets_count}") #GEORGIAN
print(f"Percentage of Georgian tweets: {georgian_tweets_percentage:.2f}%")
print(f"Number of Serbian tweets: {serbian_tweets_count}") #SERBIAN
print(f"Percentage of Serbian tweets: {serbian_tweets_percentage:.2f}%")
print(f"Number of Croatian tweets: {croatian_tweets_count}") #CROATIAN
print(f"Percentage of Croatian tweets: {croatian_tweets_percentage:.2f}%")
print(f"Number of Slovenian tweets: {slovenian_tweets_count}") #SLOVENIAN
print(f"Percentage of Slovenian tweets: {slovenian_tweets_percentage:.2f}%")
print(f"Number of Slovak tweets: {slovak_tweets_count}") #SLOVAK
print(f"Percentage of Slovak tweets: {slovak_tweets_percentage:.2f}%") 
print(f"Number of Russian tweets: {russian_tweets_count}") #RUSSIAN
print(f"Percentage of Russian tweets: {russian_tweets_percentage:.2f}%")
print(f"Number of Belarusian tweets: {belarusian_tweets_count}") #BELARUSIAN
print(f"Percentage of Belarusian tweets: {belarusian_tweets_percentage:.2f}%")
print(f"Number of Hebrew tweets: {hebrew_tweets_count}") #HEBREW
print(f"Percentage of Hebrew tweets: {hebrew_tweets_percentage:.2f}%")
print(f"Number of Unknown tweets: {unknown_tweets_count}") #UNKNOWN
print(f"Percentage of Unknown tweets: {unknown_tweets_percentage:.2f}%")

print("----------------------------------")
print(f"Number of tweets: {tweets_count}")
print(f"Number of original tweets: {original_tweets_count}")
print(f"Number of original Spanish tweets: {spanish_tweets_count}")
print("-----------------------------------")
print(f"Frequent words: {words_by_frequence}\n")
print(f"Most frequent Spanish words: {spanish_words_by_frequence}\n")
print(tweets[0:2])

Number of English tweets: 491
Percentage of English tweets: 49.10%
Number of Spanish tweets: 94
Percentage of Spanish tweets: 23.60%
Number of Catalan tweets: 5
Percentage of Catalan tweets: 0.50%
Number of Galician tweets: 0
Percentage of Galician tweets: 0.00%
Number of Basque tweets: 1
Percentage of Basque tweets: 0.10%
Number of Portuguese tweets: 6
Percentage of Portuguese tweets: 0.60%
Number of French tweets: 96
Percentage of French tweets: 9.60%
Number of Italian tweets: 30
Percentage of Italian tweets: 3.00%
Number of Romanian tweets: 1
Percentage of Romanian tweets: 0.10%
Number of Occitan tweets: 0
Percentage of Occitan tweets: 0.00%
Number of Corsican tweets: 0
Percentage of Corsican tweets: 0.00%
Number of Breton tweets: 0
Percentage of Breton tweets: 0.00%
Number of Luxembourgish tweets: 0
Percentage of Luxembourgish tweets: 0.00%
Number of Greek tweets: 7
Percentage of Greek tweets: 0.70%
Number of Turkish tweets: 21
Percentage of Turkish tweets: 2.10%
Number of German t

# 2:

In [49]:
import s3fs
s3 = s3fs.S3FileSystem(anon=False)

bucket='mudab-2025-big-data'
data_key = 'twitter-data/Eurovision-02.json'
data_location = 's3://{}/{}'.format(bucket, data_key)

In [50]:
import json, dataclasses

tweets = []

bucket='mudab-2025-pc1262057'
data_key = 'input/Eurovision-02.json'
data_location = 's3://{}/{}'.format(bucket, data_key)

def parse_line(line: str):
  """Try to parse a string into a Person"""
  error = 0
  try:
    parsed = json.loads(line)
    return Tweet(parsed['id'], parsed['text'], parsed['user']['name'], parsed['lang'],parsed.get('retweeted_status', False))
  except Exception as e:
    error += 1    
#   print(f"Error parsing '{line}': {e}")

with s3.open(data_location) as input: #CLIENT S3. CAN READ THE FILE LINE BY LINE
  for line in input.readlines():
    if len(line.strip()) > 0:
      tweet = parse_line(line)
      if tweet: # We add only if the tweet is not 'None'
         tweets.append(tweet)

for modeled_tweet in tweets[0:10]:
  print(modeled_tweet)

Tweet(id=995463126647631872, content='Wtf, #Eurovision https://t.co/J5zdXNOZR5', author='Jonathan #TUCAM', language='und', retweeted_status=False)
Tweet(id=995463127876624384, content='RT @Uznare: eurovision rules https://t.co/I8cG3D5tCh', author='Ospreys', language='en', retweeted_status={'created_at': 'Sat May 12 19:13:51 +0000 2018', 'id': 995381560277979136, 'id_str': '995381560277979136', 'text': 'eurovision rules https://t.co/I8cG3D5tCh', 'display_text_range': [0, 16], 'source': '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 29056256, 'id_str': '29056256', 'name': 'ウズ@ころあず最高', 'screen_name': 'Uznare', 'location': 'Hyperliterate México', 'url': 'http://outerheaven.xyz', 'description': '田所あずさが大好きです', 'translator_type': 'none', 'protected': False, 'verified': False, '

In [63]:
import json, dataclasses
def read_clean_tweets(input: str):
    tweets = []
    with open(input, 'r') as f:
        lines = f.readlines()
    for line in lines:
        parsed = json.loads(line)
        tweet = Tweet(**parsed)
        tweets.append(tweet)
    return tweets

def count_tweets(language: str, tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.language == language:
            count += 1
    return count 
    

def most_frequent_word(tweets: list[Tweet]):
    count = {}
    for tweet in tweets:
        words = tweet.content.split(' ')
        for word in words:
            if word in count:
                count[word] += 1
            else:
                count[word] = 1
    return dict(sorted(count.items(), key=lambda item: item[1], reverse=True))

def most_frequent_words_spanish(tweets: list[Tweet]):
    count = {}
    for tweet in tweets:
        if tweet.language == 'es':
            words = tweet.content.split(' ')
            for word in words:
                if word in count:
                    count[word] += 1
                else:
                    count[word] = 1
    return dict(sorted(count.items(), key=lambda item: item[1], reverse=True))

def get_total_tweets_count(tweets: list[Tweet]):
    return len(tweets)

def count_original_tweets(tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.retweeted_status == False:
            count += 1
    return count

def count_original_tweets_spanish(tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.language == 'es' and tweet.retweeted_status == False:
            count += 1
    return count


tweets = read_clean_tweets('clean-dataset')

english_tweets_count = count_tweets('en', tweets)
spanish_tweets_count = count_tweets('es', tweets)
catalan_tweets_count = count_tweets('ca', tweets)
galician_tweets_count = count_tweets('gl', tweets)
basque_tweets_count = count_tweets('eu', tweets)
portuguese_tweets_count = count_tweets('pt', tweets)
french_tweets_count = count_tweets('fr', tweets)
italian_tweets_count = count_tweets('it', tweets)
romanian_tweets_count = count_tweets('ro', tweets)
occitan_tweets_count = count_tweets('oc', tweets) 
corsican_tweets_count = count_tweets('co', tweets) 
breton_tweets_count = count_tweets('br', tweets) 
luxembourgish_tweets_count = count_tweets('lb', tweets)
greek_tweets_count = count_tweets('el', tweets)
turkish_tweets_count = count_tweets('tr', tweets)
german_tweets_count = count_tweets('de', tweets)
swedish_tweets_count = count_tweets('sv', tweets)
norwegian_tweets_count = count_tweets('no', tweets)
danish_tweets_count = count_tweets('da', tweets) 
finnish_tweets_count = count_tweets('fi', tweets)
dutch_tweets_count = count_tweets('nl', tweets)
ukrainian_tweets_count = count_tweets('uk', tweets) 
polish_tweets_count = count_tweets('pl', tweets) 
czech_tweets_count = count_tweets('cs', tweets) 
hungarian_tweets_count = count_tweets('hu', tweets) 
bulgarian_tweets_count = count_tweets('bg', tweets) 
albanian_tweets_count = count_tweets('sq', tweets)
bosnian_tweets_count = count_tweets('bs', tweets)
icelandic_tweets_count = count_tweets('is', tweets)
estonian_tweets_count = count_tweets('et', tweets)
maltese_tweets_count = count_tweets('mt', tweets)
montenegrin_tweets_count = count_tweets('me', tweets)
macedonian_tweets_count = count_tweets('mk', tweets)
azerbaijani_tweets_count = count_tweets('az', tweets)
lithuanian_tweets_count = count_tweets('lt', tweets)
latvian_tweets_count = count_tweets('lv', tweets)
armenian_tweets_count = count_tweets('hy', tweets)
georgian_tweets_count = count_tweets('ka', tweets)
serbian_tweets_count = count_tweets('sr', tweets)
croatian_tweets_count = count_tweets('hr', tweets)
slovenian_tweets_count = count_tweets('sl', tweets)
slovak_tweets_count = count_tweets('sk', tweets)
russian_tweets_count = count_tweets('ru', tweets) 
belarusian_tweets_count = count_tweets('be', tweets) 
hebrew_tweets_count = count_tweets('he', tweets)
unknown_tweets_count = count_tweets('und', tweets)


# PERCENTAGE FOR EACH LANGUAGE
total_tweets = get_total_tweets_count(tweets)
english_tweets_percentage = (english_tweets_count / total_tweets) * 100
spanish_tweets_percentage = (spanish_tweets_count / total_tweets) * 100
catalan_tweets_percentage = (catalan_tweets_count / total_tweets) * 100
galician_tweets_percentage = (galician_tweets_count / total_tweets) * 100
basque_tweets_percentage = (basque_tweets_count / total_tweets) * 100
portuguese_tweets_percentage = (portuguese_tweets_count / total_tweets) * 100
french_tweets_percentage = (french_tweets_count / total_tweets) * 100
italian_tweets_percentage = (italian_tweets_count / total_tweets) * 100
romanian_tweets_percentage = (romanian_tweets_count / total_tweets) * 100
occitan_tweets_percentage = (occitan_tweets_count / total_tweets) * 100
corsican_tweets_percentage = (corsican_tweets_count / total_tweets) * 100
breton_tweets_percentage = (breton_tweets_count / total_tweets) * 100
luxembourgish_tweets_percentage = (luxembourgish_tweets_count / total_tweets) * 100
greek_tweets_percentage = (greek_tweets_count / total_tweets) * 100
turkish_tweets_percentage = (turkish_tweets_count / total_tweets) * 100
german_tweets_percentage = (german_tweets_count / total_tweets) * 100
swedish_tweets_percentage = (swedish_tweets_count / total_tweets) * 100
norwegian_tweets_percentage = (norwegian_tweets_count / total_tweets) * 100
danish_tweets_percentage = (danish_tweets_count / total_tweets) * 100 
finnish_tweets_percentage = (finnish_tweets_count / total_tweets) * 100 
dutch_tweets_percentage = (dutch_tweets_count / total_tweets) * 100
ukrainian_tweets_percentage = (ukrainian_tweets_count / total_tweets) * 100
polish_tweets_percentage = (polish_tweets_count / total_tweets) * 100
czech_tweets_percentage = (czech_tweets_count / total_tweets) * 100
hungarian_tweets_percentage = (hungarian_tweets_count / total_tweets) * 100
bulgarian_tweets_percentage = (bulgarian_tweets_count / total_tweets) * 100
albanian_tweets_percentage = (albanian_tweets_count / total_tweets) * 100
bosnian_tweets_percentage = (bosnian_tweets_count / total_tweets) * 100
icelandic_tweets_percentage = (icelandic_tweets_count / total_tweets) * 100
estonian_tweets_percentage = (estonian_tweets_count / total_tweets) * 100
maltese_tweets_percentage = (maltese_tweets_count / total_tweets) * 100
montenegrin_tweets_percentage = (montenegrin_tweets_count / total_tweets) * 100
macedonian_tweets_percentage = (macedonian_tweets_count / total_tweets) * 100
azerbaijani_tweets_percentage = (azerbaijani_tweets_count / total_tweets) * 100
lithuanian_tweets_percentage = (lithuanian_tweets_count / total_tweets) * 100
latvian_tweets_percentage = (latvian_tweets_count / total_tweets) * 100
armenian_tweets_percentage = (armenian_tweets_count / total_tweets) * 100
georgian_tweets_percentage = (georgian_tweets_count / total_tweets) * 100
serbian_tweets_percentage = (serbian_tweets_count / total_tweets) * 100
croatian_tweets_percentage = (croatian_tweets_count / total_tweets) * 100
slovenian_tweets_percentage = (slovenian_tweets_count / total_tweets) * 100
slovak_tweets_percentage = (slovak_tweets_count / total_tweets) * 100
russian_tweets_percentage = (russian_tweets_count / total_tweets) * 100
belarusian_tweets_percentage = (belarusian_tweets_count / total_tweets) * 100
hebrew_tweets_percentage = (hebrew_tweets_count / total_tweets) * 100 
unknown_tweets_percentage = (unknown_tweets_count / total_tweets) * 100

tweets_count = get_total_tweets_count(tweets)
original_tweets_count = count_original_tweets(tweets)
words_by_frequence = list(most_frequent_word(tweets).items())[0:100]
spanish_tweets_count = count_original_tweets_spanish(tweets)
spanish_words_by_frequence = list(most_frequent_words_spanish(tweets).items())[0:100]

print(f"Number of English tweets: {english_tweets_count}")  #ENGLISH
print(f"Percentage of English tweets: {english_tweets_percentage:.2f}%")
print(f"Number of Spanish tweets: {spanish_tweets_count}") #SPANISH
print(f"Percentage of Spanish tweets: {spanish_tweets_percentage:.2f}%")
print(f"Number of Catalan tweets: {catalan_tweets_count}") #CATALAN
print(f"Percentage of Catalan tweets: {catalan_tweets_percentage:.2f}%")
print(f"Number of Galician tweets: {galician_tweets_count}") #GALICIAN
print(f"Percentage of Galician tweets: {galician_tweets_percentage:.2f}%")
print(f"Number of Basque tweets: {basque_tweets_count}") #BASQUE
print(f"Percentage of Basque tweets: {basque_tweets_percentage:.2f}%")
print(f"Number of Portuguese tweets: {portuguese_tweets_count}") #PORTUGUESE
print(f"Percentage of Portuguese tweets: {portuguese_tweets_percentage:.2f}%")
print(f"Number of French tweets: {french_tweets_count}") #FRENCH
print(f"Percentage of French tweets: {french_tweets_percentage:.2f}%") 
print(f"Number of Italian tweets: {italian_tweets_count}") #ITALIAN
print(f"Percentage of Italian tweets: {italian_tweets_percentage:.2f}%")
print(f"Number of Romanian tweets: {romanian_tweets_count}") #ROMANIAN
print(f"Percentage of Romanian tweets: {romanian_tweets_percentage:.2f}%")
print(f"Number of Occitan tweets: {occitan_tweets_count}") #OCCITAN
print(f"Percentage of Occitan tweets: {occitan_tweets_percentage:.2f}%")
print(f"Number of Corsican tweets: {corsican_tweets_count}") #CORSICAN
print(f"Percentage of Corsican tweets: {corsican_tweets_percentage:.2f}%")
print(f"Number of Breton tweets: {breton_tweets_count}") #BRETON
print(f"Percentage of Breton tweets: {breton_tweets_percentage:.2f}%")
print(f"Number of Luxembourgish tweets: {luxembourgish_tweets_count}") #LUXEMBOURGISH
print(f"Percentage of Luxembourgish tweets: {luxembourgish_tweets_percentage:.2f}%")
print(f"Number of Greek tweets: {greek_tweets_count}") #GREEK
print(f"Percentage of Greek tweets: {greek_tweets_percentage:.2f}%")
print(f"Number of Turkish tweets: {turkish_tweets_count}") #TURKISH
print(f"Percentage of Turkish tweets: {turkish_tweets_percentage:.2f}%")
print(f"Number of German tweets: {german_tweets_count}") #GERMAN
print(f"Percentage of German tweets: {german_tweets_percentage:.2f}%")
print(f"Number of Swedish tweets: {swedish_tweets_count}") #SWEDISH
print(f"Percentage of Swedish tweets: {swedish_tweets_percentage:.2f}%") 
print(f"Number of Norwegian tweets: {norwegian_tweets_count}") #NORWEGIAN
print(f"Percentage of Norwegian tweets: {norwegian_tweets_percentage:.2f}%") 
print(f"Number of Danish tweets: {danish_tweets_count}") #DANISH
print(f"Percentage of Danish tweets: {danish_tweets_percentage:.2f}%")
print(f"Number of Finnish tweets: {finnish_tweets_count}") #FINNISH
print(f"Percentage of Finnish tweets: {finnish_tweets_percentage:.2f}%")
print(f"Number of Dutch tweets: {dutch_tweets_count}") #DUTCH
print(f"Percentage of Dutch tweets: {dutch_tweets_percentage:.2f}%")
print(f"Number of Ukrainian tweets: {ukrainian_tweets_count}") #UKRAINIAN
print(f"Percentage of Ukrainian tweets: {ukrainian_tweets_percentage:.2f}%")
print(f"Number of Polish tweets: {polish_tweets_count}") #POLISH 
print(f"Percentage of Polish tweets: {polish_tweets_percentage:.2f}%")
print(f"Number of Czech tweets: {czech_tweets_count}") #CZECH
print(f"Percentage of Czech tweets: {czech_tweets_percentage:.2f}%")
print(f"Number of Hungarian tweets: {hungarian_tweets_count}") #HUNGARIAN
print(f"Percentage of Hungarian tweets: {hungarian_tweets_percentage:.2f}%")
print(f"Number of Bulgarian tweets: {bulgarian_tweets_count}") #BULGARIAN
print(f"Percentage of Bulgarian tweets: {bulgarian_tweets_percentage:.2f}%")
print(f"Number of Albanian tweets: {albanian_tweets_count}") #ALBANIAN
print(f"Percentage of Albanian tweets: {albanian_tweets_percentage:.2f}%")
print(f"Number of Bosnian tweets: {bosnian_tweets_count}") #BOSNIAN
print(f"Percentage of Bosnian tweets: {bosnian_tweets_percentage:.2f}%")
print(f"Number of Icelandic tweets: {icelandic_tweets_count}") #ICELANDIC
print(f"Percentage of Icelandic tweets: {icelandic_tweets_percentage:.2f}%")
print(f"Number of Estonian tweets: {estonian_tweets_count}") #ESTONIAN
print(f"Percentage of Estonian tweets: {estonian_tweets_percentage:.2f}%")
print(f"Number of Maltese tweets: {maltese_tweets_count}") #MALTESE
print(f"Percentage of Maltese tweets: {maltese_tweets_percentage:.2f}%")
print(f"Number of Montenegrin tweets: {montenegrin_tweets_count}") #MONTENEGRIN
print(f"Percentage of Montenegrin tweets: {montenegrin_tweets_percentage:.2f}%")
print(f"Number of Macedonian tweets: {macedonian_tweets_count}") #MACEDONIAN
print(f"Percentage of Macedonian tweets: {macedonian_tweets_percentage:.2f}%")
print(f"Number of Azerbaijani tweets: {azerbaijani_tweets_count}") #AZERBAIJANI
print(f"Percentage of Azerbaijani tweets: {azerbaijani_tweets_percentage:.2f}%")
print(f"Number of Lithuanian tweets: {lithuanian_tweets_count}") #LITHUANIAN
print(f"Percentage of Lithuanian tweets: {lithuanian_tweets_percentage:.2f}%")
print(f"Number of Latvian tweets: {latvian_tweets_count}") #LATVIAN
print(f"Percentage of Latvian tweets: {latvian_tweets_percentage:.2f}%")
print(f"Number of Armenian tweets: {armenian_tweets_count}") #ARMENIAN
print(f"Percentage of Armenian tweets: {armenian_tweets_percentage:.2f}%")
print(f"Number of Georgian tweets: {georgian_tweets_count}") #GEORGIAN
print(f"Percentage of Georgian tweets: {georgian_tweets_percentage:.2f}%")
print(f"Number of Serbian tweets: {serbian_tweets_count}") #SERBIAN
print(f"Percentage of Serbian tweets: {serbian_tweets_percentage:.2f}%")
print(f"Number of Croatian tweets: {croatian_tweets_count}") #CROATIAN
print(f"Percentage of Croatian tweets: {croatian_tweets_percentage:.2f}%")
print(f"Number of Slovenian tweets: {slovenian_tweets_count}") #SLOVENIAN
print(f"Percentage of Slovenian tweets: {slovenian_tweets_percentage:.2f}%")
print(f"Number of Slovak tweets: {slovak_tweets_count}") #SLOVAK
print(f"Percentage of Slovak tweets: {slovak_tweets_percentage:.2f}%") 
print(f"Number of Russian tweets: {russian_tweets_count}") #RUSSIAN
print(f"Percentage of Russian tweets: {russian_tweets_percentage:.2f}%")
print(f"Number of Belarusian tweets: {belarusian_tweets_count}") #BELARUSIAN
print(f"Percentage of Belarusian tweets: {belarusian_tweets_percentage:.2f}%")
print(f"Number of Hebrew tweets: {hebrew_tweets_count}") #HEBREW
print(f"Percentage of Hebrew tweets: {hebrew_tweets_percentage:.2f}%")
print(f"Number of Unknown tweets: {unknown_tweets_count}") #UNKNOWN
print(f"Percentage of Unknown tweets: {unknown_tweets_percentage:.2f}%")

print("----------------------------------")
print(f"Number of tweets: {tweets_count}")
print(f"Number of original tweets: {original_tweets_count}")
print(f"Number of original Spanish tweets: {spanish_tweets_count}")
print("-----------------------------------")
print(f"Frequent words: {words_by_frequence}\n")
print(f"Most frequent Spanish words: {spanish_words_by_frequence}\n")
print(tweets[0:2])

Number of English tweets: 491
Percentage of English tweets: 49.10%
Number of Spanish tweets: 94
Percentage of Spanish tweets: 23.60%
Number of Catalan tweets: 5
Percentage of Catalan tweets: 0.50%
Number of Galician tweets: 0
Percentage of Galician tweets: 0.00%
Number of Basque tweets: 1
Percentage of Basque tweets: 0.10%
Number of Portuguese tweets: 6
Percentage of Portuguese tweets: 0.60%
Number of French tweets: 96
Percentage of French tweets: 9.60%
Number of Italian tweets: 30
Percentage of Italian tweets: 3.00%
Number of Romanian tweets: 1
Percentage of Romanian tweets: 0.10%
Number of Occitan tweets: 0
Percentage of Occitan tweets: 0.00%
Number of Corsican tweets: 0
Percentage of Corsican tweets: 0.00%
Number of Breton tweets: 0
Percentage of Breton tweets: 0.00%
Number of Luxembourgish tweets: 0
Percentage of Luxembourgish tweets: 0.00%
Number of Greek tweets: 7
Percentage of Greek tweets: 0.70%
Number of Turkish tweets: 21
Percentage of Turkish tweets: 2.10%
Number of German t

# 3:

In [18]:
import s3fs
s3 = s3fs.S3FileSystem(anon=False)

bucket='mudab-2025-big-data'
data_key = 'twitter-data/Eurovision-03.json'
data_location = 's3://{}/{}'.format(bucket, data_key)

In [20]:
import json, dataclasses

tweets = []

bucket='mudab-2025-pc1262057'
data_key = 'input/Eurovision-03.json'
data_location = 's3://{}/{}'.format(bucket, data_key)

def parse_line(line: str):
  """Try to parse a string into a Person"""
  error = 0
  try:
    parsed = json.loads(line)
    return Tweet(parsed['id'], parsed['text'], parsed['user']['name'], parsed['lang'],parsed.get('retweeted_status', False))
  except Exception as e:
    error += 1    
#   print(f"Error parsing '{line}': {e}")

with s3.open(data_location) as input: #CLIENT S3. CAN READ THE FILE LINE BY LINE
  for line in input.readlines():
    if len(line.strip()) > 0:
      tweet = parse_line(line)
      if tweet: # We add only if the tweet is not 'None'
         tweets.append(tweet)

for modeled_tweet in tweets[0:10]:
  print(modeled_tweet)

Tweet(id=995492134441639936, content='RT @vickygom3z: Como han estado mis niños @Amaia_ot2017 y @Alfred_ot2017 de brillantes???? 😍😍😍 Que emoción por dios! Son tan especiales...…', author='ALFRED🎺1016🎺', language='es', retweeted_status={'created_at': 'Sat May 12 19:44:41 +0000 2018', 'id': 995389318213586945, 'id_str': '995389318213586945', 'text': 'Como han estado mis niños @Amaia_ot2017 y @Alfred_ot2017 de brillantes???? 😍😍😍 Que emoción por dios! Son tan especi… https://t.co/rU5JpZr6Js', 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'truncated': True, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 485237915, 'id_str': '485237915', 'name': 'Vicky Gómez', 'screen_name': 'vickygom3z', 'location': None, 'url': None, 'description': 'Bailarina y coreógrafa/ Profesora de baile de OT Instagram: @vickygom3z', 'transla

In [64]:
import json, dataclasses
def read_clean_tweets(input: str):
    tweets = []
    with open(input, 'r') as f:
        lines = f.readlines()
    for line in lines:
        parsed = json.loads(line)
        tweet = Tweet(**parsed)
        tweets.append(tweet)
    return tweets

def count_tweets(language: str, tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.language == language:
            count += 1
    return count 
    

def most_frequent_word(tweets: list[Tweet]):
    count = {}
    for tweet in tweets:
        words = tweet.content.split(' ')
        for word in words:
            if word in count:
                count[word] += 1
            else:
                count[word] = 1
    return dict(sorted(count.items(), key=lambda item: item[1], reverse=True))

def most_frequent_words_spanish(tweets: list[Tweet]):
    count = {}
    for tweet in tweets:
        if tweet.language == 'es':
            words = tweet.content.split(' ')
            for word in words:
                if word in count:
                    count[word] += 1
                else:
                    count[word] = 1
    return dict(sorted(count.items(), key=lambda item: item[1], reverse=True))

def get_total_tweets_count(tweets: list[Tweet]):
    return len(tweets)

def count_original_tweets(tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.retweeted_status == False:
            count += 1
    return count

def count_original_tweets_spanish(tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.language == 'es' and tweet.retweeted_status == False:
            count += 1
    return count


tweets = read_clean_tweets('clean-dataset')

english_tweets_count = count_tweets('en', tweets)
spanish_tweets_count = count_tweets('es', tweets)
catalan_tweets_count = count_tweets('ca', tweets)
galician_tweets_count = count_tweets('gl', tweets)
basque_tweets_count = count_tweets('eu', tweets)
portuguese_tweets_count = count_tweets('pt', tweets)
french_tweets_count = count_tweets('fr', tweets)
italian_tweets_count = count_tweets('it', tweets)
romanian_tweets_count = count_tweets('ro', tweets)
occitan_tweets_count = count_tweets('oc', tweets) 
corsican_tweets_count = count_tweets('co', tweets) 
breton_tweets_count = count_tweets('br', tweets) 
luxembourgish_tweets_count = count_tweets('lb', tweets)
greek_tweets_count = count_tweets('el', tweets)
turkish_tweets_count = count_tweets('tr', tweets)
german_tweets_count = count_tweets('de', tweets)
swedish_tweets_count = count_tweets('sv', tweets)
norwegian_tweets_count = count_tweets('no', tweets)
danish_tweets_count = count_tweets('da', tweets) 
finnish_tweets_count = count_tweets('fi', tweets)
dutch_tweets_count = count_tweets('nl', tweets)
ukrainian_tweets_count = count_tweets('uk', tweets) 
polish_tweets_count = count_tweets('pl', tweets) 
czech_tweets_count = count_tweets('cs', tweets) 
hungarian_tweets_count = count_tweets('hu', tweets) 
bulgarian_tweets_count = count_tweets('bg', tweets) 
albanian_tweets_count = count_tweets('sq', tweets)
bosnian_tweets_count = count_tweets('bs', tweets)
icelandic_tweets_count = count_tweets('is', tweets)
estonian_tweets_count = count_tweets('et', tweets)
maltese_tweets_count = count_tweets('mt', tweets)
montenegrin_tweets_count = count_tweets('me', tweets)
macedonian_tweets_count = count_tweets('mk', tweets)
azerbaijani_tweets_count = count_tweets('az', tweets)
lithuanian_tweets_count = count_tweets('lt', tweets)
latvian_tweets_count = count_tweets('lv', tweets)
armenian_tweets_count = count_tweets('hy', tweets)
georgian_tweets_count = count_tweets('ka', tweets)
serbian_tweets_count = count_tweets('sr', tweets)
croatian_tweets_count = count_tweets('hr', tweets)
slovenian_tweets_count = count_tweets('sl', tweets)
slovak_tweets_count = count_tweets('sk', tweets)
russian_tweets_count = count_tweets('ru', tweets) 
belarusian_tweets_count = count_tweets('be', tweets) 
hebrew_tweets_count = count_tweets('he', tweets)
unknown_tweets_count = count_tweets('und', tweets)


# PERCENTAGE FOR EACH LANGUAGE
total_tweets = get_total_tweets_count(tweets)
english_tweets_percentage = (english_tweets_count / total_tweets) * 100
spanish_tweets_percentage = (spanish_tweets_count / total_tweets) * 100
catalan_tweets_percentage = (catalan_tweets_count / total_tweets) * 100
galician_tweets_percentage = (galician_tweets_count / total_tweets) * 100
basque_tweets_percentage = (basque_tweets_count / total_tweets) * 100
portuguese_tweets_percentage = (portuguese_tweets_count / total_tweets) * 100
french_tweets_percentage = (french_tweets_count / total_tweets) * 100
italian_tweets_percentage = (italian_tweets_count / total_tweets) * 100
romanian_tweets_percentage = (romanian_tweets_count / total_tweets) * 100
occitan_tweets_percentage = (occitan_tweets_count / total_tweets) * 100
corsican_tweets_percentage = (corsican_tweets_count / total_tweets) * 100
breton_tweets_percentage = (breton_tweets_count / total_tweets) * 100
luxembourgish_tweets_percentage = (luxembourgish_tweets_count / total_tweets) * 100
greek_tweets_percentage = (greek_tweets_count / total_tweets) * 100
turkish_tweets_percentage = (turkish_tweets_count / total_tweets) * 100
german_tweets_percentage = (german_tweets_count / total_tweets) * 100
swedish_tweets_percentage = (swedish_tweets_count / total_tweets) * 100
norwegian_tweets_percentage = (norwegian_tweets_count / total_tweets) * 100
danish_tweets_percentage = (danish_tweets_count / total_tweets) * 100 
finnish_tweets_percentage = (finnish_tweets_count / total_tweets) * 100 
dutch_tweets_percentage = (dutch_tweets_count / total_tweets) * 100
ukrainian_tweets_percentage = (ukrainian_tweets_count / total_tweets) * 100
polish_tweets_percentage = (polish_tweets_count / total_tweets) * 100
czech_tweets_percentage = (czech_tweets_count / total_tweets) * 100
hungarian_tweets_percentage = (hungarian_tweets_count / total_tweets) * 100
bulgarian_tweets_percentage = (bulgarian_tweets_count / total_tweets) * 100
albanian_tweets_percentage = (albanian_tweets_count / total_tweets) * 100
bosnian_tweets_percentage = (bosnian_tweets_count / total_tweets) * 100
icelandic_tweets_percentage = (icelandic_tweets_count / total_tweets) * 100
estonian_tweets_percentage = (estonian_tweets_count / total_tweets) * 100
maltese_tweets_percentage = (maltese_tweets_count / total_tweets) * 100
montenegrin_tweets_percentage = (montenegrin_tweets_count / total_tweets) * 100
macedonian_tweets_percentage = (macedonian_tweets_count / total_tweets) * 100
azerbaijani_tweets_percentage = (azerbaijani_tweets_count / total_tweets) * 100
lithuanian_tweets_percentage = (lithuanian_tweets_count / total_tweets) * 100
latvian_tweets_percentage = (latvian_tweets_count / total_tweets) * 100
armenian_tweets_percentage = (armenian_tweets_count / total_tweets) * 100
georgian_tweets_percentage = (georgian_tweets_count / total_tweets) * 100
serbian_tweets_percentage = (serbian_tweets_count / total_tweets) * 100
croatian_tweets_percentage = (croatian_tweets_count / total_tweets) * 100
slovenian_tweets_percentage = (slovenian_tweets_count / total_tweets) * 100
slovak_tweets_percentage = (slovak_tweets_count / total_tweets) * 100
russian_tweets_percentage = (russian_tweets_count / total_tweets) * 100
belarusian_tweets_percentage = (belarusian_tweets_count / total_tweets) * 100
hebrew_tweets_percentage = (hebrew_tweets_count / total_tweets) * 100 
unknown_tweets_percentage = (unknown_tweets_count / total_tweets) * 100

tweets_count = get_total_tweets_count(tweets)
original_tweets_count = count_original_tweets(tweets)
words_by_frequence = list(most_frequent_word(tweets).items())[0:100]
spanish_tweets_count = count_original_tweets_spanish(tweets)
spanish_words_by_frequence = list(most_frequent_words_spanish(tweets).items())[0:100]

print(f"Number of English tweets: {english_tweets_count}")  #ENGLISH
print(f"Percentage of English tweets: {english_tweets_percentage:.2f}%")
print(f"Number of Spanish tweets: {spanish_tweets_count}") #SPANISH
print(f"Percentage of Spanish tweets: {spanish_tweets_percentage:.2f}%")
print(f"Number of Catalan tweets: {catalan_tweets_count}") #CATALAN
print(f"Percentage of Catalan tweets: {catalan_tweets_percentage:.2f}%")
print(f"Number of Galician tweets: {galician_tweets_count}") #GALICIAN
print(f"Percentage of Galician tweets: {galician_tweets_percentage:.2f}%")
print(f"Number of Basque tweets: {basque_tweets_count}") #BASQUE
print(f"Percentage of Basque tweets: {basque_tweets_percentage:.2f}%")
print(f"Number of Portuguese tweets: {portuguese_tweets_count}") #PORTUGUESE
print(f"Percentage of Portuguese tweets: {portuguese_tweets_percentage:.2f}%")
print(f"Number of French tweets: {french_tweets_count}") #FRENCH
print(f"Percentage of French tweets: {french_tweets_percentage:.2f}%") 
print(f"Number of Italian tweets: {italian_tweets_count}") #ITALIAN
print(f"Percentage of Italian tweets: {italian_tweets_percentage:.2f}%")
print(f"Number of Romanian tweets: {romanian_tweets_count}") #ROMANIAN
print(f"Percentage of Romanian tweets: {romanian_tweets_percentage:.2f}%")
print(f"Number of Occitan tweets: {occitan_tweets_count}") #OCCITAN
print(f"Percentage of Occitan tweets: {occitan_tweets_percentage:.2f}%")
print(f"Number of Corsican tweets: {corsican_tweets_count}") #CORSICAN
print(f"Percentage of Corsican tweets: {corsican_tweets_percentage:.2f}%")
print(f"Number of Breton tweets: {breton_tweets_count}") #BRETON
print(f"Percentage of Breton tweets: {breton_tweets_percentage:.2f}%")
print(f"Number of Luxembourgish tweets: {luxembourgish_tweets_count}") #LUXEMBOURGISH
print(f"Percentage of Luxembourgish tweets: {luxembourgish_tweets_percentage:.2f}%")
print(f"Number of Greek tweets: {greek_tweets_count}") #GREEK
print(f"Percentage of Greek tweets: {greek_tweets_percentage:.2f}%")
print(f"Number of Turkish tweets: {turkish_tweets_count}") #TURKISH
print(f"Percentage of Turkish tweets: {turkish_tweets_percentage:.2f}%")
print(f"Number of German tweets: {german_tweets_count}") #GERMAN
print(f"Percentage of German tweets: {german_tweets_percentage:.2f}%")
print(f"Number of Swedish tweets: {swedish_tweets_count}") #SWEDISH
print(f"Percentage of Swedish tweets: {swedish_tweets_percentage:.2f}%") 
print(f"Number of Norwegian tweets: {norwegian_tweets_count}") #NORWEGIAN
print(f"Percentage of Norwegian tweets: {norwegian_tweets_percentage:.2f}%") 
print(f"Number of Danish tweets: {danish_tweets_count}") #DANISH
print(f"Percentage of Danish tweets: {danish_tweets_percentage:.2f}%")
print(f"Number of Finnish tweets: {finnish_tweets_count}") #FINNISH
print(f"Percentage of Finnish tweets: {finnish_tweets_percentage:.2f}%")
print(f"Number of Dutch tweets: {dutch_tweets_count}") #DUTCH
print(f"Percentage of Dutch tweets: {dutch_tweets_percentage:.2f}%")
print(f"Number of Ukrainian tweets: {ukrainian_tweets_count}") #UKRAINIAN
print(f"Percentage of Ukrainian tweets: {ukrainian_tweets_percentage:.2f}%")
print(f"Number of Polish tweets: {polish_tweets_count}") #POLISH 
print(f"Percentage of Polish tweets: {polish_tweets_percentage:.2f}%")
print(f"Number of Czech tweets: {czech_tweets_count}") #CZECH
print(f"Percentage of Czech tweets: {czech_tweets_percentage:.2f}%")
print(f"Number of Hungarian tweets: {hungarian_tweets_count}") #HUNGARIAN
print(f"Percentage of Hungarian tweets: {hungarian_tweets_percentage:.2f}%")
print(f"Number of Bulgarian tweets: {bulgarian_tweets_count}") #BULGARIAN
print(f"Percentage of Bulgarian tweets: {bulgarian_tweets_percentage:.2f}%")
print(f"Number of Albanian tweets: {albanian_tweets_count}") #ALBANIAN
print(f"Percentage of Albanian tweets: {albanian_tweets_percentage:.2f}%")
print(f"Number of Bosnian tweets: {bosnian_tweets_count}") #BOSNIAN
print(f"Percentage of Bosnian tweets: {bosnian_tweets_percentage:.2f}%")
print(f"Number of Icelandic tweets: {icelandic_tweets_count}") #ICELANDIC
print(f"Percentage of Icelandic tweets: {icelandic_tweets_percentage:.2f}%")
print(f"Number of Estonian tweets: {estonian_tweets_count}") #ESTONIAN
print(f"Percentage of Estonian tweets: {estonian_tweets_percentage:.2f}%")
print(f"Number of Maltese tweets: {maltese_tweets_count}") #MALTESE
print(f"Percentage of Maltese tweets: {maltese_tweets_percentage:.2f}%")
print(f"Number of Montenegrin tweets: {montenegrin_tweets_count}") #MONTENEGRIN
print(f"Percentage of Montenegrin tweets: {montenegrin_tweets_percentage:.2f}%")
print(f"Number of Macedonian tweets: {macedonian_tweets_count}") #MACEDONIAN
print(f"Percentage of Macedonian tweets: {macedonian_tweets_percentage:.2f}%")
print(f"Number of Azerbaijani tweets: {azerbaijani_tweets_count}") #AZERBAIJANI
print(f"Percentage of Azerbaijani tweets: {azerbaijani_tweets_percentage:.2f}%")
print(f"Number of Lithuanian tweets: {lithuanian_tweets_count}") #LITHUANIAN
print(f"Percentage of Lithuanian tweets: {lithuanian_tweets_percentage:.2f}%")
print(f"Number of Latvian tweets: {latvian_tweets_count}") #LATVIAN
print(f"Percentage of Latvian tweets: {latvian_tweets_percentage:.2f}%")
print(f"Number of Armenian tweets: {armenian_tweets_count}") #ARMENIAN
print(f"Percentage of Armenian tweets: {armenian_tweets_percentage:.2f}%")
print(f"Number of Georgian tweets: {georgian_tweets_count}") #GEORGIAN
print(f"Percentage of Georgian tweets: {georgian_tweets_percentage:.2f}%")
print(f"Number of Serbian tweets: {serbian_tweets_count}") #SERBIAN
print(f"Percentage of Serbian tweets: {serbian_tweets_percentage:.2f}%")
print(f"Number of Croatian tweets: {croatian_tweets_count}") #CROATIAN
print(f"Percentage of Croatian tweets: {croatian_tweets_percentage:.2f}%")
print(f"Number of Slovenian tweets: {slovenian_tweets_count}") #SLOVENIAN
print(f"Percentage of Slovenian tweets: {slovenian_tweets_percentage:.2f}%")
print(f"Number of Slovak tweets: {slovak_tweets_count}") #SLOVAK
print(f"Percentage of Slovak tweets: {slovak_tweets_percentage:.2f}%") 
print(f"Number of Russian tweets: {russian_tweets_count}") #RUSSIAN
print(f"Percentage of Russian tweets: {russian_tweets_percentage:.2f}%")
print(f"Number of Belarusian tweets: {belarusian_tweets_count}") #BELARUSIAN
print(f"Percentage of Belarusian tweets: {belarusian_tweets_percentage:.2f}%")
print(f"Number of Hebrew tweets: {hebrew_tweets_count}") #HEBREW
print(f"Percentage of Hebrew tweets: {hebrew_tweets_percentage:.2f}%")
print(f"Number of Unknown tweets: {unknown_tweets_count}") #UNKNOWN
print(f"Percentage of Unknown tweets: {unknown_tweets_percentage:.2f}%")

print("----------------------------------")
print(f"Number of tweets: {tweets_count}")
print(f"Number of original tweets: {original_tweets_count}")
print(f"Number of original Spanish tweets: {spanish_tweets_count}")
print("-----------------------------------")
print(f"Frequent words: {words_by_frequence}\n")
print(f"Most frequent Spanish words: {spanish_words_by_frequence}\n")
print(tweets[0:2])

Number of English tweets: 491
Percentage of English tweets: 49.10%
Number of Spanish tweets: 94
Percentage of Spanish tweets: 23.60%
Number of Catalan tweets: 5
Percentage of Catalan tweets: 0.50%
Number of Galician tweets: 0
Percentage of Galician tweets: 0.00%
Number of Basque tweets: 1
Percentage of Basque tweets: 0.10%
Number of Portuguese tweets: 6
Percentage of Portuguese tweets: 0.60%
Number of French tweets: 96
Percentage of French tweets: 9.60%
Number of Italian tweets: 30
Percentage of Italian tweets: 3.00%
Number of Romanian tweets: 1
Percentage of Romanian tweets: 0.10%
Number of Occitan tweets: 0
Percentage of Occitan tweets: 0.00%
Number of Corsican tweets: 0
Percentage of Corsican tweets: 0.00%
Number of Breton tweets: 0
Percentage of Breton tweets: 0.00%
Number of Luxembourgish tweets: 0
Percentage of Luxembourgish tweets: 0.00%
Number of Greek tweets: 7
Percentage of Greek tweets: 0.70%
Number of Turkish tweets: 21
Percentage of Turkish tweets: 2.10%
Number of German t

# 4:

In [22]:
import s3fs
s3 = s3fs.S3FileSystem(anon=False)

bucket='mudab-2025-big-data'
data_key = 'twitter-data/Eurovision-04.json'
data_location = 's3://{}/{}'.format(bucket, data_key)

In [24]:
import json, dataclasses

tweets = []

bucket='mudab-2025-pc1262057'
data_key = 'input/Eurovision-04.json'
data_location = 's3://{}/{}'.format(bucket, data_key)

def parse_line(line: str):
  """Try to parse a string into a Person"""
  error = 0
  try:
    parsed = json.loads(line)
    return Tweet(parsed['id'], parsed['text'], parsed['user']['name'], parsed['lang'],parsed.get('retweeted_status', False))
  except Exception as e:
    error += 1    
#   print(f"Error parsing '{line}': {e}")

with s3.open(data_location) as input: #CLIENT S3. CAN READ THE FILE LINE BY LINE
  for line in input.readlines():
    if len(line.strip()) > 0:
      tweet = parse_line(line)
      if tweet: # We add only if the tweet is not 'None'
         tweets.append(tweet)

for modeled_tweet in tweets[0:10]:
  print(modeled_tweet)

Tweet(id=995364942193848322, content='RT @NetflixES: Ella está al mando. Con @PaquitaSalas nada malo puede pasar, ¿no? #Eurovision https://t.co/5HeUDCqxX6', author='GH taxi', language='es', retweeted_status={'created_at': 'Sat May 12 17:35:18 +0000 2018', 'id': 995356756770467840, 'id_str': '995356756770467840', 'text': 'Ella está al mando. Con @PaquitaSalas nada malo puede pasar, ¿no? #Eurovision https://t.co/5HeUDCqxX6', 'display_text_range': [0, 77], 'source': '<a href="https://studio.twitter.com" rel="nofollow">Media Studio</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 3143260474, 'id_str': '3143260474', 'name': 'Netflix España', 'screen_name': 'NetflixES', 'location': 'Spain', 'url': 'http://www.netflix.es', 'description': 'Oh Bella Ciao, bella ciao, bella ciao, ciao, ciao. Ayuda: @Netflixhelps', 'translator_type': 'none', 'pro

In [65]:
import json, dataclasses
def read_clean_tweets(input: str):
    tweets = []
    with open(input, 'r') as f:
        lines = f.readlines()
    for line in lines:
        parsed = json.loads(line)
        tweet = Tweet(**parsed)
        tweets.append(tweet)
    return tweets

def count_tweets(language: str, tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.language == language:
            count += 1
    return count 
    

def most_frequent_word(tweets: list[Tweet]):
    count = {}
    for tweet in tweets:
        words = tweet.content.split(' ')
        for word in words:
            if word in count:
                count[word] += 1
            else:
                count[word] = 1
    return dict(sorted(count.items(), key=lambda item: item[1], reverse=True))

def most_frequent_words_spanish(tweets: list[Tweet]):
    count = {}
    for tweet in tweets:
        if tweet.language == 'es':
            words = tweet.content.split(' ')
            for word in words:
                if word in count:
                    count[word] += 1
                else:
                    count[word] = 1
    return dict(sorted(count.items(), key=lambda item: item[1], reverse=True))

def get_total_tweets_count(tweets: list[Tweet]):
    return len(tweets)

def count_original_tweets(tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.retweeted_status == False:
            count += 1
    return count

def count_original_tweets_spanish(tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.language == 'es' and tweet.retweeted_status == False:
            count += 1
    return count


tweets = read_clean_tweets('clean-dataset')

english_tweets_count = count_tweets('en', tweets)
spanish_tweets_count = count_tweets('es', tweets)
catalan_tweets_count = count_tweets('ca', tweets)
galician_tweets_count = count_tweets('gl', tweets)
basque_tweets_count = count_tweets('eu', tweets)
portuguese_tweets_count = count_tweets('pt', tweets)
french_tweets_count = count_tweets('fr', tweets)
italian_tweets_count = count_tweets('it', tweets)
romanian_tweets_count = count_tweets('ro', tweets)
occitan_tweets_count = count_tweets('oc', tweets) 
corsican_tweets_count = count_tweets('co', tweets) 
breton_tweets_count = count_tweets('br', tweets) 
luxembourgish_tweets_count = count_tweets('lb', tweets)
greek_tweets_count = count_tweets('el', tweets)
turkish_tweets_count = count_tweets('tr', tweets)
german_tweets_count = count_tweets('de', tweets)
swedish_tweets_count = count_tweets('sv', tweets)
norwegian_tweets_count = count_tweets('no', tweets)
danish_tweets_count = count_tweets('da', tweets) 
finnish_tweets_count = count_tweets('fi', tweets)
dutch_tweets_count = count_tweets('nl', tweets)
ukrainian_tweets_count = count_tweets('uk', tweets) 
polish_tweets_count = count_tweets('pl', tweets) 
czech_tweets_count = count_tweets('cs', tweets) 
hungarian_tweets_count = count_tweets('hu', tweets) 
bulgarian_tweets_count = count_tweets('bg', tweets) 
albanian_tweets_count = count_tweets('sq', tweets)
bosnian_tweets_count = count_tweets('bs', tweets)
icelandic_tweets_count = count_tweets('is', tweets)
estonian_tweets_count = count_tweets('et', tweets)
maltese_tweets_count = count_tweets('mt', tweets)
montenegrin_tweets_count = count_tweets('me', tweets)
macedonian_tweets_count = count_tweets('mk', tweets)
azerbaijani_tweets_count = count_tweets('az', tweets)
lithuanian_tweets_count = count_tweets('lt', tweets)
latvian_tweets_count = count_tweets('lv', tweets)
armenian_tweets_count = count_tweets('hy', tweets)
georgian_tweets_count = count_tweets('ka', tweets)
serbian_tweets_count = count_tweets('sr', tweets)
croatian_tweets_count = count_tweets('hr', tweets)
slovenian_tweets_count = count_tweets('sl', tweets)
slovak_tweets_count = count_tweets('sk', tweets)
russian_tweets_count = count_tweets('ru', tweets) 
belarusian_tweets_count = count_tweets('be', tweets) 
hebrew_tweets_count = count_tweets('he', tweets)
unknown_tweets_count = count_tweets('und', tweets)


# PERCENTAGE FOR EACH LANGUAGE
total_tweets = get_total_tweets_count(tweets)
english_tweets_percentage = (english_tweets_count / total_tweets) * 100
spanish_tweets_percentage = (spanish_tweets_count / total_tweets) * 100
catalan_tweets_percentage = (catalan_tweets_count / total_tweets) * 100
galician_tweets_percentage = (galician_tweets_count / total_tweets) * 100
basque_tweets_percentage = (basque_tweets_count / total_tweets) * 100
portuguese_tweets_percentage = (portuguese_tweets_count / total_tweets) * 100
french_tweets_percentage = (french_tweets_count / total_tweets) * 100
italian_tweets_percentage = (italian_tweets_count / total_tweets) * 100
romanian_tweets_percentage = (romanian_tweets_count / total_tweets) * 100
occitan_tweets_percentage = (occitan_tweets_count / total_tweets) * 100
corsican_tweets_percentage = (corsican_tweets_count / total_tweets) * 100
breton_tweets_percentage = (breton_tweets_count / total_tweets) * 100
luxembourgish_tweets_percentage = (luxembourgish_tweets_count / total_tweets) * 100
greek_tweets_percentage = (greek_tweets_count / total_tweets) * 100
turkish_tweets_percentage = (turkish_tweets_count / total_tweets) * 100
german_tweets_percentage = (german_tweets_count / total_tweets) * 100
swedish_tweets_percentage = (swedish_tweets_count / total_tweets) * 100
norwegian_tweets_percentage = (norwegian_tweets_count / total_tweets) * 100
danish_tweets_percentage = (danish_tweets_count / total_tweets) * 100 
finnish_tweets_percentage = (finnish_tweets_count / total_tweets) * 100 
dutch_tweets_percentage = (dutch_tweets_count / total_tweets) * 100
ukrainian_tweets_percentage = (ukrainian_tweets_count / total_tweets) * 100
polish_tweets_percentage = (polish_tweets_count / total_tweets) * 100
czech_tweets_percentage = (czech_tweets_count / total_tweets) * 100
hungarian_tweets_percentage = (hungarian_tweets_count / total_tweets) * 100
bulgarian_tweets_percentage = (bulgarian_tweets_count / total_tweets) * 100
albanian_tweets_percentage = (albanian_tweets_count / total_tweets) * 100
bosnian_tweets_percentage = (bosnian_tweets_count / total_tweets) * 100
icelandic_tweets_percentage = (icelandic_tweets_count / total_tweets) * 100
estonian_tweets_percentage = (estonian_tweets_count / total_tweets) * 100
maltese_tweets_percentage = (maltese_tweets_count / total_tweets) * 100
montenegrin_tweets_percentage = (montenegrin_tweets_count / total_tweets) * 100
macedonian_tweets_percentage = (macedonian_tweets_count / total_tweets) * 100
azerbaijani_tweets_percentage = (azerbaijani_tweets_count / total_tweets) * 100
lithuanian_tweets_percentage = (lithuanian_tweets_count / total_tweets) * 100
latvian_tweets_percentage = (latvian_tweets_count / total_tweets) * 100
armenian_tweets_percentage = (armenian_tweets_count / total_tweets) * 100
georgian_tweets_percentage = (georgian_tweets_count / total_tweets) * 100
serbian_tweets_percentage = (serbian_tweets_count / total_tweets) * 100
croatian_tweets_percentage = (croatian_tweets_count / total_tweets) * 100
slovenian_tweets_percentage = (slovenian_tweets_count / total_tweets) * 100
slovak_tweets_percentage = (slovak_tweets_count / total_tweets) * 100
russian_tweets_percentage = (russian_tweets_count / total_tweets) * 100
belarusian_tweets_percentage = (belarusian_tweets_count / total_tweets) * 100
hebrew_tweets_percentage = (hebrew_tweets_count / total_tweets) * 100 
unknown_tweets_percentage = (unknown_tweets_count / total_tweets) * 100

tweets_count = get_total_tweets_count(tweets)
original_tweets_count = count_original_tweets(tweets)
words_by_frequence = list(most_frequent_word(tweets).items())[0:100]
spanish_tweets_count = count_original_tweets_spanish(tweets)
spanish_words_by_frequence = list(most_frequent_words_spanish(tweets).items())[0:100]

print(f"Number of English tweets: {english_tweets_count}")  #ENGLISH
print(f"Percentage of English tweets: {english_tweets_percentage:.2f}%")
print(f"Number of Spanish tweets: {spanish_tweets_count}") #SPANISH
print(f"Percentage of Spanish tweets: {spanish_tweets_percentage:.2f}%")
print(f"Number of Catalan tweets: {catalan_tweets_count}") #CATALAN
print(f"Percentage of Catalan tweets: {catalan_tweets_percentage:.2f}%")
print(f"Number of Galician tweets: {galician_tweets_count}") #GALICIAN
print(f"Percentage of Galician tweets: {galician_tweets_percentage:.2f}%")
print(f"Number of Basque tweets: {basque_tweets_count}") #BASQUE
print(f"Percentage of Basque tweets: {basque_tweets_percentage:.2f}%")
print(f"Number of Portuguese tweets: {portuguese_tweets_count}") #PORTUGUESE
print(f"Percentage of Portuguese tweets: {portuguese_tweets_percentage:.2f}%")
print(f"Number of French tweets: {french_tweets_count}") #FRENCH
print(f"Percentage of French tweets: {french_tweets_percentage:.2f}%") 
print(f"Number of Italian tweets: {italian_tweets_count}") #ITALIAN
print(f"Percentage of Italian tweets: {italian_tweets_percentage:.2f}%")
print(f"Number of Romanian tweets: {romanian_tweets_count}") #ROMANIAN
print(f"Percentage of Romanian tweets: {romanian_tweets_percentage:.2f}%")
print(f"Number of Occitan tweets: {occitan_tweets_count}") #OCCITAN
print(f"Percentage of Occitan tweets: {occitan_tweets_percentage:.2f}%")
print(f"Number of Corsican tweets: {corsican_tweets_count}") #CORSICAN
print(f"Percentage of Corsican tweets: {corsican_tweets_percentage:.2f}%")
print(f"Number of Breton tweets: {breton_tweets_count}") #BRETON
print(f"Percentage of Breton tweets: {breton_tweets_percentage:.2f}%")
print(f"Number of Luxembourgish tweets: {luxembourgish_tweets_count}") #LUXEMBOURGISH
print(f"Percentage of Luxembourgish tweets: {luxembourgish_tweets_percentage:.2f}%")
print(f"Number of Greek tweets: {greek_tweets_count}") #GREEK
print(f"Percentage of Greek tweets: {greek_tweets_percentage:.2f}%")
print(f"Number of Turkish tweets: {turkish_tweets_count}") #TURKISH
print(f"Percentage of Turkish tweets: {turkish_tweets_percentage:.2f}%")
print(f"Number of German tweets: {german_tweets_count}") #GERMAN
print(f"Percentage of German tweets: {german_tweets_percentage:.2f}%")
print(f"Number of Swedish tweets: {swedish_tweets_count}") #SWEDISH
print(f"Percentage of Swedish tweets: {swedish_tweets_percentage:.2f}%") 
print(f"Number of Norwegian tweets: {norwegian_tweets_count}") #NORWEGIAN
print(f"Percentage of Norwegian tweets: {norwegian_tweets_percentage:.2f}%") 
print(f"Number of Danish tweets: {danish_tweets_count}") #DANISH
print(f"Percentage of Danish tweets: {danish_tweets_percentage:.2f}%")
print(f"Number of Finnish tweets: {finnish_tweets_count}") #FINNISH
print(f"Percentage of Finnish tweets: {finnish_tweets_percentage:.2f}%")
print(f"Number of Dutch tweets: {dutch_tweets_count}") #DUTCH
print(f"Percentage of Dutch tweets: {dutch_tweets_percentage:.2f}%")
print(f"Number of Ukrainian tweets: {ukrainian_tweets_count}") #UKRAINIAN
print(f"Percentage of Ukrainian tweets: {ukrainian_tweets_percentage:.2f}%")
print(f"Number of Polish tweets: {polish_tweets_count}") #POLISH 
print(f"Percentage of Polish tweets: {polish_tweets_percentage:.2f}%")
print(f"Number of Czech tweets: {czech_tweets_count}") #CZECH
print(f"Percentage of Czech tweets: {czech_tweets_percentage:.2f}%")
print(f"Number of Hungarian tweets: {hungarian_tweets_count}") #HUNGARIAN
print(f"Percentage of Hungarian tweets: {hungarian_tweets_percentage:.2f}%")
print(f"Number of Bulgarian tweets: {bulgarian_tweets_count}") #BULGARIAN
print(f"Percentage of Bulgarian tweets: {bulgarian_tweets_percentage:.2f}%")
print(f"Number of Albanian tweets: {albanian_tweets_count}") #ALBANIAN
print(f"Percentage of Albanian tweets: {albanian_tweets_percentage:.2f}%")
print(f"Number of Bosnian tweets: {bosnian_tweets_count}") #BOSNIAN
print(f"Percentage of Bosnian tweets: {bosnian_tweets_percentage:.2f}%")
print(f"Number of Icelandic tweets: {icelandic_tweets_count}") #ICELANDIC
print(f"Percentage of Icelandic tweets: {icelandic_tweets_percentage:.2f}%")
print(f"Number of Estonian tweets: {estonian_tweets_count}") #ESTONIAN
print(f"Percentage of Estonian tweets: {estonian_tweets_percentage:.2f}%")
print(f"Number of Maltese tweets: {maltese_tweets_count}") #MALTESE
print(f"Percentage of Maltese tweets: {maltese_tweets_percentage:.2f}%")
print(f"Number of Montenegrin tweets: {montenegrin_tweets_count}") #MONTENEGRIN
print(f"Percentage of Montenegrin tweets: {montenegrin_tweets_percentage:.2f}%")
print(f"Number of Macedonian tweets: {macedonian_tweets_count}") #MACEDONIAN
print(f"Percentage of Macedonian tweets: {macedonian_tweets_percentage:.2f}%")
print(f"Number of Azerbaijani tweets: {azerbaijani_tweets_count}") #AZERBAIJANI
print(f"Percentage of Azerbaijani tweets: {azerbaijani_tweets_percentage:.2f}%")
print(f"Number of Lithuanian tweets: {lithuanian_tweets_count}") #LITHUANIAN
print(f"Percentage of Lithuanian tweets: {lithuanian_tweets_percentage:.2f}%")
print(f"Number of Latvian tweets: {latvian_tweets_count}") #LATVIAN
print(f"Percentage of Latvian tweets: {latvian_tweets_percentage:.2f}%")
print(f"Number of Armenian tweets: {armenian_tweets_count}") #ARMENIAN
print(f"Percentage of Armenian tweets: {armenian_tweets_percentage:.2f}%")
print(f"Number of Georgian tweets: {georgian_tweets_count}") #GEORGIAN
print(f"Percentage of Georgian tweets: {georgian_tweets_percentage:.2f}%")
print(f"Number of Serbian tweets: {serbian_tweets_count}") #SERBIAN
print(f"Percentage of Serbian tweets: {serbian_tweets_percentage:.2f}%")
print(f"Number of Croatian tweets: {croatian_tweets_count}") #CROATIAN
print(f"Percentage of Croatian tweets: {croatian_tweets_percentage:.2f}%")
print(f"Number of Slovenian tweets: {slovenian_tweets_count}") #SLOVENIAN
print(f"Percentage of Slovenian tweets: {slovenian_tweets_percentage:.2f}%")
print(f"Number of Slovak tweets: {slovak_tweets_count}") #SLOVAK
print(f"Percentage of Slovak tweets: {slovak_tweets_percentage:.2f}%") 
print(f"Number of Russian tweets: {russian_tweets_count}") #RUSSIAN
print(f"Percentage of Russian tweets: {russian_tweets_percentage:.2f}%")
print(f"Number of Belarusian tweets: {belarusian_tweets_count}") #BELARUSIAN
print(f"Percentage of Belarusian tweets: {belarusian_tweets_percentage:.2f}%")
print(f"Number of Hebrew tweets: {hebrew_tweets_count}") #HEBREW
print(f"Percentage of Hebrew tweets: {hebrew_tweets_percentage:.2f}%")
print(f"Number of Unknown tweets: {unknown_tweets_count}") #UNKNOWN
print(f"Percentage of Unknown tweets: {unknown_tweets_percentage:.2f}%")

print("----------------------------------")
print(f"Number of tweets: {tweets_count}")
print(f"Number of original tweets: {original_tweets_count}")
print(f"Number of original Spanish tweets: {spanish_tweets_count}")
print("-----------------------------------")
print(f"Frequent words: {words_by_frequence}\n")
print(f"Most frequent Spanish words: {spanish_words_by_frequence}\n")
print(tweets[0:2])

Number of English tweets: 491
Percentage of English tweets: 49.10%
Number of Spanish tweets: 94
Percentage of Spanish tweets: 23.60%
Number of Catalan tweets: 5
Percentage of Catalan tweets: 0.50%
Number of Galician tweets: 0
Percentage of Galician tweets: 0.00%
Number of Basque tweets: 1
Percentage of Basque tweets: 0.10%
Number of Portuguese tweets: 6
Percentage of Portuguese tweets: 0.60%
Number of French tweets: 96
Percentage of French tweets: 9.60%
Number of Italian tweets: 30
Percentage of Italian tweets: 3.00%
Number of Romanian tweets: 1
Percentage of Romanian tweets: 0.10%
Number of Occitan tweets: 0
Percentage of Occitan tweets: 0.00%
Number of Corsican tweets: 0
Percentage of Corsican tweets: 0.00%
Number of Breton tweets: 0
Percentage of Breton tweets: 0.00%
Number of Luxembourgish tweets: 0
Percentage of Luxembourgish tweets: 0.00%
Number of Greek tweets: 7
Percentage of Greek tweets: 0.70%
Number of Turkish tweets: 21
Percentage of Turkish tweets: 2.10%
Number of German t

# 5:

In [26]:
import s3fs
s3 = s3fs.S3FileSystem(anon=False)

bucket='mudab-2025-big-data'
data_key = 'twitter-data/Eurovision-05.json'
data_location = 's3://{}/{}'.format(bucket, data_key)

In [28]:
import json, dataclasses

tweets = []

bucket='mudab-2025-pc1262057'
data_key = 'input/Eurovision-05.json'
data_location = 's3://{}/{}'.format(bucket, data_key)

def parse_line(line: str):
  """Try to parse a string into a Person"""
  error = 0
  try:
    parsed = json.loads(line)
    return Tweet(parsed['id'], parsed['text'], parsed['user']['name'], parsed['lang'],parsed.get('retweeted_status', False))
  except Exception as e:
    error += 1    
#   print(f"Error parsing '{line}': {e}")

with s3.open(data_location) as input: #CLIENT S3. CAN READ THE FILE LINE BY LINE
  for line in input.readlines():
    if len(line.strip()) > 0:
      tweet = parse_line(line)
      if tweet: # We add only if the tweet is not 'None'
         tweets.append(tweet)

for modeled_tweet in tweets[0:10]:
  print(modeled_tweet)

Tweet(id=995386043187769344, content='RT @tasvolverhalen: Lekker catchy voor een uitvaart. #ltu #esf18 #eurovision', author='Susanne', language='nl', retweeted_status={'created_at': 'Sat May 12 19:31:13 +0000 2018', 'id': 995385927190097926, 'id_str': '995385927190097926', 'text': 'Lekker catchy voor een uitvaart. #ltu #esf18 #eurovision', 'source': '<a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 54936488, 'id_str': '54936488', 'name': 'Tas', 'screen_name': 'tasvolverhalen', 'location': 'All over the land', 'url': 'http://omafiettekoop.tumblr.com', 'description': 'Kijkt graag pit tijdens den twit, copy tekst online redactie, ex-dierenarts, Tukker en gewoon verschrikkelijk goed met frikadellen. ©Dagrapportade', 'translator_type': 'none', 'protected': False, 

In [66]:
import json, dataclasses
def read_clean_tweets(input: str):
    tweets = []
    with open(input, 'r') as f:
        lines = f.readlines()
    for line in lines:
        parsed = json.loads(line)
        tweet = Tweet(**parsed)
        tweets.append(tweet)
    return tweets

def count_tweets(language: str, tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.language == language:
            count += 1
    return count 
    

def most_frequent_word(tweets: list[Tweet]):
    count = {}
    for tweet in tweets:
        words = tweet.content.split(' ')
        for word in words:
            if word in count:
                count[word] += 1
            else:
                count[word] = 1
    return dict(sorted(count.items(), key=lambda item: item[1], reverse=True))

def most_frequent_words_spanish(tweets: list[Tweet]):
    count = {}
    for tweet in tweets:
        if tweet.language == 'es':
            words = tweet.content.split(' ')
            for word in words:
                if word in count:
                    count[word] += 1
                else:
                    count[word] = 1
    return dict(sorted(count.items(), key=lambda item: item[1], reverse=True))

def get_total_tweets_count(tweets: list[Tweet]):
    return len(tweets)

def count_original_tweets(tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.retweeted_status == False:
            count += 1
    return count

def count_original_tweets_spanish(tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.language == 'es' and tweet.retweeted_status == False:
            count += 1
    return count


tweets = read_clean_tweets('clean-dataset')

english_tweets_count = count_tweets('en', tweets)
spanish_tweets_count = count_tweets('es', tweets)
catalan_tweets_count = count_tweets('ca', tweets)
galician_tweets_count = count_tweets('gl', tweets)
basque_tweets_count = count_tweets('eu', tweets)
portuguese_tweets_count = count_tweets('pt', tweets)
french_tweets_count = count_tweets('fr', tweets)
italian_tweets_count = count_tweets('it', tweets)
romanian_tweets_count = count_tweets('ro', tweets)
occitan_tweets_count = count_tweets('oc', tweets) 
corsican_tweets_count = count_tweets('co', tweets) 
breton_tweets_count = count_tweets('br', tweets) 
luxembourgish_tweets_count = count_tweets('lb', tweets)
greek_tweets_count = count_tweets('el', tweets)
turkish_tweets_count = count_tweets('tr', tweets)
german_tweets_count = count_tweets('de', tweets)
swedish_tweets_count = count_tweets('sv', tweets)
norwegian_tweets_count = count_tweets('no', tweets)
danish_tweets_count = count_tweets('da', tweets) 
finnish_tweets_count = count_tweets('fi', tweets)
dutch_tweets_count = count_tweets('nl', tweets)
ukrainian_tweets_count = count_tweets('uk', tweets) 
polish_tweets_count = count_tweets('pl', tweets) 
czech_tweets_count = count_tweets('cs', tweets) 
hungarian_tweets_count = count_tweets('hu', tweets) 
bulgarian_tweets_count = count_tweets('bg', tweets) 
albanian_tweets_count = count_tweets('sq', tweets)
bosnian_tweets_count = count_tweets('bs', tweets)
icelandic_tweets_count = count_tweets('is', tweets)
estonian_tweets_count = count_tweets('et', tweets)
maltese_tweets_count = count_tweets('mt', tweets)
montenegrin_tweets_count = count_tweets('me', tweets)
macedonian_tweets_count = count_tweets('mk', tweets)
azerbaijani_tweets_count = count_tweets('az', tweets)
lithuanian_tweets_count = count_tweets('lt', tweets)
latvian_tweets_count = count_tweets('lv', tweets)
armenian_tweets_count = count_tweets('hy', tweets)
georgian_tweets_count = count_tweets('ka', tweets)
serbian_tweets_count = count_tweets('sr', tweets)
croatian_tweets_count = count_tweets('hr', tweets)
slovenian_tweets_count = count_tweets('sl', tweets)
slovak_tweets_count = count_tweets('sk', tweets)
russian_tweets_count = count_tweets('ru', tweets) 
belarusian_tweets_count = count_tweets('be', tweets) 
hebrew_tweets_count = count_tweets('he', tweets)
unknown_tweets_count = count_tweets('und', tweets)


# PERCENTAGE FOR EACH LANGUAGE
total_tweets = get_total_tweets_count(tweets)
english_tweets_percentage = (english_tweets_count / total_tweets) * 100
spanish_tweets_percentage = (spanish_tweets_count / total_tweets) * 100
catalan_tweets_percentage = (catalan_tweets_count / total_tweets) * 100
galician_tweets_percentage = (galician_tweets_count / total_tweets) * 100
basque_tweets_percentage = (basque_tweets_count / total_tweets) * 100
portuguese_tweets_percentage = (portuguese_tweets_count / total_tweets) * 100
french_tweets_percentage = (french_tweets_count / total_tweets) * 100
italian_tweets_percentage = (italian_tweets_count / total_tweets) * 100
romanian_tweets_percentage = (romanian_tweets_count / total_tweets) * 100
occitan_tweets_percentage = (occitan_tweets_count / total_tweets) * 100
corsican_tweets_percentage = (corsican_tweets_count / total_tweets) * 100
breton_tweets_percentage = (breton_tweets_count / total_tweets) * 100
luxembourgish_tweets_percentage = (luxembourgish_tweets_count / total_tweets) * 100
greek_tweets_percentage = (greek_tweets_count / total_tweets) * 100
turkish_tweets_percentage = (turkish_tweets_count / total_tweets) * 100
german_tweets_percentage = (german_tweets_count / total_tweets) * 100
swedish_tweets_percentage = (swedish_tweets_count / total_tweets) * 100
norwegian_tweets_percentage = (norwegian_tweets_count / total_tweets) * 100
danish_tweets_percentage = (danish_tweets_count / total_tweets) * 100 
finnish_tweets_percentage = (finnish_tweets_count / total_tweets) * 100 
dutch_tweets_percentage = (dutch_tweets_count / total_tweets) * 100
ukrainian_tweets_percentage = (ukrainian_tweets_count / total_tweets) * 100
polish_tweets_percentage = (polish_tweets_count / total_tweets) * 100
czech_tweets_percentage = (czech_tweets_count / total_tweets) * 100
hungarian_tweets_percentage = (hungarian_tweets_count / total_tweets) * 100
bulgarian_tweets_percentage = (bulgarian_tweets_count / total_tweets) * 100
albanian_tweets_percentage = (albanian_tweets_count / total_tweets) * 100
bosnian_tweets_percentage = (bosnian_tweets_count / total_tweets) * 100
icelandic_tweets_percentage = (icelandic_tweets_count / total_tweets) * 100
estonian_tweets_percentage = (estonian_tweets_count / total_tweets) * 100
maltese_tweets_percentage = (maltese_tweets_count / total_tweets) * 100
montenegrin_tweets_percentage = (montenegrin_tweets_count / total_tweets) * 100
macedonian_tweets_percentage = (macedonian_tweets_count / total_tweets) * 100
azerbaijani_tweets_percentage = (azerbaijani_tweets_count / total_tweets) * 100
lithuanian_tweets_percentage = (lithuanian_tweets_count / total_tweets) * 100
latvian_tweets_percentage = (latvian_tweets_count / total_tweets) * 100
armenian_tweets_percentage = (armenian_tweets_count / total_tweets) * 100
georgian_tweets_percentage = (georgian_tweets_count / total_tweets) * 100
serbian_tweets_percentage = (serbian_tweets_count / total_tweets) * 100
croatian_tweets_percentage = (croatian_tweets_count / total_tweets) * 100
slovenian_tweets_percentage = (slovenian_tweets_count / total_tweets) * 100
slovak_tweets_percentage = (slovak_tweets_count / total_tweets) * 100
russian_tweets_percentage = (russian_tweets_count / total_tweets) * 100
belarusian_tweets_percentage = (belarusian_tweets_count / total_tweets) * 100
hebrew_tweets_percentage = (hebrew_tweets_count / total_tweets) * 100 
unknown_tweets_percentage = (unknown_tweets_count / total_tweets) * 100

tweets_count = get_total_tweets_count(tweets)
original_tweets_count = count_original_tweets(tweets)
words_by_frequence = list(most_frequent_word(tweets).items())[0:100]
spanish_tweets_count = count_original_tweets_spanish(tweets)
spanish_words_by_frequence = list(most_frequent_words_spanish(tweets).items())[0:100]

print(f"Number of English tweets: {english_tweets_count}")  #ENGLISH
print(f"Percentage of English tweets: {english_tweets_percentage:.2f}%")
print(f"Number of Spanish tweets: {spanish_tweets_count}") #SPANISH
print(f"Percentage of Spanish tweets: {spanish_tweets_percentage:.2f}%")
print(f"Number of Catalan tweets: {catalan_tweets_count}") #CATALAN
print(f"Percentage of Catalan tweets: {catalan_tweets_percentage:.2f}%")
print(f"Number of Galician tweets: {galician_tweets_count}") #GALICIAN
print(f"Percentage of Galician tweets: {galician_tweets_percentage:.2f}%")
print(f"Number of Basque tweets: {basque_tweets_count}") #BASQUE
print(f"Percentage of Basque tweets: {basque_tweets_percentage:.2f}%")
print(f"Number of Portuguese tweets: {portuguese_tweets_count}") #PORTUGUESE
print(f"Percentage of Portuguese tweets: {portuguese_tweets_percentage:.2f}%")
print(f"Number of French tweets: {french_tweets_count}") #FRENCH
print(f"Percentage of French tweets: {french_tweets_percentage:.2f}%") 
print(f"Number of Italian tweets: {italian_tweets_count}") #ITALIAN
print(f"Percentage of Italian tweets: {italian_tweets_percentage:.2f}%")
print(f"Number of Romanian tweets: {romanian_tweets_count}") #ROMANIAN
print(f"Percentage of Romanian tweets: {romanian_tweets_percentage:.2f}%")
print(f"Number of Occitan tweets: {occitan_tweets_count}") #OCCITAN
print(f"Percentage of Occitan tweets: {occitan_tweets_percentage:.2f}%")
print(f"Number of Corsican tweets: {corsican_tweets_count}") #CORSICAN
print(f"Percentage of Corsican tweets: {corsican_tweets_percentage:.2f}%")
print(f"Number of Breton tweets: {breton_tweets_count}") #BRETON
print(f"Percentage of Breton tweets: {breton_tweets_percentage:.2f}%")
print(f"Number of Luxembourgish tweets: {luxembourgish_tweets_count}") #LUXEMBOURGISH
print(f"Percentage of Luxembourgish tweets: {luxembourgish_tweets_percentage:.2f}%")
print(f"Number of Greek tweets: {greek_tweets_count}") #GREEK
print(f"Percentage of Greek tweets: {greek_tweets_percentage:.2f}%")
print(f"Number of Turkish tweets: {turkish_tweets_count}") #TURKISH
print(f"Percentage of Turkish tweets: {turkish_tweets_percentage:.2f}%")
print(f"Number of German tweets: {german_tweets_count}") #GERMAN
print(f"Percentage of German tweets: {german_tweets_percentage:.2f}%")
print(f"Number of Swedish tweets: {swedish_tweets_count}") #SWEDISH
print(f"Percentage of Swedish tweets: {swedish_tweets_percentage:.2f}%") 
print(f"Number of Norwegian tweets: {norwegian_tweets_count}") #NORWEGIAN
print(f"Percentage of Norwegian tweets: {norwegian_tweets_percentage:.2f}%") 
print(f"Number of Danish tweets: {danish_tweets_count}") #DANISH
print(f"Percentage of Danish tweets: {danish_tweets_percentage:.2f}%")
print(f"Number of Finnish tweets: {finnish_tweets_count}") #FINNISH
print(f"Percentage of Finnish tweets: {finnish_tweets_percentage:.2f}%")
print(f"Number of Dutch tweets: {dutch_tweets_count}") #DUTCH
print(f"Percentage of Dutch tweets: {dutch_tweets_percentage:.2f}%")
print(f"Number of Ukrainian tweets: {ukrainian_tweets_count}") #UKRAINIAN
print(f"Percentage of Ukrainian tweets: {ukrainian_tweets_percentage:.2f}%")
print(f"Number of Polish tweets: {polish_tweets_count}") #POLISH 
print(f"Percentage of Polish tweets: {polish_tweets_percentage:.2f}%")
print(f"Number of Czech tweets: {czech_tweets_count}") #CZECH
print(f"Percentage of Czech tweets: {czech_tweets_percentage:.2f}%")
print(f"Number of Hungarian tweets: {hungarian_tweets_count}") #HUNGARIAN
print(f"Percentage of Hungarian tweets: {hungarian_tweets_percentage:.2f}%")
print(f"Number of Bulgarian tweets: {bulgarian_tweets_count}") #BULGARIAN
print(f"Percentage of Bulgarian tweets: {bulgarian_tweets_percentage:.2f}%")
print(f"Number of Albanian tweets: {albanian_tweets_count}") #ALBANIAN
print(f"Percentage of Albanian tweets: {albanian_tweets_percentage:.2f}%")
print(f"Number of Bosnian tweets: {bosnian_tweets_count}") #BOSNIAN
print(f"Percentage of Bosnian tweets: {bosnian_tweets_percentage:.2f}%")
print(f"Number of Icelandic tweets: {icelandic_tweets_count}") #ICELANDIC
print(f"Percentage of Icelandic tweets: {icelandic_tweets_percentage:.2f}%")
print(f"Number of Estonian tweets: {estonian_tweets_count}") #ESTONIAN
print(f"Percentage of Estonian tweets: {estonian_tweets_percentage:.2f}%")
print(f"Number of Maltese tweets: {maltese_tweets_count}") #MALTESE
print(f"Percentage of Maltese tweets: {maltese_tweets_percentage:.2f}%")
print(f"Number of Montenegrin tweets: {montenegrin_tweets_count}") #MONTENEGRIN
print(f"Percentage of Montenegrin tweets: {montenegrin_tweets_percentage:.2f}%")
print(f"Number of Macedonian tweets: {macedonian_tweets_count}") #MACEDONIAN
print(f"Percentage of Macedonian tweets: {macedonian_tweets_percentage:.2f}%")
print(f"Number of Azerbaijani tweets: {azerbaijani_tweets_count}") #AZERBAIJANI
print(f"Percentage of Azerbaijani tweets: {azerbaijani_tweets_percentage:.2f}%")
print(f"Number of Lithuanian tweets: {lithuanian_tweets_count}") #LITHUANIAN
print(f"Percentage of Lithuanian tweets: {lithuanian_tweets_percentage:.2f}%")
print(f"Number of Latvian tweets: {latvian_tweets_count}") #LATVIAN
print(f"Percentage of Latvian tweets: {latvian_tweets_percentage:.2f}%")
print(f"Number of Armenian tweets: {armenian_tweets_count}") #ARMENIAN
print(f"Percentage of Armenian tweets: {armenian_tweets_percentage:.2f}%")
print(f"Number of Georgian tweets: {georgian_tweets_count}") #GEORGIAN
print(f"Percentage of Georgian tweets: {georgian_tweets_percentage:.2f}%")
print(f"Number of Serbian tweets: {serbian_tweets_count}") #SERBIAN
print(f"Percentage of Serbian tweets: {serbian_tweets_percentage:.2f}%")
print(f"Number of Croatian tweets: {croatian_tweets_count}") #CROATIAN
print(f"Percentage of Croatian tweets: {croatian_tweets_percentage:.2f}%")
print(f"Number of Slovenian tweets: {slovenian_tweets_count}") #SLOVENIAN
print(f"Percentage of Slovenian tweets: {slovenian_tweets_percentage:.2f}%")
print(f"Number of Slovak tweets: {slovak_tweets_count}") #SLOVAK
print(f"Percentage of Slovak tweets: {slovak_tweets_percentage:.2f}%") 
print(f"Number of Russian tweets: {russian_tweets_count}") #RUSSIAN
print(f"Percentage of Russian tweets: {russian_tweets_percentage:.2f}%")
print(f"Number of Belarusian tweets: {belarusian_tweets_count}") #BELARUSIAN
print(f"Percentage of Belarusian tweets: {belarusian_tweets_percentage:.2f}%")
print(f"Number of Hebrew tweets: {hebrew_tweets_count}") #HEBREW
print(f"Percentage of Hebrew tweets: {hebrew_tweets_percentage:.2f}%")
print(f"Number of Unknown tweets: {unknown_tweets_count}") #UNKNOWN
print(f"Percentage of Unknown tweets: {unknown_tweets_percentage:.2f}%")

print("----------------------------------")
print(f"Number of tweets: {tweets_count}")
print(f"Number of original tweets: {original_tweets_count}")
print(f"Number of original Spanish tweets: {spanish_tweets_count}")
print("-----------------------------------")
print(f"Frequent words: {words_by_frequence}\n")
print(f"Most frequent Spanish words: {spanish_words_by_frequence}\n")
print(tweets[0:2])

Number of English tweets: 491
Percentage of English tweets: 49.10%
Number of Spanish tweets: 94
Percentage of Spanish tweets: 23.60%
Number of Catalan tweets: 5
Percentage of Catalan tweets: 0.50%
Number of Galician tweets: 0
Percentage of Galician tweets: 0.00%
Number of Basque tweets: 1
Percentage of Basque tweets: 0.10%
Number of Portuguese tweets: 6
Percentage of Portuguese tweets: 0.60%
Number of French tweets: 96
Percentage of French tweets: 9.60%
Number of Italian tweets: 30
Percentage of Italian tweets: 3.00%
Number of Romanian tweets: 1
Percentage of Romanian tweets: 0.10%
Number of Occitan tweets: 0
Percentage of Occitan tweets: 0.00%
Number of Corsican tweets: 0
Percentage of Corsican tweets: 0.00%
Number of Breton tweets: 0
Percentage of Breton tweets: 0.00%
Number of Luxembourgish tweets: 0
Percentage of Luxembourgish tweets: 0.00%
Number of Greek tweets: 7
Percentage of Greek tweets: 0.70%
Number of Turkish tweets: 21
Percentage of Turkish tweets: 2.10%
Number of German t

# 6:

In [30]:
import s3fs
s3 = s3fs.S3FileSystem(anon=False)

bucket='mudab-2025-big-data'
data_key = 'twitter-data/Eurovision-06.json'
data_location = 's3://{}/{}'.format(bucket, data_key)

In [32]:
import json, dataclasses

tweets = []

bucket='mudab-2025-pc1262057'
data_key = 'input/Eurovision-06.json'
data_location = 's3://{}/{}'.format(bucket, data_key)

def parse_line(line: str):
  """Try to parse a string into a Person"""
  error = 0
  try:
    parsed = json.loads(line)
    return Tweet(parsed['id'], parsed['text'], parsed['user']['name'], parsed['lang'],parsed.get('retweeted_status', False))
  except Exception as e:
    error += 1    
#   print(f"Error parsing '{line}': {e}")

with s3.open(data_location) as input: #CLIENT S3. CAN READ THE FILE LINE BY LINE
  for line in input.readlines():
    if len(line.strip()) > 0:
      tweet = parse_line(line)
      if tweet: # We add only if the tweet is not 'None'
         tweets.append(tweet)

for modeled_tweet in tweets[0:10]:
  print(modeled_tweet)

Tweet(id=995398563206258688, content='WHY IS AUSTRALIA STILL TAKING PART?? #EUROVISION', author='Luce ʕ´• ᴥ •`ʔ', language='en', retweeted_status=False)
Tweet(id=995398562921009152, content='RT @deef4ever: Winter is coming ❄️❄️❄️\n\n#den #Tweurovision #Eurovision\xa0\xa0\xa0  #Eurovision18 #Eurovision2018 #ESC2018\xa0\xa0\xa0  #ESC #ESC18 #ESF #ESF…', author='noella broekhuizen', language='en', retweeted_status={'created_at': 'Sat May 12 20:19:21 +0000 2018', 'id': 995398041501913091, 'id_str': '995398041501913091', 'text': 'Winter is coming ❄️❄️❄️\n\n#den #Tweurovision #Eurovision\xa0\xa0\xa0  #Eurovision18 #Eurovision2018 #ESC2018\xa0\xa0\xa0  #ESC #ESC18… https://t.co/S0cm6DtNdv', 'display_text_range': [0, 140], 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'truncated': True, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None,

In [67]:
import json, dataclasses
def read_clean_tweets(input: str):
    tweets = []
    with open(input, 'r') as f:
        lines = f.readlines()
    for line in lines:
        parsed = json.loads(line)
        tweet = Tweet(**parsed)
        tweets.append(tweet)
    return tweets

def count_tweets(language: str, tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.language == language:
            count += 1
    return count 
    

def most_frequent_word(tweets: list[Tweet]):
    count = {}
    for tweet in tweets:
        words = tweet.content.split(' ')
        for word in words:
            if word in count:
                count[word] += 1
            else:
                count[word] = 1
    return dict(sorted(count.items(), key=lambda item: item[1], reverse=True))

def most_frequent_words_spanish(tweets: list[Tweet]):
    count = {}
    for tweet in tweets:
        if tweet.language == 'es':
            words = tweet.content.split(' ')
            for word in words:
                if word in count:
                    count[word] += 1
                else:
                    count[word] = 1
    return dict(sorted(count.items(), key=lambda item: item[1], reverse=True))

def get_total_tweets_count(tweets: list[Tweet]):
    return len(tweets)

def count_original_tweets(tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.retweeted_status == False:
            count += 1
    return count

def count_original_tweets_spanish(tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.language == 'es' and tweet.retweeted_status == False:
            count += 1
    return count


tweets = read_clean_tweets('clean-dataset')

english_tweets_count = count_tweets('en', tweets)
spanish_tweets_count = count_tweets('es', tweets)
catalan_tweets_count = count_tweets('ca', tweets)
galician_tweets_count = count_tweets('gl', tweets)
basque_tweets_count = count_tweets('eu', tweets)
portuguese_tweets_count = count_tweets('pt', tweets)
french_tweets_count = count_tweets('fr', tweets)
italian_tweets_count = count_tweets('it', tweets)
romanian_tweets_count = count_tweets('ro', tweets)
occitan_tweets_count = count_tweets('oc', tweets) 
corsican_tweets_count = count_tweets('co', tweets) 
breton_tweets_count = count_tweets('br', tweets) 
luxembourgish_tweets_count = count_tweets('lb', tweets)
greek_tweets_count = count_tweets('el', tweets)
turkish_tweets_count = count_tweets('tr', tweets)
german_tweets_count = count_tweets('de', tweets)
swedish_tweets_count = count_tweets('sv', tweets)
norwegian_tweets_count = count_tweets('no', tweets)
danish_tweets_count = count_tweets('da', tweets) 
finnish_tweets_count = count_tweets('fi', tweets)
dutch_tweets_count = count_tweets('nl', tweets)
ukrainian_tweets_count = count_tweets('uk', tweets) 
polish_tweets_count = count_tweets('pl', tweets) 
czech_tweets_count = count_tweets('cs', tweets) 
hungarian_tweets_count = count_tweets('hu', tweets) 
bulgarian_tweets_count = count_tweets('bg', tweets) 
albanian_tweets_count = count_tweets('sq', tweets)
bosnian_tweets_count = count_tweets('bs', tweets)
icelandic_tweets_count = count_tweets('is', tweets)
estonian_tweets_count = count_tweets('et', tweets)
maltese_tweets_count = count_tweets('mt', tweets)
montenegrin_tweets_count = count_tweets('me', tweets)
macedonian_tweets_count = count_tweets('mk', tweets)
azerbaijani_tweets_count = count_tweets('az', tweets)
lithuanian_tweets_count = count_tweets('lt', tweets)
latvian_tweets_count = count_tweets('lv', tweets)
armenian_tweets_count = count_tweets('hy', tweets)
georgian_tweets_count = count_tweets('ka', tweets)
serbian_tweets_count = count_tweets('sr', tweets)
croatian_tweets_count = count_tweets('hr', tweets)
slovenian_tweets_count = count_tweets('sl', tweets)
slovak_tweets_count = count_tweets('sk', tweets)
russian_tweets_count = count_tweets('ru', tweets) 
belarusian_tweets_count = count_tweets('be', tweets) 
hebrew_tweets_count = count_tweets('he', tweets)
unknown_tweets_count = count_tweets('und', tweets)


# PERCENTAGE FOR EACH LANGUAGE
total_tweets = get_total_tweets_count(tweets)
english_tweets_percentage = (english_tweets_count / total_tweets) * 100
spanish_tweets_percentage = (spanish_tweets_count / total_tweets) * 100
catalan_tweets_percentage = (catalan_tweets_count / total_tweets) * 100
galician_tweets_percentage = (galician_tweets_count / total_tweets) * 100
basque_tweets_percentage = (basque_tweets_count / total_tweets) * 100
portuguese_tweets_percentage = (portuguese_tweets_count / total_tweets) * 100
french_tweets_percentage = (french_tweets_count / total_tweets) * 100
italian_tweets_percentage = (italian_tweets_count / total_tweets) * 100
romanian_tweets_percentage = (romanian_tweets_count / total_tweets) * 100
occitan_tweets_percentage = (occitan_tweets_count / total_tweets) * 100
corsican_tweets_percentage = (corsican_tweets_count / total_tweets) * 100
breton_tweets_percentage = (breton_tweets_count / total_tweets) * 100
luxembourgish_tweets_percentage = (luxembourgish_tweets_count / total_tweets) * 100
greek_tweets_percentage = (greek_tweets_count / total_tweets) * 100
turkish_tweets_percentage = (turkish_tweets_count / total_tweets) * 100
german_tweets_percentage = (german_tweets_count / total_tweets) * 100
swedish_tweets_percentage = (swedish_tweets_count / total_tweets) * 100
norwegian_tweets_percentage = (norwegian_tweets_count / total_tweets) * 100
danish_tweets_percentage = (danish_tweets_count / total_tweets) * 100 
finnish_tweets_percentage = (finnish_tweets_count / total_tweets) * 100 
dutch_tweets_percentage = (dutch_tweets_count / total_tweets) * 100
ukrainian_tweets_percentage = (ukrainian_tweets_count / total_tweets) * 100
polish_tweets_percentage = (polish_tweets_count / total_tweets) * 100
czech_tweets_percentage = (czech_tweets_count / total_tweets) * 100
hungarian_tweets_percentage = (hungarian_tweets_count / total_tweets) * 100
bulgarian_tweets_percentage = (bulgarian_tweets_count / total_tweets) * 100
albanian_tweets_percentage = (albanian_tweets_count / total_tweets) * 100
bosnian_tweets_percentage = (bosnian_tweets_count / total_tweets) * 100
icelandic_tweets_percentage = (icelandic_tweets_count / total_tweets) * 100
estonian_tweets_percentage = (estonian_tweets_count / total_tweets) * 100
maltese_tweets_percentage = (maltese_tweets_count / total_tweets) * 100
montenegrin_tweets_percentage = (montenegrin_tweets_count / total_tweets) * 100
macedonian_tweets_percentage = (macedonian_tweets_count / total_tweets) * 100
azerbaijani_tweets_percentage = (azerbaijani_tweets_count / total_tweets) * 100
lithuanian_tweets_percentage = (lithuanian_tweets_count / total_tweets) * 100
latvian_tweets_percentage = (latvian_tweets_count / total_tweets) * 100
armenian_tweets_percentage = (armenian_tweets_count / total_tweets) * 100
georgian_tweets_percentage = (georgian_tweets_count / total_tweets) * 100
serbian_tweets_percentage = (serbian_tweets_count / total_tweets) * 100
croatian_tweets_percentage = (croatian_tweets_count / total_tweets) * 100
slovenian_tweets_percentage = (slovenian_tweets_count / total_tweets) * 100
slovak_tweets_percentage = (slovak_tweets_count / total_tweets) * 100
russian_tweets_percentage = (russian_tweets_count / total_tweets) * 100
belarusian_tweets_percentage = (belarusian_tweets_count / total_tweets) * 100
hebrew_tweets_percentage = (hebrew_tweets_count / total_tweets) * 100 
unknown_tweets_percentage = (unknown_tweets_count / total_tweets) * 100

tweets_count = get_total_tweets_count(tweets)
original_tweets_count = count_original_tweets(tweets)
words_by_frequence = list(most_frequent_word(tweets).items())[0:100]
spanish_tweets_count = count_original_tweets_spanish(tweets)
spanish_words_by_frequence = list(most_frequent_words_spanish(tweets).items())[0:100]

print(f"Number of English tweets: {english_tweets_count}")  #ENGLISH
print(f"Percentage of English tweets: {english_tweets_percentage:.2f}%")
print(f"Number of Spanish tweets: {spanish_tweets_count}") #SPANISH
print(f"Percentage of Spanish tweets: {spanish_tweets_percentage:.2f}%")
print(f"Number of Catalan tweets: {catalan_tweets_count}") #CATALAN
print(f"Percentage of Catalan tweets: {catalan_tweets_percentage:.2f}%")
print(f"Number of Galician tweets: {galician_tweets_count}") #GALICIAN
print(f"Percentage of Galician tweets: {galician_tweets_percentage:.2f}%")
print(f"Number of Basque tweets: {basque_tweets_count}") #BASQUE
print(f"Percentage of Basque tweets: {basque_tweets_percentage:.2f}%")
print(f"Number of Portuguese tweets: {portuguese_tweets_count}") #PORTUGUESE
print(f"Percentage of Portuguese tweets: {portuguese_tweets_percentage:.2f}%")
print(f"Number of French tweets: {french_tweets_count}") #FRENCH
print(f"Percentage of French tweets: {french_tweets_percentage:.2f}%") 
print(f"Number of Italian tweets: {italian_tweets_count}") #ITALIAN
print(f"Percentage of Italian tweets: {italian_tweets_percentage:.2f}%")
print(f"Number of Romanian tweets: {romanian_tweets_count}") #ROMANIAN
print(f"Percentage of Romanian tweets: {romanian_tweets_percentage:.2f}%")
print(f"Number of Occitan tweets: {occitan_tweets_count}") #OCCITAN
print(f"Percentage of Occitan tweets: {occitan_tweets_percentage:.2f}%")
print(f"Number of Corsican tweets: {corsican_tweets_count}") #CORSICAN
print(f"Percentage of Corsican tweets: {corsican_tweets_percentage:.2f}%")
print(f"Number of Breton tweets: {breton_tweets_count}") #BRETON
print(f"Percentage of Breton tweets: {breton_tweets_percentage:.2f}%")
print(f"Number of Luxembourgish tweets: {luxembourgish_tweets_count}") #LUXEMBOURGISH
print(f"Percentage of Luxembourgish tweets: {luxembourgish_tweets_percentage:.2f}%")
print(f"Number of Greek tweets: {greek_tweets_count}") #GREEK
print(f"Percentage of Greek tweets: {greek_tweets_percentage:.2f}%")
print(f"Number of Turkish tweets: {turkish_tweets_count}") #TURKISH
print(f"Percentage of Turkish tweets: {turkish_tweets_percentage:.2f}%")
print(f"Number of German tweets: {german_tweets_count}") #GERMAN
print(f"Percentage of German tweets: {german_tweets_percentage:.2f}%")
print(f"Number of Swedish tweets: {swedish_tweets_count}") #SWEDISH
print(f"Percentage of Swedish tweets: {swedish_tweets_percentage:.2f}%") 
print(f"Number of Norwegian tweets: {norwegian_tweets_count}") #NORWEGIAN
print(f"Percentage of Norwegian tweets: {norwegian_tweets_percentage:.2f}%") 
print(f"Number of Danish tweets: {danish_tweets_count}") #DANISH
print(f"Percentage of Danish tweets: {danish_tweets_percentage:.2f}%")
print(f"Number of Finnish tweets: {finnish_tweets_count}") #FINNISH
print(f"Percentage of Finnish tweets: {finnish_tweets_percentage:.2f}%")
print(f"Number of Dutch tweets: {dutch_tweets_count}") #DUTCH
print(f"Percentage of Dutch tweets: {dutch_tweets_percentage:.2f}%")
print(f"Number of Ukrainian tweets: {ukrainian_tweets_count}") #UKRAINIAN
print(f"Percentage of Ukrainian tweets: {ukrainian_tweets_percentage:.2f}%")
print(f"Number of Polish tweets: {polish_tweets_count}") #POLISH 
print(f"Percentage of Polish tweets: {polish_tweets_percentage:.2f}%")
print(f"Number of Czech tweets: {czech_tweets_count}") #CZECH
print(f"Percentage of Czech tweets: {czech_tweets_percentage:.2f}%")
print(f"Number of Hungarian tweets: {hungarian_tweets_count}") #HUNGARIAN
print(f"Percentage of Hungarian tweets: {hungarian_tweets_percentage:.2f}%")
print(f"Number of Bulgarian tweets: {bulgarian_tweets_count}") #BULGARIAN
print(f"Percentage of Bulgarian tweets: {bulgarian_tweets_percentage:.2f}%")
print(f"Number of Albanian tweets: {albanian_tweets_count}") #ALBANIAN
print(f"Percentage of Albanian tweets: {albanian_tweets_percentage:.2f}%")
print(f"Number of Bosnian tweets: {bosnian_tweets_count}") #BOSNIAN
print(f"Percentage of Bosnian tweets: {bosnian_tweets_percentage:.2f}%")
print(f"Number of Icelandic tweets: {icelandic_tweets_count}") #ICELANDIC
print(f"Percentage of Icelandic tweets: {icelandic_tweets_percentage:.2f}%")
print(f"Number of Estonian tweets: {estonian_tweets_count}") #ESTONIAN
print(f"Percentage of Estonian tweets: {estonian_tweets_percentage:.2f}%")
print(f"Number of Maltese tweets: {maltese_tweets_count}") #MALTESE
print(f"Percentage of Maltese tweets: {maltese_tweets_percentage:.2f}%")
print(f"Number of Montenegrin tweets: {montenegrin_tweets_count}") #MONTENEGRIN
print(f"Percentage of Montenegrin tweets: {montenegrin_tweets_percentage:.2f}%")
print(f"Number of Macedonian tweets: {macedonian_tweets_count}") #MACEDONIAN
print(f"Percentage of Macedonian tweets: {macedonian_tweets_percentage:.2f}%")
print(f"Number of Azerbaijani tweets: {azerbaijani_tweets_count}") #AZERBAIJANI
print(f"Percentage of Azerbaijani tweets: {azerbaijani_tweets_percentage:.2f}%")
print(f"Number of Lithuanian tweets: {lithuanian_tweets_count}") #LITHUANIAN
print(f"Percentage of Lithuanian tweets: {lithuanian_tweets_percentage:.2f}%")
print(f"Number of Latvian tweets: {latvian_tweets_count}") #LATVIAN
print(f"Percentage of Latvian tweets: {latvian_tweets_percentage:.2f}%")
print(f"Number of Armenian tweets: {armenian_tweets_count}") #ARMENIAN
print(f"Percentage of Armenian tweets: {armenian_tweets_percentage:.2f}%")
print(f"Number of Georgian tweets: {georgian_tweets_count}") #GEORGIAN
print(f"Percentage of Georgian tweets: {georgian_tweets_percentage:.2f}%")
print(f"Number of Serbian tweets: {serbian_tweets_count}") #SERBIAN
print(f"Percentage of Serbian tweets: {serbian_tweets_percentage:.2f}%")
print(f"Number of Croatian tweets: {croatian_tweets_count}") #CROATIAN
print(f"Percentage of Croatian tweets: {croatian_tweets_percentage:.2f}%")
print(f"Number of Slovenian tweets: {slovenian_tweets_count}") #SLOVENIAN
print(f"Percentage of Slovenian tweets: {slovenian_tweets_percentage:.2f}%")
print(f"Number of Slovak tweets: {slovak_tweets_count}") #SLOVAK
print(f"Percentage of Slovak tweets: {slovak_tweets_percentage:.2f}%") 
print(f"Number of Russian tweets: {russian_tweets_count}") #RUSSIAN
print(f"Percentage of Russian tweets: {russian_tweets_percentage:.2f}%")
print(f"Number of Belarusian tweets: {belarusian_tweets_count}") #BELARUSIAN
print(f"Percentage of Belarusian tweets: {belarusian_tweets_percentage:.2f}%")
print(f"Number of Hebrew tweets: {hebrew_tweets_count}") #HEBREW
print(f"Percentage of Hebrew tweets: {hebrew_tweets_percentage:.2f}%")
print(f"Number of Unknown tweets: {unknown_tweets_count}") #UNKNOWN
print(f"Percentage of Unknown tweets: {unknown_tweets_percentage:.2f}%")

print("----------------------------------")
print(f"Number of tweets: {tweets_count}")
print(f"Number of original tweets: {original_tweets_count}")
print(f"Number of original Spanish tweets: {spanish_tweets_count}")
print("-----------------------------------")
print(f"Frequent words: {words_by_frequence}\n")
print(f"Most frequent Spanish words: {spanish_words_by_frequence}\n")
print(tweets[0:2])

Number of English tweets: 491
Percentage of English tweets: 49.10%
Number of Spanish tweets: 94
Percentage of Spanish tweets: 23.60%
Number of Catalan tweets: 5
Percentage of Catalan tweets: 0.50%
Number of Galician tweets: 0
Percentage of Galician tweets: 0.00%
Number of Basque tweets: 1
Percentage of Basque tweets: 0.10%
Number of Portuguese tweets: 6
Percentage of Portuguese tweets: 0.60%
Number of French tweets: 96
Percentage of French tweets: 9.60%
Number of Italian tweets: 30
Percentage of Italian tweets: 3.00%
Number of Romanian tweets: 1
Percentage of Romanian tweets: 0.10%
Number of Occitan tweets: 0
Percentage of Occitan tweets: 0.00%
Number of Corsican tweets: 0
Percentage of Corsican tweets: 0.00%
Number of Breton tweets: 0
Percentage of Breton tweets: 0.00%
Number of Luxembourgish tweets: 0
Percentage of Luxembourgish tweets: 0.00%
Number of Greek tweets: 7
Percentage of Greek tweets: 0.70%
Number of Turkish tweets: 21
Percentage of Turkish tweets: 2.10%
Number of German t

# 7:

In [34]:
import s3fs
s3 = s3fs.S3FileSystem(anon=False)

bucket='mudab-2025-big-data'
data_key = 'twitter-data/Eurovision-07.json'
data_location = 's3://{}/{}'.format(bucket, data_key)

In [35]:
import json, dataclasses

tweets = []

bucket='mudab-2025-pc1262057'
data_key = 'input/Eurovision-07.json'
data_location = 's3://{}/{}'.format(bucket, data_key)

def parse_line(line: str):
  """Try to parse a string into a Person"""
  error = 0
  try:
    parsed = json.loads(line)
    return Tweet(parsed['id'], parsed['text'], parsed['user']['name'], parsed['lang'],parsed.get('retweeted_status', False))
  except Exception as e:
    error += 1    
#   print(f"Error parsing '{line}': {e}")

with s3.open(data_location) as input: #CLIENT S3. CAN READ THE FILE LINE BY LINE
  for line in input.readlines():
    if len(line.strip()) > 0:
      tweet = parse_line(line)
      if tweet: # We add only if the tweet is not 'None'
         tweets.append(tweet)

for modeled_tweet in tweets[0:10]:
  print(modeled_tweet)

Tweet(id=995410604914610177, content='RT @DaniCarMa_96: YEEEHAAA #Eurovision #NED https://t.co/Gg8wMrCPUa', author='Emelie @ death™', language='en', retweeted_status={'created_at': 'Sat May 12 20:55:13 +0000 2018', 'id': 995407069342785536, 'id_str': '995407069342785536', 'text': 'YEEEHAAA #Eurovision #NED https://t.co/Gg8wMrCPUa', 'display_text_range': [0, 25], 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 326443832, 'id_str': '326443832', 'name': 'Danié', 'screen_name': 'DaniCarMa_96', 'location': 'Middle Earth', 'url': None, 'description': '21. Estudiante de Química + Ingeniería de Materiales en la US. Former IB student. Turn to page three hundred and ninety-four.', 'translator_type': 'regular', 'protected': False, 'verified': False, 'foll

In [68]:
import json, dataclasses
def read_clean_tweets(input: str):
    tweets = []
    with open(input, 'r') as f:
        lines = f.readlines()
    for line in lines:
        parsed = json.loads(line)
        tweet = Tweet(**parsed)
        tweets.append(tweet)
    return tweets

def count_tweets(language: str, tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.language == language:
            count += 1
    return count 
    

def most_frequent_word(tweets: list[Tweet]):
    count = {}
    for tweet in tweets:
        words = tweet.content.split(' ')
        for word in words:
            if word in count:
                count[word] += 1
            else:
                count[word] = 1
    return dict(sorted(count.items(), key=lambda item: item[1], reverse=True))

def most_frequent_words_spanish(tweets: list[Tweet]):
    count = {}
    for tweet in tweets:
        if tweet.language == 'es':
            words = tweet.content.split(' ')
            for word in words:
                if word in count:
                    count[word] += 1
                else:
                    count[word] = 1
    return dict(sorted(count.items(), key=lambda item: item[1], reverse=True))

def get_total_tweets_count(tweets: list[Tweet]):
    return len(tweets)

def count_original_tweets(tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.retweeted_status == False:
            count += 1
    return count

def count_original_tweets_spanish(tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.language == 'es' and tweet.retweeted_status == False:
            count += 1
    return count


tweets = read_clean_tweets('clean-dataset')

english_tweets_count = count_tweets('en', tweets)
spanish_tweets_count = count_tweets('es', tweets)
catalan_tweets_count = count_tweets('ca', tweets)
galician_tweets_count = count_tweets('gl', tweets)
basque_tweets_count = count_tweets('eu', tweets)
portuguese_tweets_count = count_tweets('pt', tweets)
french_tweets_count = count_tweets('fr', tweets)
italian_tweets_count = count_tweets('it', tweets)
romanian_tweets_count = count_tweets('ro', tweets)
occitan_tweets_count = count_tweets('oc', tweets) 
corsican_tweets_count = count_tweets('co', tweets) 
breton_tweets_count = count_tweets('br', tweets) 
luxembourgish_tweets_count = count_tweets('lb', tweets)
greek_tweets_count = count_tweets('el', tweets)
turkish_tweets_count = count_tweets('tr', tweets)
german_tweets_count = count_tweets('de', tweets)
swedish_tweets_count = count_tweets('sv', tweets)
norwegian_tweets_count = count_tweets('no', tweets)
danish_tweets_count = count_tweets('da', tweets) 
finnish_tweets_count = count_tweets('fi', tweets)
dutch_tweets_count = count_tweets('nl', tweets)
ukrainian_tweets_count = count_tweets('uk', tweets) 
polish_tweets_count = count_tweets('pl', tweets) 
czech_tweets_count = count_tweets('cs', tweets) 
hungarian_tweets_count = count_tweets('hu', tweets) 
bulgarian_tweets_count = count_tweets('bg', tweets) 
albanian_tweets_count = count_tweets('sq', tweets)
bosnian_tweets_count = count_tweets('bs', tweets)
icelandic_tweets_count = count_tweets('is', tweets)
estonian_tweets_count = count_tweets('et', tweets)
maltese_tweets_count = count_tweets('mt', tweets)
montenegrin_tweets_count = count_tweets('me', tweets)
macedonian_tweets_count = count_tweets('mk', tweets)
azerbaijani_tweets_count = count_tweets('az', tweets)
lithuanian_tweets_count = count_tweets('lt', tweets)
latvian_tweets_count = count_tweets('lv', tweets)
armenian_tweets_count = count_tweets('hy', tweets)
georgian_tweets_count = count_tweets('ka', tweets)
serbian_tweets_count = count_tweets('sr', tweets)
croatian_tweets_count = count_tweets('hr', tweets)
slovenian_tweets_count = count_tweets('sl', tweets)
slovak_tweets_count = count_tweets('sk', tweets)
russian_tweets_count = count_tweets('ru', tweets) 
belarusian_tweets_count = count_tweets('be', tweets) 
hebrew_tweets_count = count_tweets('he', tweets)
unknown_tweets_count = count_tweets('und', tweets)


# PERCENTAGE FOR EACH LANGUAGE
total_tweets = get_total_tweets_count(tweets)
english_tweets_percentage = (english_tweets_count / total_tweets) * 100
spanish_tweets_percentage = (spanish_tweets_count / total_tweets) * 100
catalan_tweets_percentage = (catalan_tweets_count / total_tweets) * 100
galician_tweets_percentage = (galician_tweets_count / total_tweets) * 100
basque_tweets_percentage = (basque_tweets_count / total_tweets) * 100
portuguese_tweets_percentage = (portuguese_tweets_count / total_tweets) * 100
french_tweets_percentage = (french_tweets_count / total_tweets) * 100
italian_tweets_percentage = (italian_tweets_count / total_tweets) * 100
romanian_tweets_percentage = (romanian_tweets_count / total_tweets) * 100
occitan_tweets_percentage = (occitan_tweets_count / total_tweets) * 100
corsican_tweets_percentage = (corsican_tweets_count / total_tweets) * 100
breton_tweets_percentage = (breton_tweets_count / total_tweets) * 100
luxembourgish_tweets_percentage = (luxembourgish_tweets_count / total_tweets) * 100
greek_tweets_percentage = (greek_tweets_count / total_tweets) * 100
turkish_tweets_percentage = (turkish_tweets_count / total_tweets) * 100
german_tweets_percentage = (german_tweets_count / total_tweets) * 100
swedish_tweets_percentage = (swedish_tweets_count / total_tweets) * 100
norwegian_tweets_percentage = (norwegian_tweets_count / total_tweets) * 100
danish_tweets_percentage = (danish_tweets_count / total_tweets) * 100 
finnish_tweets_percentage = (finnish_tweets_count / total_tweets) * 100 
dutch_tweets_percentage = (dutch_tweets_count / total_tweets) * 100
ukrainian_tweets_percentage = (ukrainian_tweets_count / total_tweets) * 100
polish_tweets_percentage = (polish_tweets_count / total_tweets) * 100
czech_tweets_percentage = (czech_tweets_count / total_tweets) * 100
hungarian_tweets_percentage = (hungarian_tweets_count / total_tweets) * 100
bulgarian_tweets_percentage = (bulgarian_tweets_count / total_tweets) * 100
albanian_tweets_percentage = (albanian_tweets_count / total_tweets) * 100
bosnian_tweets_percentage = (bosnian_tweets_count / total_tweets) * 100
icelandic_tweets_percentage = (icelandic_tweets_count / total_tweets) * 100
estonian_tweets_percentage = (estonian_tweets_count / total_tweets) * 100
maltese_tweets_percentage = (maltese_tweets_count / total_tweets) * 100
montenegrin_tweets_percentage = (montenegrin_tweets_count / total_tweets) * 100
macedonian_tweets_percentage = (macedonian_tweets_count / total_tweets) * 100
azerbaijani_tweets_percentage = (azerbaijani_tweets_count / total_tweets) * 100
lithuanian_tweets_percentage = (lithuanian_tweets_count / total_tweets) * 100
latvian_tweets_percentage = (latvian_tweets_count / total_tweets) * 100
armenian_tweets_percentage = (armenian_tweets_count / total_tweets) * 100
georgian_tweets_percentage = (georgian_tweets_count / total_tweets) * 100
serbian_tweets_percentage = (serbian_tweets_count / total_tweets) * 100
croatian_tweets_percentage = (croatian_tweets_count / total_tweets) * 100
slovenian_tweets_percentage = (slovenian_tweets_count / total_tweets) * 100
slovak_tweets_percentage = (slovak_tweets_count / total_tweets) * 100
russian_tweets_percentage = (russian_tweets_count / total_tweets) * 100
belarusian_tweets_percentage = (belarusian_tweets_count / total_tweets) * 100
hebrew_tweets_percentage = (hebrew_tweets_count / total_tweets) * 100 
unknown_tweets_percentage = (unknown_tweets_count / total_tweets) * 100

tweets_count = get_total_tweets_count(tweets)
original_tweets_count = count_original_tweets(tweets)
words_by_frequence = list(most_frequent_word(tweets).items())[0:100]
spanish_tweets_count = count_original_tweets_spanish(tweets)
spanish_words_by_frequence = list(most_frequent_words_spanish(tweets).items())[0:100]

print(f"Number of English tweets: {english_tweets_count}")  #ENGLISH
print(f"Percentage of English tweets: {english_tweets_percentage:.2f}%")
print(f"Number of Spanish tweets: {spanish_tweets_count}") #SPANISH
print(f"Percentage of Spanish tweets: {spanish_tweets_percentage:.2f}%")
print(f"Number of Catalan tweets: {catalan_tweets_count}") #CATALAN
print(f"Percentage of Catalan tweets: {catalan_tweets_percentage:.2f}%")
print(f"Number of Galician tweets: {galician_tweets_count}") #GALICIAN
print(f"Percentage of Galician tweets: {galician_tweets_percentage:.2f}%")
print(f"Number of Basque tweets: {basque_tweets_count}") #BASQUE
print(f"Percentage of Basque tweets: {basque_tweets_percentage:.2f}%")
print(f"Number of Portuguese tweets: {portuguese_tweets_count}") #PORTUGUESE
print(f"Percentage of Portuguese tweets: {portuguese_tweets_percentage:.2f}%")
print(f"Number of French tweets: {french_tweets_count}") #FRENCH
print(f"Percentage of French tweets: {french_tweets_percentage:.2f}%") 
print(f"Number of Italian tweets: {italian_tweets_count}") #ITALIAN
print(f"Percentage of Italian tweets: {italian_tweets_percentage:.2f}%")
print(f"Number of Romanian tweets: {romanian_tweets_count}") #ROMANIAN
print(f"Percentage of Romanian tweets: {romanian_tweets_percentage:.2f}%")
print(f"Number of Occitan tweets: {occitan_tweets_count}") #OCCITAN
print(f"Percentage of Occitan tweets: {occitan_tweets_percentage:.2f}%")
print(f"Number of Corsican tweets: {corsican_tweets_count}") #CORSICAN
print(f"Percentage of Corsican tweets: {corsican_tweets_percentage:.2f}%")
print(f"Number of Breton tweets: {breton_tweets_count}") #BRETON
print(f"Percentage of Breton tweets: {breton_tweets_percentage:.2f}%")
print(f"Number of Luxembourgish tweets: {luxembourgish_tweets_count}") #LUXEMBOURGISH
print(f"Percentage of Luxembourgish tweets: {luxembourgish_tweets_percentage:.2f}%")
print(f"Number of Greek tweets: {greek_tweets_count}") #GREEK
print(f"Percentage of Greek tweets: {greek_tweets_percentage:.2f}%")
print(f"Number of Turkish tweets: {turkish_tweets_count}") #TURKISH
print(f"Percentage of Turkish tweets: {turkish_tweets_percentage:.2f}%")
print(f"Number of German tweets: {german_tweets_count}") #GERMAN
print(f"Percentage of German tweets: {german_tweets_percentage:.2f}%")
print(f"Number of Swedish tweets: {swedish_tweets_count}") #SWEDISH
print(f"Percentage of Swedish tweets: {swedish_tweets_percentage:.2f}%") 
print(f"Number of Norwegian tweets: {norwegian_tweets_count}") #NORWEGIAN
print(f"Percentage of Norwegian tweets: {norwegian_tweets_percentage:.2f}%") 
print(f"Number of Danish tweets: {danish_tweets_count}") #DANISH
print(f"Percentage of Danish tweets: {danish_tweets_percentage:.2f}%")
print(f"Number of Finnish tweets: {finnish_tweets_count}") #FINNISH
print(f"Percentage of Finnish tweets: {finnish_tweets_percentage:.2f}%")
print(f"Number of Dutch tweets: {dutch_tweets_count}") #DUTCH
print(f"Percentage of Dutch tweets: {dutch_tweets_percentage:.2f}%")
print(f"Number of Ukrainian tweets: {ukrainian_tweets_count}") #UKRAINIAN
print(f"Percentage of Ukrainian tweets: {ukrainian_tweets_percentage:.2f}%")
print(f"Number of Polish tweets: {polish_tweets_count}") #POLISH 
print(f"Percentage of Polish tweets: {polish_tweets_percentage:.2f}%")
print(f"Number of Czech tweets: {czech_tweets_count}") #CZECH
print(f"Percentage of Czech tweets: {czech_tweets_percentage:.2f}%")
print(f"Number of Hungarian tweets: {hungarian_tweets_count}") #HUNGARIAN
print(f"Percentage of Hungarian tweets: {hungarian_tweets_percentage:.2f}%")
print(f"Number of Bulgarian tweets: {bulgarian_tweets_count}") #BULGARIAN
print(f"Percentage of Bulgarian tweets: {bulgarian_tweets_percentage:.2f}%")
print(f"Number of Albanian tweets: {albanian_tweets_count}") #ALBANIAN
print(f"Percentage of Albanian tweets: {albanian_tweets_percentage:.2f}%")
print(f"Number of Bosnian tweets: {bosnian_tweets_count}") #BOSNIAN
print(f"Percentage of Bosnian tweets: {bosnian_tweets_percentage:.2f}%")
print(f"Number of Icelandic tweets: {icelandic_tweets_count}") #ICELANDIC
print(f"Percentage of Icelandic tweets: {icelandic_tweets_percentage:.2f}%")
print(f"Number of Estonian tweets: {estonian_tweets_count}") #ESTONIAN
print(f"Percentage of Estonian tweets: {estonian_tweets_percentage:.2f}%")
print(f"Number of Maltese tweets: {maltese_tweets_count}") #MALTESE
print(f"Percentage of Maltese tweets: {maltese_tweets_percentage:.2f}%")
print(f"Number of Montenegrin tweets: {montenegrin_tweets_count}") #MONTENEGRIN
print(f"Percentage of Montenegrin tweets: {montenegrin_tweets_percentage:.2f}%")
print(f"Number of Macedonian tweets: {macedonian_tweets_count}") #MACEDONIAN
print(f"Percentage of Macedonian tweets: {macedonian_tweets_percentage:.2f}%")
print(f"Number of Azerbaijani tweets: {azerbaijani_tweets_count}") #AZERBAIJANI
print(f"Percentage of Azerbaijani tweets: {azerbaijani_tweets_percentage:.2f}%")
print(f"Number of Lithuanian tweets: {lithuanian_tweets_count}") #LITHUANIAN
print(f"Percentage of Lithuanian tweets: {lithuanian_tweets_percentage:.2f}%")
print(f"Number of Latvian tweets: {latvian_tweets_count}") #LATVIAN
print(f"Percentage of Latvian tweets: {latvian_tweets_percentage:.2f}%")
print(f"Number of Armenian tweets: {armenian_tweets_count}") #ARMENIAN
print(f"Percentage of Armenian tweets: {armenian_tweets_percentage:.2f}%")
print(f"Number of Georgian tweets: {georgian_tweets_count}") #GEORGIAN
print(f"Percentage of Georgian tweets: {georgian_tweets_percentage:.2f}%")
print(f"Number of Serbian tweets: {serbian_tweets_count}") #SERBIAN
print(f"Percentage of Serbian tweets: {serbian_tweets_percentage:.2f}%")
print(f"Number of Croatian tweets: {croatian_tweets_count}") #CROATIAN
print(f"Percentage of Croatian tweets: {croatian_tweets_percentage:.2f}%")
print(f"Number of Slovenian tweets: {slovenian_tweets_count}") #SLOVENIAN
print(f"Percentage of Slovenian tweets: {slovenian_tweets_percentage:.2f}%")
print(f"Number of Slovak tweets: {slovak_tweets_count}") #SLOVAK
print(f"Percentage of Slovak tweets: {slovak_tweets_percentage:.2f}%") 
print(f"Number of Russian tweets: {russian_tweets_count}") #RUSSIAN
print(f"Percentage of Russian tweets: {russian_tweets_percentage:.2f}%")
print(f"Number of Belarusian tweets: {belarusian_tweets_count}") #BELARUSIAN
print(f"Percentage of Belarusian tweets: {belarusian_tweets_percentage:.2f}%")
print(f"Number of Hebrew tweets: {hebrew_tweets_count}") #HEBREW
print(f"Percentage of Hebrew tweets: {hebrew_tweets_percentage:.2f}%")
print(f"Number of Unknown tweets: {unknown_tweets_count}") #UNKNOWN
print(f"Percentage of Unknown tweets: {unknown_tweets_percentage:.2f}%")

print("----------------------------------")
print(f"Number of tweets: {tweets_count}")
print(f"Number of original tweets: {original_tweets_count}")
print(f"Number of original Spanish tweets: {spanish_tweets_count}")
print("-----------------------------------")
print(f"Frequent words: {words_by_frequence}\n")
print(f"Most frequent Spanish words: {spanish_words_by_frequence}\n")
print(tweets[0:2])

Number of English tweets: 491
Percentage of English tweets: 49.10%
Number of Spanish tweets: 94
Percentage of Spanish tweets: 23.60%
Number of Catalan tweets: 5
Percentage of Catalan tweets: 0.50%
Number of Galician tweets: 0
Percentage of Galician tweets: 0.00%
Number of Basque tweets: 1
Percentage of Basque tweets: 0.10%
Number of Portuguese tweets: 6
Percentage of Portuguese tweets: 0.60%
Number of French tweets: 96
Percentage of French tweets: 9.60%
Number of Italian tweets: 30
Percentage of Italian tweets: 3.00%
Number of Romanian tweets: 1
Percentage of Romanian tweets: 0.10%
Number of Occitan tweets: 0
Percentage of Occitan tweets: 0.00%
Number of Corsican tweets: 0
Percentage of Corsican tweets: 0.00%
Number of Breton tweets: 0
Percentage of Breton tweets: 0.00%
Number of Luxembourgish tweets: 0
Percentage of Luxembourgish tweets: 0.00%
Number of Greek tweets: 7
Percentage of Greek tweets: 0.70%
Number of Turkish tweets: 21
Percentage of Turkish tweets: 2.10%
Number of German t

# 8:

In [37]:
import s3fs
s3 = s3fs.S3FileSystem(anon=False)

bucket='mudab-2025-big-data'
data_key = 'twitter-data/Eurovision-08.json'
data_location = 's3://{}/{}'.format(bucket, data_key)

In [38]:
import json, dataclasses

tweets = []

bucket='mudab-2025-pc1262057'
data_key = 'input/Eurovision-08.json'
data_location = 's3://{}/{}'.format(bucket, data_key)

def parse_line(line: str):
  """Try to parse a string into a Person"""
  error = 0
  try:
    parsed = json.loads(line)
    return Tweet(parsed['id'], parsed['text'], parsed['user']['name'], parsed['lang'],parsed.get('retweeted_status', False))
  except Exception as e:
    error += 1    
#   print(f"Error parsing '{line}': {e}")

with s3.open(data_location) as input: #CLIENT S3. CAN READ THE FILE LINE BY LINE
  for line in input.readlines():
    if len(line.strip()) > 0:
      tweet = parse_line(line)
      if tweet: # We add only if the tweet is not 'None'
         tweets.append(tweet)

for modeled_tweet in tweets[0:10]:
  print(modeled_tweet)

Tweet(id=995420583667040256, content='RT @Uznare: eurovision rules https://t.co/I8cG3D5tCh', author='gonçalo', language='en', retweeted_status={'created_at': 'Sat May 12 19:13:51 +0000 2018', 'id': 995381560277979136, 'id_str': '995381560277979136', 'text': 'eurovision rules https://t.co/I8cG3D5tCh', 'display_text_range': [0, 16], 'source': '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 29056256, 'id_str': '29056256', 'name': 'ウズ@ころあず最高', 'screen_name': 'Uznare', 'location': 'Hyperliterate México', 'url': 'http://outerheaven.xyz', 'description': '田所あずさが大好きです', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 656, 'friends_count': 449, 'listed_count': 16, 'favourites_count': 22950, 'statuses_count': 92272, 'created_at': 'Sun Apr 05 20:2

In [69]:
import json, dataclasses
def read_clean_tweets(input: str):
    tweets = []
    with open(input, 'r') as f:
        lines = f.readlines()
    for line in lines:
        parsed = json.loads(line)
        tweet = Tweet(**parsed)
        tweets.append(tweet)
    return tweets

def count_tweets(language: str, tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.language == language:
            count += 1
    return count 
    

def most_frequent_word(tweets: list[Tweet]):
    count = {}
    for tweet in tweets:
        words = tweet.content.split(' ')
        for word in words:
            if word in count:
                count[word] += 1
            else:
                count[word] = 1
    return dict(sorted(count.items(), key=lambda item: item[1], reverse=True))

def most_frequent_words_spanish(tweets: list[Tweet]):
    count = {}
    for tweet in tweets:
        if tweet.language == 'es':
            words = tweet.content.split(' ')
            for word in words:
                if word in count:
                    count[word] += 1
                else:
                    count[word] = 1
    return dict(sorted(count.items(), key=lambda item: item[1], reverse=True))

def get_total_tweets_count(tweets: list[Tweet]):
    return len(tweets)

def count_original_tweets(tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.retweeted_status == False:
            count += 1
    return count

def count_original_tweets_spanish(tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.language == 'es' and tweet.retweeted_status == False:
            count += 1
    return count


tweets = read_clean_tweets('clean-dataset')

english_tweets_count = count_tweets('en', tweets)
spanish_tweets_count = count_tweets('es', tweets)
catalan_tweets_count = count_tweets('ca', tweets)
galician_tweets_count = count_tweets('gl', tweets)
basque_tweets_count = count_tweets('eu', tweets)
portuguese_tweets_count = count_tweets('pt', tweets)
french_tweets_count = count_tweets('fr', tweets)
italian_tweets_count = count_tweets('it', tweets)
romanian_tweets_count = count_tweets('ro', tweets)
occitan_tweets_count = count_tweets('oc', tweets) 
corsican_tweets_count = count_tweets('co', tweets) 
breton_tweets_count = count_tweets('br', tweets) 
luxembourgish_tweets_count = count_tweets('lb', tweets)
greek_tweets_count = count_tweets('el', tweets)
turkish_tweets_count = count_tweets('tr', tweets)
german_tweets_count = count_tweets('de', tweets)
swedish_tweets_count = count_tweets('sv', tweets)
norwegian_tweets_count = count_tweets('no', tweets)
danish_tweets_count = count_tweets('da', tweets) 
finnish_tweets_count = count_tweets('fi', tweets)
dutch_tweets_count = count_tweets('nl', tweets)
ukrainian_tweets_count = count_tweets('uk', tweets) 
polish_tweets_count = count_tweets('pl', tweets) 
czech_tweets_count = count_tweets('cs', tweets) 
hungarian_tweets_count = count_tweets('hu', tweets) 
bulgarian_tweets_count = count_tweets('bg', tweets) 
albanian_tweets_count = count_tweets('sq', tweets)
bosnian_tweets_count = count_tweets('bs', tweets)
icelandic_tweets_count = count_tweets('is', tweets)
estonian_tweets_count = count_tweets('et', tweets)
maltese_tweets_count = count_tweets('mt', tweets)
montenegrin_tweets_count = count_tweets('me', tweets)
macedonian_tweets_count = count_tweets('mk', tweets)
azerbaijani_tweets_count = count_tweets('az', tweets)
lithuanian_tweets_count = count_tweets('lt', tweets)
latvian_tweets_count = count_tweets('lv', tweets)
armenian_tweets_count = count_tweets('hy', tweets)
georgian_tweets_count = count_tweets('ka', tweets)
serbian_tweets_count = count_tweets('sr', tweets)
croatian_tweets_count = count_tweets('hr', tweets)
slovenian_tweets_count = count_tweets('sl', tweets)
slovak_tweets_count = count_tweets('sk', tweets)
russian_tweets_count = count_tweets('ru', tweets) 
belarusian_tweets_count = count_tweets('be', tweets) 
hebrew_tweets_count = count_tweets('he', tweets)
unknown_tweets_count = count_tweets('und', tweets)


# PERCENTAGE FOR EACH LANGUAGE
total_tweets = get_total_tweets_count(tweets)
english_tweets_percentage = (english_tweets_count / total_tweets) * 100
spanish_tweets_percentage = (spanish_tweets_count / total_tweets) * 100
catalan_tweets_percentage = (catalan_tweets_count / total_tweets) * 100
galician_tweets_percentage = (galician_tweets_count / total_tweets) * 100
basque_tweets_percentage = (basque_tweets_count / total_tweets) * 100
portuguese_tweets_percentage = (portuguese_tweets_count / total_tweets) * 100
french_tweets_percentage = (french_tweets_count / total_tweets) * 100
italian_tweets_percentage = (italian_tweets_count / total_tweets) * 100
romanian_tweets_percentage = (romanian_tweets_count / total_tweets) * 100
occitan_tweets_percentage = (occitan_tweets_count / total_tweets) * 100
corsican_tweets_percentage = (corsican_tweets_count / total_tweets) * 100
breton_tweets_percentage = (breton_tweets_count / total_tweets) * 100
luxembourgish_tweets_percentage = (luxembourgish_tweets_count / total_tweets) * 100
greek_tweets_percentage = (greek_tweets_count / total_tweets) * 100
turkish_tweets_percentage = (turkish_tweets_count / total_tweets) * 100
german_tweets_percentage = (german_tweets_count / total_tweets) * 100
swedish_tweets_percentage = (swedish_tweets_count / total_tweets) * 100
norwegian_tweets_percentage = (norwegian_tweets_count / total_tweets) * 100
danish_tweets_percentage = (danish_tweets_count / total_tweets) * 100 
finnish_tweets_percentage = (finnish_tweets_count / total_tweets) * 100 
dutch_tweets_percentage = (dutch_tweets_count / total_tweets) * 100
ukrainian_tweets_percentage = (ukrainian_tweets_count / total_tweets) * 100
polish_tweets_percentage = (polish_tweets_count / total_tweets) * 100
czech_tweets_percentage = (czech_tweets_count / total_tweets) * 100
hungarian_tweets_percentage = (hungarian_tweets_count / total_tweets) * 100
bulgarian_tweets_percentage = (bulgarian_tweets_count / total_tweets) * 100
albanian_tweets_percentage = (albanian_tweets_count / total_tweets) * 100
bosnian_tweets_percentage = (bosnian_tweets_count / total_tweets) * 100
icelandic_tweets_percentage = (icelandic_tweets_count / total_tweets) * 100
estonian_tweets_percentage = (estonian_tweets_count / total_tweets) * 100
maltese_tweets_percentage = (maltese_tweets_count / total_tweets) * 100
montenegrin_tweets_percentage = (montenegrin_tweets_count / total_tweets) * 100
macedonian_tweets_percentage = (macedonian_tweets_count / total_tweets) * 100
azerbaijani_tweets_percentage = (azerbaijani_tweets_count / total_tweets) * 100
lithuanian_tweets_percentage = (lithuanian_tweets_count / total_tweets) * 100
latvian_tweets_percentage = (latvian_tweets_count / total_tweets) * 100
armenian_tweets_percentage = (armenian_tweets_count / total_tweets) * 100
georgian_tweets_percentage = (georgian_tweets_count / total_tweets) * 100
serbian_tweets_percentage = (serbian_tweets_count / total_tweets) * 100
croatian_tweets_percentage = (croatian_tweets_count / total_tweets) * 100
slovenian_tweets_percentage = (slovenian_tweets_count / total_tweets) * 100
slovak_tweets_percentage = (slovak_tweets_count / total_tweets) * 100
russian_tweets_percentage = (russian_tweets_count / total_tweets) * 100
belarusian_tweets_percentage = (belarusian_tweets_count / total_tweets) * 100
hebrew_tweets_percentage = (hebrew_tweets_count / total_tweets) * 100 
unknown_tweets_percentage = (unknown_tweets_count / total_tweets) * 100

tweets_count = get_total_tweets_count(tweets)
original_tweets_count = count_original_tweets(tweets)
words_by_frequence = list(most_frequent_word(tweets).items())[0:100]
spanish_tweets_count = count_original_tweets_spanish(tweets)
spanish_words_by_frequence = list(most_frequent_words_spanish(tweets).items())[0:100]

print(f"Number of English tweets: {english_tweets_count}")  #ENGLISH
print(f"Percentage of English tweets: {english_tweets_percentage:.2f}%")
print(f"Number of Spanish tweets: {spanish_tweets_count}") #SPANISH
print(f"Percentage of Spanish tweets: {spanish_tweets_percentage:.2f}%")
print(f"Number of Catalan tweets: {catalan_tweets_count}") #CATALAN
print(f"Percentage of Catalan tweets: {catalan_tweets_percentage:.2f}%")
print(f"Number of Galician tweets: {galician_tweets_count}") #GALICIAN
print(f"Percentage of Galician tweets: {galician_tweets_percentage:.2f}%")
print(f"Number of Basque tweets: {basque_tweets_count}") #BASQUE
print(f"Percentage of Basque tweets: {basque_tweets_percentage:.2f}%")
print(f"Number of Portuguese tweets: {portuguese_tweets_count}") #PORTUGUESE
print(f"Percentage of Portuguese tweets: {portuguese_tweets_percentage:.2f}%")
print(f"Number of French tweets: {french_tweets_count}") #FRENCH
print(f"Percentage of French tweets: {french_tweets_percentage:.2f}%") 
print(f"Number of Italian tweets: {italian_tweets_count}") #ITALIAN
print(f"Percentage of Italian tweets: {italian_tweets_percentage:.2f}%")
print(f"Number of Romanian tweets: {romanian_tweets_count}") #ROMANIAN
print(f"Percentage of Romanian tweets: {romanian_tweets_percentage:.2f}%")
print(f"Number of Occitan tweets: {occitan_tweets_count}") #OCCITAN
print(f"Percentage of Occitan tweets: {occitan_tweets_percentage:.2f}%")
print(f"Number of Corsican tweets: {corsican_tweets_count}") #CORSICAN
print(f"Percentage of Corsican tweets: {corsican_tweets_percentage:.2f}%")
print(f"Number of Breton tweets: {breton_tweets_count}") #BRETON
print(f"Percentage of Breton tweets: {breton_tweets_percentage:.2f}%")
print(f"Number of Luxembourgish tweets: {luxembourgish_tweets_count}") #LUXEMBOURGISH
print(f"Percentage of Luxembourgish tweets: {luxembourgish_tweets_percentage:.2f}%")
print(f"Number of Greek tweets: {greek_tweets_count}") #GREEK
print(f"Percentage of Greek tweets: {greek_tweets_percentage:.2f}%")
print(f"Number of Turkish tweets: {turkish_tweets_count}") #TURKISH
print(f"Percentage of Turkish tweets: {turkish_tweets_percentage:.2f}%")
print(f"Number of German tweets: {german_tweets_count}") #GERMAN
print(f"Percentage of German tweets: {german_tweets_percentage:.2f}%")
print(f"Number of Swedish tweets: {swedish_tweets_count}") #SWEDISH
print(f"Percentage of Swedish tweets: {swedish_tweets_percentage:.2f}%") 
print(f"Number of Norwegian tweets: {norwegian_tweets_count}") #NORWEGIAN
print(f"Percentage of Norwegian tweets: {norwegian_tweets_percentage:.2f}%") 
print(f"Number of Danish tweets: {danish_tweets_count}") #DANISH
print(f"Percentage of Danish tweets: {danish_tweets_percentage:.2f}%")
print(f"Number of Finnish tweets: {finnish_tweets_count}") #FINNISH
print(f"Percentage of Finnish tweets: {finnish_tweets_percentage:.2f}%")
print(f"Number of Dutch tweets: {dutch_tweets_count}") #DUTCH
print(f"Percentage of Dutch tweets: {dutch_tweets_percentage:.2f}%")
print(f"Number of Ukrainian tweets: {ukrainian_tweets_count}") #UKRAINIAN
print(f"Percentage of Ukrainian tweets: {ukrainian_tweets_percentage:.2f}%")
print(f"Number of Polish tweets: {polish_tweets_count}") #POLISH 
print(f"Percentage of Polish tweets: {polish_tweets_percentage:.2f}%")
print(f"Number of Czech tweets: {czech_tweets_count}") #CZECH
print(f"Percentage of Czech tweets: {czech_tweets_percentage:.2f}%")
print(f"Number of Hungarian tweets: {hungarian_tweets_count}") #HUNGARIAN
print(f"Percentage of Hungarian tweets: {hungarian_tweets_percentage:.2f}%")
print(f"Number of Bulgarian tweets: {bulgarian_tweets_count}") #BULGARIAN
print(f"Percentage of Bulgarian tweets: {bulgarian_tweets_percentage:.2f}%")
print(f"Number of Albanian tweets: {albanian_tweets_count}") #ALBANIAN
print(f"Percentage of Albanian tweets: {albanian_tweets_percentage:.2f}%")
print(f"Number of Bosnian tweets: {bosnian_tweets_count}") #BOSNIAN
print(f"Percentage of Bosnian tweets: {bosnian_tweets_percentage:.2f}%")
print(f"Number of Icelandic tweets: {icelandic_tweets_count}") #ICELANDIC
print(f"Percentage of Icelandic tweets: {icelandic_tweets_percentage:.2f}%")
print(f"Number of Estonian tweets: {estonian_tweets_count}") #ESTONIAN
print(f"Percentage of Estonian tweets: {estonian_tweets_percentage:.2f}%")
print(f"Number of Maltese tweets: {maltese_tweets_count}") #MALTESE
print(f"Percentage of Maltese tweets: {maltese_tweets_percentage:.2f}%")
print(f"Number of Montenegrin tweets: {montenegrin_tweets_count}") #MONTENEGRIN
print(f"Percentage of Montenegrin tweets: {montenegrin_tweets_percentage:.2f}%")
print(f"Number of Macedonian tweets: {macedonian_tweets_count}") #MACEDONIAN
print(f"Percentage of Macedonian tweets: {macedonian_tweets_percentage:.2f}%")
print(f"Number of Azerbaijani tweets: {azerbaijani_tweets_count}") #AZERBAIJANI
print(f"Percentage of Azerbaijani tweets: {azerbaijani_tweets_percentage:.2f}%")
print(f"Number of Lithuanian tweets: {lithuanian_tweets_count}") #LITHUANIAN
print(f"Percentage of Lithuanian tweets: {lithuanian_tweets_percentage:.2f}%")
print(f"Number of Latvian tweets: {latvian_tweets_count}") #LATVIAN
print(f"Percentage of Latvian tweets: {latvian_tweets_percentage:.2f}%")
print(f"Number of Armenian tweets: {armenian_tweets_count}") #ARMENIAN
print(f"Percentage of Armenian tweets: {armenian_tweets_percentage:.2f}%")
print(f"Number of Georgian tweets: {georgian_tweets_count}") #GEORGIAN
print(f"Percentage of Georgian tweets: {georgian_tweets_percentage:.2f}%")
print(f"Number of Serbian tweets: {serbian_tweets_count}") #SERBIAN
print(f"Percentage of Serbian tweets: {serbian_tweets_percentage:.2f}%")
print(f"Number of Croatian tweets: {croatian_tweets_count}") #CROATIAN
print(f"Percentage of Croatian tweets: {croatian_tweets_percentage:.2f}%")
print(f"Number of Slovenian tweets: {slovenian_tweets_count}") #SLOVENIAN
print(f"Percentage of Slovenian tweets: {slovenian_tweets_percentage:.2f}%")
print(f"Number of Slovak tweets: {slovak_tweets_count}") #SLOVAK
print(f"Percentage of Slovak tweets: {slovak_tweets_percentage:.2f}%") 
print(f"Number of Russian tweets: {russian_tweets_count}") #RUSSIAN
print(f"Percentage of Russian tweets: {russian_tweets_percentage:.2f}%")
print(f"Number of Belarusian tweets: {belarusian_tweets_count}") #BELARUSIAN
print(f"Percentage of Belarusian tweets: {belarusian_tweets_percentage:.2f}%")
print(f"Number of Hebrew tweets: {hebrew_tweets_count}") #HEBREW
print(f"Percentage of Hebrew tweets: {hebrew_tweets_percentage:.2f}%")
print(f"Number of Unknown tweets: {unknown_tweets_count}") #UNKNOWN
print(f"Percentage of Unknown tweets: {unknown_tweets_percentage:.2f}%")

print("----------------------------------")
print(f"Number of tweets: {tweets_count}")
print(f"Number of original tweets: {original_tweets_count}")
print(f"Number of original Spanish tweets: {spanish_tweets_count}")
print("-----------------------------------")
print(f"Frequent words: {words_by_frequence}\n")
print(f"Most frequent Spanish words: {spanish_words_by_frequence}\n")
print(tweets[0:2])

Number of English tweets: 491
Percentage of English tweets: 49.10%
Number of Spanish tweets: 94
Percentage of Spanish tweets: 23.60%
Number of Catalan tweets: 5
Percentage of Catalan tweets: 0.50%
Number of Galician tweets: 0
Percentage of Galician tweets: 0.00%
Number of Basque tweets: 1
Percentage of Basque tweets: 0.10%
Number of Portuguese tweets: 6
Percentage of Portuguese tweets: 0.60%
Number of French tweets: 96
Percentage of French tweets: 9.60%
Number of Italian tweets: 30
Percentage of Italian tweets: 3.00%
Number of Romanian tweets: 1
Percentage of Romanian tweets: 0.10%
Number of Occitan tweets: 0
Percentage of Occitan tweets: 0.00%
Number of Corsican tweets: 0
Percentage of Corsican tweets: 0.00%
Number of Breton tweets: 0
Percentage of Breton tweets: 0.00%
Number of Luxembourgish tweets: 0
Percentage of Luxembourgish tweets: 0.00%
Number of Greek tweets: 7
Percentage of Greek tweets: 0.70%
Number of Turkish tweets: 21
Percentage of Turkish tweets: 2.10%
Number of German t

# 9: 

In [40]:
import s3fs
s3 = s3fs.S3FileSystem(anon=False)

bucket='mudab-2025-big-data'
data_key = 'twitter-data/Eurovision-09.json'
data_location = 's3://{}/{}'.format(bucket, data_key)

In [41]:
import json, dataclasses

tweets = []

bucket='mudab-2025-pc1262057'
data_key = 'input/Eurovision-09.json'
data_location = 's3://{}/{}'.format(bucket, data_key)

def parse_line(line: str):
  """Try to parse a string into a Person"""
  error = 0
  try:
    parsed = json.loads(line)
    return Tweet(parsed['id'], parsed['text'], parsed['user']['name'], parsed['lang'],parsed.get('retweeted_status', False))
  except Exception as e:
    error += 1    
#   print(f"Error parsing '{line}': {e}")

with s3.open(data_location) as input: #CLIENT S3. CAN READ THE FILE LINE BY LINE
  for line in input.readlines():
    if len(line.strip()) > 0:
      tweet = parse_line(line)
      if tweet: # We add only if the tweet is not 'None'
         tweets.append(tweet)

for modeled_tweet in tweets[0:10]:
  print(modeled_tweet)

Tweet(id=995432738101645313, content='Gospodarze z Portugalii dzisiaj na ostatnim miejscu.\n#eurovision', author='Tomasz Witowski 🇵🇱', language='pl', retweeted_status=False)
Tweet(id=995432738223214593, content='#Eurovision \n\nNEJ!!! I følge min mening er hverken Østrig eller Israel de rette vindere....', author='Mihyunie SVT Bunny', language='da', retweeted_status=False)
Tweet(id=995432738068029447, content='все пизда, обидно #Eurovision', author="цветочек ри' estonia 💙", language='ru', retweeted_status=False)
Tweet(id=995432737736708097, content='incompréhensible l #Eurovision  ex la Moldavie Chypre et d autres', author='oth sisterhood', language='fr', retweeted_status=False)
Tweet(id=995432738227458049, content='raga al televoto non ci batte nessuno #escita #Eurovision #socialsybelli', author='Chiara Yuchi', language='it', retweeted_status=False)
Tweet(id=995432738252521473, content='все плохо\n#Eurovision', author='кичи|24', language='ru', retweeted_status=False)
Tweet(id=99543273

In [70]:
import json, dataclasses
def read_clean_tweets(input: str):
    tweets = []
    with open(input, 'r') as f:
        lines = f.readlines()
    for line in lines:
        parsed = json.loads(line)
        tweet = Tweet(**parsed)
        tweets.append(tweet)
    return tweets

def count_tweets(language: str, tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.language == language:
            count += 1
    return count 
    

def most_frequent_word(tweets: list[Tweet]):
    count = {}
    for tweet in tweets:
        words = tweet.content.split(' ')
        for word in words:
            if word in count:
                count[word] += 1
            else:
                count[word] = 1
    return dict(sorted(count.items(), key=lambda item: item[1], reverse=True))

def most_frequent_words_spanish(tweets: list[Tweet]):
    count = {}
    for tweet in tweets:
        if tweet.language == 'es':
            words = tweet.content.split(' ')
            for word in words:
                if word in count:
                    count[word] += 1
                else:
                    count[word] = 1
    return dict(sorted(count.items(), key=lambda item: item[1], reverse=True))

def get_total_tweets_count(tweets: list[Tweet]):
    return len(tweets)

def count_original_tweets(tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.retweeted_status == False:
            count += 1
    return count

def count_original_tweets_spanish(tweets: list[Tweet]):
    count = 0
    for tweet in tweets:
        if tweet.language == 'es' and tweet.retweeted_status == False:
            count += 1
    return count


tweets = read_clean_tweets('clean-dataset')

english_tweets_count = count_tweets('en', tweets)
spanish_tweets_count = count_tweets('es', tweets)
catalan_tweets_count = count_tweets('ca', tweets)
galician_tweets_count = count_tweets('gl', tweets)
basque_tweets_count = count_tweets('eu', tweets)
portuguese_tweets_count = count_tweets('pt', tweets)
french_tweets_count = count_tweets('fr', tweets)
italian_tweets_count = count_tweets('it', tweets)
romanian_tweets_count = count_tweets('ro', tweets)
occitan_tweets_count = count_tweets('oc', tweets) 
corsican_tweets_count = count_tweets('co', tweets) 
breton_tweets_count = count_tweets('br', tweets) 
luxembourgish_tweets_count = count_tweets('lb', tweets)
greek_tweets_count = count_tweets('el', tweets)
turkish_tweets_count = count_tweets('tr', tweets)
german_tweets_count = count_tweets('de', tweets)
swedish_tweets_count = count_tweets('sv', tweets)
norwegian_tweets_count = count_tweets('no', tweets)
danish_tweets_count = count_tweets('da', tweets) 
finnish_tweets_count = count_tweets('fi', tweets)
dutch_tweets_count = count_tweets('nl', tweets)
ukrainian_tweets_count = count_tweets('uk', tweets) 
polish_tweets_count = count_tweets('pl', tweets) 
czech_tweets_count = count_tweets('cs', tweets) 
hungarian_tweets_count = count_tweets('hu', tweets) 
bulgarian_tweets_count = count_tweets('bg', tweets) 
albanian_tweets_count = count_tweets('sq', tweets)
bosnian_tweets_count = count_tweets('bs', tweets)
icelandic_tweets_count = count_tweets('is', tweets)
estonian_tweets_count = count_tweets('et', tweets)
maltese_tweets_count = count_tweets('mt', tweets)
montenegrin_tweets_count = count_tweets('me', tweets)
macedonian_tweets_count = count_tweets('mk', tweets)
azerbaijani_tweets_count = count_tweets('az', tweets)
lithuanian_tweets_count = count_tweets('lt', tweets)
latvian_tweets_count = count_tweets('lv', tweets)
armenian_tweets_count = count_tweets('hy', tweets)
georgian_tweets_count = count_tweets('ka', tweets)
serbian_tweets_count = count_tweets('sr', tweets)
croatian_tweets_count = count_tweets('hr', tweets)
slovenian_tweets_count = count_tweets('sl', tweets)
slovak_tweets_count = count_tweets('sk', tweets)
russian_tweets_count = count_tweets('ru', tweets) 
belarusian_tweets_count = count_tweets('be', tweets) 
hebrew_tweets_count = count_tweets('he', tweets)
unknown_tweets_count = count_tweets('und', tweets)


# PERCENTAGE FOR EACH LANGUAGE
total_tweets = get_total_tweets_count(tweets)
english_tweets_percentage = (english_tweets_count / total_tweets) * 100
spanish_tweets_percentage = (spanish_tweets_count / total_tweets) * 100
catalan_tweets_percentage = (catalan_tweets_count / total_tweets) * 100
galician_tweets_percentage = (galician_tweets_count / total_tweets) * 100
basque_tweets_percentage = (basque_tweets_count / total_tweets) * 100
portuguese_tweets_percentage = (portuguese_tweets_count / total_tweets) * 100
french_tweets_percentage = (french_tweets_count / total_tweets) * 100
italian_tweets_percentage = (italian_tweets_count / total_tweets) * 100
romanian_tweets_percentage = (romanian_tweets_count / total_tweets) * 100
occitan_tweets_percentage = (occitan_tweets_count / total_tweets) * 100
corsican_tweets_percentage = (corsican_tweets_count / total_tweets) * 100
breton_tweets_percentage = (breton_tweets_count / total_tweets) * 100
luxembourgish_tweets_percentage = (luxembourgish_tweets_count / total_tweets) * 100
greek_tweets_percentage = (greek_tweets_count / total_tweets) * 100
turkish_tweets_percentage = (turkish_tweets_count / total_tweets) * 100
german_tweets_percentage = (german_tweets_count / total_tweets) * 100
swedish_tweets_percentage = (swedish_tweets_count / total_tweets) * 100
norwegian_tweets_percentage = (norwegian_tweets_count / total_tweets) * 100
danish_tweets_percentage = (danish_tweets_count / total_tweets) * 100 
finnish_tweets_percentage = (finnish_tweets_count / total_tweets) * 100 
dutch_tweets_percentage = (dutch_tweets_count / total_tweets) * 100
ukrainian_tweets_percentage = (ukrainian_tweets_count / total_tweets) * 100
polish_tweets_percentage = (polish_tweets_count / total_tweets) * 100
czech_tweets_percentage = (czech_tweets_count / total_tweets) * 100
hungarian_tweets_percentage = (hungarian_tweets_count / total_tweets) * 100
bulgarian_tweets_percentage = (bulgarian_tweets_count / total_tweets) * 100
albanian_tweets_percentage = (albanian_tweets_count / total_tweets) * 100
bosnian_tweets_percentage = (bosnian_tweets_count / total_tweets) * 100
icelandic_tweets_percentage = (icelandic_tweets_count / total_tweets) * 100
estonian_tweets_percentage = (estonian_tweets_count / total_tweets) * 100
maltese_tweets_percentage = (maltese_tweets_count / total_tweets) * 100
montenegrin_tweets_percentage = (montenegrin_tweets_count / total_tweets) * 100
macedonian_tweets_percentage = (macedonian_tweets_count / total_tweets) * 100
azerbaijani_tweets_percentage = (azerbaijani_tweets_count / total_tweets) * 100
lithuanian_tweets_percentage = (lithuanian_tweets_count / total_tweets) * 100
latvian_tweets_percentage = (latvian_tweets_count / total_tweets) * 100
armenian_tweets_percentage = (armenian_tweets_count / total_tweets) * 100
georgian_tweets_percentage = (georgian_tweets_count / total_tweets) * 100
serbian_tweets_percentage = (serbian_tweets_count / total_tweets) * 100
croatian_tweets_percentage = (croatian_tweets_count / total_tweets) * 100
slovenian_tweets_percentage = (slovenian_tweets_count / total_tweets) * 100
slovak_tweets_percentage = (slovak_tweets_count / total_tweets) * 100
russian_tweets_percentage = (russian_tweets_count / total_tweets) * 100
belarusian_tweets_percentage = (belarusian_tweets_count / total_tweets) * 100
hebrew_tweets_percentage = (hebrew_tweets_count / total_tweets) * 100 
unknown_tweets_percentage = (unknown_tweets_count / total_tweets) * 100

tweets_count = get_total_tweets_count(tweets)
original_tweets_count = count_original_tweets(tweets)
words_by_frequence = list(most_frequent_word(tweets).items())[0:100]
spanish_tweets_count = count_original_tweets_spanish(tweets)
spanish_words_by_frequence = list(most_frequent_words_spanish(tweets).items())[0:100]

print(f"Number of English tweets: {english_tweets_count}")  #ENGLISH
print(f"Percentage of English tweets: {english_tweets_percentage:.2f}%")
print(f"Number of Spanish tweets: {spanish_tweets_count}") #SPANISH
print(f"Percentage of Spanish tweets: {spanish_tweets_percentage:.2f}%")
print(f"Number of Catalan tweets: {catalan_tweets_count}") #CATALAN
print(f"Percentage of Catalan tweets: {catalan_tweets_percentage:.2f}%")
print(f"Number of Galician tweets: {galician_tweets_count}") #GALICIAN
print(f"Percentage of Galician tweets: {galician_tweets_percentage:.2f}%")
print(f"Number of Basque tweets: {basque_tweets_count}") #BASQUE
print(f"Percentage of Basque tweets: {basque_tweets_percentage:.2f}%")
print(f"Number of Portuguese tweets: {portuguese_tweets_count}") #PORTUGUESE
print(f"Percentage of Portuguese tweets: {portuguese_tweets_percentage:.2f}%")
print(f"Number of French tweets: {french_tweets_count}") #FRENCH
print(f"Percentage of French tweets: {french_tweets_percentage:.2f}%") 
print(f"Number of Italian tweets: {italian_tweets_count}") #ITALIAN
print(f"Percentage of Italian tweets: {italian_tweets_percentage:.2f}%")
print(f"Number of Romanian tweets: {romanian_tweets_count}") #ROMANIAN
print(f"Percentage of Romanian tweets: {romanian_tweets_percentage:.2f}%")
print(f"Number of Occitan tweets: {occitan_tweets_count}") #OCCITAN
print(f"Percentage of Occitan tweets: {occitan_tweets_percentage:.2f}%")
print(f"Number of Corsican tweets: {corsican_tweets_count}") #CORSICAN
print(f"Percentage of Corsican tweets: {corsican_tweets_percentage:.2f}%")
print(f"Number of Breton tweets: {breton_tweets_count}") #BRETON
print(f"Percentage of Breton tweets: {breton_tweets_percentage:.2f}%")
print(f"Number of Luxembourgish tweets: {luxembourgish_tweets_count}") #LUXEMBOURGISH
print(f"Percentage of Luxembourgish tweets: {luxembourgish_tweets_percentage:.2f}%")
print(f"Number of Greek tweets: {greek_tweets_count}") #GREEK
print(f"Percentage of Greek tweets: {greek_tweets_percentage:.2f}%")
print(f"Number of Turkish tweets: {turkish_tweets_count}") #TURKISH
print(f"Percentage of Turkish tweets: {turkish_tweets_percentage:.2f}%")
print(f"Number of German tweets: {german_tweets_count}") #GERMAN
print(f"Percentage of German tweets: {german_tweets_percentage:.2f}%")
print(f"Number of Swedish tweets: {swedish_tweets_count}") #SWEDISH
print(f"Percentage of Swedish tweets: {swedish_tweets_percentage:.2f}%") 
print(f"Number of Norwegian tweets: {norwegian_tweets_count}") #NORWEGIAN
print(f"Percentage of Norwegian tweets: {norwegian_tweets_percentage:.2f}%") 
print(f"Number of Danish tweets: {danish_tweets_count}") #DANISH
print(f"Percentage of Danish tweets: {danish_tweets_percentage:.2f}%")
print(f"Number of Finnish tweets: {finnish_tweets_count}") #FINNISH
print(f"Percentage of Finnish tweets: {finnish_tweets_percentage:.2f}%")
print(f"Number of Dutch tweets: {dutch_tweets_count}") #DUTCH
print(f"Percentage of Dutch tweets: {dutch_tweets_percentage:.2f}%")
print(f"Number of Ukrainian tweets: {ukrainian_tweets_count}") #UKRAINIAN
print(f"Percentage of Ukrainian tweets: {ukrainian_tweets_percentage:.2f}%")
print(f"Number of Polish tweets: {polish_tweets_count}") #POLISH 
print(f"Percentage of Polish tweets: {polish_tweets_percentage:.2f}%")
print(f"Number of Czech tweets: {czech_tweets_count}") #CZECH
print(f"Percentage of Czech tweets: {czech_tweets_percentage:.2f}%")
print(f"Number of Hungarian tweets: {hungarian_tweets_count}") #HUNGARIAN
print(f"Percentage of Hungarian tweets: {hungarian_tweets_percentage:.2f}%")
print(f"Number of Bulgarian tweets: {bulgarian_tweets_count}") #BULGARIAN
print(f"Percentage of Bulgarian tweets: {bulgarian_tweets_percentage:.2f}%")
print(f"Number of Albanian tweets: {albanian_tweets_count}") #ALBANIAN
print(f"Percentage of Albanian tweets: {albanian_tweets_percentage:.2f}%")
print(f"Number of Bosnian tweets: {bosnian_tweets_count}") #BOSNIAN
print(f"Percentage of Bosnian tweets: {bosnian_tweets_percentage:.2f}%")
print(f"Number of Icelandic tweets: {icelandic_tweets_count}") #ICELANDIC
print(f"Percentage of Icelandic tweets: {icelandic_tweets_percentage:.2f}%")
print(f"Number of Estonian tweets: {estonian_tweets_count}") #ESTONIAN
print(f"Percentage of Estonian tweets: {estonian_tweets_percentage:.2f}%")
print(f"Number of Maltese tweets: {maltese_tweets_count}") #MALTESE
print(f"Percentage of Maltese tweets: {maltese_tweets_percentage:.2f}%")
print(f"Number of Montenegrin tweets: {montenegrin_tweets_count}") #MONTENEGRIN
print(f"Percentage of Montenegrin tweets: {montenegrin_tweets_percentage:.2f}%")
print(f"Number of Macedonian tweets: {macedonian_tweets_count}") #MACEDONIAN
print(f"Percentage of Macedonian tweets: {macedonian_tweets_percentage:.2f}%")
print(f"Number of Azerbaijani tweets: {azerbaijani_tweets_count}") #AZERBAIJANI
print(f"Percentage of Azerbaijani tweets: {azerbaijani_tweets_percentage:.2f}%")
print(f"Number of Lithuanian tweets: {lithuanian_tweets_count}") #LITHUANIAN
print(f"Percentage of Lithuanian tweets: {lithuanian_tweets_percentage:.2f}%")
print(f"Number of Latvian tweets: {latvian_tweets_count}") #LATVIAN
print(f"Percentage of Latvian tweets: {latvian_tweets_percentage:.2f}%")
print(f"Number of Armenian tweets: {armenian_tweets_count}") #ARMENIAN
print(f"Percentage of Armenian tweets: {armenian_tweets_percentage:.2f}%")
print(f"Number of Georgian tweets: {georgian_tweets_count}") #GEORGIAN
print(f"Percentage of Georgian tweets: {georgian_tweets_percentage:.2f}%")
print(f"Number of Serbian tweets: {serbian_tweets_count}") #SERBIAN
print(f"Percentage of Serbian tweets: {serbian_tweets_percentage:.2f}%")
print(f"Number of Croatian tweets: {croatian_tweets_count}") #CROATIAN
print(f"Percentage of Croatian tweets: {croatian_tweets_percentage:.2f}%")
print(f"Number of Slovenian tweets: {slovenian_tweets_count}") #SLOVENIAN
print(f"Percentage of Slovenian tweets: {slovenian_tweets_percentage:.2f}%")
print(f"Number of Slovak tweets: {slovak_tweets_count}") #SLOVAK
print(f"Percentage of Slovak tweets: {slovak_tweets_percentage:.2f}%") 
print(f"Number of Russian tweets: {russian_tweets_count}") #RUSSIAN
print(f"Percentage of Russian tweets: {russian_tweets_percentage:.2f}%")
print(f"Number of Belarusian tweets: {belarusian_tweets_count}") #BELARUSIAN
print(f"Percentage of Belarusian tweets: {belarusian_tweets_percentage:.2f}%")
print(f"Number of Hebrew tweets: {hebrew_tweets_count}") #HEBREW
print(f"Percentage of Hebrew tweets: {hebrew_tweets_percentage:.2f}%")
print(f"Number of Unknown tweets: {unknown_tweets_count}") #UNKNOWN
print(f"Percentage of Unknown tweets: {unknown_tweets_percentage:.2f}%")

print("----------------------------------")
print(f"Number of tweets: {tweets_count}")
print(f"Number of original tweets: {original_tweets_count}")
print(f"Number of original Spanish tweets: {spanish_tweets_count}")
print("-----------------------------------")
print(f"Frequent words: {words_by_frequence}\n")
print(f"Most frequent Spanish words: {spanish_words_by_frequence}\n")
print(tweets[0:2])

Number of English tweets: 491
Percentage of English tweets: 49.10%
Number of Spanish tweets: 94
Percentage of Spanish tweets: 23.60%
Number of Catalan tweets: 5
Percentage of Catalan tweets: 0.50%
Number of Galician tweets: 0
Percentage of Galician tweets: 0.00%
Number of Basque tweets: 1
Percentage of Basque tweets: 0.10%
Number of Portuguese tweets: 6
Percentage of Portuguese tweets: 0.60%
Number of French tweets: 96
Percentage of French tweets: 9.60%
Number of Italian tweets: 30
Percentage of Italian tweets: 3.00%
Number of Romanian tweets: 1
Percentage of Romanian tweets: 0.10%
Number of Occitan tweets: 0
Percentage of Occitan tweets: 0.00%
Number of Corsican tweets: 0
Percentage of Corsican tweets: 0.00%
Number of Breton tweets: 0
Percentage of Breton tweets: 0.00%
Number of Luxembourgish tweets: 0
Percentage of Luxembourgish tweets: 0.00%
Number of Greek tweets: 7
Percentage of Greek tweets: 0.70%
Number of Turkish tweets: 21
Percentage of Turkish tweets: 2.10%
Number of German t