# Clase 2 - Contestar a preguntas sencillas sobre datos

## Objetivos

Utilizar datos reales y crear c√≥digo para contestar preguntas sobre el conjunto.

### Pasos

1. leer el input (1 json per l√≠nea)
2. modelar cada l√≠nea correcta en una data class que represente a un Tweet 
3. contestar preguntas sobre el conjunto de datos obtenidos

## Repaso de Python: I/O con ficheros

### Escribir ficheros

El siguiente snippet de c√≥digo muestra como escribir un fichero (observen la `w` - de *write* - como segundo argumento del m√©todo `open`)
Adem√°s, observen el carater espeecial `\n` para insertar un final de l√≠nea. 


```python
with open('/some/path/to/a/file', 'w') as f:
  f.writeline("a line\n")
```

-> `\n` fin de linea

In [None]:
with open('/tmp/test-file.txt', 'w') as input:
  input.write("test1\n")
  input.write("test2\n")
  input.writelines(["test3\n", 'test4\n'])


In [3]:
with open('test-file.txt', 'w') as input:
  input.write("test1\n")
  input.write("test2\n")
  input.writelines(["test3\n", 'test4\n'])

### Leer ficheros

El siguiente snippet de c√≥digo muestra como escribir un fichero (observen la `r` - de *read* - como segundo argumento del m√©todo `open`)


```python
with open('/some/path/to/a/file', 'r') as f:
  line = f.readline()
```

### Leer linea por l√≠nea

Es la manera recomendada en leer ficheros grandes en Python, para evitar de mantener todo el fichero en memoria, que es un recurso escaso. 

In [5]:
with open('test-file.txt', 'r') as f:
    for line in f:
      print(line.rstrip()) 

test1
test2
test3
test4


rstrip() method removes any trailing whitespace characters (including newlines) from the string

### Leer solo las primeras N l√≠neas

Puede ser √∫til para explorar los ficheros de input y hacerse una idea de la forma de los datos 

In [7]:
with open('test-file.txt', 'r') as f:
    ## Lee cada l√≠nea (next) para cada elemento de una lista de 0 a N y a√±adela a la lista head
    head = [next(f).rstrip() for _ in range(2)] # next 2 times over f 
    ## Imprime una l√≠nea por cada l√≠nea en head (que es una lista) 
    [print(line) for line in head]

test1
test2


In [8]:
with open('test-file.txt', 'r') as f:
    ## Lee cada l√≠nea (next) para cada elemento de una lista de 0 a N y a√±adela a la lista head
    head = [next(f).rstrip() for _ in range(3)] # next 3 times over f 
    ## Imprime una l√≠nea por cada l√≠nea en head (que es una lista) 
    [print(line) for line in head]

test1
test2
test3


# ETL (Extract, transform, Load)




### Estudiar el input

Descargamos el fichero `mini_input.txt` y lo subimos a nuestro entorno Jupyter. Procedemos a explorarlo, leendo las primeras l√≠neas del mismo. 

In [9]:
with open('mini_input.txt', 'r') as f:
    head = [next(f).rstrip() for _ in range(3)]
    [print(line) for line in head]

{"created_at":"Sat May 12 15:58:53 +0000 2018","id":995332494974210048,"id_str":"995332494974210048","text":"RT @carloscarmo98: -Manel, algo que decir sobre tu actuaci\u00f3n en Eurovision?\n-Kikiriketediga https:\/\/t.co\/yXGYtKmJoM","source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":492271155,"id_str":"492271155","name":"alba aguirre","screen_name":"Alba137","location":"en pleno akelarre","url":null,"description":"no todo lo que brilla es oro, a veces es highlight \u2728\ud83d\udc9c","translator_type":"regular","protected":false,"verified":false,"followers_count":718,"friends_count":416,"listed_count":2,"favourites_count":24718,"statuses_count":21764,"created_at":"Tue Feb 14 14:46:34 +0000 2012","utc_offset":10800,"time_zone":"Athen

## Modelizar Tweets

Representar un Tweet dentro de una dataclass de python. Desde el input de arriba sabemos que: 
- el input es en formato JSON 
- contiene muchos campos, por lo tanto seleccionaremos algunos relevantes, u ignoraremos los otros

In [1]:
from dataclasses import dataclass

@dataclass
class Tweet:
  """Class to model a Tweet"""
  id: int         # The unique ID of a tweet
  content: str    # The textual content of a tweet
  author: str     # The nickname of the author of the tweet
  language: str   # The language of the tweet

En el c√≥digo siguiente, el m√©todo `parse_line` se ocupa de interpretar una l√≠nea de input como JSON, y de mapear cada l√≠nea en una instancia de la dataclass `Tweet`. 

De momento, aunque no se recomiende para inputs de gran tama√±o, almacenaremos **todos** los Tweets en una lista `tweets`, que escribiremos en un fichero `clean-dataset`. 

Completamos juntos *en clase* el c√≥digo del m√©todo `parse_line`. 

In [2]:
import json, dataclasses

tweets = [] # empty list to store the parsed tweet objects.

def parse_line(line: str):
  """Try to parse a string into a Person"""
  try:
    parsed = json.loads(line) # 
    return Tweet(parsed['id'], parsed['text'], parsed['user']['screen_name'], parsed['lang'])
    pass
  except Exception as e:
    print(f"Error parsing '{line}': {e}")

with open("mini_input.txt") as input:
    for line in input:
        if len(line.rstrip()) > 0:
          tweet = parse_line(line)
          tweets.append(tweet)

#print(tweets)
                                                            
with open("clean-dataset", 'w') as f:
  tweet_strings = map(lambda x: json.dumps(dataclasses.asdict(x)) + '\n', tweets)
  f.writelines(tweet_strings)


In [3]:
for modeled_tweet in tweets[0:10]:
  print(modeled_tweet)

Tweet(id=995332494974210048, content='RT @carloscarmo98: -Manel, algo que decir sobre tu actuaci√≥n en Eurovision?\n-Kikiriketediga https://t.co/yXGYtKmJoM', author='Alba137', language='es')
Tweet(id=995332495783727105, content="RT @DougJ7777: If Britain wins #Eurovision then we have to rejoin the EU. It's in the rules. #Eurovision2018", author='DougJ7777', language='en')
Tweet(id=995332497029332994, content='RT @AndreaMBaeza: Enserio esto es BRUTAL. ESTAMOS TODOS A UNA CON ELLOS!!!!!!!! OS QUEREMOS!! ‚ù§Ô∏èüíõ‚ù§Ô∏è #aMaiaALFRED12POINTS #AmaiaAlfredLisboa‚Ä¶', author='cf_pablochape', language='es')
Tweet(id=995332494185680897, content='8. –±–µ–ª–∞—Ä—É—Å—å ‚Ä¢ 2018 \n#eurovision https://t.co/eWK7qRykgz', author='blcklcfr', language='ru')
Tweet(id=995332497419403265, content="RT @Mystificus: Of course I'll watch #eurovision tonight. After all, 200 million people can't be wrong, can they?\nEr...üçäüî´...", author='EcuadorDon', language='en')
Tweet(id=995332497943777281, content='RT @K

# Responder preguntas sobre datos

Desde nuestro dataset limpio `clean_input` podemos ahora intentar responder a preguntas concretas, por ejemplo:

1. Cu√°ntos tweets hay en nuestro fichero `mini_input.txt`? 
2. Cu√°ntos de estos tweets son en espa√±ol? Y cuantos son en ingl√©s? 
3. Cual es la palabra m√°s significativa de los tweets en castellano? 

In [4]:
import json, dataclasses

def read_clean_tweets(input: str):
  tweets = []
  with open(input, 'r') as f:
    lines = f.readlines()
  for line in lines:
    parsed = json.loads(line)
    tweet = Tweet(**parsed) 
    tweets.append(tweet)
  return tweets

def count_tweets(tweets: list[Tweet]): # Function 1
    return len(tweets)

def count_spanish_tweets(tweets: list[Tweet]):  # Function 2
    count = 0
    for tweet in tweets:
        if tweet.language == 'es':
            count += 1
    return count           

def most_significant_word_in_lang(tweets: list[Tweet], lang: str): # Function 3
    counts = {} # EMPTY DICTIONARY 
    # ITERATION
    for tweet in tweets:
        if tweet.language == lang:
            for word in tweet.content.split(' '):
                if word in counts: 
                    current_value = counts[word]
                    new_value = current_value + 1
                    counts[word] = new_value
                else:
                    counts[word] = 1
    return dict(sorted(counts.items(), key = lambda x: x[1], reverse = True))
   

tweets = read_clean_tweets('clean-dataset')

print(tweets[0:2]) # FIRST 2

tweets_count = count_tweets(tweets) # CALL 

spanish_tweets_count = count_spanish_tweets(tweets)

most_significant_word_in_spanish = most_significant_word_in_lang(tweets, 'es')

print(tweets_count)

print(spanish_tweets_count)

print(most_significant_word_in_spanish)

[Tweet(id=995332494974210048, content='RT @carloscarmo98: -Manel, algo que decir sobre tu actuaci√≥n en Eurovision?\n-Kikiriketediga https://t.co/yXGYtKmJoM', author='Alba137', language='es'), Tweet(id=995332495783727105, content="RT @DougJ7777: If Britain wins #Eurovision then we have to rejoin the EU. It's in the rules. #Eurovision2018", author='DougJ7777', language='en')]
1000
236
{'de': 146, 'RT': 142, 'que': 96, 'y': 92, 'a': 87, 'en': 81, '#Eurovision': 75, 'la': 73, 'el': 68, 'no': 48, 'para': 31, 'Amaia': 28, 'Eurovision': 27, 'con': 27, 'es': 26, 'Alfred': 26, 'un': 26, 'por': 25, 'me': 22, 'ya': 21, 'los': 19, '@Alfred_ot2017': 18, 'nos': 18, 'del': 18, 'si': 17, 'noche': 16, 'ver': 16, '@Amaia_ot2017': 15, 'al': 15, 'Espa√±a': 15, 'esta': 15, 'ha': 15, 'lo': 15, 'este': 15, 'La': 14, 'se': 14, 'No': 14, 'porque': 14, 'esto': 13, 'A': 13, 'Hoy': 13, 'hemos': 13, 'han': 13, 'm√°s': 13, 'todos': 12, '#AmaiaAlfred12points': 12, 'va': 12, 'importa': 12, 'ganado': 11, 'hoy': 11, '

for word in tweet.content.split(' '): will break down the content of a tweet into individual words, and then perform an analays to get the most_significant_word_in_lang

## Brainstorming

What other questions could we ask to the data?

Which account has the most retweets and followers. 

# Question 2.1

Cu√°ntos tweets son originales? 

Definimos un Tweet como *original* si no es un retweet. Para nuestros datos, diremos que un tweet es un retweet si contiene el campo `retweeted_status`. (null -> original)

Pasos a seguir: 
- Anadir un elemento a nuestra dataclass Tweet (que tipo mejor representa esta informaci√≥n?) 
- Leer de nuevo el input teniendo en cuenta este nuevo elemento y el criterio establecido arriba para determinar si un tweet as un retweet
- Escribir un m√©todo `count_original_tweets` que devuelva el recuento de tweet originales

In [None]:
from dataclasses import dataclass

@dataclass
class Tweet:
  """Class to model a Tweet"""
  id: int         # The unique ID of a tweet
  content: str    # The textual content of a tweet
  author: str     # The nickname of the author of the tweet
  language: str   # The language of the tweet
  retweeted_status: bool 

In [None]:
import json, dataclasses

tweets = [] # empty list to store the parsed tweet objects.

def parse_line(line: str):
  """Try to parse a string into a Person"""
  try:
    parsed = json.loads(line) # 
    return Tweet(parsed['id'], parsed['text'], parsed['user']['screen_name'], parsed['lang'], parsed.get('retweeted_status', False))
    pass
  except Exception as e:
    print(f"Error parsing '{line}': {e}")

with open("mini_input.txt") as input:
    for line in input:
        if len(line.rstrip()) > 0:
          tweet = parse_line(line)
          tweets.append(tweet)

#print(tweets)
                                                            
with open("clean-dataset", 'w') as f:
  tweet_strings = map(lambda x: json.dumps(dataclasses.asdict(x)) + '\n', tweets)
  f.writelines(tweet_strings)

In [None]:
import json, dataclasses

def read_clean_tweets(input: str):
  tweets = []
  with open(input, 'r') as f:
    lines = f.readlines()
  for line in lines:
    parsed = json.loads(line)
    tweet = Tweet(**parsed) 
    tweets.append(tweet)
  return tweets

def count_tweets(tweets: list[Tweet]): # Function 1
    return len(tweets)


def count_original_tweets(tweets: list[Tweet]):  # Function 2
    count = 0
    for tweet in tweets:
        if tweet.retweeted_status == False:
            count += 1
    return count 

tweets = read_clean_tweets('clean-dataset')

print(tweets[0:2]) # FIRST 2

tweets_count = count_tweets(tweets) # CALL 

original_tweets_count = count_original_tweets(tweets)


print(tweets_count)

print(original_tweets_count)

In [65]:
import json, dataclasses

def read_clean_tweets(input: str):
  tweets = []
  with open(input, 'r') as f:
    lines = f.readlines()
  for line in lines:
    parsed = json.loads(line)
    tweet = Tweet(**parsed) 
    tweets.append(tweet)
  return tweets

def count_tweets(tweets: list[Tweet]): # Function 1
    return len(tweets)

def count_spanish_tweets(tweets: list[Tweet]):  # Function 2
    count = 0
    for tweet in tweets:
        if tweet.language == 'es':
            count += 1
    return count   

def count_original_tweets(tweets: list[Tweet]):  # Function 3
    count = 0
    for tweet in tweets:
        if tweet.retweeted_status == False:
            count += 1
    return count 

def most_significant_word_in_lang(tweets: list[Tweet], lang: str): # Function 4
    counts = {} # EMPTY DICTIONARY 
    # ITERATION
    for tweet in tweets:
        if tweet.language == lang:
            for word in tweet.content.split(' '):
                if word in counts: 
                    current_value = counts[word]
                    new_value = current_value + 1
                    counts[word] = new_value
                else:
                    counts[word] = 1
    return dict(sorted(counts.items(), key = lambda x: x[1], reverse = True))
   

tweets = read_clean_tweets('clean-dataset')

print(tweets[0:2]) # FIRST 2

tweets_count = count_tweets(tweets) # CALL 

spanish_tweets_count = count_spanish_tweets(tweets)

original_tweets_count = count_original_tweets(tweets)

most_significant_word_in_spanish = most_significant_word_in_lang(tweets, 'es')

print(tweets_count)

print(spanish_tweets_count)

print(original_tweets_count)

print(most_significant_word_in_spanish)

[Tweet(id=995332494974210048, content='RT @carloscarmo98: -Manel, algo que decir sobre tu actuaci√≥n en Eurovision?\n-Kikiriketediga https://t.co/yXGYtKmJoM', author='Alba137', language='es', retweeted_status={'created_at': 'Sat May 13 20:57:18 +0000 2017', 'id': 863498411517108224, 'id_str': '863498411517108224', 'text': '-Manel, algo que decir sobre tu actuaci√≥n en Eurovision?\n-Kikiriketediga https://t.co/yXGYtKmJoM', 'display_text_range': [0, 72], 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 1651197529, 'id_str': '1651197529', 'name': 'Carlos Carmona', 'screen_name': 'carloscarmo98', 'location': 'Logro√±o, Espa√±a', 'url': None, 'description': 'Estudiante de Geograf√≠a e Historia y seguidor del Valencia C.F. Nacido en Villanueva de la S