# GETTING DATA

In this section we will show how tweets will be extracted using the twitter API. This requires a developer research account. It is necessary to review the requirements to have access to this account. he advantage of this is that as a developer research account you can access any tweet in the historical Twitter database. Other types of accounts only allow access for a few days (see `elevated case` section).

First we will import some libraries together with the `bearer` file containing the `bearer token` of the research account. This is important for establishing the connection with the API.

In [None]:
# import libraries
import tweepy # To consume Twitter's API
import bearer # To get the bearer token
import pandas as pd # To handle data from the Twitter API
import datetime # To handle dates

Then, we create the connection. With `search_all` we will ask the API to search the tweet history.

In [None]:
search_url = "https://api.twitter.com/2/tweets/search/all"
tweepyclient=tweepy.Client(bearer_token=bearer.bearer_token)

Before doing the search we will create a function that allows to fix the text, removing links. The rest of the dataset is not altered.

In [None]:
def fixing_text(texto):

  '''receives the text of the tweet and delivers it in a way that does not affect the csv structure'''

  words = texto.split()
  tweet = ''
  for w in words:
    try:
      w_ini = w[:4]
    except:
      w_ini = w
    # filtering to remove links from the tweet
    if w_ini != 'http': tweet+= f' {w}'

  return tweet
  # create the connection


We will select the year for which we want to download the tweets. For this, we will vary the variable `anio` between 2019 and 2022.

In [None]:
#generating a list of dictionaries with all the dates for which the search will be made
anio = 2022         # modify the year for which tweets are to be generated

start_date = datetime.date(anio,1,1)
dates_list = []
while start_date < datetime.date(anio,12,31):
  end_date = start_date + datetime.timedelta(days=10)
  if end_date > datetime.date(anio,12,31):
    end_date = datetime.date(anio,12,31)

  dates_dict = {
      'start_date':start_date,
      'end_date':end_date
  }
  dates_list.append(dates_dict)
  start_date = end_date

Now, we will proceed to create the data with the downloaded tweets. The tweets will be downloaded by day and by filtered word. For the year 2019 the word chosen will be `Medellin`. Once the data understanding is done and the categories are selected (see `Data preparation` notebook), we will proceed to perform a new search for the years 2019 to 2022 with those words. In this way, the tweets (including those of 2019 that were previously together) will be separated in files differentiated by day and category. 

In [None]:
for date in dates_list:
  
  try:

    print(f'corriendo {date["start_date"]}')
    # extracting the dates from the list of dates
    # date = dates_list[0]

    # query parameters
    query = 'medellin tecnologia -is:retweet lang:es' 

    start_year = str(date['start_date'])[:4]
    start_month = str(date['start_date'])[5:7]
    start_day = str(date['start_date'])[-2:]
    start_time = f'{start_year}-{start_month}-{start_day}T00:00:00Z'

    end_year = str(date['end_date'])[:4]
    end_month = str(date['end_date'])[5:7]
    end_day = str(date['end_date'])[-2:]
    end_time = f'{end_year}-{end_month}-{end_day}T00:00:00Z'

    # get the tweets

    # text and id came by default
    tweets = tweepyclient.search_all_tweets(
        query=query, 
        tweet_fields=['created_at',
                      'public_metrics',
                      'geo',
                      'conversation_id',
                      'author_id'], 
        user_fields = ['username'],
        place_fields = ['name',
                        'full_name'],
        expansions=['author_id',
                    'geo.place_id'],
        start_time=start_time,
        end_time=end_time,
        max_results=500)

    # saving the data into a DF

    # Get users list from the includes object
    users = {u["id"]: u for u in tweets.includes['users']}

    # Get list of places from includes object
    places = {p["id"]: p for p in tweets.includes['places']}

    # creating the dic to fill with the tweets that we fetch
    diccionario = {
        'full_text':[],
        'user':[],
        'location':[],
        'date':[],
        'tweet_id':[],
        'number_rt':[],
        'number_likes':[],
        'number_reply':[],
        'conversation_id':[]
    }

    diccionario_to_fill = diccionario.copy()

    # saving the tweets data into a dictionary
    for tweet in tweets.data:

      diccionario_to_fill['full_text'].append(tweet.text)

      if users[tweet.author_id]:
        user = users[tweet.author_id]
        diccionario_to_fill['user'].append(user.username)
      else:
        diccionario_to_fill['user'].append('')

      diccionario_to_fill['location'].append(str(list(places.values())[0]))
      diccionario_to_fill['date'].append(tweet.created_at)
      diccionario_to_fill['tweet_id'].append(tweet.id)
      diccionario_to_fill['number_rt'].append(tweet.public_metrics['retweet_count'])
      diccionario_to_fill['number_likes'].append(tweet.public_metrics['like_count'])
      diccionario_to_fill['number_reply'].append(tweet.public_metrics['reply_count'])
      diccionario_to_fill['conversation_id'].append(tweet.conversation_id)

    df = pd.DataFrame.from_dict(diccionario_to_fill)

    # limpiando la columna con el texto del tweet
    df['full_text'] = df['full_text'].apply(fixing_text)

    st_day = start_time[8:10]
    st_month = start_time[5:7]
    st_year = start_time[:4]
    et_day = end_time[8:10]
    et_month = end_time[5:7]
    et_year = end_time[:4]
    df.to_csv(f'data/research/tweets_{st_day}{st_month}{st_year}_{et_day}{et_month}{et_year}.csv',index = False)

    for i in range(0,100000000):
      i =+ i
      if i > 23423:
        i =- 10
      else:
        i =+ 15
  except:
    print('#Error')
    print(date['start_date'])
    print(date['end_date'])