<a href="https://colab.research.google.com/github/TailUFPB/storIA/blob/main/redditScrap.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Para treinar o modelo, será necessário baixar uma boa quantidade os posts da sub desejada do Reddit. Precisamos dos posts e da popularidade de cada post. 

A API do próprio Reddit já permite isso. Para utilizá-la é necessário criar uma conta e solicitar acesso a API(https://www.reddit.com/dev/api/).
Depois de criada a conta, utilizamos a biblioteca PRAW(https://praw.readthedocs.io/en/stable/) e entramos com os dados de autenticação da gerados na etapa anterior. 
A API deve retornar os posts que irão compor o dataframe que será utilizado para treinar o modelo. Etretanto existe uma limitação para 1000 posts. Temos duas opções: podemos pegar os 1000 mais populares ou podemos pegar os 1000 mais recentes. Outra opção que a API fornece é retornar o post referente a algum id específico. 

Uma das ideias para conseguir baixar mais do que essa limitação foi usar a função random da API, que retorna um post aleatório. Depois disso, repetiríamos a função random X vezes até conseguir uma quantidade satisfatória de posts. 
O random costumava retornar o mesmo grupo de posts, mesmo que fosse executado milhares de vezes. 

A solução encontrada foi utilizar uma outra API('https://api.pushshift.io/reddit/search/submission/) para criar uma lista com os IDs dos posts, e posteriormente utiliza-la na primeira API para resgatar os textos.

Essa segunda API também contava com limitações nas respostas, então fomos pegando a lista de todos os posts de cada dia nos últimos dez anos. Muitas vezes a API dava erro e não retornava post nenhum para algum dia específico, mas ao tentar esse dia novamente, a API retornava os resultados normalmente. Então cada dia que retornava erro era salvo em uma lista, e seria executado novamente em um outro momento. 

Com a lista de todos os Ids pronta, solicitamos os posts daqueles IDS. Como a sub tem vários anos, alguns dos posts mais antigos tinham sido deletados ou removidos. Foi necessário checar se haviam posts duplicados também.

Por fim, esse mesmo processo foi executado para as subs shortscarystories, scarystories e nosleep.





In [None]:
!pip install praw -q
import praw
import pandas as pd
from tqdm import tqdm
import requests
import datetime
from joblib import Parallel, delayed

In [None]:
folder = '/content/drive/MyDrive/dados/nlp2/'
subred = 'nosleep'
#subred = 'poetry'
def main(years = 10):
  !mkdir '$folder$subred/' -p
  print('1st runt IDS: ' + str(getIds(years)))
  n = getErrors()
  print(n)
  print('TOTAL IDS: ' + str(mergeIDS(n)))
  n = getPosts()
  print(n)
  print(mergePosts(n))

In [None]:
def getIds(years):
  data = int(datetime.datetime.today().timestamp())
  link = 'https://api.pushshift.io/reddit/search/submission/?subreddit=' + subred + '&sort=desc&sort_type=created_utc&after={}&before={}&size=1000&fields=id'
  ids = []
  errors = []
  dias = 86400
  for i in tqdm(range(365*years)):
    try: 
      r = requests.get(link.format(data - dias, data), timeout = 5)
      j = r.json()
      for post in r.json()['data']:
        ids.append(post['id'])
    except:
      errors.append(data)
    data = data - dias
  print(len(ids))
  errors = pd.DataFrame({'errors':errors})
  errors.to_csv('{}{}/errors.csv'.format(folder,subred), index = False)
  df = pd.DataFrame({'id':ids})
  df.to_csv('{}{}/ids.csv'.format(folder,subred), index = False)
  return len(df)


def getErrors():
  errors = []
  for n in range(0,100):
    if (len(errors) == 0):
      try:
        errors = pd.read_csv('{}{}/errors.csv'.format(folder,subred))
      except:
        break
    errors = errors['errors'].tolist()
    link = 'https://api.pushshift.io/reddit/search/submission/?subreddit=' + subred + '&sort=desc&sort_type=created_utc&after={}&before={}&size=1000&fields=id'
    ids = []
    errorsf = []
    dias = 86400
    for i in tqdm(errors):
      try: 
        r = requests.get(link.format(i - dias, i), timeout = 10)
        j = r.json()
        for post in r.json()['data']:
          ids.append(post['id'])
      except:
        errorsf.append(i)
    print(len(ids))
    print(len(errorsf))
    if len(errorsf) == 0:
      break
    errors = pd.DataFrame({'errors':errorsf})
    errors.to_csv('{}{}/errors{}.csv'.format(folder,subred,n+1), index = False)
    df = pd.DataFrame({'id':ids})
    df.to_csv('{}{}/ids{}.csv'.format(folder,subred,n+1), index = False)
  return(n)

def mergeIDS(n):
  df = pd.read_csv('{}{}/ids.csv'.format(folder,subred))
  for i in range(1,n+1):
    df = df.append(pd.read_csv('{}{}/ids{}.csv'.format(folder,subred,i)))
  df.to_csv('{}{}/allIDS.csv'.format(folder,subred), index = False)
  return len(df)

def getPosts():
  import praw
  reddit = praw.Reddit(client_id='s_3FiiAEEguClw', client_secret='eEAenY2x2lyJU6TLlgOvHLMj7JluKQ', user_agent='storIA',check_for_async=False)
  ml_subreddit = reddit.subreddit(subred)
  df = pd.read_csv('{}{}/allIDS.csv'.format(folder,subred))
  lista = df['id'].tolist()
  def capture(part):
    posts = []
    cs = 1000
    for i in range(cs*part, cs*(part+1)):
      try: 
        post = reddit.submission(lista[i])
        posts.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
      except:
        print('over?')
    posts = pd.DataFrame(posts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])
    posts.to_csv('{}{}/posts{}.csv'.format(folder,subred,part), index = False)
  Parallel(n_jobs=8)(delayed(capture)(part) for part in tqdm(range(len(df)//1000 +2)))
  return (len(df)//1000 +1)

def mergePosts(n):
  #junta posts
  df = pd.read_csv('{}{}/posts0.csv'.format(folder,subred))
  for i in range(1,n):
    df = df.append(pd.read_csv('{}{}/posts{}.csv'.format(folder,subred,i)))
  df.to_csv('{}{}/allPOSTS{}.csv'.format(folder,subred,subred), index = False)
  return(df.shape)


main(14)

100%|██████████| 5110/5110 [06:16<00:00, 13.57it/s]
  0%|          | 2/4727 [00:00<05:02, 15.60it/s]

1
1st runt IDS: 1


100%|██████████| 4727/4727 [06:07<00:00, 12.88it/s]
  0%|          | 2/4363 [00:00<05:00, 14.51it/s]

1
4363


100%|██████████| 4363/4363 [06:30<00:00, 11.17it/s]
  0%|          | 2/3991 [00:00<05:01, 13.25it/s]

1
3991


100%|██████████| 3991/3991 [05:02<00:00, 13.20it/s]
  0%|          | 2/3688 [00:00<03:29, 17.55it/s]

1
3688


100%|██████████| 3688/3688 [05:12<00:00, 11.81it/s]
  0%|          | 2/3396 [00:00<03:39, 15.43it/s]

1
3396


100%|██████████| 3396/3396 [04:19<00:00, 13.09it/s]
  0%|          | 2/3136 [00:00<04:16, 12.24it/s]

0
3136


100%|██████████| 3136/3136 [04:01<00:00, 13.00it/s]
  0%|          | 2/2895 [00:00<03:07, 15.42it/s]

2
2895


100%|██████████| 2895/2895 [03:52<00:00, 12.45it/s]
  0%|          | 2/2672 [00:00<02:51, 15.59it/s]

2
2672


100%|██████████| 2672/2672 [03:19<00:00, 13.40it/s]
  0%|          | 2/2473 [00:00<02:39, 15.49it/s]

1
2473


100%|██████████| 2473/2473 [03:03<00:00, 13.47it/s]
  0%|          | 1/2289 [00:00<05:14,  7.27it/s]

0
2289


100%|██████████| 2289/2289 [02:53<00:00, 13.23it/s]
  0%|          | 2/2116 [00:00<02:22, 14.82it/s]

1
2116


100%|██████████| 2116/2116 [02:32<00:00, 13.86it/s]
  0%|          | 2/1964 [00:00<02:56, 11.14it/s]

0
1964


100%|██████████| 1964/1964 [03:27<00:00,  9.47it/s]
  0%|          | 2/1773 [00:00<02:08, 13.74it/s]

0
1773


100%|██████████| 1773/1773 [02:13<00:00, 13.23it/s]
  0%|          | 2/1639 [00:00<01:26, 18.87it/s]

0
1639


100%|██████████| 1639/1639 [02:05<00:00, 13.07it/s]
  0%|          | 2/1514 [00:00<01:44, 14.41it/s]

1
1514


100%|██████████| 1514/1514 [01:54<00:00, 13.21it/s]
  0%|          | 2/1399 [00:00<01:26, 16.07it/s]

1
1399


100%|██████████| 1399/1399 [01:39<00:00, 14.01it/s]
  0%|          | 0/1299 [00:00<?, ?it/s]

0
1299


100%|██████████| 1299/1299 [01:45<00:00, 12.34it/s]
  0%|          | 2/1194 [00:00<01:41, 11.71it/s]

3
1194


100%|██████████| 1194/1194 [01:27<00:00, 13.59it/s]
  0%|          | 2/1106 [00:00<01:31, 12.03it/s]

0
1106


100%|██████████| 1106/1106 [01:27<00:00, 12.63it/s]
  0%|          | 2/1018 [00:00<01:03, 15.93it/s]

0
1018


100%|██████████| 1018/1018 [02:18<00:00,  7.32it/s]
  0%|          | 2/924 [00:00<00:58, 15.69it/s]

1
924


100%|██████████| 924/924 [01:05<00:00, 14.05it/s]
  0%|          | 2/858 [00:00<01:12, 11.75it/s]

1
858


100%|██████████| 858/858 [01:07<00:00, 12.74it/s]
  0%|          | 1/790 [00:00<01:20,  9.81it/s]

1
790


100%|██████████| 790/790 [01:01<00:00, 12.83it/s]
  0%|          | 2/729 [00:00<00:44, 16.44it/s]

0
729


100%|██████████| 729/729 [02:21<00:00,  5.15it/s]
  0%|          | 2/635 [00:00<00:44, 14.31it/s]

1
635


100%|██████████| 635/635 [00:58<00:00, 10.92it/s]
  0%|          | 2/577 [00:00<00:44, 12.80it/s]

0
577


100%|██████████| 577/577 [01:36<00:00,  5.97it/s]
  0%|          | 0/494 [00:00<?, ?it/s]

0
494


100%|██████████| 494/494 [00:54<00:00,  9.00it/s]
  0%|          | 1/439 [00:00<00:47,  9.13it/s]

0
439


100%|██████████| 439/439 [00:48<00:00,  9.04it/s]
  0%|          | 1/390 [00:00<00:41,  9.31it/s]

0
390


100%|██████████| 390/390 [00:37<00:00, 10.53it/s]
  0%|          | 1/353 [00:00<00:42,  8.30it/s]

1
353


100%|██████████| 353/353 [00:37<00:00,  9.42it/s]
  1%|          | 2/316 [00:00<00:22, 13.74it/s]

0
316


100%|██████████| 316/316 [00:30<00:00, 10.35it/s]
  1%|          | 2/285 [00:00<00:23, 12.03it/s]

0
285


100%|██████████| 285/285 [00:26<00:00, 10.93it/s]
  1%|          | 2/259 [00:00<00:22, 11.61it/s]

1
259


100%|██████████| 259/259 [00:24<00:00, 10.72it/s]
  1%|          | 2/235 [00:00<00:18, 12.45it/s]

0
235


100%|██████████| 235/235 [00:22<00:00, 10.68it/s]
  1%|          | 2/213 [00:00<00:16, 12.87it/s]

0
213


100%|██████████| 213/213 [00:20<00:00, 10.45it/s]
  1%|          | 1/192 [00:00<00:20,  9.46it/s]

0
192


100%|██████████| 192/192 [00:17<00:00, 11.02it/s]
  1%|          | 2/175 [00:00<00:14, 11.74it/s]

0
175


100%|██████████| 175/175 [00:16<00:00, 10.60it/s]
  1%|▏         | 2/158 [00:00<00:13, 11.42it/s]

0
158


100%|██████████| 158/158 [00:14<00:00, 10.81it/s]
  0%|          | 0/144 [00:00<?, ?it/s]

0
144


100%|██████████| 144/144 [00:17<00:00,  8.33it/s]
  1%|          | 1/127 [00:00<00:18,  6.80it/s]

0
127


100%|██████████| 127/127 [00:12<00:00,  9.79it/s]
  2%|▏         | 2/114 [00:00<00:09, 11.57it/s]

0
114


100%|██████████| 114/114 [00:13<00:00,  8.73it/s]
  1%|          | 1/100 [00:00<00:11,  8.82it/s]

0
100


100%|██████████| 100/100 [00:43<00:00,  2.30it/s]
  3%|▎         | 2/79 [00:00<00:06, 11.55it/s]

0
79


100%|██████████| 79/79 [00:12<00:00,  6.09it/s]
  0%|          | 0/66 [00:00<?, ?it/s]

0
66


100%|██████████| 66/66 [00:05<00:00, 11.44it/s]
  3%|▎         | 2/60 [00:00<00:04, 11.66it/s]

0
60


100%|██████████| 60/60 [00:06<00:00,  9.08it/s]
  4%|▎         | 2/54 [00:00<00:04, 10.77it/s]

0
54


100%|██████████| 54/54 [00:05<00:00,  9.57it/s]
  4%|▍         | 2/48 [00:00<00:03, 14.59it/s]

0
48


100%|██████████| 48/48 [00:05<00:00,  9.20it/s]
  5%|▍         | 2/43 [00:00<00:03, 11.09it/s]

0
43


100%|██████████| 43/43 [00:04<00:00,  9.62it/s]
  5%|▌         | 2/38 [00:00<00:02, 14.44it/s]

1
38


100%|██████████| 38/38 [00:04<00:00,  8.99it/s]
  6%|▌         | 2/34 [00:00<00:01, 17.82it/s]

0
34


100%|██████████| 34/34 [00:03<00:00, 10.71it/s]
  6%|▋         | 2/31 [00:00<00:01, 15.10it/s]

0
31


100%|██████████| 31/31 [00:02<00:00, 10.99it/s]
  0%|          | 0/28 [00:00<?, ?it/s]

0
28


100%|██████████| 28/28 [00:03<00:00,  8.37it/s]
  8%|▊         | 2/25 [00:00<00:01, 13.31it/s]

0
25


100%|██████████| 25/25 [00:02<00:00, 10.41it/s]
  0%|          | 0/22 [00:00<?, ?it/s]

0
22


100%|██████████| 22/22 [00:02<00:00, 10.06it/s]
  5%|▌         | 1/20 [00:00<00:03,  5.99it/s]

0
20


100%|██████████| 20/20 [00:02<00:00,  9.89it/s]
 11%|█         | 2/18 [00:00<00:01, 10.71it/s]

0
18


100%|██████████| 18/18 [00:01<00:00, 10.75it/s]
  0%|          | 0/16 [00:00<?, ?it/s]

0
16


100%|██████████| 16/16 [00:02<00:00,  7.49it/s]
  0%|          | 0/14 [00:00<?, ?it/s]

0
14


100%|██████████| 14/14 [00:01<00:00, 11.14it/s]
  8%|▊         | 1/13 [00:00<00:01,  7.37it/s]

0
13


100%|██████████| 13/13 [00:01<00:00,  9.33it/s]
 18%|█▊        | 2/11 [00:00<00:00, 10.76it/s]

0
11


100%|██████████| 11/11 [00:01<00:00, 10.92it/s]
  0%|          | 0/10 [00:00<?, ?it/s]

0
10


100%|██████████| 10/10 [00:01<00:00,  9.61it/s]
 22%|██▏       | 2/9 [00:00<00:00, 15.31it/s]

0
9


100%|██████████| 9/9 [00:00<00:00, 10.90it/s]
 12%|█▎        | 1/8 [00:00<00:00,  7.68it/s]

0
8


100%|██████████| 8/8 [00:00<00:00, 10.69it/s]
  0%|          | 0/8 [00:00<?, ?it/s]

0
8


100%|██████████| 8/8 [00:00<00:00, 10.83it/s]
  0%|          | 0/7 [00:00<?, ?it/s]

0
7


100%|██████████| 7/7 [00:00<00:00, 10.14it/s]
 17%|█▋        | 1/6 [00:00<00:00,  8.63it/s]

0
6


100%|██████████| 6/6 [00:00<00:00,  9.84it/s]
 40%|████      | 2/5 [00:00<00:00, 10.94it/s]

0
5


100%|██████████| 5/5 [00:00<00:00, 12.15it/s]
  0%|          | 0/5 [00:00<?, ?it/s]

0
5


100%|██████████| 5/5 [00:01<00:00,  4.98it/s]
 50%|█████     | 2/4 [00:00<00:00, 11.44it/s]

0
4


100%|██████████| 4/4 [00:00<00:00,  9.73it/s]
 25%|██▌       | 1/4 [00:00<00:00,  9.77it/s]

0
4


100%|██████████| 4/4 [00:00<00:00, 11.93it/s]
 67%|██████▋   | 2/3 [00:00<00:00, 11.35it/s]

0
3


100%|██████████| 3/3 [00:00<00:00, 10.59it/s]
 67%|██████▋   | 2/3 [00:00<00:00, 15.19it/s]

0
3


100%|██████████| 3/3 [00:00<00:00, 12.21it/s]
  0%|          | 0/3 [00:00<?, ?it/s]

0
3


100%|██████████| 3/3 [00:00<00:00, 16.38it/s]
100%|██████████| 2/2 [00:00<00:00, 14.79it/s]
  0%|          | 0/2 [00:00<?, ?it/s]

0
2
0
2


100%|██████████| 2/2 [00:00<00:00, 10.37it/s]
 50%|█████     | 1/2 [00:00<00:00,  9.60it/s]

0
2


100%|██████████| 2/2 [00:00<00:00,  8.94it/s]
 50%|█████     | 1/2 [00:00<00:00,  8.75it/s]

0
2


100%|██████████| 2/2 [00:00<00:00,  9.05it/s]
100%|██████████| 1/1 [00:00<00:00, 10.99it/s]
100%|██████████| 1/1 [00:00<00:00, 18.39it/s]


0
1
0
1
0
1


100%|██████████| 1/1 [00:00<00:00, 10.62it/s]
100%|██████████| 1/1 [00:00<00:00, 21.00it/s]
  0%|          | 0/1 [00:00<?, ?it/s]

0
1
0
1


100%|██████████| 1/1 [00:00<00:00, 21.14it/s]
100%|██████████| 1/1 [00:00<00:00,  9.37it/s]
100%|██████████| 1/1 [00:00<00:00, 23.54it/s]


0
1
0
1
0
1


100%|██████████| 1/1 [00:00<00:00, 10.16it/s]
100%|██████████| 1/1 [00:00<00:00,  7.76it/s]


0
1
0
0
85


100%|██████████| 2/2 [00:00<00:00, 123.39it/s]

TOTAL IDS: 23





1
(23, 8)
