# Get Newsela Dataset

<div class="alert alert-block alert-warning">
This notebook deals contains the following aspects:
<li>Get Access to the Newsela API</li>
<li>Load first the Article URLs</li>
<li>Extract each text from the API with the article URL</li>  
</div>

In [33]:
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
import pandas as pd
import ast
import re

In [3]:
df_path = "data/newsela/articles.csv"

In [2]:
test_urls = ["https://newsela.com/read/elem-sparrow-song/id/44677",
"https://newsela.com/read/gorilla-poop-treasure-hunt/id/44303",
"https://newsela.com/read/jurassic-park-generation/id/44270",
"https://newsela.com/read/school-fights-for-racial-equity/id/44370",
"https://newsela.com/read/can-algorithms-be-art/id/43488",
"https://newsela.com/read/decolonize-student-reading/id/41535",
"https://newsela.com/read/conservative-students-difficult-college-decisions/id/42964",
"https://newsela.com/read/DIY-tech-helps-marine-scientists/id/42157",
"https://newsela.com/read/yanny-laurel-explained/id/43684",
"https://newsela.com/read/elem-mobile-libraries-around-the-world/id/42697"]

## Import Data

### Helper Functions

In [55]:
def get_newsela_headers():
    url = "https://newsela.com/api/v2/articleheader/"
    json = requests.get(url).json()
    slugs = []
    for i in tqdm(range(1,200)):
        slugs += [article["slug"] for article in json]
        json = requests.get(url+f"?page={i}").json()
    return slugs

def save_article_to_file(title,ident,score,text):
    with open(f"data/newsela/{title}-{ident}-{score}.txt","w") as f:
        f.write(text)
        f.close()
        
def api_request_newsela(url):
    regex_rm_minus = re.compile(r"(--)*")
    r = requests.get(url)
    text_title = url.split("/")[-1]
    json = r.json()
    level_articles = json["articles"]
    articles = []
    for article in level_articles:
        text = article["teaser"] + "\n" + article["text"]
        text = regex_rm_minus.sub("", text)
        score = article["lexile_level"]
        ident = article["id"]
        articles += [(text_title,ident,score,text)]
    return articles

def create_article_files(title_path):
    base_url = "https://newsela.com/api/v2/articleheader/"
    with open(title_path,"r") as f:
        titles = f.read().split("\n")
        
    all_articles = []
    for ind,title in enumerate(titles):
        articles = api_request_newsela(base_url+title)
        for title,ident,score,text in articles:
            save_article_to_file(title,ident,score,text)
        all_articles += articles
        if ind % 50 == 0:
            print(ind)
    return all_articles

### Connect to API

In [13]:
base_url = 'https://newsela.com'
url = base_url + '/articles/#/rule/latest'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
page = requests.get(url, headers=headers)


In [63]:
slugs = get_newsela_headers()
with open("data/newsela_article_titles.txt","w") as f:
    f.write("\n".join(set(slugs)))
    f.close()

In [79]:
articles = create_article_files("data/newsela_article_titles.txt")

0
50
100
150
200
250
300
350
400
450
500
550
600
650
700
750
800
850
900
950
1000
1050
1100
1150
1200
1250
1300
1350
1400
1450
1500
1550
1600
1650
1700
1750
1800
1850
1900
1950
2000
2050
2100
2150
2200
2250
2300
2350
2400
2450
2500
2550
2600
2650
2700
2750
2800
2850
2900
2950
3000
3050
3100
3150
3200
3250
3300
3350
3400
3450
3500
3550
3600
3650
3700
3750
3800
3850
3900
3950
4000
4050
4100
4150
4200
4250
4300
4350
4400
4450
4500
4550
4600
4650
4700
4750
4800
4850
4900
4950
5000
5050
5100
5150
5200
5250
5300


### Show Example

In [80]:
articles[0]

('farmworkers-mexico-spanish',
 24778,
 680,
 'Mexico farmworkers strike during harvest\nSAN QUINTÍN (México) — Verónica Zaragoza creció recogiendo fresas y tomates en México. Desde que comenzó han cambiado muchas cosas. Sin embargo, a ella le siguen pagando la misma cantidad de dinero.\n\nEn los campos se han instalado líneas de riego. Estas líneas llevan agua a todos los cultivos. También se han construido nuevos invernaderos. Gracias a ellos, se cultivan más frutas y verduras en espacios cerrados.\n\nLos recolectores tienen más trabajo que antes, pero Zaragoza todavía gana 110 pesos, unos 8 dólares al día. Esto es solo un poco más de lo que ganaba cuando empezó a trabajar como recolectora a los 13 años de edad. Ahora tiene 26 años y es madre de tres hijos.\n\nEsta semana, Zaragoza se unió a miles de recolectores que protestaban por sus bajos salarios. Dejaron de trabajar y abandonaron los campos. Es la primera huelga de trabajadores agrícolas que se organiza aquí en muchos años.\n\n

## Save articles to files

In [85]:
article_df = pd.DataFrame(data=articles,columns=["title","id","newsela_score","text"])
article_df.to_csv("data/newsela_articles.csv",sep=";",index=False)