In [1]:
!pip install torch news-please numpy pandas datasets matplotlib



In [2]:
import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datasets import load_dataset
from dotenv import load_dotenv
import os

load_dotenv()  # looks for .env in current directory or parents

print(torch.__version__)

print(torch.randn(3,3))
print(torch.cuda.is_available())

db_user = os.getenv("DB_USER")
db_password = os.getenv("DB_PASSWORD")


  from .autonotebook import tqdm as notebook_tqdm


2.5.1
tensor([[-0.8820, -0.8982,  0.5360],
        [-0.5079, -0.3259, -1.3447],
        [-0.0790,  0.3232,  0.3885]])
True


## Load the data

Dataset source: [Data](https://huggingface.co/datasets/copenlu/mm-framing)

In [3]:
# this loads data as an apache arrow format, mapping to disk
data = load_dataset("copenlu/mm-framing")
data_raw = data
data = data['full']
data[1], type(data)

({'uuid': '0000442d-4ce2-4d1c-b654-5561c3cab3b7',
  'title': 'NJ community teams up to help repair historic cemetery damaged by August storms',
  'date_publish': '2023-08-23 22:25:00',
  'source_domain': 'www.cbsnews.com',
  'url': 'https://www.cbsnews.com/philadelphia/video/nj-community-teams-up-to-help-repair-historic-cemetery-damaged-by-august-storms/',
  'political_leaning': 'left_lean',
  'text-topic': 'Historic Cemetery Damage and Restoration',
  'text-topic-exp': "The article discusses the damage caused to a historic cemetery by storms and the community's efforts to repair it. The cemetery is significant as it is the final resting place of several black Civil War veterans.",
  'text-entity-name': "Woodstown church's historic cemetery",
  'text-entity-sentiment': 'Positive',
  'text-entity-sentiment-exp': 'The article describes efforts to repair and clean up the cemetery, indicating a positive sentiment towards the entity.',
  'text-generic-frame': "['Economic', 'Health and safet

In [4]:
len(data)


478856

### Hydrating the urls test

In [5]:
from newsplease import NewsPlease
test_index = 2

url = data[test_index]["url"]

article = NewsPlease.from_url(url)

print(article.maintext)
print(url)
data[test_index]

def get_article_text(url):
    text = NewsPlease.from_url(url)
    return(text)



For someone whose affably rumpled appearance suggests Matt Damon waking up from a nap, the comedian Mike Birbiglia is remarkably industrious. He is a film-maker (movies include Don’t Think Twice, a comedy about an improv group starring Keegan-Michael Key) and an actor (The Fault in Our Stars, Orange Is the New Black), and he recently played Taylor Swift’s avaricious son, who discovers at her funeral that he has been left out of her will, in the video for Anti-Hero. He also has a podcast, The Old Ones, in which celebrity guests pick apart his earlier standup routines, and another, Working It Out, in which they help him develop new material. “Like Tom Sawyer getting his friends to paint the fence,” as one guest, Nathan Lane, memorably put it.
Now the 45-year-old is bringing his latest show, The Old Man and the Pool, to the UK after a sell-out Broadway run. If he had his way, he wouldn’t reveal anything about it in advance. “I’d just say, ‘I wrote a show about mortality and I guarantee yo

After checking several examples of the text, it seems that a large portion of the news are non-political, include videos, or aren't very long. Seems to be a flaw with the original paper. 

TO DO: Filter the dataset to get just the truly 'political' news stories

TO DO: Check out the other dataset, the 'validation' set to see if it different in terms of its content

In [6]:
data_raw['valid_framing_subset'][150]

{'uuid': '019a66c8-1d6f-42f7-ae2c-ade5b91d6b45',
 'title': 'Florida neighborhood terrorized after crocodile eats small dog',
 'date_publish': '2023-08-02 16:09:12',
 'source_domain': 'www.foxnews.com',
 'url': 'https://www.foxnews.com/us/florida-neighborhood-terrorized-crocodile-eats-small-dog',
 'political_leaning': 'right',
 'text-topic': 'Crocodile Attack',
 'text-topic-exp': "The article discusses an incident where a crocodile ate a small dog in a Florida canal, and it includes quotes from witnesses and the Florida Fish and Wildlife Conservation Commission's response to the incident.",
 'text-entity-name': 'Crocodile',
 'text-entity-sentiment': 'negative',
 'text-entity-sentiment-exp': 'The crocodile is portrayed as a dangerous and harmful entity that caused the death of a pet dog and caused concern among the residents of Satellite Beach, Florida.',
 'text-generic-frame': "['crime', 'security', 'public_op']",
 'text-generic-frame-exp': "The article discusses an incident of a crocod

So it seems that the validation set is similar in having non-political articles


For now, run with just a subset of the data. Ignore this issue. Parse later before full training runs.

In order to create a hydrated version of the data, bu keep the arrow datasets (which are immutable), then we use a .map() method

In [7]:
# Create a minimal dataset to process in this notebook with hydrated article text
import time
from newsplease import NewsPlease

def scrape_and_add(example):
    """Scrape article text from URL with error handling"""
    url = example['url']

    try:
        # Add a small delay to avoid rate limiting
        time.sleep(0.5)

        # Try to fetch the article
        article = NewsPlease.from_url(url)

        # Return the maintext if available, otherwise empty string
        if article and article.maintext:
            return {"article_text": article.maintext}
        else:
            print(f"No text found for: {url}")
            return {"article_text": ""}

    except Exception as e:
        # Handle 403s and other errors gracefully
        print(f"Error fetching {url}: {type(e).__name__} - {str(e)}")
        return {"article_text": ""}

# Create a small dataset for testing
df = data.shuffle(seed=42).select(range(25))

# Apply the scraping function to each row (batched=False means one at a time)
print("Starting to hydrate URLs...")
df = df.map(scrape_and_add, batched=False)
print("Hydration complete!")

Starting to hydrate URLs...
Hydration complete!


In [8]:
df[7]

{'uuid': '71f2fc7a-72e4-4001-9a1a-ab2c962c81de',
 'title': '"People" magazine editor-in-chief shares exclusive excerpts from Britney Spears\' new memoir',
 'date_publish': '2023-10-17 12:30:00',
 'source_domain': 'www.cbsnews.com',
 'url': 'https://www.cbsnews.com/video/people-magazine-editor-in-chief-shares-exclusive-excerpts-from-britney-spears-new-memoir/',
 'political_leaning': 'left_lean',
 'text-topic': 'Britney Spears',
 'text-topic-exp': "The article is about an exclusive excerpt from Britney Spears' new memoir and a recent interview and cover shoot with her, as shared by the editor-in-chief of 'People' magazine.",
 'text-entity-name': 'Britney Spears',
 'text-entity-sentiment': None,
 'text-entity-sentiment-exp': 'The article does not provide enough context to determine the sentiment towards Britney Spears.',
 'text-generic-frame': "['Cultural identity', 'Public Opinion']",
 'text-generic-frame-exp': "The article discusses Britney Spears, a cultural icon, and her memoir, which

In [9]:
# Practice storing data

type(df)

df.save_to_disk("test_results")

Saving the dataset (1/1 shards): 100%|██████████| 25/25 [00:00<00:00, 3412.78 examples/s]


In [10]:
# Practice loading from disk
# all within the datasets library
from datasets import load_from_disk

# use the load from disk function
df_loaded = load_from_disk("test_results")

df_loaded[1]

{'uuid': '2f7aae04-dec6-428a-b0fd-c87a5e16b102',
 'title': 'Cowboys and safety Malik Hooker agree on a $24 million, 3-year contract extension',
 'date_publish': '2023-08-05 17:02:33',
 'source_domain': 'apnews.com',
 'url': 'https://apnews.com/article/cowboys-malik-hooker-contract-5531a25a7aecda62883756a7270dfc89',
 'political_leaning': 'left_lean',
 'text-topic': 'Sports',
 'text-topic-exp': 'The article discusses a sports-related event, specifically a contract extension in American Football (NFL) between the Dallas Cowboys and Malik Hooker.',
 'text-entity-name': 'Malik Hooker',
 'text-entity-sentiment': 'positive',
 'text-entity-sentiment-exp': 'He has been healthy since joining Dallas, played a significant number of games in the past two seasons, and tied for second on the club with three interceptions last season',
 'text-generic-frame': "['Economic', 'Capacity and resources', 'Crime and punishment', 'Quality of life']",
 'text-generic-frame-exp': "The article discusses the financ

#### Practice script generation for hydration script

Since there's 400,000 plus rows, need to be efficient.

Appears that the CLI mode is the best way to use Newsplease to hydrate the dataset. 

https://github.com/fhamborg/news-please/tree/master



In [11]:
len(data)

478856

In [12]:
with open("../urls_to_crawl.txt", "w") as f:
    for url in data['url']:
        f.write(f"{url}\n")



## Tokenize and encode

In [13]:
type(df['article_text'])

datasets.arrow_dataset.Column

In [14]:
# use the default auto tokenizer 
from transformers import AutoTokenizer

# Pick a test model checkpoint, change this later if desired
model = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model, local_files_only=False)

def tokenize_function(examples):
    return tokenizer(
        examples["article_text"],
        padding="max_length",
        truncation=True
    )

tokenized_df = df.map(tokenize_function, batched=True)
tokenized_df[3]

{'uuid': '8cdfd2d3-ff6a-475d-b9e0-baefbc8955b8',
 'title': 'Mets better be right about being ‘capable of better’',
 'date_publish': '2023-06-06 01:15:07',
 'source_domain': 'nypost.com',
 'url': 'https://nypost.com/2023/06/05/mets-better-be-right-about-them-being-capable-of-better/',
 'political_leaning': 'right_lean',
 'text-topic': 'MLB (New York Mets Performance)',
 'text-topic-exp': "The article discusses the performance of the New York Mets in Major League Baseball, their current record, run differential, and individual player performances. It also mentions the team's owner's response to a fan's request to acquire Shohei Ohtani, a current MLB player. The focus is primarily on the Mets' on-field performance and their potential for improvement.",
 'text-entity-name': 'Steve Cohen',
 'text-entity-sentiment': 'neutral',
 'text-entity-sentiment-exp': 'The article portrays Steve Cohen as a businessman focused on the current issues of his team, the Mets, and not actively pursuing other p

## Set Format

In [15]:
tokenized_df

Dataset({
    features: ['uuid', 'title', 'date_publish', 'source_domain', 'url', 'political_leaning', 'text-topic', 'text-topic-exp', 'text-entity-name', 'text-entity-sentiment', 'text-entity-sentiment-exp', 'text-generic-frame', 'text-generic-frame-exp', 'text-issue-frame', 'text-issue-frame-exp', 'img-generic-frame', 'img-frame-exp', 'img-entity-name', 'img-entity-sentiment', 'img-entity-sentiment-exp', 'gpt-topic', 'article_text', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 25
})

In [16]:
# set format doesn't delete the other columns, it just hides them. can toggle back to see them
tokenized_df.set_format(
    type="torch",
    columns=['input_ids', 'token_type_ids', 'attention_mask', "text-topic"]
)
tokenized_df.format

{'type': 'torch',
 'format_kwargs': {},
 'columns': ['input_ids', 'token_type_ids', 'attention_mask', 'text-topic'],
 'output_all_columns': False}

In [17]:
# toggling back to see the hidden columns
tokenized_df.reset_format()
tokenized_df.format

{'type': None,
 'format_kwargs': {},
 'columns': ['uuid',
  'title',
  'date_publish',
  'source_domain',
  'url',
  'political_leaning',
  'text-topic',
  'text-topic-exp',
  'text-entity-name',
  'text-entity-sentiment',
  'text-entity-sentiment-exp',
  'text-generic-frame',
  'text-generic-frame-exp',
  'text-issue-frame',
  'text-issue-frame-exp',
  'img-generic-frame',
  'img-frame-exp',
  'img-entity-name',
  'img-entity-sentiment',
  'img-entity-sentiment-exp',
  'gpt-topic',
  'article_text',
  'input_ids',
  'token_type_ids',
  'attention_mask'],
 'output_all_columns': False}

In [18]:
tokenized_df.set_format(
    type="torch",
    columns=['input_ids', 'token_type_ids', 'attention_mask', "text-topic"]
)

## Creating the data loader

In [19]:
from torch.utils.data import DataLoader

# Create the conveyor belt
train_dataloader = DataLoader(
    tokenized_df,
    shuffle = True, # mixes the data so the model doesn't learn the order
    batch_size = 4 # number of articles to process at once
)

# in the above, the model sees 16 rows at once, each with 128 numbers


## Run sanity checks

In [20]:
# We'll just look at the first 2 batches to make sure it works
for i, batch in enumerate(train_dataloader):
    if i >= 2: 
        break
        
    print(f"--- Batch {i} ---")
    print(f"Input IDs shape: {batch['input_ids'].shape}")
    print(f"Attention Mask shape: {batch['attention_mask'].shape}")
    print(f"Labels in this batch: {batch['text-topic']}")
    
    # If you want to see what the model "sees" (first 10 tokens of first row)
    # This turns those IDs back into words!
    # USE tokenizer.decode for specific debugging
    example_text = tokenizer.decode(batch['input_ids'][0][:10])
    print(f"First 10 tokens decoded: {example_text}")
    print("\n")

--- Batch 0 ---
Input IDs shape: torch.Size([4, 512])
Attention Mask shape: torch.Size([4, 512])
Labels in this batch: ['Gender-Affirming Health Services in Schools', 'Bank Tax Proposal in Italy', 'Abortion Pill Access', 'Tech']
First 10 tokens decoded: [CLS] the health clinic that has black panther party roots


--- Batch 1 ---
Input IDs shape: torch.Size([4, 512])
Attention Mask shape: torch.Size([4, 512])
Labels in this batch: ['Accident', 'Baseball Transfers', 'Drag Worship Service Controversy', 'Politics']
First 10 tokens decoded: [CLS] miami - a miami - dade police officer had




# Let's run a test model on a small subset of the data

### testing data connection for hydrated dataset

In [21]:
# firstly let's load in our hydrated dataset
import psycopg2

# create a connection object
conn = psycopg2.connect(
    dbname=os.getenv("DB_NAME"),
    user=os.getenv("DB_USER"),
    password=os.getenv("DB_PASSWORD"),
    host=os.getenv("DB_HOST"),
    port=os.getenv("DB_PORT")
)

In [22]:
# Run a simple query
cur = conn.cursor()
cur.execute("""SELECT COUNT(*) 
            FROM newsarticles;""")
result = cur.fetchall()
result

[(361975,)]

In [23]:
# Load query into pandas
import pandas as pd
df = pd.read_sql("SELECT COUNT(*) FROM newsarticles",
                 conn)
df.head()
conn.close()
cur.close()


  df = pd.read_sql("SELECT COUNT(*) FROM newsarticles",


In [24]:
# ALWAYS CLOSE CONNECTIONS
cur.close()
conn.close()

### Create a small hydrated data subset

In [25]:
# let's just learn to play around with the arrow_dataset
type(data), len(data)

(datasets.arrow_dataset.Dataset, 478856)

In [26]:
data.column_names

['uuid',
 'title',
 'date_publish',
 'source_domain',
 'url',
 'political_leaning',
 'text-topic',
 'text-topic-exp',
 'text-entity-name',
 'text-entity-sentiment',
 'text-entity-sentiment-exp',
 'text-generic-frame',
 'text-generic-frame-exp',
 'text-issue-frame',
 'text-issue-frame-exp',
 'img-generic-frame',
 'img-frame-exp',
 'img-entity-name',
 'img-entity-sentiment',
 'img-entity-sentiment-exp',
 'gpt-topic']

In [27]:
df = data.shuffle(seed=42).select(range(1000)).to_pandas()

df.head()

# then you can do all kinds of EDA on it

Unnamed: 0,uuid,title,date_publish,source_domain,url,political_leaning,text-topic,text-topic-exp,text-entity-name,text-entity-sentiment,...,text-generic-frame,text-generic-frame-exp,text-issue-frame,text-issue-frame-exp,img-generic-frame,img-frame-exp,img-entity-name,img-entity-sentiment,img-entity-sentiment-exp,gpt-topic
0,f2de4efd-6ed0-4a0d-99e3-51a8c79e2b3e,,2024-03-19 16:58:00,www.cbsnews.com,https://www.cbsnews.com/news/trump-suing-abc-n...,left_lean,,,,,...,,,,,['Political'],The image features two prominent political fig...,Donald Trump,neutral,The image shows Donald Trump in a formal setti...,
1,2f7aae04-dec6-428a-b0fd-c87a5e16b102,Cowboys and safety Malik Hooker agree on a $24...,2023-08-05 17:02:33,apnews.com,https://apnews.com/article/cowboys-malik-hooke...,left_lean,Sports,"The article discusses a sports-related event, ...",Malik Hooker,positive,...,"['Economic', 'Capacity and resources', 'Crime ...",The article discusses the financial implicatio...,Sports Success Story,The article focuses on the contract extension ...,"['Cultural identity', 'None']","The image shows football players in action, we...",Dallas Cowboys,positive,The players are celebrating with one player ho...,Sports
2,fef978c8-f085-4871-9d6f-0fe4409148f7,KCAL News Anchor Rudabeh Shahbazi speaks at Ir...,2024-03-04 07:15:00,www.cbsnews.com,https://www.cbsnews.com/losangeles/video/kcal-...,left_lean,Iranian American Women,The article is about KCAL News anchor Rudabeh ...,Rudabeh Shahbazi,Positive,...,"['Cultural identity', 'Public Opinion']",The article discusses an event organized by th...,Empowerment of Women,The article focuses on a conference for Irania...,,,,,,Social
3,8cdfd2d3-ff6a-475d-b9e0-baefbc8955b8,Mets better be right about being ‘capable of b...,2023-06-06 01:15:07,nypost.com,https://nypost.com/2023/06/05/mets-better-be-r...,right_lean,MLB (New York Mets Performance),The article discusses the performance of the N...,Steve Cohen,neutral,...,"['Economic', 'Quality of life', 'Policy prescr...",The article discusses the financial implicatio...,Underperforming Athletes,The article focuses on the poor performance of...,"['Cultural identity', 'None']",The image primarily features baseball players ...,New York Mets,positive,"The image shows players in action, celebrating...",Sports
4,af3378b7-447e-44e7-8c1f-1f752ecc5539,"QBs Watson, Howell have solid starts, Brissett...",2023-08-12 03:53:58,apnews.com,https://apnews.com/article/browns-commanders-d...,left_lean,NFL Preseason Game,The article discusses an NFL preseason game be...,Deshaun Watson,positive,...,"['Economic', 'Quality of life', 'Legality, con...",The article discusses the financial investment...,Redemption Narrative,The article focuses on Deshaun Watson's return...,['None'],"The image depicts a football player in action,...",Deshaun Watson,neutral,The image shows Deshaun Watson in action durin...,Sports


In [None]:

# but for now, let's just have our shuffled 2500 sample to train a small model on and merge in from the database
data_subset = data.shuffle(seed=42).select(range(2500)).to_pandas() # conversion might not be possible in the full version
len(data_subset), type(data_subset), data_subset.columns

(2500,
 pandas.core.frame.DataFrame,
 Index(['uuid', 'title', 'date_publish', 'source_domain', 'url',
        'political_leaning', 'text-topic', 'text-topic-exp', 'text-entity-name',
        'text-entity-sentiment', 'text-entity-sentiment-exp',
        'text-generic-frame', 'text-generic-frame-exp', 'text-issue-frame',
        'text-issue-frame-exp', 'img-generic-frame', 'img-frame-exp',
        'img-entity-name', 'img-entity-sentiment', 'img-entity-sentiment-exp',
        'gpt-topic'],
       dtype='object'))

In [106]:
# Trying with the full data
data_pd = data.to_pandas()

### Merging in news articles

Next steps:
- merge in the news articles

the most efficient way to do this is actually to send my table to the database, do the merge on the server, then pull back the result.

We can do this with execute_values from psycopg2.extras

In [116]:
import psycopg2

# create a connection object
conn = psycopg2.connect(
    dbname=os.getenv("DB_NAME"),
    user=os.getenv("DB_USER"),
    password=os.getenv("DB_PASSWORD"),
    host=os.getenv("DB_HOST"),
    port=os.getenv("DB_PORT")
)
cur = conn.cursor()



In [108]:
data_pd.columns

Index(['uuid', 'title', 'date_publish', 'source_domain', 'url',
       'political_leaning', 'text-topic', 'text-topic-exp', 'text-entity-name',
       'text-entity-sentiment', 'text-entity-sentiment-exp',
       'text-generic-frame', 'text-generic-frame-exp', 'text-issue-frame',
       'text-issue-frame-exp', 'img-generic-frame', 'img-frame-exp',
       'img-entity-name', 'img-entity-sentiment', 'img-entity-sentiment-exp',
       'gpt-topic'],
      dtype='object')

In [109]:
# sql doesn't like hyphens in column names
data_pd.columns = data_pd.columns.str.replace("-","_")
data_pd.columns

Index(['uuid', 'title', 'date_publish', 'source_domain', 'url',
       'political_leaning', 'text_topic', 'text_topic_exp', 'text_entity_name',
       'text_entity_sentiment', 'text_entity_sentiment_exp',
       'text_generic_frame', 'text_generic_frame_exp', 'text_issue_frame',
       'text_issue_frame_exp', 'img_generic_frame', 'img_frame_exp',
       'img_entity_name', 'img_entity_sentiment', 'img_entity_sentiment_exp',
       'gpt_topic'],
      dtype='object')

In [110]:
# we also need the url as the first column to act as a key
cols = ['url'] + [c for c in data_pd.columns if c != 'url']
data_pd = data_pd[cols]
data_pd.columns

Index(['url', 'uuid', 'title', 'date_publish', 'source_domain',
       'political_leaning', 'text_topic', 'text_topic_exp', 'text_entity_name',
       'text_entity_sentiment', 'text_entity_sentiment_exp',
       'text_generic_frame', 'text_generic_frame_exp', 'text_issue_frame',
       'text_issue_frame_exp', 'img_generic_frame', 'img_frame_exp',
       'img_entity_name', 'img_entity_sentiment', 'img_entity_sentiment_exp',
       'gpt_topic'],
      dtype='object')

In [112]:
# in this full pipline, this pandas conversion might not be possible
cols = list(data_pd.columns)
values = [tuple(row) for row in data_pd.itertuples(index=False)] # itertuples is an interable that converts each row into tuplez


In [113]:
# Firstly we create the table in the database
cur.execute(f"""
            CREATE TABLE mm_framing_full (
    url TEXT,
    uuid TEXT,
    title TEXT,
    date_publish TEXT,
    source_domain TEXT,
    political_leaning TEXT,
    text_topic TEXT,
    text_topic_exp TEXT,
    text_entity_name TEXT,
    text_entity_sentiment TEXT,
    text_entity_sentiment_exp TEXT,
    text_generic_frame TEXT,
    text_generic_frame_exp TEXT,
    text_issue_frame TEXT,
    text_issue_frame_exp TEXT,
    img_generic_frame TEXT,
    img_frame_exp TEXT,
    img_entity_name TEXT,
    img_entity_sentiment TEXT,
    img_entity_sentiment_exp TEXT,
    gpt_topic TEXT);
            """)

In [114]:
from psycopg2.extras import execute_values

query = f"""
INSERT INTO mm_framing_full ({",".join(cols)})

    VALUES %s
"""
# in the above, the %s takes the value of values in the execute_values() function below, more or less

with conn.cursor() as cur:
    execute_values(cur, query, values)

conn.commit() # this makes all changes permanent after the connection closes

# Now we've populated the table with our data subset and now we can query and thereby merge the data

In [117]:
# Try a sample merge
cur.execute("""
           SELECT a.url, a.political_leaning, a.title, a.gpt_topic, a.text_issue_frame, a.text_generic_frame,
           b.maintext
           FROM mm_framing_full a
           JOIN newsarticles b
            ON a.url = b.url 
            """)

result= cur.fetchall()

cur.close()
conn.close()

In [121]:
len(result)
type(result)

result[1]

('https://www.cbsnews.com/philadelphia/news/northern-snakeheads-invasive-fish-found-pennsylvania/',
 'left_lean',
 'Advisory issued for invasive fish in Pennsylvania that can survive outside of water',
 'Environment',
 'Environmental Threat',
 "['Economic', 'Health and safety', 'Policy prescription and evaluation', 'External regulation and reputation']",
 'PITTSBURGH (KDKA) — The Pennsylvania Fish and Boat Commission has issued a "strong advisory" to anglers in Pennsylvania regarding an invasive fish.\nIn a release on Wednesday, the agency encourages anyone who catches a northern snakehead anywhere in the state to report and dispose of it.\n"Northern Snakeheads are voracious predators and may cause declines in important sport fisheries, such as bass and panfish, and may inhibit recovery efforts for species of conservation concern in the region such as American Shad and Chesapeake Logperch," said Sean Hartzell of the Pennsylvania Fish and Boat Commission.\nOfficials have been monitoring

In [122]:
conn.close()
cur.close()

### Finishing the data preparation
Next steps:
* ensure data is in the right format (tensors)
* tokenize the data, matching it to pytorch's transformers
* generate the dataloader
* Split the data along test_train_split (arrow df method available)

### Training loop
