You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 13, 2023. It is now read-only.
query="""
SELECT source, REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ') AS title FROM
(SELECT
ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
title
FROM bigquery-public-data.hacker_news.stories
WHERE
REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.://(.[^/]+)/'), '.com$')
AND LENGTH(title) > 10
)
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')
"""
I created using code from: https://datalab.office.datisan.com.au/notebooks/training-data-analyst/blogs/textclassification/txtcls.ipynb
as:
query="""
SELECT source, REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ') AS title FROM
(SELECT
ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
title
FROM
bigquery-public-data.hacker_news.stories
WHERE
REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.://(.[^/]+)/'), '.com$')
AND LENGTH(title) > 10
)
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')
"""
from google.cloud import bigquery
client = bigquery.Client()
df = client.query(query).to_dataframe()
df.to_csv('titles_full.csv', header=False, index=False, encoding='utf-8', sep=',')
I had to swap the column order:
COLUMNS = ['source', 'title']
without it loss was minimised after 20.
"some stuff here about setting up Eval jobs"
The text was updated successfully, but these errors were encountered: