Skip to content

Using regular expressions to search for date expressions in news texts

Notifications You must be signed in to change notification settings

MMDPY/Date-Recognition-with-Regular-Expressions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Date Recognition with RE

Installation and Execution Instructions

  • Clone the repository
  • Open a terminal inside the cloned folder (the folder should consist of src, output, data, and README.md)
  • No third-party library are required. All of the imported libraries are part of Python's Standard Library (re, sys, os, csv)
  • Run the command python3 src/main.py data/dev/ output/dev.csv from the terminal
  • The output can be found inside output

Run

python3 src/main.py data/dev/ output/dev.csv

Data

The data can be found inside data/dev.

Introduction

In this tutprial we will explore the use of regular expressions to extract information from written text. This method has been widely applied in information retrieval, and it is used in text processing applications and research.

A regular expression is a notation used to match strings in a text. It works like the CTRL+F search feature in browsers, but it has the added bonus of allowing the use of special characters to count, exclude, and group specific strings. Like in a language, regular expressions have a set of characters with predefined functions, and these characters can be used to create search patterns. For example, the character + means one or more occurrences. The regular expression /e+/ searches for strings of one or more “e”s. In the string “Feed the Birds”, this regular expression would match “ee”.

Task

In this project we are going to use regular expressions to search for date expressions in news texts. We are interested in two types of date expressions. The first one is simple date expressions, strings like “14 June 2019” and “Fall 2020” which represent absolute points in time and are independent of when you are reading them. The second type is deictic date expressions, dates that are relative to the current time, for example, “the day before yesterday”, “next Friday”, and “two weeks prior”.

Input: news articles

News is a genre that makes use of dates to convey more information about when an event took place and to help the readers place future and past occurrences in time. The input dataset is a collection of news articles that discusses several topics, such as politics, tech, and business.

Article 265 in this dataset has sentences like this:

“Pipa conducted the poll from 15 November 2004 to 3 January 2005 across 22 countries in face-to-face or telephone interviews.”

Output: CSV file, code, and documentation

CSV file

As mentioned before, news articles usually have a lot of time references, and we want you to search for those references in the input data. The output of the search is a CSV file with all of the dates expressions found. The file contains four columns, one column for the id of the article, one for the type of date expression found, one for the date expression itself, and one for the offset in characters from the beginning of the file to the beginning of the date expression, that is the position of the first character of the date expression in the file.

  • The output is like:
  • article_id, expr_type, value, char_offset
  • 265.txt, date, 15 November 2004, 30
  • 265.txt, date, 3 January 2005, 50

Useful links

  • This website has a good overview of the library Python RegEx. It lists the library functions available and regular expression’s metacharacters along with their uses.
  • This post is very useful to look for metacharacters and their applications.
  • You may want to test your regular expression before embedding them to your code. This website is perfect for that.
  • You may also want to improve the readability of your code by adding comments within a regular expression. The VERBOSE mode in Python provides such functionality, as described here.

More information

  • Book Chapter: Chapter 2

  • Learning Objectives : Learn how to use regular expressions to extract information from text.

Execution

Use the following command in the current directory.

python3 src/main.py data/dev/ output/dev.csv

About

Using regular expressions to search for date expressions in news texts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages