# Information Extraction

Information extraction: automatically identifying structured information from unstructured text.

including

- named entity recogonition

- time, event extraction

- template filling

- relation extraction

application: 

- build knowledge base

- question answering

## Named Entity Recognition (NER)

Named Entity Recognition (NER): identify and categorize named entities in text, such as people, locations, organizations, teams, newspapers, companies, and geo-political entities. 

two main subtasks:

1. Segmentation: Determining which words or phrases belong to a named entity.

2. Classification: Categorizing the identified named entities into predefined types.

### features

- Identity of word $w_i$
- Identity of neighboring words
- Part of speech of word $w_i$ and neighboring words
- Base-phrase syntactic chunk label of $w_i$ and neighboring words
- Presence of $w_i$ in a gazetteer (a list of known entities)
- Prefixes and suffixes of $w_i$ (up to length 4)
- Capitalization pattern of $w_i$
- Word shape of $w_i$ and neighboring words (e.g., patterns like "Xx-Xx" for "Mc-Donald")
- Short word shape of $w_i$ and neighboring words (simplified version of word shape)
- Presence of hyphen

### labels

Various types of labels can be used, depending on the application or domain. 

common labels: Person, Organization, Facility, Location, Geopolitical Entity

1. Person (PER): Names of individuals, such as "Barack Obama" or "William Shakespeare".

2. Organization (ORG): Names of organizations, companies, institutions, such as "Apple Inc." or "United Nations".

3. Location (LOC): Names of geographical locations, like countries, cities, mountains, and rivers, such as "Paris" or "Amazon River".

4. Geopolitical Entity (GPE): Names of geopolitical entities, such as countries, states, provinces, or cities, like "United States" or "California".

5. Facility (FAC): Names of buildings, airports, stadiums, bridges, and other man-made structures, such as "Eiffel Tower" or "JFK Airport".

6. Product (PROD): Names of products, brands, or services, like "iPhone" or "Coca-Cola".

7. Event (EVT): Names of historical events, natural disasters, or other incidents, such as "World War II" or "Hurricane Katrina".

8. Time (TIME): Time-related expressions, like "9 a.m.", "Tuesday", or "June 15th".

9. Date (DATE): Date-related expressions, such as "January 1, 2000" or "Independence Day".

10. Money (MONEY): Monetary expressions, such as "$100" or "5 euros".

11. Percent (PERCENT): Percentage expressions, like "50%" or "one-third".

12. Quantity (QUANTITY): Expressions of quantity or measurement, such as "5 kilometers" or "100 grams".

### challenges

1. Ambiguity: Named entities can be ambiguous, meaning they can refer to different types of entities depending on the context (e.g., "London" as a person, city, or country by metonymy).

2. Variability: Named entities can be written in various ways (e.g., abbreviations, acronyms, alternative spellings), making it challenging to identify and match them.

3. Nested entities: Named entities can be embedded within other named entities, making it difficult to identify and extract them correctly (e.g., "New York University" contains the location entity "New York").

4. Domain-specific entities: NER systems may need to be adapted for specific domains (e.g., medical, legal, financial), as the types of named entities and their representations can vary significantly across domains.

5. Language-specific issues: Different languages have unique characteristics, such as morphological variations and script differences, which can make NER more challenging in some languages compared to others.

6. Lack of labeled data: Supervised learning approaches for NER require large amounts of annotated data, which may not be readily available for all languages or domains.

7. Noisy data: Text data from sources like social media, web pages, or user-generated content may contain spelling errors, abbreviations, and informal language that make NER more challenging.

8. Multilingual and cross-lingual NER: Identifying and classifying named entities in multilingual or cross-lingual settings can be challenging due to variations in language-specific features and named entity representations.

## Relation Extraction

Relation Extraction: identifying and extracting relationships between entities in text. 

### common types of relations include:

- Person-person: ParentOf, MarriedTo, Manages
- Person-organization: WorksFor
- Organization-organization: IsPartOf
- Organization-location: IsHeadquarteredAt

### Approaches to Relation Extraction:

1. Using patterns: Employ regular expressions or gazetteers to identify patterns that indicate relations between entities.

2. Supervised learning: Use a classifier to determine whether a relation exists between two entities in a sentence, based on the words in the sentence, especially those between the entities.

3. Semi-supervised learning: Start with seed sentences containing known relations, find other sentences with similar words or expressions, and use these to extract more relations.

**Bootstrapping** is a semi-supervised method for relation extraction that starts with a small set of seed tuples (known relations) and iteratively expands the set of extracted relations by discovering new relations in the text.

algorithm

1. Initialize the `tuples` set with a few seed tuples that have the target relation `R`.

2. Iterate the following steps until a stopping criterion is met (e.g., no new tuples are found or a certain number of iterations have been completed):

   a. Find `sentences` that contain entities present in the seed tuples.

   b. Identify `patterns` by generalizing the context between and around the entities in the sentences.

   c. Use the discovered `patterns` to search for more tuples, called `newpairs`.

   d. Filter `newpairs` to only include those with high confidence (e.g., using a threshold or other evaluation metric).

   e. Update the `tuples` set by adding the `newpairs`.

3. Return the final set of `tuples` containing the extracted relations.



### Evaluation Metrics for Relation Extraction:

- Precision (P): The ratio of correctly extracted relations to **all extracted** relations.

- Recall (R): The ratio of correctly extracted relations to **all existing** relations.

- F1 measure: The harmonic mean of precision and recall, defined as $F1 = 2P*R/(P+R)$.

If there's no annotated data available, only precision can be measured, as recall and F1 measure require information on all existing relations.