In [1]:
!pip3 --version
!pip install transformers

pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)


In [10]:
# Import Libraries
import cv2
import re, string, unicodedata                          # Import Regex, string and unicodedata
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import nltk                                             # NLP tool-kit
nltk.download('punkt')
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
from nltk.tokenize import word_tokenize, sent_tokenize  # Import Tokenizer
from nltk.stem.wordnet import WordNetLemmatizer         # Import Lemmatizer
from nltk.stem import PorterStemmer                     # Stemmer
from google.colab import drive
from google.colab.patches import cv2_imshow
from bs4 import BeautifulSoup
%matplotlib inline

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [29]:
# Load data by providing the path to the file
tweets_df = pd.read_csv('/content/sample_data/Tweets.csv')
tweets_df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [30]:
# Print dataframe's shape and description
print('df Shape: ', tweets_df.shape)
print(tweets_df.info())

tweets_df.describe()

df Shape:  (14640, 15)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   tweet_id                      14640 non-null  int64  
 1   airline_sentiment             14640 non-null  object 
 2   airline_sentiment_confidence  14640 non-null  float64
 3   negativereason                9178 non-null   object 
 4   negativereason_confidence     10522 non-null  float64
 5   airline                       14640 non-null  object 
 6   airline_sentiment_gold        40 non-null     object 
 7   name                          14640 non-null  object 
 8   negativereason_gold           32 non-null     object 
 9   retweet_count                 14640 non-null  int64  
 10  text                          14640 non-null  object 
 11  tweet_coord                   1019 non-null   object 
 12  tweet_created                 14640 n

Unnamed: 0,tweet_id,airline_sentiment_confidence,negativereason_confidence,retweet_count
count,14640.0,14640.0,10522.0,14640.0
mean,5.692184e+17,0.900169,0.638298,0.08265
std,779111200000000.0,0.16283,0.33044,0.745778
min,5.675883e+17,0.335,0.0,0.0
25%,5.685592e+17,0.6923,0.3606,0.0
50%,5.694779e+17,1.0,0.6706,0.0
75%,5.698905e+17,1.0,1.0,0.0
max,5.703106e+17,1.0,1.0,44.0


Looking at the columns and its values, I feel that, for "Entity Extraction" of airport mentions, we only need the `'airline'` and the `'text'` columns. So let's only consider these 2 columns.

In [31]:
# Drop all other columns except “text” and “airline”
tweets_df = tweets_df[['airline', 'text']]
tweets_df.head()

Unnamed: 0,airline,text
0,Virgin America,@VirginAmerica What @dhepburn said.
1,Virgin America,@VirginAmerica plus you've added commercials t...
2,Virgin America,@VirginAmerica I didn't today... Must mean I n...
3,Virgin America,@VirginAmerica it's really aggressive to blast...
4,Virgin America,@VirginAmerica and it's a really big bad thing...


Looking at the text, it looks like we have a lot of '@'s, '#'s, '$'s, underscores and URLs in the text. We also will not need numbers for extracting airport mentions. So I am preprocessing the data to remove all these. I will refer to the standard NLP preprocessing functions available across the web to avoid reinventing the wheel.


The text also has emojis and smileys which I will remove while removing non-ascii characters.



## Data Preprocessing

In [33]:
# Html tag removal, function to remove HTML tags
def remove_html_tags(textcpy):
    """Remove HTML tags in string of text"""
    return BeautifulSoup(textcpy, 'html.parser').get_text()


# Remove Underscore
def remove_underscore(textcpy):
    return re.sub(r'_', '', textcpy)


# Remove at-rate
def remove_attherate(textcpy):
    return re.sub(r'@', '', textcpy)


# Remove hash #
def remove_hash(textcpy):
    return re.sub(r'#', '', textcpy)


# Remove dollar $
def remove_dollar(textcpy):
    return re.sub(r'$', '', textcpy)


# Remove numbers
def remove_numbers(textcpy):
  """Remove numbers in string of text"""
  return re.sub(r'\d+', '', textcpy)


# Remove non-ascii characters
def remove_non_ascii(textcpy):
    """Remove non-ASCII characters from list of words"""
    words = textcpy.split(' ')

    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return ' '.join(map(str, new_words))


In [34]:
tweets_df['text'] = tweets_df['text'].apply(lambda i: remove_html_tags(i))

tweets_df['text'] = tweets_df['text'].apply(lambda i: remove_underscore(i))

tweets_df['text'] = tweets_df['text'].apply(lambda i: remove_attherate(i))

tweets_df['text'] = tweets_df['text'].apply(lambda i: remove_hash(i))

tweets_df['text'] = tweets_df['text'].apply(lambda i: remove_dollar(i))

tweets_df['text'] = tweets_df['text'].apply(lambda i: remove_numbers(i))

tweets_df['text'] = tweets_df['text'].apply(lambda i: remove_non_ascii(i))

tweets_df.head()

  return BeautifulSoup(textcpy, 'html.parser').get_text()


Unnamed: 0,airline,text
0,Virgin America,VirginAmerica What dhepburn said.
1,Virgin America,VirginAmerica plus you've added commercials to...
2,Virgin America,VirginAmerica I didn't today... Must mean I ne...
3,Virgin America,VirginAmerica it's really aggressive to blast ...
4,Virgin America,VirginAmerica and it's a really big bad thing ...


Let's look at the `'text'` column a little bit more in detail to see how it looks like now.

In [35]:
tweets_df['text'].head(20)

0                     VirginAmerica What dhepburn said.
1     VirginAmerica plus you've added commercials to...
2     VirginAmerica I didn't today... Must mean I ne...
3     VirginAmerica it's really aggressive to blast ...
4     VirginAmerica and it's a really big bad thing ...
5     VirginAmerica seriously would pay $ a flight f...
6     VirginAmerica yes, nearly every time I fly VX ...
7     VirginAmerica Really missed a prime opportunit...
8      virginamerica Well, I didn't...but NOW I DO! :-D
9     VirginAmerica it was amazing, and arrived an h...
10    VirginAmerica did you know that suicide is the...
11    VirginAmerica I < pretty graphics. so much bet...
12    VirginAmerica This is such a great deal! Alrea...
13    VirginAmerica virginmedia I'm flying your fabu...
14                                VirginAmerica Thanks!
15         VirginAmerica SFO-PDX schedule is still MIA.
16    VirginAmerica So excited for my first cross co...
17    VirginAmerica  I flew from NYC to SFO last

In [37]:
for i in range(5):
  print(tweets_df['text'][i])

VirginAmerica What dhepburn said.
VirginAmerica plus you've added commercials to the experience... tacky.
VirginAmerica I didn't today... Must mean I need to take another trip!
VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces & they have little recourse
VirginAmerica and it's a really big bad thing about it


Now, I will try a few methods to solve this task of Airport Entity Extraction. Our goal is to extract airport mentions from tweet texts and identify the airports where each airline operates. From my observations, the tweet text contains mentions (tagged person/airline/company/govt agency), or the city name or the airport code. Since we want a method to "identify and extract mentions of airports", here I will make an assumption that we're looking for airport codes.

From Wikipedia, here's a list of airports in the United States: https://en.wikipedia.org/wiki/List_of_airports_in_the_United_States.


In [67]:
import requests
from bs4 import BeautifulSoup

URL = "https://en.wikipedia.org/wiki/List_of_airports_in_the_United_States"
res = requests.get(URL).text
soup = BeautifulSoup(res,'lxml')

cities_in_usa = set()
airport_codes_in_usa = set()
for items in soup.find('table', class_='wikitable').find_all('tr')[1::1]:
    data = items.find_all(['th','td'])
    try:
      city = data[0].a.text
      if data[1] != None and data[1].a != None and data[1].a.text != None:
        code = data[1].a.text
        cities_in_usa.add(city)
        airport_codes_in_usa.add(code)
    except IndexError:
      pass

print("Total: ", len(cities_in_usa), list(cities_in_usa)[:8])
print("Total: ", len(airport_codes_in_usa), list(airport_codes_in_usa)[:10])

Total:  259 ['Orlando', 'Phoenix', 'Fargo', 'Baton Rouge', 'Minneapolis-St. Paul', 'Kearney', 'Lafayette', 'Minot']
Total:  272 ['SNA', 'SRQ', 'GRK', 'SBY', 'CLE', 'AUS', 'TXK', 'CVG', 'ABI', 'SWF']


In [86]:
print("BFF" in airport_codes_in_usa)

True


Therefore, there are a total of 259 cities in the USA and 272 airport codes in the United States.

# Entity Extraction: Airports & Airlines


In [68]:
# Let's make a copy of our dataframe.
df = tweets_df

In [69]:
airlines_list = df.airline.unique()
print(len(airlines_list), airlines_list)

6 ['Virgin America' 'United' 'Southwest' 'Delta' 'US Airways' 'American']


So there are 6 unique airlines in our dataframe.

In [70]:
# Check the shape of data.
print(df.shape)
df.isnull().sum(axis=0)

(14640, 2)


airline    0
text       0
dtype: int64

## Extract airport mentions from text and identify the airports where each airline operates.


Let's assume that the text in the tweet could have airport mentions in any of the cases: uppercase, lowercase, or mixed of both. Using regular expressions, let's find a string of length 3, followed or preceded by `'\b'`. `'\b'` matches any character that is not a letter or number without including itself in the match.

### 1. Using Regular Expression

In [77]:
import re

def extract_airports_mixed(text):
    # Use a regular expression to find airport codes (assuming three mixed letters)
    airports = re.findall(r'\b[A-Za-z]{3}\b', text)
    result = set()

    for apt in airports:
      if apt.upper() in airport_codes_in_usa:
        result.add(apt.upper())
    return result

In [82]:
# Set to store all airport codes (for comparisons)
total_airports_in_df = set([])

# Dictionary which stores a list of airports for all airlines (used set to avoid duplicates)
airline_airports_dict = {element: set([]) for element in airlines_list}

# Iterate over all rows in df
for index, row in df.iterrows():
  tweet_text = row['text']
  company = row['airline']
  airport_mentions = extract_airports_mixed(tweet_text)

  total_airports_in_df.update(airport_mentions)
  airline_airports_dict[company].update(airport_mentions)

# Prints
print("Total num of Airports in dataframe:", len(total_airports_in_df))

for key, val in airline_airports_dict.items():
  print("\n" + key + ":", val)
  print("Count:", len(val))

Total num of Airports in dataframe: 142

Virgin America: {'PHL', 'SUN', 'SFO', 'SAW', 'AUS', 'DAY', 'FLL', 'PDX', 'PSP', 'MIA', 'SAT', 'FAT', 'CMH', 'LAX', 'DTW', 'DFW', 'EAT', 'SEA', 'LGA', 'LAS', 'EWR', 'MCO', 'DAL', 'JFK', 'SJC', 'SAN', 'BFF', 'EAR'}
Count: 28

United: {'SNA', 'CLE', 'AUS', 'CVG', 'TPA', 'ELP', 'SHV', 'OGG', 'RNO', 'BNA', 'PBI', 'LAW', 'ART', 'RSW', 'GJT', 'LAX', 'FSD', 'ERI', 'EAT', 'MSN', 'JAX', 'BWI', 'GRR', 'SUN', 'MSY', 'HNL', 'FLL', 'PDX', 'HDN', 'RDU', 'PSP', 'OKC', 'LIT', 'ACT', 'MDT', 'ALB', 'PIT', 'DFW', 'LGA', 'HOU', 'ROC', 'MCO', 'MSP', 'BUF', 'ITO', 'PHL', 'ATL', 'SAW', 'FAR', 'DAY', 'SAT', 'BRO', 'PHX', 'FWA', 'DEN', 'ASE', 'CLT', 'EWR', 'LAS', 'JFK', 'KOA', 'COS', 'FAY', 'OMA', 'BTV', 'SFO', 'IAH', 'SDF', 'ONT', 'SBA', 'SMF', 'MKE', 'MCI', 'MOT', 'MIA', 'JMS', 'CMH', 'CAE', 'DTW', 'MHT', 'SEA', 'STL', 'SAN', 'BGM', 'GNV'}
Count: 85

Southwest: {'PHL', 'SNA', 'SAW', 'FAR', 'CLE', 'AUS', 'DAY', 'TPA', 'BRD', 'FAT', 'SAT', 'BNA', 'PHX', 'PBI', 'ALS', 'GE

This does give decent results.

But this does not seem very accurate. Actual English words like EAR, EAT, SAT, BFF, etc. are also getting categorized as airport mentions for that particular airline row.

The downside is that: it will count any 3-letter independently occuring word and check if it exists as a code in the directory of US airport codes that we created above.

### 2. Using the popular NLP library spaCy that provides pre-trained models for Named Entity Recognition

In [89]:
import spacy

# Load spaCy's pre-trained NER model
nlp = spacy.load("en_core_web_sm")

def extract_airports_ner(text):
    doc = nlp(text)
    airport_mentions = [ent.text for ent in doc.ents if ent.label_ == "LOC"]
    return airport_mentions

In [90]:
airport_codes = set([])
airline_airports_dict = {element: set([]) for element in airlines_list}

for index, row in df.iterrows():
  tweet_text = row['text']
  company = row['airline']
  airports_mentioned = extract_airports_ner(tweet_text)

  airport_codes.update(airports_mentioned)
  airline_airports_dict[company].update(airports_mentioned)


# Prints
print("Total num of Airport mentions found:", len(airport_codes))

for key, value in airline_airports_dict.items():
  print("\n" + key + ":", value)
  print("Count:", len(value))

Total num of Airports: 62

Virgin America: {'Virgin America', 'Europe', 'Rockies', 'Southwest', 'Central Baggage', 'Silicon Valley', 'NYC'}
Count: 7

United: {'the United App', 'StartingBloc', 'the Mojave Desert', 'Maui', 'B. Res', 'Bay .', 'East', 'Tarmac', 'Montego Bay', 'the Western Hemisphere', 'North America', 'Asia Pac', 'the East Coast', 'East Bay', 'South Florida', 'Europe', 'Southwest', 'http://t.co/SdyLuKRpt', 'Asia', 'flySFO', 'NYC', 'Pacific Rim', 'east coast', 'Delta'}
Count: 24

Southwest: {'Caribbean', 'Carolina', 'South Florida', 'New England', 'Southwest', 'Northern California', 'nyc', 'Delta', 'earth', 'NYC', 'Columbus OH'}
Count: 11

Delta: {'Caribbean', 'FoodNetwork SOBEWFF', 'South Florida', 'Southwest', 'the Middle East', 'Delta', 'Cartago', 'the Bay Area', 'Tweet', 'Northeast', 'north east', 'West Palm', 'NYC', 'my bay area'}
Count: 14

US Airways: {'East Coast Freeze', 'Neptune', 'MoBay', 'Europe', 'Southwest', 'Manch', "cx'd", 'Eastern GastonCounty', 'Montego B

This is not very accurate.

EUROPE, the MIDDLE EAST, Atlantic, etc. are not airports. The "LOC" tag that we used is used for location. So it's taking any location as result.

Modifying the label in spacy to ent.label_ == "GPE":

In [91]:
import spacy

# Load spaCy's pre-trained NER model
nlp = spacy.load("en_core_web_sm")

def extract_airports_ner(text):
    doc = nlp(text)
    airport_mentions = [ent.text for ent in doc.ents if ent.label_ == "GPE"]
    return airport_mentions

In [92]:
airport_codes = set([])
airline_airports_dict = {element: set([]) for element in airlines_list}

for index, row in df.iterrows():
  tweet_text = row['text']
  company = row['airline']
  airports_mentioned = extract_airports_ner(tweet_text)

  airport_codes.update(airports_mentioned)
  airline_airports_dict[company].update(airports_mentioned)

# Prints
print("Total num of Airport mentions found:", len(airport_codes))

for key, value in airline_airports_dict.items():
  print("\n" + key + ":", value)
  print("Count:", len(value))

Total num of Airport mentions found: 540

Virgin America: {'Los Angeles', 'AUS', 'Philadelphia', 'KCIAirport', 'Palm Springs', 'New York', 'Seattle', 'SoundOfMusic', 'UK', 'New Route', 'San Diego', 'Atlanta', 'VA', 'LA', 'Dulles', 'http://t.co/fqXElbOn', 'Elevate', 'Hawaii', 'Austin', 'new york', 'Dallas', 'PA', 'Vegas', 'http://t.co/YAOrMfkaC', 'US', 'SF', 'Newark', 'USA \n http://t.co/AzTDaer', 'Texas', 'Houston', 'Boston', 'America', 'San Jose', 'Paris', 'San Francisco', 'Flighting', 'VX', 'Airline', 'texas', 'DC', 'Thurs'}
Count: 41

United: {'united please', 'Orlando', 'Phoenix', 'Expedia', 'San Pedro', 'Colo. Springs', 'AUS', 'x.x.', 'orlando', 'New York', 'Bahamas', 'Palm Springs', 'chicago', 'Bangkok', 'Colombia', 'San Diego', 'Zambia', 'Phl', 'Cebu', 'Tarmac', 'LH', 'JetBlue', 'Atlanta', 'Rapid City', 'Saipan', 'shanghai', 'united roundtrip', 'united landing', 'Norway', 'RT', 'Edinburgh', 'united nice', 'Montana', 'Minneapolis', 'Istanbul', 'omaha', 'LAX.When', 'Melbourne', 'c

This method finds a lot of airports for each airline, most of which are cities/countries/states/continents.

I will try to implement a method mentioned on the HuggingFace wiki (ref in doc under "Resources").

### 3. Using transformers and the BERT language model (HuggingFace)

In [95]:
import numpy as np
import tensorflow as tf
from transformers import BertTokenizer, TFBertForTokenClassification

# Load pre-trained BERT model and tokenizer for token classification
model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = TFBertForTokenClassification.from_pretrained(model_name)


# Function to extract airport mentions from tweet text
def extract_airports_with_bert(tweet_text):
    # Tokenize the input text
    tokens = tokenizer.encode(tweet_text, return_tensors="tf")

    # Make predictions using the BERT model
    outputs = model(tokens)

    # Get predicted labels for each token
    predicted_labels = np.argmax(outputs.logits.numpy(), axis=-1)

    # Decode the tokens and labels
    token_labels = [(tokenizer.decode(token), label) for token, label in zip(tokens[0].numpy(), predicted_labels[0])]

    # print(token_labels)
    # Extract entities labeled as airports
    airport_entities = [token for token, label in token_labels if label == 8]  # Assuming 1 is the label for airports

    return airport_entities

All PyTorch model weights were used when initializing TFBertForTokenClassification.

All the weights of TFBertForTokenClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


**Note**: This block below is super slow and can be further optimized given more time.

In [97]:
airport_codes = set([])
airline_airports_dict = {element: set([]) for element in airlines_list}

for index, row in df.iterrows():
    tweet_text = row['text']
    company = row['airline']
    potential_mentions = extract_airports_with_bert(tweet_text)

    processed_codes = []
    is_next_merge_2_prev = False
    for code in potential_mentions:
        code = code.split()

        can_merge_2_prev = is_next_merge_2_prev or code[0] == '#'
        is_next_merge_2_prev = code[-1] == '#'

        code = [c for c in code if c != '#']
        code = ''.join(code)

        if can_merge_2_prev and processed_codes:
            processed_codes[-1] = processed_codes[-1] + code
        else:
            processed_codes.append(code)

    airports_mentioned = [code for code in processed_codes if len(code) == 3 and code.upper() == code]

    if airports_mentioned:
        # print(company, ":", airports_mentioned)
        airport_codes.update(airports_mentioned)
        airline_airports_dict[company].update(airports_mentioned)

# Prints
print("Total num of Airports:", len(airport_codes))

for key, value in airline_airports_dict.items():
    print("\n" + key + ":", value)
    print("Count:", len(value))

Total num of Airports: 66

Virgin America: {'SFO', 'AUS', 'LAX', 'DFW', 'USA', 'LAS', 'JFK', 'MCO', 'DCA', 'SAN', 'NYC'}
Count: 11

United: {'SNA', 'OKA', 'AUS', 'YVR', 'PEK', 'LAX', 'SLC', 'ORD', 'BCN', 'YYZ', 'USA', 'CLT', 'JFK', 'LAS', 'JAX', 'SFO', 'IAH', 'BKK', 'HDN', 'KUL', 'RDU', 'MIA', 'NYC', 'IAD', 'DFW', 'ROC'}
Count: 26

Southwest: {'SNA', 'ORL', 'AUS', 'BNA', 'MDW', 'PHX', 'NSW', 'LAX', 'USA', 'JFK', 'CHI', 'MSY', 'SFO', 'SDF', 'MID', 'MKE', 'FLL', 'RIC', 'RDU', 'OKC', 'NYC', 'CMH', 'BOS', 'HRL', 'LGA', 'SEA', 'BUF', 'ATL', 'SAN'}
Count: 29

Delta: {'SFO', 'SRQ', 'SJU', 'BOS', 'HPN', 'LAX', 'DFW', 'FLL', 'USA', 'PVD', 'JFK', 'LAS', 'BQN', 'SJC', 'MCO', 'BUF', 'NYC'}
Count: 17

US Airways: {'PHL', 'SRQ', 'AUS', 'RNO', 'PBI', 'PHX', 'LAX', 'USA', 'CLT', 'JFK', 'LAS', 'SFO', 'PRN', 'RDU', 'NYC', 'DFW', 'PVD', 'ROC', 'DCA'}
Count: 19

American: {'EGE', 'GRK', 'VCP', 'LAX', 'USF', 'ORD', 'USA', 'JFK', 'LAS', 'JAX', 'SFO', 'RDU', 'OKC', 'LIT', 'NYC', 'IAD', 'CMH', 'DTW', 'DFW', '

The output of the extract_airports_with_bert() function is a list of tokens. The tokens have '#' and '##' in the word, so I've done some post-processing here above to obtain the airport codes from the returned list of tokens.

This block above, however, is super slow and can be further optimized given more time.


In [110]:
final_airports_with_airlines = {element: set([]) for element in airlines_list}

for key, values in airline_airports_dict.items():
  for airport in list(values):
    if airport in airport_codes_in_usa:
      final_airports_with_airlines[key].add(airport)

for k, v in final_airports_with_airlines.items():
  print("\n" + k + ":", v)
  print("Count:", len(v))


Virgin America: {'SFO', 'AUS', 'LAX', 'DFW', 'LAS', 'JFK', 'MCO', 'SAN'}
Count: 8

United: {'SNA', 'SFO', 'IAH', 'AUS', 'LAX', 'DFW', 'CLT', 'JFK', 'LAS', 'HDN', 'JAX', 'ROC', 'RDU', 'MIA'}
Count: 14

Southwest: {'SNA', 'AUS', 'BNA', 'PHX', 'LAX', 'JFK', 'MSY', 'SFO', 'SDF', 'MKE', 'FLL', 'RDU', 'OKC', 'CMH', 'HRL', 'LGA', 'SEA', 'BUF', 'ATL', 'SAN'}
Count: 20

Delta: {'SFO', 'SRQ', 'HPN', 'LAX', 'DFW', 'FLL', 'JFK', 'LAS', 'SJC', 'MCO', 'BUF'}
Count: 11

US Airways: {'PHL', 'SRQ', 'SFO', 'AUS', 'LAX', 'DFW', 'CLT', 'JFK', 'LAS', 'ROC', 'RNO', 'RDU', 'PBI', 'PHX'}
Count: 14

American: {'CMH', 'EGE', 'SFO', 'GRK', 'LAX', 'DTW', 'DFW', 'LGA', 'JFK', 'LAS', 'JAX', 'RDU', 'CLL', 'OKC', 'LIT'}
Count: 15


So this approach seems the best and most accurate of all the methods that we tried above!

We have thus obtained a final list of US airports where each US airline operates.

### **Final Comments**


1.   All the three methods listed above have certain drawbacks.
2.   We could use LangChain's functions to solve this problem, which requires the OpenAI API's key for every user (which is not shareable). Here is how we could typically do it by importing the `openai` library from `langchain`. This is also more accurate than rest of the methods.

 https://python.langchain.com/docs/use_cases/extraction

