# Airline Data Parser
This notebook contains a parser to convert HTML airline tables into structured data format.

## Setup Requirements
- Python 3.9.6
- Required packages: beautifulsoup4, pandas
- Input file: airline_data.html with airline tables

In [16]:
# Import required libraries
import pandas as pd
from bs4 import BeautifulSoup
import re
from datetime import datetime
import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

## AirlineHTMLParser Class
This class handles the parsing of airline HTML data into structured format.

Key features:
- Reads HTML file containing multiple airline tables
- Extracts individual airline sections
- Standardizes dates to DD/MM/YYYY format
- Parses table data into structured format
- Exports data to CSV

In [23]:
class AirlineHTMLParser:
    """
    Parser for converting HTML airline tables into structured data format
    """
    
    def __init__(self, input_file='airline_data.html'):
        """Initialize parser with input file path"""
        self.input_file = input_file
    
    def read_html_file(self):
        """Read the HTML file containing airline tables"""
        try:
            with open(self.input_file, 'r', encoding='utf-8') as file:
                return file.read()
        except Exception as e:
            logging.error(f"Error reading file {self.input_file}: {e}")
            return None
            
    def extract_airline_sections(self, html_content):
        """Extract individual airline sections from HTML content"""
        pattern = r'<!-- START AIRLINE: (.*?) -->(.*?)<!-- END AIRLINE:'
        matches = re.finditer(pattern, html_content, re.DOTALL)
        
        airline_sections = []
        for match in matches:
            airline_name = match.group(1)
            airline_html = match.group(2)
            airline_sections.append((airline_name, airline_html))
            
        return airline_sections
        
    def standardize_date(self, date_str):
        """Convert date string to DD/MM/YYYY format"""
        try:
            date_obj = datetime.strptime(date_str, '%d %b %Y')
            return date_obj.strftime('%d/%m/%Y')
        except ValueError as e:
            logging.warning(f"Could not parse date: {date_str}. Error: {e}")
            return date_str
            
    def parse_airline_table(self, airline_html):
        """Parse individual airline table HTML"""
        soup = BeautifulSoup(airline_html, 'html.parser')
        table = soup.find('table', class_='infobox')
        
        if not table:
            return None
            
        airline_data = {}
        
        # Extract logo URL
        logo_img = table.find('img', class_='airline-logo__img')
        if logo_img and 'src' in logo_img.attrs:
            airline_data['logo_url'] = logo_img['src']
            
        # Extract other information
        for row in table.find_all('tr'):
            cells = row.find_all(['td', 'th'])
            if len(cells) >= 2:
                header = cells[0].text.strip()
                if len(cells) > 2:
                    value = cells[2].text.strip()
                else:
                    value = cells[1].text.strip()
                    
                # Clean up header
                header = header.replace('\n', ' ').strip()
                
                # Handle specific fields
                if header in ['Founded', 'Started Operations']:
                    value = self.standardize_date(value)
                    
                airline_data[header] = value
                
        return airline_data
        
    def process_airlines(self):
        """Process all airlines in the HTML file"""
        html_content = self.read_html_file()
        if not html_content:
            return None
            
        airline_sections = self.extract_airline_sections(html_content)
        all_airlines_data = []
        
        for airline_name, airline_html in airline_sections:
            logging.info(f"Processing airline: {airline_name}")
            airline_data = self.parse_airline_table(airline_html)
            if airline_data:
                airline_data['Airline Name'] = airline_name
                all_airlines_data.append(airline_data)
                
        return pd.DataFrame(all_airlines_data)
        
    def export_to_csv(self, df, filename='parsed_airline_data.csv'):
        """Export DataFrame to CSV"""
        try:
            df.to_csv(filename, index=False)
            logging.info(f"Data successfully exported to {filename}")
        except Exception as e:
            logging.error(f"Error exporting data to CSV: {e}")

## Testing the Parser
Now we'll test the parser with our airline data. Make sure you have created the `airline_data.html` file with the correct format:

```html
<!-- START AIRLINE: Qatar Airways -->
<table class="infobox">
... Qatar Airways table content ...
</table>
<!-- END AIRLINE: Qatar Airways -->

In [24]:
# Initialize parser
parser = AirlineHTMLParser('airline_data.html')

# Process airlines and get DataFrame
df = parser.process_airlines()

if df is not None:
    # Display the data
    print("\nExtracted Airline Data:")
    print(df.to_string())
    
    # Export to CSV
    parser.export_to_csv(df)
else:
    print("Error processing airline data")

2025-01-28 16:44:34,645 - INFO - Processing airline: Qatar Airways
2025-01-28 16:44:34,650 - INFO - Processing airline: Singapore Airlines
2025-01-28 16:44:34,654 - INFO - Processing airline: Emirates Airlines
2025-01-28 16:44:34,657 - INFO - Processing airline: All Nippon Airways
2025-01-28 16:44:34,662 - INFO - Processing airline: Cathay Pacific Airways
2025-01-28 16:44:34,666 - INFO - Processing airline: Japan Airlines
2025-01-28 16:44:34,670 - INFO - Processing airline: Turkish Airlines
2025-01-28 16:44:34,673 - INFO - Processing airline: EVA Air
2025-01-28 16:44:34,677 - INFO - Processing airline: Air France
2025-01-28 16:44:34,682 - INFO - Processing airline: Swiss International Air Lines
2025-01-28 16:44:34,713 - INFO - Data successfully exported to parsed_airline_data.csv



Extracted Airline Data:
                                                                            logo_url          IATAQR                           Full Name               Country     Founded Started Operations      Group / Part of                          Headquarters                               Main Hub                                               Fleet Size Average Fleet Age 1        Official Site                   Airline Name             IATASQ                                                                                                                  Predecessors                                                                                                                                                                                                                                                                                                                                                                                                                              

In [25]:
# Initialize parser
parser = AirlineHTMLParser('airline_data.html')

# Check if file exists
import os
if not os.path.exists('airline_data.html'):
    print("Error: airline_data.html file not found!")
    print("Current working directory:", os.getcwd())
else:
    # Process airlines and get DataFrame
    html_content = parser.read_html_file()
    if html_content is None:
        print("Error: Could not read HTML file")
    elif not html_content.strip():
        print("Error: HTML file is empty")
    else:
        print("File read successfully, length:", len(html_content))
        
        df = parser.process_airlines()
        
        if df is not None:
            # Display the data
            print("\nExtracted Airline Data:")
            print(df.to_string())
            
            # Export to CSV
            parser.export_to_csv(df)
        else:
            print("Error: Could not process airline data")

2025-01-28 16:44:35,865 - INFO - Processing airline: Qatar Airways
2025-01-28 16:44:35,869 - INFO - Processing airline: Singapore Airlines
2025-01-28 16:44:35,873 - INFO - Processing airline: Emirates Airlines
2025-01-28 16:44:35,877 - INFO - Processing airline: All Nippon Airways
2025-01-28 16:44:35,881 - INFO - Processing airline: Cathay Pacific Airways
2025-01-28 16:44:35,884 - INFO - Processing airline: Japan Airlines
2025-01-28 16:44:35,887 - INFO - Processing airline: Turkish Airlines
2025-01-28 16:44:35,890 - INFO - Processing airline: EVA Air
2025-01-28 16:44:35,893 - INFO - Processing airline: Air France
2025-01-28 16:44:35,897 - INFO - Processing airline: Swiss International Air Lines
2025-01-28 16:44:35,943 - INFO - Data successfully exported to parsed_airline_data.csv


File read successfully, length: 56185

Extracted Airline Data:
                                                                            logo_url          IATAQR                           Full Name               Country     Founded Started Operations      Group / Part of                          Headquarters                               Main Hub                                               Fleet Size Average Fleet Age 1        Official Site                   Airline Name             IATASQ                                                                                                                  Predecessors                                                                                                                                                                                                                                                                                                                                                                                        