# PDF to CSV Converter for APEMCET Data

This notebook converts the APEMCET (AP Engineering Common Entrance Test) cutoff data from PDF format to CSV. It uses the following libraries:
- `pdfplumber`: For extracting tables from PDF
- `pandas`: For data manipulation and CSV output
- `os`: For file path handling

## Setup and Imports

In [2]:
# Import required libraries
import pdfplumber
import pandas as pd
import os

## File Path Configuration

Define the input PDF file path and output CSV file path. We'll use the current directory to make it easier to work with relative paths.

In [3]:
# Get the current directory
current_dir = os.path.dirname(os.path.abspath('__file__'))

# Define input and output paths
pdf_path = os.path.join(current_dir, "APEAPCET_2022_Cutoff.pdf")
csv_path = os.path.join(current_dir, "csv_file.csv")

print(f"PDF file path: {pdf_path}")
print(f"CSV file path: {csv_path}")

PDF file path: d:\Projects\APEMCET_2023\APEAPCET_2022_Cutoff.pdf
CSV file path: d:\Projects\APEMCET_2023\csv_file.csv


## Define Table Extraction Function

Create a function to extract tables from the PDF file. This function will:
1. Open the PDF file
2. Process each page
3. Extract tables from each page
4. Convert tables to pandas DataFrames
5. Combine all tables
6. Save the result to a CSV file

In [7]:
def extract_tables_from_pdf(pdf_path, csv_path):
    """
    Extract tables from PDF and save them to a CSV file.
    
    Args:
        pdf_path (str): Path to the input PDF file
        csv_path (str): Path where the CSV file will be saved
    """
    # Check if the PDF file exists
    if not os.path.exists(pdf_path):
        raise FileNotFoundError(f"PDF file not found: {pdf_path}")
    
    # List to store all tables
    all_tables = []
    
    try:
        # Open the PDF file
        with pdfplumber.open(pdf_path) as pdf:
            # Iterate through all pages
            for page_number, page in enumerate(pdf.pages, 1):
                print(f"Processing page {page_number} of {len(pdf.pages)}")
                
                # Extract tables from the current page
                tables = page.extract_tables()
                
                # Process each table in the page
                for table_number, table in enumerate(tables, 1):
                    print(f"Found table {table_number} on page {page_number}")
                    
                    # Convert table to pandas DataFrame
                    df = pd.DataFrame(table[1:], columns=table[0])
                    
                    # Clean the DataFrame
                    # Remove any empty rows
                    df = df.dropna(how='all')
                    # Remove any empty columns
                    df = df.dropna(axis=1, how='all')
                    
                    all_tables.append(df)
    
        if not all_tables:
            print("No tables found in the PDF")
            return
        
        # Combine all tables
        final_df = pd.concat(all_tables, ignore_index=True)
        
        # Save to CSV
        final_df.to_csv(csv_path, index=False)
        print(f"Successfully saved data to {csv_path}")
        
        # Display first few rows of the extracted data
        print("\nFirst few rows of the extracted data:")
        print(final_df.head())
        
    except Exception as e:
        print(f"An error occurred: {str(e)}")

## Run the Conversion

Now let's run the function to convert the PDF to CSV. This will process all pages in the PDF and save the extracted tables to a CSV file.

In [8]:
# Extract tables and save to CSV
extract_tables_from_pdf(pdf_path, csv_path)

Processing page 1 of 60
Found table 1 on page 1
Processing page 2 of 60
Found table 1 on page 1
Processing page 2 of 60
Found table 1 on page 2
Processing page 3 of 60
Found table 1 on page 2
Processing page 3 of 60
Found table 1 on page 3
Processing page 4 of 60
Found table 1 on page 3
Processing page 4 of 60
Found table 1 on page 4
Processing page 5 of 60
Found table 1 on page 4
Processing page 5 of 60
Found table 1 on page 5
Processing page 6 of 60
Found table 1 on page 5
Processing page 6 of 60
Found table 1 on page 6
Processing page 7 of 60
Found table 1 on page 6
Processing page 7 of 60
Found table 1 on page 7
Processing page 8 of 60
Found table 1 on page 7
Processing page 8 of 60
Found table 1 on page 8
Processing page 9 of 60
Found table 1 on page 8
Processing page 9 of 60
Found table 1 on page 9
Processing page 10 of 60
Found table 1 on page 9
Processing page 10 of 60
Found table 1 on page 10
Processing page 11 of 60
Found table 1 on page 10
Processing page 11 of 60
Found tabl

## Preview the Results

Let's read the generated CSV file and display the first few rows to verify the conversion:

In [9]:
# Read and display the first few rows of the generated CSV file
try:
    df = pd.read_csv(csv_path)
    print("CSV file shape:", df.shape)
    print("\nColumns in the CSV file:")
    print(df.columns.tolist())
    print("\nFirst few rows of the data:")
    display(df.head())
except Exception as e:
    print(f"Error reading CSV file: {str(e)}")

CSV file shape: (1504, 32)

Columns in the CSV file:
['SNO', 'inst_code', 'inst_name', 'type', 'INST\n_RE\nG.', 'DIST', 'PLACE', 'COED', 'AFFLIA\n.UNIV', 'ESTD', 'branch_\ncode', 'Local_Ar\nea', 'OC_BO\nYS', 'OC_GIR\nLS', 'SC_BO\nYS', 'SC_GIR\nLS', 'ST_BOY\nS', 'ST_GIR\nLS', 'BCA_B\nOYS', 'BCA_GI\nRLS', 'BCB_B\nOYS', 'BCB_GI\nRLS', 'BCC_B\nOYS', 'BCC_GI\nRLS', 'BCD_B\nOYS', 'BCD_GI\nRLS', 'BCE_B\nOYS', 'BCE_GI\nRLS', 'OC_EWS_B\nOYS', 'OC_EWS_GI\nRLS', 'COLLFEE', 'Local\narea']

First few rows of the data:


Unnamed: 0,SNO,inst_code,inst_name,type,INST\n_RE\nG.,DIST,PLACE,COED,AFFLIA\n.UNIV,ESTD,...,BCC_B\nOYS,BCC_GI\nRLS,BCD_B\nOYS,BCD_GI\nRLS,BCE_B\nOYS,BCE_GI\nRLS,OC_EWS_B\nOYS,OC_EWS_GI\nRLS,COLLFEE,Local\narea
0,1.0,ACEE,ADARSH COLLEGE OF ENGINEERING,PVT,AU,EG,GOLLAPROLU,COED,JNTUK,2008,...,143031.0,143031.0,143031.0,143031.0,143031.0,143031.0,158522.0,,35000.0,
1,2.0,ACEE,ADARSH COLLEGE OF ENGINEERING,PVT,AU,EG,GOLLAPROLU,COED,JNTUK,2008,...,132938.0,139146.0,134908.0,157283.0,144114.0,144114.0,149718.0,120842.0,35000.0,
2,3.0,ACEE,ADARSH COLLEGE OF ENGINEERING,PVT,AU,EG,GOLLAPROLU,COED,JNTUK,2008,...,114459.0,114459.0,114459.0,146246.0,114459.0,114459.0,128288.0,,35000.0,
3,4.0,ACEE,ADARSH COLLEGE OF ENGINEERING,PVT,AU,EG,GOLLAPROLU,COED,JNTUK,2008,...,,,169142.0,169163.0,169163.0,169163.0,,,35000.0,
4,5.0,ACEE,ADARSH COLLEGE OF ENGINEERING,PVT,AU,EG,GOLLAPROLU,COED,JNTUK,2008,...,,169131.0,169142.0,169142.0,169163.0,169163.0,,,35000.0,
