# <font color='blue'>**PDF Data Extraction Project**</font>

In this project, I am writing a **Python** program to automate the extraction of student information from multiple PDF files and save it into an **Excel** file. The PDFs contain various student-related fields like:

- **Date of Admission**
- **Roll Number**
- **Admission Number**
- **Name of Student**
- **Father’s Name**
- **Mother’s Name**
- **Address**
- **Mobile No**

The goal is to process these files and organize the extracted data in a structured format that can be saved and reviewed efficiently.

## <font color='green'>**Libraries Used**</font>

I use the following libraries for the implementation:

- **`os`**: To interact with the file system and read the contents of the folder containing the PDF files.
- **`pdfplumber`**: This is used to extract text from the PDF documents. It allows us to navigate through the pages of each PDF and extract the required information.
- **`pandas`**: Used for organizing and manipulating the data. Once the information is extracted, it is stored in a `pandas` DataFrame, which is later exported to an Excel file.

## <font color='purple'>**Steps Involved**</font>

### <font size='4' color='red'>**1. Extracting Information from PDFs**</font>

For each PDF, I extract specific pieces of information like the **Date of Admission**, **Roll Number**, **Admission Number**, etc. This is done using the **`pdfplumber`** library. The text from each page is split by lines, and I search for keywords (like _'Date of Admission'_ or _'Roll Number'_) to extract the corresponding values. I also ensure that unnecessary data (such as the "Social Category" mentioned after the mother’s name) is removed.

### <font size='4' color='red'>**2. Organizing Data in a DataFrame**</font>

After extracting the data, I store it in a **`pandas` DataFrame**. This helps in organizing the information row by row, with each row representing one student’s information, and each column representing a specific data field like **Name of Student** or **Mobile No**.

### <font size='4' color='red'>**3. Handling Empty Rows**</font>

Sometimes, the extraction process may result in rows where all the fields are empty. To avoid having empty rows in the final Excel file, I remove rows where no data has been extracted. This ensures that the final output is clean and doesn't contain unnecessary gaps or empty spaces between rows.

### <font size='4' color='red'>**4. Saving Data to an Excel File**</font>

Finally, once all the data has been extracted and organized, I save it to an **Excel** file using the **`pandas` `to_excel()`** function. The Excel file will contain all the extracted data in a well-structured format, which makes it easy to review and analyze.

### <font size='4' color='red'>**5. Managing Multiple PDF Files**</font>

I loop through all PDF files in a specific folder. For each file, I open it, extract the text from each page, and apply the extraction logic to retrieve the required data. This makes the program scalable, as it can handle multiple PDF files without needing any manual intervention.

## <font color='blue'>**Conclusion**</font>

This project demonstrates how **Python** can be used to automate the extraction of structured data from PDFs, organize it using **pandas**, and save it into a format that is easy to analyze (like **Excel**). By automating this process, I am able to efficiently handle large sets of documents and reduce manual effort significantly.
educe manual effort significantly.
import pandas as pd


In [9]:
pip install PyPDF2 pandas openpyxl pdfplumber


Note: you may need to restart the kernel to use updated packages.


In [10]:
import os
import pdfplumber
import pandas as pd

# Function to extract data from each PDF page
def extract_student_info(text):
    lines = text.split('\n')  # Split text into lines
    student_info = {
        'Date of Admission': None,
        'Roll Number': None,
        'Admission Number': None,
        'Name of Student': None,
        'Father’s Name': None,
        'Mother’s Name': None,
        'Address': None,
        'Mobile No': None
    }
    
    # Go through each line and find the relevant information
    for line in lines:
        line = line.strip()  # Remove leading and trailing spaces
        
        if 'Date of Admission' in line:
            student_info['Date of Admission'] = line.split('Date of Admission')[-1].strip()
        elif 'Roll Number' in line:
            roll_number = line.split('Roll Number')[-1].strip()
            if roll_number and roll_number[0].isdigit():  # Check if the extracted part is numeric
                student_info['Roll Number'] = roll_number.split()[0]
        elif 'Admission Number' in line:
            student_info['Admission Number'] = line.split('Admission Number')[-1].strip()
        elif 'Name of the Student' in line or 'Name of Student' in line:
            student_info['Name of Student'] = line.split('Name of the Student')[-1].split('Date of Birth')[0].strip()
        elif 'Father’s Name' in line:
            student_info['Father’s Name'] = line.split('Father’s Name')[-1].strip()
        elif 'Mother’s Name' in line:
            mother_name = line.split('Mother’s Name')[-1].strip()
            if 'Social Category' in mother_name:
                mother_name = mother_name.split('Social Category')[0].strip()
            student_info['Mother’s Name'] = mother_name
        elif 'Address' in line:
            student_info['Address'] = line.split('Address')[-1].strip()
        elif 'Mobile Number (Student/ Parent)' in line or 'Mobile Number (Student/Parent)' in line:
            student_info['Mobile No'] = line.split('Mobile Number (Student/ Parent)')[-1].strip().split(' ')[0]

    return student_info

folder_path = "1"  # Folder containing PDF files

# List to hold all the student data
all_student_data = []

# Loop through each file in the folder
for filename in os.listdir(folder_path):
    if filename.endswith('.pdf'):
        file_path = os.path.join(folder_path, filename)
        with pdfplumber.open(file_path) as pdf:
            for page in pdf.pages:
                text = page.extract_text()
                if text:
                    student_info = extract_student_info(text)
                    if any(student_info.values()):
                        all_student_data.append(student_info)

# Convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(all_student_data)

# Remove rows where all fields are empty
df.dropna(how='all', inplace=True)

# Save the DataFrame to an Excel file without extra space between rows
output_excel = "Book1.xlsx"
df.to_excel(output_excel, index=False)

print(f"Data saved to {output_excel}")


Data saved to Book1.xlsx
