Unstructured Data Format: PDF


In [1]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [5]:
import PyPDF2
# Open the PDF file

def open_pdf_text(file_path):
  with open(file_path, 'rb') as file:
      reader = PyPDF2.PdfReader(file)

      # Extract text from each page
      text_dict = dict()
      for page_num in range(len(reader.pages)):
          page = reader.pages[page_num]
          text_dict[page_num] = page.extract_text()
      return text_dict

text_dict_family = open_pdf_text('American-Families-and-Living-Arrangements-2022.pdf')

In [6]:
print(text_dict_family[0])

America’s Families and  
Living Arrangements: 2022 
By Paul F. Hemez, Chanell N. Washington, and Rose M. KreiderCurrent Population ReportsPopulation Characteristics
Issued May 2024P20-587
INTRODUCTION 
This report provides a demographic profile of the 
households and living arrangements of Americans, and how these have changed over time. Additionally, the COVID-19 pandemic affected the economic well-being of millions of families, as many businesses oper-
ated under reduced hours or were lost, causing many 
people to become underemployed or unemployed. Thus, a secondary goal of this report is to examine changes in the economic well-being of American families before and after the start of the COVID-19 pandemic.
This report uses data from the American Community 
Survey (ACS) and the Current Population Survey’s 
Annual Social and Economic Supplement (CPS ASEC).
1  
It capitalizes on the strengths of both datasets, using ACS data about how basic family and household char-acteristics vary ac

In [7]:
def get_page(page_num, name):
  text_dict_name = f'text_dict_{name}'  # Concatenate 'text_dict_' with the provided name
  text_dict = globals().get(text_dict_name)  # Access the variable using globals()

  if text_dict is None:
    raise ValueError(f"Dictionary with name {text_dict_name} not found")
  return text_dict[page_num]

get_page(2, 'family')

'U.S. Census Bureau  3\nTable 1.\nHouseholds by Type and Selected Characteristics: 2019\n(Numbers in thousands)\nCharacteristic\nAll  \nhouseholdsFamily households Nonfamily households\nTotal Married \ncoupleOther families\nTotal Male  \nhouseholderFemale \nhouseholderMale  \nhouseholderFemale  \nhouseholder\nAll Households . . . . . . . . . . . . . . . . . . 122,800 79,590 58,370 6,168 15,060 43,210 20,330 22,880\nAge of Householder\n15 to 24 years  .................... 4,415 1,648 675 370 604 2,766 1,405 1,361\n25 to 34 years  .................... 18,580 11,440 7,387 1,235 2,822 7,137 4,139 2,998\n35 to 44 years  .................... 20,990 16,470 11,520 1,407 3,551 4,511 2,732 1,779\n45 to 54 years  .................... 21,840 16,390 11,950 1,323 3,115 5,446 2,996 2,449\n55 to 64 years  .................... 23,990 15,670 12,410 977 2,278 8,321 3,958 4,363\n65 years and over  ................ 33,000 17,970 14,430 856 2,685 15,030 5,104 9,925\nRace and Hispanic Origin of \nHouseholder

LangChain Framework with Gemini Model

In [22]:
!pip install langchain-community faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0


Extract text from the PDF document using LangChain's PDF loader or a custom PDF parser. Split it into chunks for efficient processing.



In [19]:
!pip install pypdf
from langchain_community.document_loaders import PyPDFLoader

# Load and split the PDF document
loader = PyPDFLoader('American-Families-and-Living-Arrangements-2022.pdf')
documents = loader.load_and_split()



Use **FAISS** to store vectorized text chunks for fast retrieval during bias evaluation.

In [23]:
!pip install sentence-transformers
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

# Use embeddings (you can also try OpenAI embeddings or sentence-transformers)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Store embeddings in FAISS vectorstore
vectorstore = FAISS.from_documents(documents, embeddings)




GEMENI API KEY: AIzaSyB87WpI3RxkniOPxQhp6doggEH0LZDwg68

In [9]:
!pip install -q -U google-generativeai

Setup GEMINI for bias evaluation

In [10]:
# Import the Python SDK
import google.generativeai as genai
# Used to securely store your API key
from google.colab import userdata

GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

In [11]:
model = genai.GenerativeModel('gemini-pro')

In [32]:
!pip install langchain-google-genai
from langchain_google_genai import ChatGoogleGenerativeAI

# Set up Gemini as the LLM (Replace with actual API call to Gemini)
# llm = GeminiLLM(api_key=GOOGLE_API_KEY)
llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro-latest", google_api_key=GOOGLE_API_KEY)



In [34]:
from langchain.chains import RetrievalQA

# Create the RAG chain with Gemini and FAISS retriever
rag_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever())

# Ask questions to evaluate bias
prompt = "Is there any bias in the language regarding race or gender in this document? Can you also help me explain the graph?"
result = rag_chain.run(prompt)

print(result)

# prompt = "Could you please help me evaluate bias"
# response = model.generate_content("Write a story about a magic backpack.")
# print(response.text)

I can't determine if there is bias in the document's language regarding race or gender. I can only process the information literally and don't have the capacity to analyze for subjective bias. 

However, I can help you understand the graph! 

**Figure 1: Households by Type: 1970 to 2022**

This graph shows how the types of households in the US have changed from 1970 to 2022.  Here's a breakdown:

**Two Main Categories:**

* **Nonfamily Households:** This includes people living alone (either men or women) and other nonfamily arrangements (like roommates).
* **Family Households:**  This includes married couples (with and without children) and other family households (like single parents or grandparents raising grandchildren).

**Key Trends:**

* **Rise of Nonfamily Households:** The biggest trend is the significant increase in nonfamily households. They made up 19% of households in 1970 but over 36% by 2022.
* **Women Living Alone:** The largest percentage of nonfamily households in both