# Part A: Extracting Information from a CSV File

## Introduction
In this part of the assignment, we aim to extract specific information from a CSV file containing short stories. The CSV file includes empty columns with header labels indicating the types of information we need to extract using regular expressions.

## Methodology
We will read the CSV file into a pandas DataFrame and then apply regular expressions to extract data such as first names, last names, company names, addresses, and other relevant information from the 'Description' column. The extracted data will then populate the corresponding empty columns.

In [4]:
import pandas as pd
import re

# Load the CSV file
csv_path = 'Assignment0PartA.csv'
data = pd.read_csv(csv_path)

# Updated regular expressions as provided
regex_patterns = {
    'first_name': r'([A-Z][a-z]+)\s[A-Z][a-z]+:',  # Extracting first name
    'last_name': r'[A-Z][a-z]+\s([A-Z][a-z]+):',  # Extracting last name
    'company_name': r'(?:of|from|at)\s+((?:[A-Z][a-z]*,?\s*)+(?:\s+Jr|Esq|Cpa|Service|Dimensions))',  # Extracting company name
    'address': r'(?:at|in|from) (\d+.*?\b(?:St|ZIP|Blvd)\b)',  # Extracting address
    'city': r'\b(?:in|hub of|streets of) ([A-Z][a-z]*(?:\s[A-Z][a-z]+)*)',  # Extracting city
    'county': r'\b([A-Z][a-z]+) County\b',  # Extracting county
    'state': r'\b(A[LKZRAEP]|C[AOT]|DE|FL|GA|HI|I[DLNA]|K[SY]|LA|M[EDAINSOT]|N[EVHJMYCD]|O[HKR]|PA|RI|S[CD]|T[NX]|UT|V[TA]|W[AVIY])\b',  # Extracting state
    'phone1': r'(\d+-\d+-\d+)',  # Extracting first phone number
    'phone2': r'(\d{3}-\d{3}-\d{4})(?!.*\d{3}-\d{3}-\d{4})',  # Extracting second phone number
    'email': r'(\S+@\S+)'  # Extracting email
}

# Apply regular expressions to extract the data
for column, pattern in regex_patterns.items():
    data[column] = data['Description'].apply(lambda x: re.search(pattern, x).group(1) if re.search(pattern, x) else 'N/A')

# Save the updated DataFrame to the same CSV file
data.to_csv(csv_path, index=False)

# Display the first few rows of the updated DataFrame
data.head()

Unnamed: 0,first_name,last_name,company_name,address,city,county,state,phone1,phone2,email,Description
0,James,Butt,"Benton, John B Jr",6649 N Blue Gum St,New Orleans,Orleans,LA,504-621-8927,504-845-1427,jbutt@gmail.com,"James Butt: An avid historian, James Butt from..."
1,Josephine,Darakjy,"Chanay, Jeffrey A Esq",4 B Blue Ridge Blvd,Brighton,Livingston,MI,810-292-9388,810-374-9840,josephine_darakjy@darakjy.org.,Josephine Darakjy: Amidst the jazz-filled stre...
2,Art,Venere,"Chemel, James L Cpa",8014 ZIP,Bridgeport,Gloucester,,856-636-8749,856-264-4130,,"Art Venere: Art Venere, a nature enthusiast at..."
3,Lenna,Paprocki,Feltz Printing Service,639 Main St,Anchorage,,AK,907-385-4412,907-921-2010,"lpaprocki@hotmail.com,",Lenna Paprocki: While renovating their office ...
4,Donette,Foller,Printing Dimensions,34 Center St,Hamilton,,,,,,"Donette Foller: In the tech hub of Hamilton, D..."


## Results and Observations
Above are the first few rows of the DataFrame after applying the regular expressions. This output showcases the effectiveness of our extraction process with each column now populated with the corresponding extracted data:

# Part B: Analyzing Texts for Health Condition Mentions

In [6]:
import pandas as pd
import re

# Define the regular expressions for each health condition
health_conditions = {
    'Heart disease': r'Heart disease',
    'Cancer': r'Cancer',
    'Stroke': r'Stroke',
    'Respiratory diseases': r'Respiratory diseases',
    "Alzheimer's disease": r"Alzheimer's disease",
    'Diabetes': r'Diabetes',
    'Influenza and Pneumonia': r'Influenza and Pneumonia',
    'Kidney diseases': r'Kidney diseases',
    'Septicemia': r'Septicemia',
    'Liver disease': r'Liver disease',
    'Hypertension': r'Hypertension',
    "Parkinson's disease": r"Parkinson's disease",
    'Chronic lower respiratory disease': r'Chronic lower respiratory disease',
    'Accidents/injuries': r'Accidents/injuries',
    'Osteoporosis': r'Osteoporosis',
    'Asthma': r'Asthma',
    'Depression': r'Depression',
    'Oral health issues': r'Oral health issues',
    'HIV/AIDS': r'HIV/AIDS',
    'Tuberculosis': r'Tuberculosis',
    'Malaria': r'Malaria',
    'Dengue fever': r'Dengue fever',
    'Hepatitis': r'Hepatitis',
    'Epilepsy': r'Epilepsy',
    'Multiple sclerosis': r'Multiple sclerosis'
}

# Initialize a DataFrame to store the frequencies
df = pd.DataFrame(columns=health_conditions.keys())

# Process each volume
for i in range(1, 6):  # Assuming 5 volumes
    with open(f'{i}.txt', 'r', encoding='utf-8') as file:  # Replace with the actual file path
        text = file.read()

        # Count occurrences of each health condition
        counts = {condition: len(re.findall(pattern, text, re.IGNORECASE)) for condition, pattern in health_conditions.items()}
        
        # Add the counts to the DataFrame
        df.loc[f'Volume {i}'] = counts

# Save the DataFrame to a CSV file
df.to_csv('health_conditions_frequency.csv')

# Print the DataFrame
df.head()

Unnamed: 0,Heart disease,Cancer,Stroke,Respiratory diseases,Alzheimer's disease,Diabetes,Influenza and Pneumonia,Kidney diseases,Septicemia,Liver disease,...,Asthma,Depression,Oral health issues,HIV/AIDS,Tuberculosis,Malaria,Dengue fever,Hepatitis,Epilepsy,Multiple sclerosis
Volume 1,4,97,9,1,0,9,0,0,0,0,...,6,41,0,0,125,269,0,0,9,0
Volume 2,23,1335,10,0,0,227,0,0,0,4,...,10,93,0,0,130,142,0,71,12,0
Volume 3,67,224,12,0,0,5,0,0,0,0,...,383,73,0,0,192,83,0,2,4,0
Volume 4,15,116,3,0,0,47,0,1,0,2,...,1,27,0,0,37,70,0,2,35,0
Volume 5,10,41,97,0,0,33,0,0,0,0,...,15,124,0,0,36,57,0,17,530,5
