# Independent Lab: Manipulating Data

**Intro to Python**  
**Manipulating Data**  
**Cody Thompson**  
**Date:** 4/14/2025

Welcome to my notebook for the Manipulating Data lab! In this notebook, I will be working with two datasets: `CaliforniaHospitalData.csv` and `CaliforniaHospitalData_Personnel.txt`. My task is to pre-process and clean the data, merge the two datasets, filter and rename columns, and perform various data manipulations to prepare the data for analysis by the Business Intelligence team.


In [8]:
# Import necessary libraries
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold

# Set Working Directory
os.chdir('C:\\Users\\cthom\\Downloads\\BGEN 632 Intro to Python\\GitHub_Repos\\Week 7\\week7labs\\data')

# Verify the current working directory
print("Current Working Directory:", os.getcwd())

Current Working Directory: C:\Users\cthom\Downloads\BGEN 632 Intro to Python\GitHub_Repos\Week 7\week7labs\data


In [14]:
# Load the hospital data into a DataFrame
hospital_df = pd.read_csv('CaliforniaHospitalData.csv')

# Load the personnel data into a DataFrame
personnel_df = pd.read_csv('CaliforniaHospitalData_Personnel.txt', delimiter='\t')


In [16]:
# Display the first few rows of both DataFrames to verify the correct loading
hospital_df.head()

Unnamed: 0,HospitalID,Name,Zip,Website,TypeControl,Teaching,DonorType,NoFTE,NetPatRev,InOperExp,OutOperExp,OperRev,OperInc,AvlBeds
0,45740,Mammoth Hospital,93546-0660,www.mammothhospital.com,District,Small/Rural,Charity,327.0,135520.2186,20523425.53,34916220.47,49933713,-5505933,15
1,12145,Victor Valley Community Hospital,92392,www.vvch.org,Non Profit,Small/Rural,Charity,345.0,136156.6913,33447542.78,20348596.22,53351748,-444391,99
2,25667,Pioneers Memorial Hospital,92227,www.pmhd.org,District,Small/Rural,Charity,601.2,197094.2541,37254178.67,37832448.33,72933707,-2152920,107
3,46996,Ridgecrest Regional Hospital,93555,www.rrh.org,Non Profit,Small/Rural,Charity,400.0,139170.3798,23385570.1,24661355.9,51087341,3040415,55
4,37393,Barstow Community Hospital,92311,www.barstowhospital.com,Investor,Small/Rural,Charity,262.0,116797.8306,13684502.49,15159986.51,42845642,14001153,42


In [17]:
# Display the first few rows of both DataFrames to verify the correct loading
personnel_df.head()

Unnamed: 0,HospitalID,Work_ID,LastName,FirstName,Gender,PositionID,PositionTitle,Compensation,MaxTerm,StartDate,Phone,Email
0,35665,351131,Cherukuri,Dileep,M,4,Safety Inspection Member,23987,2,1/1/2019,405-564-5580,dileep.cherukuri@okstate.edu
1,12145,756481,Rodriguez,Jose,M,1,Regional Representative,46978,4,1/1/2009,405-744-2238,jose.rodriguez@edihealth.com
2,45771,756481,Rodriguez,Jose,M,1,Regional Representative,46978,4,1/1/2011,405-744-2238,jose.rodriguez@edihealth.com
3,43353,756481,Rodriguez,Jose,M,4,Safety Inspection Member,23987,2,1/1/2011,405-744-2238,jose.rodriguez@edihealth.com
4,17718,811240,Charles,Kenneth,M,1,Regional Representative,46978,4,1/1/2009,405-744-3412,kenneth.charles@edihealth.com


In [51]:
# Merge the hospital and personnel data on the 'HospitalID' column
merged_df = pd.merge(hospital_df, personnel_df, on='HospitalID', how='inner')

# Remove unnecessary columns
merged_df.drop(columns=['Work_ID', 'PositionID', 'Website'], inplace=True)

# Display the first few rows of the merged DataFrame
merged_df.head()


Unnamed: 0,HospitalID,Name,Zip,TypeControl,Teaching,DonorType,NoFTE,NetPatRev,InOperExp,OutOperExp,...,AvlBeds,LastName,FirstName,Gender,PositionTitle,Compensation,MaxTerm,StartDate,Phone,Email
0,45740,Mammoth Hospital,93546-0660,District,Small/Rural,Charity,327.0,135520.2186,20523425.53,34916220.47,...,15,Web,David,M,Safety Inspection Member,23987,2,1/1/2012,785-532-2452,david.web@comenitymed.com
1,12145,Victor Valley Community Hospital,92392,Non Profit,Small/Rural,Charity,345.0,136156.6913,33447542.78,20348596.22,...,99,Rodriguez,Jose,M,Regional Representative,46978,4,1/1/2009,405-744-2238,jose.rodriguez@edihealth.com
2,25667,Pioneers Memorial Hospital,92227,District,Small/Rural,Charity,601.2,197094.2541,37254178.67,37832448.33,...,107,Adamson,David,M,Regional Representative,46978,4,1/1/2012,785-532-7573,david.adamson@txbiomed.net
3,46996,Ridgecrest Regional Hospital,93555,Non Profit,Small/Rural,Charity,400.0,139170.3798,23385570.1,24661355.9,...,55,Roberts,Melissa,F,Safety Inspection Member,23987,2,1/1/2009,785-532-9779,melissa.roberts@txbiomed.net
4,37393,Barstow Community Hospital,92311,Investor,Small/Rural,Charity,262.0,116797.8306,13684502.49,15159986.51,...,42,Iwata,Akira,M,Regional Representative,46978,4,1/1/2011,801-611-9161,akira.iwata@hsu.edu


In [53]:
# Exporting the Data
# Ensure the 'data' directory exists, create it if not
if not os.path.exists('data'):
    os.makedirs('data')

# Filter the data based on the specified conditions (using merged_df, not hospital_df)
filtered_df = merged_df[(merged_df['Teaching'] == 'Small/Rural') &  # Small/Rural hospitals
                        (merged_df['AvlBeds'] >= 15) &  # At least 15 available beds
                        (merged_df['OperInc'] >= 0)]  # Non-negative operating income

# Export the filtered data to a tab-delimited file
filtered_df.to_csv('data/hospital_data_new.txt', sep='\t', index=False)

# Verify the export was successful
print('hospital_data_new.txt exported successfully!')

# Display the first few rows to confirm the data
filtered_df.head()


hospital_data_new.txt exported successfully!


Unnamed: 0,HospitalID,Name,Zip,TypeControl,Teaching,DonorType,NoFTE,NetPatRev,InOperExp,OutOperExp,...,AvlBeds,LastName,FirstName,Gender,PositionTitle,Compensation,MaxTerm,StartDate,Phone,Email
3,46996,Ridgecrest Regional Hospital,93555,Non Profit,Small/Rural,Charity,400.0,139170.3798,23385570.0,24661355.9,...,55,Roberts,Melissa,F,Safety Inspection Member,23987,2,1/1/2009,785-532-9779,melissa.roberts@txbiomed.net
4,37393,Barstow Community Hospital,92311,Investor,Small/Rural,Charity,262.0,116797.8306,13684500.0,15159986.51,...,42,Iwata,Akira,M,Regional Representative,46978,4,1/1/2011,801-611-9161,akira.iwata@hsu.edu
5,17741,St. Elizabeth Community Hospital,96080,Non Profit,Small/Rural,Charity,397.5,232503.0191,36682890.0,36739260.3,...,66,Marlin,Bill,M,Safety Inspection Member,23987,2,1/1/2011,503-645-7508,bill.marlin@larcmed.com
6,20277,Ukiah Valley Medical Center,95482,Non Profit,Small/Rural,Charity,503.5,214516.4481,32709220.0,43571851.35,...,65,Johanson,Sandy,F,Regional Representative,46978,4,1/1/2012,801-216-4821,sandy.johanson@ihc.com
8,29823,Colusa Regional Medical Center,95932-2954,Non Profit,Small/Rural,Charity,168.0,51726.4918,9022366.0,10402509.55,...,48,Tanner,Patricia,F,Acting Director,248904,8,1/1/2009,801-687-7877,patricia.tanner@prohealth.net


In [54]:
# Load the newly created file into a new DataFrame
filtered_df = pd.read_csv('data/hospital_data_new.txt', sep='\t')

# Rename the specified columns
filtered_df.rename(columns={
    'NoFTE': 'FullTimeCount',
    'NetPatRev': 'NetPatientRevenue',
    'InOperExp': 'InpatientOperExp',
    'OutOperExp': 'OutpatientOperExp',
    'OperRev': 'Operating_Revenue',
    'OperInc': 'Operating_Income'
}, inplace=True)

# Display the first 5 rows to confirm the renaming
print(filtered_df.head(5))


   HospitalID                              Name         Zip TypeControl  \
0       46996      Ridgecrest Regional Hospital       93555  Non Profit   
1       37393        Barstow Community Hospital       92311    Investor   
2       17741  St. Elizabeth Community Hospital       96080  Non Profit   
3       20277       Ukiah Valley Medical Center       95482  Non Profit   
4       29823    Colusa Regional Medical Center  95932-2954  Non Profit   

      Teaching DonorType  FullTimeCount  NetPatientRevenue  InpatientOperExp  \
0  Small/Rural   Charity          400.0        139170.3798      2.338557e+07   
1  Small/Rural   Charity          262.0        116797.8306      1.368450e+07   
2  Small/Rural   Charity          397.5        232503.0191      3.668289e+07   
3  Small/Rural   Charity          503.5        214516.4481      3.270922e+07   
4  Small/Rural   Charity          168.0         51726.4918      9.022366e+06   

   OutpatientOperExp  ...  AvlBeds  LastName  FirstName Gender  \
0 

In [58]:
# Insert New Records with all specified columns (including Phone and Email)

new_employee_1 = {
    'HospitalID': 46996,
    'Name': 'Ridgecrest Regional Hospital',
    'Zip': '93555',
    'TypeControl': 'Non Profit',
    'Teaching': 'Small/Rural',
    'DonorType': 'Charity',
    'FullTimeCount': 100,  
    'NetPatientRevenue': 50000, 
    'InpatientOperExp': 100000, 
    'OutpatientOperExp': 200000,
    'Operating_Revenue': 300000,
    'Operating_Income': 10000,  
    'AvlBeds': 50,
    'PositionTitle': 'Regional Representative',
    'Compensation': 46978,
    'MaxTerm': 4,
    'StartDate': '2025-04-14',
    'Gender': 'M',
    'LastName': 'Thompson',
    'FirstName': 'Cody',
    'Phone': '555-555-0101',  
    'Email': 'cody.thompson@example.com'  
}

new_employee_2 = {
    'HospitalID': 17741,
    'Name': 'St. Elizabeth Community Hospital',
    'Zip': '96080',  
    'TypeControl': 'Non Profit',
    'Teaching': 'Small/Rural',
    'DonorType': 'Charity',
    'FullTimeCount': 200,  
    'NetPatientRevenue': 70000,  
    'InpatientOperExp': 150000,  
    'OutpatientOperExp': 250000,  
    'Operating_Revenue': 350000,  
    'Operating_Income': 15000,  
    'AvlBeds': 60,
    'PositionTitle': 'State Board Representative',
    'Compensation': 89473,
    'MaxTerm': 3,
    'StartDate': '2025-04-14',
    'Gender': 'M',
    'LastName': 'Thompson',
    'FirstName': 'Cody',
    'Phone': '555-555-0102',  
    'Email': 'cody.thompson2@example.com'
}

# Convert the new records to DataFrames
new_employee_1_df = pd.DataFrame([new_employee_1])
new_employee_2_df = pd.DataFrame([new_employee_2])

# Concatenate the new records with the existing filtered_df
new_merge = pd.concat([filtered_df, new_employee_1_df, new_employee_2_df], ignore_index=True)

# Display the updated DataFrame to confirm the new records
new_merge.tail(10)


Unnamed: 0,HospitalID,Name,Zip,TypeControl,Teaching,DonorType,FullTimeCount,NetPatientRevenue,InpatientOperExp,OutpatientOperExp,...,AvlBeds,LastName,FirstName,Gender,PositionTitle,Compensation,MaxTerm,StartDate,Phone,Email
20,20266,Sonora Regional Medical Center - greenley,95370,Non Profit,Small/Rural,Charity,779.0,367540.6639,68228850.0,69968860.0,...,152,Adams,Sandy,F,Regional Representative,46978,4,1/1/2009,785-532-3333,sandy.adams@comenitymed.com
21,37436,Fallbrook Hospital,92028,District,Small/Rural,Charity,501.0,108960.418,23001690.0,14727470.0,...,146,Johanson,Sandy,F,State Board Representative,89473,3,1/1/2012,801-216-4821,sandy.johanson@ihc.com
22,17736,Sierra Nevada Memorial Hospital,95945,Non Profit,Small/Rural,Charity,524.5,295579.235,56692830.0,50264170.0,...,121,Charles,Kenneth,M,Acting Director,248904,8,1/1/2006,405-744-3412,kenneth.charles@edihealth.com
23,38802,Santa Ynez Valley Cottage Hospital,93463,Non Profit,Small/Rural,Charity,67.0,28773.45355,1780969.0,8235088.0,...,20,Dong,HongWei,F,Regional Representative,46978,4,1/1/2010,479-354-4864,hongwei.dong@brokenhealth.com
24,45067,Glenn Medical Center,95988-2745,Non Profit,Small/Rural,Charity,100.0,29712.3306,2076879.0,9501695.0,...,15,Adamson,David,M,Safety Inspection Member,23987,2,1/1/2012,785-532-7573,david.adamson@txbiomed.net
25,28283,Hi-Desert Medical Center,92252,District,Small/Rural,Charity,451.5,145733.5765,31842680.0,21184930.0,...,179,Smith,Frank,M,Acting Director,248904,8,1/1/2005,405-744-5687,frank.smith@edihealth.com
26,28812,Oak Valley District Hospital,95361,District,Small/Rural,Charity,503.0,137280.7104,19495620.0,29846900.0,...,150,Holmes,Holly,F,Acting Director,248904,8,1/1/2003,785-532-4515,holly.holmes@asu.edu
27,19868,Ojai Valley Community Hospital,93023-3163,Non Profit,Small/Rural,Charity,180.0,59504.62295,11955300.0,10326800.0,...,103,Coulter,Tracy,F,Regional Representative,46978,4,1/1/2010,785-532-6548,tracy.coulter@wou.edu
28,46996,Ridgecrest Regional Hospital,93555,Non Profit,Small/Rural,Charity,100.0,50000.0,100000.0,200000.0,...,50,Thompson,Cody,M,Regional Representative,46978,4,2025-04-14,555-555-0101,cody.thompson@example.com
29,17741,St. Elizabeth Community Hospital,96080,Non Profit,Small/Rural,Charity,200.0,70000.0,150000.0,250000.0,...,60,Thompson,Cody,M,State Board Representative,89473,3,2025-04-14,555-555-0102,cody.thompson2@example.com


In [60]:
# Filtering Data
# Task 1: Select all hospitals that are non-profit with more than 250 employees, unless the net patient revenue is smaller than $109,000.
# Also, remove the columns containing employee information (e.g., PositionTitle, Compensation, etc.)

# Filter hospitals that meet the conditions
filtered_non_profit_hospitals = new_merge[(new_merge['TypeControl'] == 'Non Profit') &  # Non-profit hospitals
                                          (new_merge['FullTimeCount'] > 250) &  # More than 250 employees
                                          (new_merge['NetPatientRevenue'] >= 109000)]  # Net patient revenue >= $109,000


# Display the result for Task 1
print("Filtered Non-Profit Hospitals with more than 250 employees:")
filtered_non_profit_hospitals.head()


Filtered Non-Profit Hospitals with more than 250 employees:


Unnamed: 0,HospitalID,Name,Zip,TypeControl,Teaching,DonorType,FullTimeCount,NetPatientRevenue,InpatientOperExp,OutpatientOperExp,...,AvlBeds,LastName,FirstName,Gender,PositionTitle,Compensation,MaxTerm,StartDate,Phone,Email
0,46996,Ridgecrest Regional Hospital,93555,Non Profit,Small/Rural,Charity,400.0,139170.3798,23385570.0,24661355.9,...,55,Roberts,Melissa,F,Safety Inspection Member,23987,2,1/1/2009,785-532-9779,melissa.roberts@txbiomed.net
2,17741,St. Elizabeth Community Hospital,96080,Non Profit,Small/Rural,Charity,397.5,232503.0191,36682890.0,36739260.3,...,66,Marlin,Bill,M,Safety Inspection Member,23987,2,1/1/2011,503-645-7508,bill.marlin@larcmed.com
3,20277,Ukiah Valley Medical Center,95482,Non Profit,Small/Rural,Charity,503.5,214516.4481,32709220.0,43571851.35,...,65,Johanson,Sandy,F,Regional Representative,46978,4,1/1/2012,801-216-4821,sandy.johanson@ihc.com
5,13738,St. Mary Medical Center,92307-2206,Non Profit,Small/Rural,Charity,1216.0,540975.1175,125128300.0,66801544.65,...,186,Milgrom,Patricia,F,Safety Inspection Member,23987,2,1/1/2011,479-178-9584,patricia.milgrom@brokenhealth.com
6,38798,Goleta Valley Cottage Hospital,93111,Non Profit,Small/Rural,Charity,288.0,335179.5574,53589040.0,67030147.91,...,119,Iwata,Akira,M,Regional Representative,46978,4,1/1/2011,801-611-9161,akira.iwata@hsu.edu


In [61]:
# Select all the *Regional Representatives* who work at a hospital with operating income greater than $100,000.
# Filter Regional Representatives with Operating Income > $100,000

filtered_regional_representatives = new_merge[(new_merge['PositionTitle'] == 'Regional Representative') &  # Regional Representatives
                                              (new_merge['Operating_Income'] > 100000)]  # Operating Income > $100,000

# Display the result for Task 2
print("\nFiltered Regional Representatives with Operating Income > $100,000:")
filtered_regional_representatives.head()



Filtered Regional Representatives with Operating Income > $100,000:


Unnamed: 0,HospitalID,Name,Zip,TypeControl,Teaching,DonorType,FullTimeCount,NetPatientRevenue,InpatientOperExp,OutpatientOperExp,...,AvlBeds,LastName,FirstName,Gender,PositionTitle,Compensation,MaxTerm,StartDate,Phone,Email
1,37393,Barstow Community Hospital,92311,Investor,Small/Rural,Charity,262.0,116797.8306,13684502.49,15159986.51,...,42,Iwata,Akira,M,Regional Representative,46978,4,1/1/2011,801-611-9161,akira.iwata@hsu.edu
3,20277,Ukiah Valley Medical Center,95482,Non Profit,Small/Rural,Charity,503.5,214516.4481,32709222.65,43571851.35,...,65,Johanson,Sandy,F,Regional Representative,46978,4,1/1/2012,801-216-4821,sandy.johanson@ihc.com
6,38798,Goleta Valley Cottage Hospital,93111,Non Profit,Small/Rural,Charity,288.0,335179.5574,53589036.09,67030147.91,...,119,Iwata,Akira,M,Regional Representative,46978,4,1/1/2011,801-611-9161,akira.iwata@hsu.edu
7,46348,Barton Memorial Hospital,96150,Non Profit,Small/Rural,Charity,750.0,335179.5574,53589036.09,67030147.91,...,119,Paanua,Kaaluai,M,Regional Representative,46978,4,1/1/2011,479-684-1883,kaaluai.paanua@brokenhealth.com
10,17718,Mercy Medical Center - Mount Shasta,96067,Non Profit,Small/Rural,Charity,215.5,123480.2705,22003500.15,22410053.85,...,60,Charles,Kenneth,M,Regional Representative,46978,4,1/1/2009,405-744-3412,kenneth.charles@edihealth.com


In [66]:
# Convert 'StartDate' column to datetime
new_merge['StartDate'] = pd.to_datetime(new_merge['StartDate'])

# Confirm the conversion by outputting the data types of all columns
print("Data Types after Conversion:")
print(new_merge.dtypes)

# Display the first 5 records of the 'StartDate' column to confirm the conversion
print("\nFirst 5 records of the 'StartDate' column:")
print(new_merge['StartDate'].head())


Data Types after Conversion:
HospitalID                    int64
Name                         object
Zip                          object
TypeControl                  object
Teaching                     object
DonorType                    object
FullTimeCount               float64
NetPatientRevenue           float64
InpatientOperExp            float64
OutpatientOperExp           float64
Operating_Revenue             int64
Operating_Income              int64
AvlBeds                       int64
LastName                     object
FirstName                    object
Gender                       object
PositionTitle                object
Compensation                  int64
MaxTerm                       int64
StartDate            datetime64[ns]
Phone                        object
Email                        object
dtype: object

First 5 records of the 'StartDate' column:
0   2009-01-01
1   2011-01-01
2   2011-01-01
3   2012-01-01
4   2009-01-01
Name: StartDate, dtype: datetime64[ns]
