To update your project on predicting hospital readmissions to utilize Big Data technologies, specifically Apache Hadoop and HDFS (Hadoop Distributed File System), follow these steps:

Steps to Use Hadoop and HDFS for Your Project
Set Up Hadoop: Ensure you have a Hadoop cluster running, whether locally or on a cloud provider. Since you mentioned you're running Hadoop locally, you should have your HDFS configured properly.

Upload Data to HDFS:

Use the Hadoop command line to copy your CSV files (hospital_with_actual_A1C.csv and hospital_with_predicted_A1C.csv) to HDFS.

In [5]:
# hadoop fs -mkdir /hospital_data
# hadoop fs -put C:\BigData\BigDataHackathon\git2\hospital_with_actual_A1C.csv /hospital_data/
# hadoop fs -put C:\BigData\BigDataHackathon\git2\hospital_with_predicted_A1C.csv /hospital_data/
# hdfs dfs -chmod 777 /hospital_data


Read Data from HDFS:

Use the pydoop library or hdfs library to read data directly from HDFS in your Python script. Below is a sample code snippet to read CSV files from HDFS using pandas and pydoop.
Install pydoop if you haven't already:

In [6]:
# pip install pydoop

In [1]:
import pandas as pd
import numpy as np
from hdfs import InsecureClient
import pickle
import io
from sklearn.ensemble import RandomForestClassifier  # Example model
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Connect to HDFS
hdfs_client = InsecureClient('http://localhost:9870', user='hadoop')  # Update URL and user as necessary

# Read the DataFrame from HDFS
hdfs_path1 = '/hospital_data/hospital_with_actual_A1C.csv'
hdfs_path2 = '/hospital_data/hospital_with_predicted_A1C.csv'

# Read CSV files from HDFS
with hdfs_client.read(hdfs_path1) as reader:
    df1 = pd.read_csv(reader)

with hdfs_client.read(hdfs_path2) as reader:
    df2 = pd.read_csv(reader)

# Concatenate, reset index, and check data
final_df = pd.concat([df1, df2], axis=0).reset_index(drop=True)

# Debug: Print the column names
print("Columns in final_df:", final_df.columns.tolist())

# Check for the actual target column name
target_column_name = 'Readmitted'  # Replace with your actual target column name

# Ensure the target column exists in the DataFrame
if target_column_name not in final_df.columns:
    raise KeyError(f"Column '{target_column_name}' not found in DataFrame. Available columns: {final_df.columns.tolist()}")

# Define feature and target variables
X = final_df.drop(columns=[target_column_name])
y = final_df[target_column_name]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a model (example: Random Forest)
Readmission_Model = RandomForestClassifier()
Readmission_Model.fit(X_train, y_train)

# Optionally, evaluate the model
y_pred = Readmission_Model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Model accuracy: {accuracy:.2f}')

# Save the processed DataFrame back to HDFS
final_hdfs_path = '/hospital_data/hospital_readmissions_final.csv'

try:
    # Use an in-memory buffer to write the DataFrame
    buffer = io.StringIO()
    final_df.to_csv(buffer, index=False)
    buffer.seek(0)  # Move to the beginning of the buffer

    # Write the buffer content to HDFS
    with hdfs_client.write(final_hdfs_path, overwrite=True, encoding='utf-8') as writer:
        writer.write(buffer.getvalue())

    print(f"DataFrame successfully saved to {final_hdfs_path}")
except Exception as e:
    print("Error writing to HDFS:", e)

# Train your model and save it locally
with open("Readmission_Model.pkl", "wb") as m:
    pickle.dump(Readmission_Model, m)

# Optionally, copy model to HDFS if needed
try:
    hdfs_client.upload('/hospital_data/Readmission_Model.pkl', 'Readmission_Model.pkl')
    print("Model uploaded successfully.")
except Exception as e:
    print("Error uploading model to HDFS:", e)

# Continue with testing new user data...


Columns in final_df: ['Gender', 'Admission_Type', 'Diagnosis', 'Num_Lab_Procedures', 'Num_Medications', 'Num_Outpatient_Visits', 'Num_Inpatient_Visits', 'Num_Emergency_Visits', 'Num_Diagnoses', 'A1C_Result', 'Readmitted']
Model accuracy: 0.55
DataFrame successfully saved to /hospital_data/hospital_readmissions_final.csv
Error uploading model to HDFS: Remote path '/hospital_data/Readmission_Model.pkl' already exists.
