<a href="https://colab.research.google.com/github/DinurakshanRavichandran/Visio-Glance/blob/Pre-Processed-Datasets-NLP/drusenPPFinal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#mount google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [7]:
import pandas as pd
from imblearn.over_sampling import SMOTE

# Load the dataset
file_path = '/content/drive/MyDrive/PROJECT 29/DATASETS/Synthetic_Drusen_Dataset.csv'
df = pd.read_csv(file_path)

# Handle Missing Values
for column in df.columns:
    most_frequent = df[column].mode()[0]
    df[column].fillna(most_frequent, inplace=True)

# Convert 'Smoking Status' to binary (1 for 'Yes', 0 for 'No')
df['Smoking Status'] = df['Smoking Status'].map({'Yes': 1, 'No': 0})

# One-Hot Encode 'Visual Symptoms' with lowercase labels after '_'
visual_symptoms_dummies = pd.get_dummies(df['Visual Symptoms'], prefix='Visual Symptoms')

# Convert only the part after "Visual Symptoms_" to lowercase
visual_symptoms_dummies.columns = [
    'Visual Symptoms_' + col.split('_')[1].lower() if '_' in col else col
    for col in visual_symptoms_dummies.columns
]

# Convert TRUE/FALSE features to binary (1 for TRUE, 0 for FALSE)
df.replace({True: 1, False: 0}, inplace=True)
visual_symptoms_dummies = visual_symptoms_dummies.astype(int)  # Ensure one-hot encoded values are 1/0

# Combine all features
features = pd.concat([df.drop(['Diagnosis', 'Visual Symptoms'], axis=1), visual_symptoms_dummies], axis=1)
target = df['Diagnosis']

# Handle Class Imbalance with SMOTE
smote = SMOTE(random_state=42)
features_smote, target_smote = smote.fit_resample(features, target)

# Reconstruct the DataFrame with resampled data
df_smote = pd.DataFrame(features_smote, columns=features.columns)
df_smote['Diagnosis'] = target_smote

# Save Preprocessed Data
preprocessed_file_path = '/content/drive/MyDrive/PROJECT 29/FINAL MODEL/Preprocessed_Drusen_Dataset.csv'
df_smote.to_csv(preprocessed_file_path, index=False)

# Display the head of the preprocessed dataset
print(df_smote.head())


   Age  Smoking Status   BMI  Blood Pressure  Cholesterol Levels  \
0   50               1  25.7             158                 189   
1   53               0  34.5             113                 241   
2   53               0  26.4             153                 152   
3   89               1  27.7             178                 229   
4   59               0  25.6             113                 219   

   Visual Symptoms_blind spots  Visual Symptoms_blurred vision  \
0                            0                               0   
1                            0                               0   
2                            0                               1   
3                            0                               0   
4                            0                               0   

   Visual Symptoms_distorted vision  Visual Symptoms_light sensitivity  \
0                                 1                                  0   
1                                 0           

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(most_frequent, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(most_frequent, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values