1: Import Necessary Libraries and Modules

In [1]:
# Ensure the scripts directory is in your Python path
import sys
sys.path.append('../scripts')

# Import the necessary class from clean_medical_data.py
from clean_medical_data import MedicalDataCleaner

# Import other necessary libraries
import pandas as pd

2: Load the Merged Data

In [2]:
# Define the path to the merged data
merged_data_path = '../src/data/merged_medical_data.csv'

# Create an instance of the MedicalDataCleaner class
cleaner = MedicalDataCleaner(merged_data_path)

# Load the merged data
df = cleaner.load_csv()
df.head()  # Display the first few rows of the DataFrame

2025-02-02 09:21:01,428 - INFO - ✅ CSV file '../src/data/merged_medical_data.csv' loaded successfully.
2025-02-02 09:21:01,513 - INFO - ✅ CSV file '../src/data/merged_medical_data.csv' loaded successfully.


Unnamed: 0,Channel Title,Channel Username,ID,Message,Date
0,Doctors Ethiopia,@DoctorsET,864,https://youtu.be/5DBoEm-8kmA?si=LDLuEecNfULJVD...,2023-12-18 17:04:02+00:00
1,Doctors Ethiopia,@DoctorsET,863,ዶክተርስ ኢትዮጵያ በ አዲስ አቀራረብ በ ቴሌቪዥን ፕሮግራሙን ለመጀመር ከ...,2023-11-03 16:14:39+00:00
2,Doctors Ethiopia,@DoctorsET,862,ሞት በስኳር \n\nለልጆቻችን የምናሲዘው ምሳቃ ሳናቀው እድሚያቸውን ይቀን...,2023-10-02 16:37:39+00:00
3,Doctors Ethiopia,@DoctorsET,861,ከ HIV የተፈወሰ ሰው አጋጥሟችሁ ያቃል ? ፈውስ እና ህክምና ?\n\nሙ...,2023-09-16 07:54:32+00:00
4,Doctors Ethiopia,@DoctorsET,860,በቅርብ ጊዜ በሃገራችን ላይ እየተስተዋለ ያለ የተመሳሳይ ፆታ ( Homos...,2023-09-01 16:16:15+00:00


Display the unique channels in the merged data

In [12]:
# Display the unique channels in the merged data
unique_channels = df['Channel Username'].unique()
print("Unique Channels in the Merged Data:")
for channel in unique_channels:
    print(channel)

Unique Channels in the Merged Data:
@DoctorsET
@CheMed123
@lobelia4cosmetics
@yetenaweg
@EAHCI


3: Clean the Data

3.1: Remove Duplicates

In [3]:
# Remove duplicates
cleaner.df = cleaner.df.drop_duplicates(subset=["ID"]).copy()
print("✅ Duplicates removed from dataset.")
cleaner.df.head()  # Display the first few rows of the DataFrame after removing duplicates

✅ Duplicates removed from dataset.


Unnamed: 0,Channel Title,Channel Username,ID,Message,Date
0,Doctors Ethiopia,@DoctorsET,864,https://youtu.be/5DBoEm-8kmA?si=LDLuEecNfULJVD...,2023-12-18 17:04:02+00:00
1,Doctors Ethiopia,@DoctorsET,863,ዶክተርስ ኢትዮጵያ በ አዲስ አቀራረብ በ ቴሌቪዥን ፕሮግራሙን ለመጀመር ከ...,2023-11-03 16:14:39+00:00
2,Doctors Ethiopia,@DoctorsET,862,ሞት በስኳር \n\nለልጆቻችን የምናሲዘው ምሳቃ ሳናቀው እድሚያቸውን ይቀን...,2023-10-02 16:37:39+00:00
3,Doctors Ethiopia,@DoctorsET,861,ከ HIV የተፈወሰ ሰው አጋጥሟችሁ ያቃል ? ፈውስ እና ህክምና ?\n\nሙ...,2023-09-16 07:54:32+00:00
4,Doctors Ethiopia,@DoctorsET,860,በቅርብ ጊዜ በሃገራችን ላይ እየተስተዋለ ያለ የተመሳሳይ ፆታ ( Homos...,2023-09-01 16:16:15+00:00


3.2: Convert Date to Datetime Format

In [4]:
# Convert Date to datetime format
cleaner.df.loc[:, 'Date'] = pd.to_datetime(cleaner.df['Date'], errors='coerce')
cleaner.df.loc[:, 'Date'] = cleaner.df['Date'].where(cleaner.df['Date'].notna(), None)
print("✅ Date column formatted to datetime.")
cleaner.df.head()  # Display the first few rows of the DataFrame after converting dates

✅ Date column formatted to datetime.


Unnamed: 0,Channel Title,Channel Username,ID,Message,Date
0,Doctors Ethiopia,@DoctorsET,864,https://youtu.be/5DBoEm-8kmA?si=LDLuEecNfULJVD...,2023-12-18 17:04:02+00:00
1,Doctors Ethiopia,@DoctorsET,863,ዶክተርስ ኢትዮጵያ በ አዲስ አቀራረብ በ ቴሌቪዥን ፕሮግራሙን ለመጀመር ከ...,2023-11-03 16:14:39+00:00
2,Doctors Ethiopia,@DoctorsET,862,ሞት በስኳር \n\nለልጆቻችን የምናሲዘው ምሳቃ ሳናቀው እድሚያቸውን ይቀን...,2023-10-02 16:37:39+00:00
3,Doctors Ethiopia,@DoctorsET,861,ከ HIV የተፈወሰ ሰው አጋጥሟችሁ ያቃል ? ፈውስ እና ህክምና ?\n\nሙ...,2023-09-16 07:54:32+00:00
4,Doctors Ethiopia,@DoctorsET,860,በቅርብ ጊዜ በሃገራችን ላይ እየተስተዋለ ያለ የተመሳሳይ ፆታ ( Homos...,2023-09-01 16:16:15+00:00


3.3: Convert 'ID' to Integer

We'll convert the 'ID' column to integer format for PostgreSQL BIGINT compatibility.

In [5]:
# Convert 'ID' to integer
cleaner.df.loc[:, 'ID'] = pd.to_numeric(cleaner.df['ID'], errors="coerce").fillna(0).astype(int)
print("✅ 'ID' column converted to integer.")
cleaner.df.head()  # Display the first few rows of the DataFrame after converting 'ID'

✅ 'ID' column converted to integer.


Unnamed: 0,Channel Title,Channel Username,ID,Message,Date
0,Doctors Ethiopia,@DoctorsET,864,https://youtu.be/5DBoEm-8kmA?si=LDLuEecNfULJVD...,2023-12-18 17:04:02+00:00
1,Doctors Ethiopia,@DoctorsET,863,ዶክተርስ ኢትዮጵያ በ አዲስ አቀራረብ በ ቴሌቪዥን ፕሮግራሙን ለመጀመር ከ...,2023-11-03 16:14:39+00:00
2,Doctors Ethiopia,@DoctorsET,862,ሞት በስኳር \n\nለልጆቻችን የምናሲዘው ምሳቃ ሳናቀው እድሚያቸውን ይቀን...,2023-10-02 16:37:39+00:00
3,Doctors Ethiopia,@DoctorsET,861,ከ HIV የተፈወሰ ሰው አጋጥሟችሁ ያቃል ? ፈውስ እና ህክምና ?\n\nሙ...,2023-09-16 07:54:32+00:00
4,Doctors Ethiopia,@DoctorsET,860,በቅርብ ጊዜ በሃገራችን ላይ እየተስተዋለ ያለ የተመሳሳይ ፆታ ( Homos...,2023-09-01 16:16:15+00:00


3.4: Fill Missing Values

We'll fill missing values in the 'Message' column.

In [6]:
# Fill missing values
cleaner.df.loc[:, 'Message'] = cleaner.df['Message'].fillna("No Message")
print("✅ Missing values filled.")
cleaner.df.head()  # Display the first few rows of the DataFrame after filling missing values

✅ Missing values filled.


Unnamed: 0,Channel Title,Channel Username,ID,Message,Date
0,Doctors Ethiopia,@DoctorsET,864,https://youtu.be/5DBoEm-8kmA?si=LDLuEecNfULJVD...,2023-12-18 17:04:02+00:00
1,Doctors Ethiopia,@DoctorsET,863,ዶክተርስ ኢትዮጵያ በ አዲስ አቀራረብ በ ቴሌቪዥን ፕሮግራሙን ለመጀመር ከ...,2023-11-03 16:14:39+00:00
2,Doctors Ethiopia,@DoctorsET,862,ሞት በስኳር \n\nለልጆቻችን የምናሲዘው ምሳቃ ሳናቀው እድሚያቸውን ይቀን...,2023-10-02 16:37:39+00:00
3,Doctors Ethiopia,@DoctorsET,861,ከ HIV የተፈወሰ ሰው አጋጥሟችሁ ያቃል ? ፈውስ እና ህክምና ?\n\nሙ...,2023-09-16 07:54:32+00:00
4,Doctors Ethiopia,@DoctorsET,860,በቅርብ ጊዜ በሃገራችን ላይ እየተስተዋለ ያለ የተመሳሳይ ፆታ ( Homos...,2023-09-01 16:16:15+00:00


3.5: Standardize Text Columns

We'll standardize the text columns by removing unnecessary spaces.

In [7]:
# Standardize text columns
cleaner.df.loc[:, 'Channel Title'] = cleaner.df['Channel Title'].str.strip()
cleaner.df.loc[:, 'Channel Username'] = cleaner.df['Channel Username'].str.strip()
cleaner.df.loc[:, 'Message'] = cleaner.df['Message'].apply(cleaner.clean_text)
print("✅ Text columns standardized.")
cleaner.df.head()  # Display the first few rows of the DataFrame after standardizing text columns

✅ Text columns standardized.


Unnamed: 0,Channel Title,Channel Username,ID,Message,Date
0,Doctors Ethiopia,@DoctorsET,864,https://youtu.be/5DBoEm-8kmA?si=LDLuEecNfULJVD...,2023-12-18 17:04:02+00:00
1,Doctors Ethiopia,@DoctorsET,863,ዶክተርስ ኢትዮጵያ በ አዲስ አቀራረብ በ ቴሌቪዥን ፕሮግራሙን ለመጀመር ከ...,2023-11-03 16:14:39+00:00
2,Doctors Ethiopia,@DoctorsET,862,ሞት በስኳር ለልጆቻችን የምናሲዘው ምሳቃ ሳናቀው እድሚያቸውን ይቀንሰው ...,2023-10-02 16:37:39+00:00
3,Doctors Ethiopia,@DoctorsET,861,ከ HIV የተፈወሰ ሰው አጋጥሟችሁ ያቃል ? ፈውስ እና ህክምና ? ሙሉ ቪ...,2023-09-16 07:54:32+00:00
4,Doctors Ethiopia,@DoctorsET,860,በቅርብ ጊዜ በሃገራችን ላይ እየተስተዋለ ያለ የተመሳሳይ ፆታ ( Homos...,2023-09-01 16:16:15+00:00


3.6: Extract and Remove Emojis

We'll extract emojis from the 'Message' column and store them in a new column, then remove emojis from the 'Message' column.

In [8]:
# Extract emojis and store them in a new column
cleaner.df.loc[:, 'emoji_used'] = cleaner.df['Message'].apply(cleaner.extract_emojis)
print("✅ Emojis extracted and stored in 'emoji_used' column.")
cleaner.df.head()  # Display the first few rows of the DataFrame after extracting emojis

# Remove emojis from message text
cleaner.df.loc[:, 'Message'] = cleaner.df['Message'].apply(cleaner.remove_emojis)
print("✅ Emojis removed from message text.")
cleaner.df.head()  # Display the first few rows of the DataFrame after removing emojis

✅ Emojis extracted and stored in 'emoji_used' column.
✅ Emojis removed from message text.


Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,emoji_used
0,Doctors Ethiopia,@DoctorsET,864,https://youtu.be/5DBoEm-8kmA?si=LDLuEecNfULJVD...,2023-12-18 17:04:02+00:00,👈👈👇👇
1,Doctors Ethiopia,@DoctorsET,863,ዶክተርስ ኢትዮጵያ በ አዲስ አቀራረብ በ ቴሌቪዥን ፕሮግራሙን ለመጀመር ከ...,2023-11-03 16:14:39+00:00,👇
2,Doctors Ethiopia,@DoctorsET,862,ሞት በስኳር ለልጆቻችን የምናሲዘው ምሳቃ ሳናቀው እድሚያቸውን ይቀንሰው ...,2023-10-02 16:37:39+00:00,No emoji
3,Doctors Ethiopia,@DoctorsET,861,ከ HIV የተፈወሰ ሰው አጋጥሟችሁ ያቃል ? ፈውስ እና ህክምና ? ሙሉ ቪ...,2023-09-16 07:54:32+00:00,👇👇👇👇
4,Doctors Ethiopia,@DoctorsET,860,በቅርብ ጊዜ በሃገራችን ላይ እየተስተዋለ ያለ የተመሳሳይ ፆታ ( Homos...,2023-09-01 16:16:15+00:00,No emoji


3.7: Extract and Remove YouTube Links

We'll extract YouTube links from the 'Message' column and store them in a new column, then remove YouTube links from the 'Message' column.

In [9]:
# Extract YouTube links into a separate column
cleaner.df.loc[:, 'youtube_links'] = cleaner.df['Message'].apply(cleaner.extract_youtube_links)
print("✅ YouTube links extracted and stored in 'youtube_links' column.")
cleaner.df.head()  # Display the first few rows of the DataFrame after extracting YouTube links

# Remove YouTube links from message text
cleaner.df.loc[:, 'Message'] = cleaner.df['Message'].apply(cleaner.remove_youtube_links)
print("✅ YouTube links removed from message text.")
cleaner.df.head()  # Display the first few rows of the DataFrame after removing YouTube links

✅ YouTube links extracted and stored in 'youtube_links' column.
✅ YouTube links removed from message text.


Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,emoji_used,youtube_links
0,Doctors Ethiopia,@DoctorsET,864,"በቀን አንዴ ብቻ የሚባለው የቢዝነስ አማካሪ በ 10,000 ብር ብቻ የተ...",2023-12-18 17:04:02+00:00,👈👈👇👇,https://youtu.be/5DBoEm-8kmA?si=LDLuEecNfULJVD...
1,Doctors Ethiopia,@DoctorsET,863,ዶክተርስ ኢትዮጵያ በ አዲስ አቀራረብ በ ቴሌቪዥን ፕሮግራሙን ለመጀመር ከ...,2023-11-03 16:14:39+00:00,👇,https://youtu.be/gwVN5eJQpko?si=xARsSxIEdZtE91GY
2,Doctors Ethiopia,@DoctorsET,862,ሞት በስኳር ለልጆቻችን የምናሲዘው ምሳቃ ሳናቀው እድሚያቸውን ይቀንሰው ...,2023-10-02 16:37:39+00:00,No emoji,https://youtu.be/oHiSRrNF7I0?si=Absgm414YSt_kjNq
3,Doctors Ethiopia,@DoctorsET,861,ከ HIV የተፈወሰ ሰው አጋጥሟችሁ ያቃል ? ፈውስ እና ህክምና ? ሙሉ ቪ...,2023-09-16 07:54:32+00:00,👇👇👇👇,https://youtu.be/tTeErZxIh_Q?si=jKHyfWcC3sfXbC8L
4,Doctors Ethiopia,@DoctorsET,860,በቅርብ ጊዜ በሃገራችን ላይ እየተስተዋለ ያለ የተመሳሳይ ፆታ ( Homos...,2023-09-01 16:16:15+00:00,No emoji,https://youtu.be/0k65P5ouw7s?si=qaUgo75bUa3AMQxD


3.8: Rename Columns

Finally, we'll rename the columns to match the PostgreSQL schema.

In [10]:
# Rename columns to match PostgreSQL schema
cleaner.df = cleaner.df.rename(columns={
    "Channel Title": "channel_title",
    "Channel Username": "channel_username",
    "ID": "message_id",
    "Message": "message",
    "Date": "message_date",
    "emoji_used": "emoji_used",
    "youtube_links": "youtube_links"
})
print("✅ Columns renamed to match PostgreSQL schema.")
cleaner.df.head()  # Display the first few rows of the DataFrame after renaming columns

✅ Columns renamed to match PostgreSQL schema.


Unnamed: 0,channel_title,channel_username,message_id,message,message_date,emoji_used,youtube_links
0,Doctors Ethiopia,@DoctorsET,864,"በቀን አንዴ ብቻ የሚባለው የቢዝነስ አማካሪ በ 10,000 ብር ብቻ የተ...",2023-12-18 17:04:02+00:00,👈👈👇👇,https://youtu.be/5DBoEm-8kmA?si=LDLuEecNfULJVD...
1,Doctors Ethiopia,@DoctorsET,863,ዶክተርስ ኢትዮጵያ በ አዲስ አቀራረብ በ ቴሌቪዥን ፕሮግራሙን ለመጀመር ከ...,2023-11-03 16:14:39+00:00,👇,https://youtu.be/gwVN5eJQpko?si=xARsSxIEdZtE91GY
2,Doctors Ethiopia,@DoctorsET,862,ሞት በስኳር ለልጆቻችን የምናሲዘው ምሳቃ ሳናቀው እድሚያቸውን ይቀንሰው ...,2023-10-02 16:37:39+00:00,No emoji,https://youtu.be/oHiSRrNF7I0?si=Absgm414YSt_kjNq
3,Doctors Ethiopia,@DoctorsET,861,ከ HIV የተፈወሰ ሰው አጋጥሟችሁ ያቃል ? ፈውስ እና ህክምና ? ሙሉ ቪ...,2023-09-16 07:54:32+00:00,👇👇👇👇,https://youtu.be/tTeErZxIh_Q?si=jKHyfWcC3sfXbC8L
4,Doctors Ethiopia,@DoctorsET,860,በቅርብ ጊዜ በሃገራችን ላይ እየተስተዋለ ያለ የተመሳሳይ ፆታ ( Homos...,2023-09-01 16:16:15+00:00,No emoji,https://youtu.be/0k65P5ouw7s?si=qaUgo75bUa3AMQxD


4: Save the Cleaned Data

In [11]:
# Define the path to save the cleaned data
output_path = '../src/data/cleaned_medical_data.csv'

# Save the cleaned merged data
cleaner.save_cleaned_data(output_path)

2025-02-02 09:21:43,332 - INFO - ✅ Cleaned data saved successfully to '../src/data/cleaned_medical_data.csv'.


✅ Cleaned data saved successfully to '../src/data/cleaned_medical_data.csv'.
