## Data Cleaning
In this step, you will clean the merged data (scraped_data.xlsx) you generated from Task 1.

#### 1. Load the Merged Data:
Create a new notebook or continue in the existing one. Load the scraped_data.xlsx file.


In [9]:
import pandas as pd
import os

os.chdir(r'c:\Users\ermias.tadesse\10x\Centralize-Ethiopian-Medical-Business-Data')
# Load the merged data
scraped_data = pd.read_excel('Data/raw/scraped_data.xlsx')

# Display the first few rows
scraped_data.head()


Unnamed: 0,message_id,date,sender_id,message,media_path,channel_name
0,864,2023-12-18 17:04:02+00:00,-1001102021238,https://youtu.be/5DBoEm-8kmA?si=LDLuEecNfULJVD...,,DoctorsET
1,863,2023-11-03 16:14:39+00:00,-1001102021238,ዶክተርስ ኢትዮጵያ በ አዲስ አቀራረብ በ ቴሌቪዥን ፕሮግራሙን ለመጀመር ከ...,,DoctorsET
2,862,2023-10-02 16:37:39+00:00,-1001102021238,ሞት በስኳር \r\n\r\nለልጆቻችን የምናሲዘው ምሳቃ ሳናቀው እድሚያቸውን...,,DoctorsET
3,861,2023-09-16 07:54:32+00:00,-1001102021238,ከ HIV የተፈወሰ ሰው አጋጥሟችሁ ያቃል ? ፈውስ እና ህክምና ?\r\n\...,,DoctorsET
4,860,2023-09-01 16:16:15+00:00,-1001102021238,በቅርብ ጊዜ በሃገራችን ላይ እየተስተዋለ ያለ የተመሳሳይ ፆታ ( Homos...,,DoctorsET


#### 2. Remove Duplicates:
Identify and remove any duplicate rows from the dataset.

In [10]:
# Check for duplicates based on all columns
duplicates = scraped_data.duplicated()

# Remove duplicates
scraped_data_cleaned = scraped_data.drop_duplicates()
print(f"Removed {duplicates.sum()} duplicate rows.")


Removed 0 duplicate rows.


#### 3. Handle Missing Values:
Check for missing values in critical columns and decide how to handle them (e.g., dropping rows or filling missing data).

In [11]:
# Check for missing values
print(scraped_data_cleaned.isnull().sum())

# Drop rows with missing messages or dates
scraped_data_cleaned = scraped_data_cleaned.dropna(subset=['message', 'date'])

# Optionally, fill missing values in 'media_path' with a placeholder (e.g., 'No Media')
scraped_data_cleaned['media_path'].fillna('No Media', inplace=True)


message_id        0
date              0
sender_id         0
message         128
media_path      300
channel_name      0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  scraped_data_cleaned['media_path'].fillna('No Media', inplace=True)


#### 4. Standardize Formats:
Ensure all columns are in a consistent format (e.g., convert dates to datetime format).

In [12]:
# Convert 'date' column to datetime format
scraped_data_cleaned['date'] = pd.to_datetime(scraped_data_cleaned['date'], errors='coerce')

# Trim whitespace from text columns (if needed)
scraped_data_cleaned['message'] = scraped_data_cleaned['message'].str.strip()