# Task 1: Cleaned Data Analysis

This notebook loads the Cleaned data collected from Telegram channels, processes it, and prepares it for the next tasks. We'll first inspect the data stored in the Excel file and ensure that the images are properly linked to the entries.

**Steps:**
1. Load the Cleaned data from the Excel file.
2. Display basic statistics and a preview of the data.
3. Ensure image paths are correctly stored and accessible.
4. Handle any initial cleaning or formatting necessary.


## - Merging with Channel Names

This notebook loads the scraped data from multiple CSV files (one for each Telegram channel), merges them, and adds the `channel_name` attribute based on the file the data was loaded from. The result is saved in a single Excel file for future processing.

**Steps:**
1. Load the scraped data from multiple CSV files.
2. Add the `channel_name` attribute based on the file name.
3. Merge the data into a single DataFrame.
4. Save the merged data into a single Excel file.

In [3]:
# Import necessary libraries
import os
import pandas as pd
os.chdir(r'c:\Users\ermias.tadesse\10x\Centralize-Ethiopian-medical-business-data')
# Define file paths for the CSV files and corresponding channel names
csv_files = {
    'Chemed.csv': 'data/cleaned/Chemed.csv',
    'DoctorsET.csv': 'data/cleaned/DoctorsET.csv',
    'EAHCI.csv': 'data/cleaned/EAHCI.csv',
    'yetenaweg.csv': 'data/cleaned/yetenaweg.csv',
    'lobelia4cosmetics.csv': 'data/cleaned/lobelia4cosmetics.csv'
}

# Define the output file path for the merged data
output_excel_file = 'data/cleaned/scraped_data.xlsx'


## - Load, Add channel_name, and Merge the CSV Files (Code):

In [4]:
# Initialize an empty list to hold the dataframes
dataframes = []

# Loop over the CSV files, read each into a dataframe, and add the 'channel_name' column
for channel, csv_file in csv_files.items():
    try:
        # Load each CSV file into a dataframe
        df = pd.read_csv(csv_file)
        print(f"Loaded {csv_file} with {df.shape[0]} records.")
        
        # Add the 'channel_name' column
        df['channel_name'] = channel.replace('.csv', '')  # Add channel name without the '.csv' extension
        
        # Append dataframe to the list
        dataframes.append(df)
    except FileNotFoundError:
        print(f"File not found: {csv_file}")
    except Exception as e:
        print(f"Error loading {csv_file}: {str(e)}")

# Merge all the dataframes into a single dataframe
if dataframes:
    merged_data = pd.concat(dataframes, ignore_index=True)
    print(f"Successfully merged all files. Total records: {merged_data.shape[0]}")
else:
    print("No dataframes were loaded.")

Loaded data/cleaned/Chemed.csv with 0 records.
Loaded data/cleaned/DoctorsET.csv with 100 records.
Loaded data/cleaned/EAHCI.csv with 100 records.
Loaded data/cleaned/yetenaweg.csv with 100 records.
Loaded data/cleaned/lobelia4cosmetics.csv with 100 records.
Successfully merged all files. Total records: 400


## - Save the Merged Data with channel_name Attribute (Code):

In [5]:
# Save the merged data to an Excel file
if not merged_data.empty:
    merged_data.to_excel(output_excel_file, index=False)
    print(f"Merged data saved to {output_excel_file}.")
else:
    print("No data to save.")

Merged data saved to data/cleaned/scraped_data.xlsx.


## - Basic Data Inspection (Code):

In [6]:
# Display the first few rows of the merged data
if not merged_data.empty:
    display(merged_data.head())

    # Check for missing values in the merged data
    print("Missing Values in Merged Data:")
    display(merged_data.isnull().sum())

Unnamed: 0,message_id,date,sender_id,message,media_path,channel_name
0,864,2023-12-18 17:04:02+00:00,-1001102021238,https://youtu.be/5DBoEm-8kmA?si=LDLuEecNfULJVD...,,DoctorsET
1,863,2023-11-03 16:14:39+00:00,-1001102021238,ዶክተርስ ኢትዮጵያ በ አዲስ አቀራረብ በ ቴሌቪዥን ፕሮግራሙን ለመጀመር ከ...,,DoctorsET
2,862,2023-10-02 16:37:39+00:00,-1001102021238,ሞት በስኳር \r\n\r\nለልጆቻችን የምናሲዘው ምሳቃ ሳናቀው እድሚያቸውን...,,DoctorsET
3,861,2023-09-16 07:54:32+00:00,-1001102021238,ከ HIV የተፈወሰ ሰው አጋጥሟችሁ ያቃል ? ፈውስ እና ህክምና ?\r\n\...,,DoctorsET
4,860,2023-09-01 16:16:15+00:00,-1001102021238,በቅርብ ጊዜ በሃገራችን ላይ እየተስተዋለ ያለ የተመሳሳይ ፆታ ( Homos...,,DoctorsET


Missing Values in Merged Data:


message_id        0
date              0
sender_id         0
message         127
media_path      300
channel_name      0
dtype: int64